Thursday 6 November 2008

Catching up with sprints and progress

Been a while since I posted, alas blogging not uber high on my GTD list, so often gets pushed back... However, I thought it might be fun to share some of the work we're doing in the lexaurus vocabulary editor in relation to SKOS vocabularies. Rob Tice has sent me some screenshots of lexaurus ingesting the SKOS vocabulary for agriculture, here's the results

Here in image one there's the post ingest status report saying that 28954 terms have been imported in 269 seconds. As a part of the SKOS import we have to do a fair few cross references and referential lookups, so the performance isn't quite as blistering as the ZThes import, still pretty good tho.



Here's some more screenies showing the label nodes and editing pages in different languages. The data is held in the lexaurus canonical schema, so it can go out as ZThes or SKOK or whatever else we need.







More to follow, but I thought these interesting to put up now, just to stimulate discussion.

Wednesday 6 August 2008

Sprint 3 - Washup and planning

A few months ago we started to experiment with scrum, sprints and backlogs in an attempt to formalise our agile development processes. I thought it might be fun to try and keep the blog updated every couple of weeks with what's moving in the k-int world.. so here goes...

Hana's done some great work on the zero functionality release for the JISC Transcoder Project and it's basically finished. I'm tasked with uploading this to the Amazon elastic cloud this week whilst hana pushes on with having a go at SCORM 2004 to IMS CP 1.1 trascoding. Hana has also spent a couple of days fixing a drag and drop bug in the vocabulary editor application and getting reload installed for some testing. We're on the lookout for an SCORM 2004 content to test with (Especially older BBC JAM content).


Liams been doing loads of work refactoring the import and export modules in Vocabulary bank to make sure we can ingest and export both ZThes and SKOS and crosswalk between the two and in preparation for creating some additional export formats for specialist educational / eLearning applications. Liam's also managed to create an installer for the new release of OpenCRM, our open source third sector call management / CRM application to go off to Sheffield Advice Link. We've been told that the latest released vocabulary editor can import the full 7923 terms from the IPSV Integrated Public Sector Vocabulary in 17 seconds, which isn't bad going, considering the amount of cross referencing and revision management thats going on under the hood.



I've been on holiday ;) but whilst I've been away I've made huge steps forward on a stand-alone OAI/ORE server that we can use to replace all the custom OAI servers we have dotted around the place. With the new server you just point the app at a JDBC data source and it does introspection to discover the DB schema, detect primary keys and timestamps, a few clicks later and you have an OAI/ORE feed for your database. The same infrastructure can be used in online or batch mode (Batch mode detecting changed records with checksums rather than timestamps, useful for normalised schemas where changes might not update datestamps). Initially this is for an update to the Peoples network discover application, but I've high hopes it will be useful in a wide range of cultural and governmental settings. This should give us an instant linked-data capability for projects like peoples network and other related metadata aggregator projects we're involved with at the moment.

I've also (To my eternal shame in the office) started to do some real work on the JZKit documentation and the maven auto generated site, to make it easier for users of the toolkit to get to grips and diagnose problems. There seem to be a huge number of Z3950 projects in the offing at the moment, for a protocol thats supposedly legacy. I think thats great!

Rob's made huge steps forward with the assisted tagging tool which is being used (in anger by many users) to create descriptive metatdata for literally thousands of learning resources. The new tagging tool seamlessly integrates with the bank vocabulary service so we can update the vocabularies used without updating the tool, and the new assisted tagging feature makes it easy for users to quickly tag resources without needed an in-depth knowledge of curriculum and administrative structures.

Sunday 27 July 2008

Updates and New Projects

Thought it best to just have a quick update as there's so much going on at the moment I'm in danger of missing important announcements.

Just before I left for a weeks holiday Neil, Hana and I attended a kick-off meeting for the JISC Transcoder project. There's information about the project at that link, although I'll be posting more about this shortly. We've already had loads of interest both from content providers and interested HE partners who want to be involved in testing and development. To cause such a stir so early on is great IMNSHO. I'm going to try and get a project page together in the coming week, and unless anyone has any better ideas I'm planning to tag project announcements and direct links as jisc-transcoder. A project-transcoder tag is available for resources useful to the project. so we can get some feeds going.

Hana has already made huge progress on the first release (Which transcodes a package into itself, but lets us test the upload/download and cloud infrastructure) and we expect to be sitting down real soon now to put some priorities on the next release.

Here's an aggregated pipe of transcoder tags and blog entries.

Monday 14 July 2008

Wanted : HTML digitisation Silos needing an OAI feed

We've just been talking to a really cool sheffield geek who's got a neat HTML screen-scraping service and integrated it into JZKit to provide dynamic meta-search across services with no machine to machine interface. Very Cool.

However, in our discussions we realised that it might have even more value as a tool for converting HTML Silos into metadata-rich repositories. By combining this service with a SWORD deposit client and some MD5 checksums we can help digitisation projects who have large well-organised web sites, but perhaps don't have OAI or SRU/SRW, we can create a preservation hub that exposes the content using open standards.

If you have a HTML silo of digitized data (Or any data for that matter) but no OAI or SRU, and you need such services, we'd love to test this idea by writing the scripts to populate an OAI and SRU repository... Any volunteers?

Going to spend some spare time throwing this and the JDBC/OAI/SRU gateway together hopefully in time for library mashup and bathcamp. Maybe we can come out of those events with some newly exposed data sources contributing to the linked data network.

Fun :)

Sunday 13 July 2008

New version of JZkit Proxy Server

An updated version of the JZKit Proxy server (With an example configuration exposing a number of SOLR backends as a z3950 target) is available for download Here

Since the initial test at the Bath CRIG barcamp we've improved the error handling and diagnostics, improved the mapping of Z3950 database names, made the attribute mapping more configurable (The out out the box sample config is pretty much Bath profile level 1 OK), improved the structure==year mappings for date search and done some pretty hefty load testing which revealed a thread leak in certain situations (It's fixed now).

Please download, play and get in touch with any questions / comments.

Ian.

Sunday 8 June 2008

JISC CRIG DRY / IEDemonstrator BarCamp - Testing the Z3950/SOLR Bridge

The JZKit to SOLR Z3950 bridge has had it's first real foray outside of internal k-int projects by providing a z39.50 server to cross search Intute and the Oxford University Research Archive in a single z39.50 search. Been itching to try this for absolutely ages, and the CRIG/DRYbarcamp was the perfect opportunity to talk to the projects and quickly hack out the config needed to set up the server.

Here's the proof of the pudding, a z39.50 search for title=science translated into solr searches against the two repositories and integrated into a single interleaved result set. Easily integrated into any meta-search engine implementing z39.50. Cool!


Z> open tcp:@:2100
Connecting...OK.
Sent initrequest.
Connection accepted by v3 target.
ID : 174
Name : JZkit generic server / JZKit Meta Search Service
Version: 3.0.0-SNAPSHOT
Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets negotiationModel
Elapsed: 0.006905
Z> base oxford intute
Z> format xml
Z> elements solr
Z> find @attr 1=4 science
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 803, setno 1
records returned: 0
Elapsed: 0.003859
Z> show 1+10
Sent presentRequest (1+10).
Records: 10
[SOLR]Record type: XML
No Third Party copyrightGordon L. ClarkGordon L. Clarkora:general>falsegeneral1982John Wiley & Sons, Ltd.D. T. HerbertD. T. HerbertR. J. JohnstonR. J. JohnstonSt Peter's College1982Social Sciences Division - Environment,Centre for the - Geography,School ofUniversity of OxfordGordon L. ClarkD. T. HerbertR. J. JohnstonJohn Wiley & Sons, Ltd.Transformations: Economy, Society and PlacePublisher has copyrightPeer reviewedGeographySocial Sciences Division - Environment,Centre for the - Geography,School ofClarkHerbertJohnstonGordon L. ClarkGordon L. ClarkD. T. HerbertD. T. HerbertR. J. JohnstonR. J. JohnstonJohn Wiley & Sons, Ltd.John Wiley & Sons, Ltd.Book (monograph): Section of book or chapter of bookGordon L.D. T.R. J.http://ora.ouls.ox.ac.uk:8081/10030/2034Geography and the urban environment : progress in research and applicationsuuid:89494f00-a92b-4914-8f9a-bcede5a4d9beUniversity of Oxford04711022531982enSt Peter's College41-61trueGordon L. ClarkD. T. HerbertR. J. Johnstonuuid:89494f00-a92b-4914-8f9a-bcede5a4d9beJohn Wiley & Sons, Ltd.Transformations: Economy, Society and PlacePeer reviewedPublisher has copyrightPublishedGeography2008-06-02T07:36:29.419ZPolicy science and instrumental reasontextuuid:89494f00-a92b-4914-8f9a-bcede5a4d9be5http://www.geog.ox.ac.uk/staff/glclark.phphttp://eu.wiley.com/WileyCDA/Section/index.html
[SOLR]Record type: XML
ScienceScienceurn:ISSN:0036-8075oai:epubs.cclrc.ac.uk:serial/1252oai:epubs.cclrc.ac.uk:serial/125257STFC ePublication Archive57-oai:epubs.cclrc.ac.uk:serial/1252ScienceScience

JISC CRIG / IEDemonstrator BarCamp - Controlled Vocabs


Beautiful Sunny morning here in Sheffield, and all seems well with the world. It's taken me a day to recover from the traveling (mostly) but now I'm feeling vaguely human again, it's time to write about bits of the CRIG / IEDemonstrator day.

Controlled Vocabularies / Terminology Services

Had a great discussion about this with the Names, HILT and the STAR project. Everyone showed a sample of what kind of vocab service they are working with, and the pattern of a pretty web app fronting a web-service back-end was pretty much the defacto. K-Int's interest really centers around the work we are doing with Vocman in the learning sector (See screen-shot). Although the lexaurus suite isn't tied to any particular metadata scheme or representation we have worked almost exclusively with ZThes to date. After talking with these projects it seems critical that we write the SKOS adapters sooner rather than later for import and export, so thats something I'm going to push for ASAP in the vocman development plan. Hopefully, that will add another SRU searchable terminology service to the IE.

Our small prototyping group was tasked with working out how vocabulary services could be used WRT repositories. We talked around many use cases, from improved metadata creation and validation on submission (This works great for both subject headings and the name authority services like NAMES) and also improved precision for searchers, and better current awareness and dissemination services, by allowing subscribers to follow a single controlled term and have that term translated into whatever subject scheme is in use at a given repo. The issue here is that without the initial effort of improved metadata (Keeping in mind Pauls closing comment about lets not get too hung up on the metadata) we decided to focus on ways of improving the metadata of items attached to deposited artifacts.

One of our group (I'm really sorry, memory has failed me, but please comment if it was you), discussed ways they have managed to put an external metadata editing page behind a repository submission page, through use of proxies. Thus, the repository is kept un-polluted by the metadata editing app, but the presentation of a form is transparent to the depositor. So our final paper prototype extended the deposit service by adding a response parameter of the URL at which the metadata for an item could be edited. This editing environment would be pre-configured to use external vocabulary services and assist the user in selecting such terms. The tool could them post back the metadata using some repository specific adapter. For example, adding a Datastream to a fedora object using the rest service, or some other system, for example, auto publishing into an indexing service such as Zebra.

One interesting side note is that we ran into the old content dis-aggregation ussues again a little when talking about how we can improve the metadata attached to a packaged item.

At k-int we've long since discussed the need to take the Tagging Tool and turn it into a web application for editing metadata records using controlled vocab sources and then publishing those records using a pluggable system of adapters. The Controlled Vocab conversations have made me look at this in a new light, and I think its about time we got to hacking something out. One for next weekend perhaps!

Sunday 1 June 2008

Upcoming Changes to JZKit Configuration

JZSome sunday musings on the Configuration mechanism in the jzkit_Service module...

the JZKit_service module is the glue module that pulls together all the other components into a federated meta search component capable of resolving internal collection and landscape names into a list of external z39.50/sru/srw/opensearch/SOLR/JDBC/etc/etc services and broadcasting a search to those services, integrating the results and providing a unified result set.

This gives rise to a problem: For solutions like the previous Z39.50/SOLR bridge, we want a simple XML config file that a user can hack once, then leave. For more complicated applications, we need a real relational database behind the app to manage the complex config that goes along with large information network archtectures.

Here's the rub: JZKit carries with it a very detailed service registry. It's purpose isnt like the JISC IESR to be a registry in it's own right, but to support the search process. More and more, there is a need to make the JZKit service registry searchable in it's own right (As a Z39.50 Explain database, or as an SRU Explain collection / ZeeRex records).

This has been bothering me for a while, yet the answer has pretty much been staring me in the face all along. So, here's what I'm considering for the final release of JZKit3:

1. The current "InMemoryConfig" which is loaded from XML config files will be deprecated.

2. It will be replaced by an in-memory derby database, essentially just the current database backed config mechanism, but with an in-memory database.

3. The XML config files will be left intact, but considered to be a "BootStrap" mechanism. At startup, JZKit will scan the config files and update/create any entries in the configuration database.

Thus, the "In-Memory" config will remain as before, but instead of being held in hashmaps the data will be inside a derby database. This means we can now define first and foremost a JDBC backed datasource for the explain database and make it searchable even for static content.

--> Free explain service for any JZKit shared collections.

Tuesday 27 May 2008

Exposing SOLR service(s) as a Z3950 server

JZKit is a pretty large toolkit for developers of search services to embed in their own systems, and it's not always easy to get to grips with ;). Partly thats because if you're using JZKit you're probably already dealing with the Z39.50 specifications along with a host of other concerns.

What developers need are simple starter apps that they can use to hit the ground running. In JZKit3 we've decided to try and address this by putting up some sample configurations of the tool that do useful stuff out of the box. First up.. Making a SOLR server visible as a Z39.50 Server using an easy to change XML config file. Why do that? Well lots of reasons, the most common one at the moment is that SOLR is being used to provide search interfaces into lots of interesting new content, not least of which are the whole new breed of digital object repository projects like Fedora and DSpace. What seems to keep coming back around is institutional librarians saying "But I want that content to be available along-side everything else".

Kewl, a problem we can do something about. For now, the gateway distro lives in our maven repository Here. If you fancy setting up a Z39.50 server to proxy for a local SOLR server, download the compressed tar file from the above URL and unpack it.

Configuration.

After unpacking, look in etc/JZKitConfig.xml you'll see a number of elements which define the SOLR services we want to search. (Actually, you can define all kinds of searchable resources here, not just SOLR, but other Z39.50 targets, SRU/SRW, OpenSearch, etc but thats for another day). You should be able to see the top points at a public SOLR server, adjust the URL to your local repository. The other elements inside the Repository element control the mapping of Z39.50 use attributes onto the SOLR search language and what record elements are requested as a part of search results. The code attribute specified what Z39.50 Database name this resource will be made available under. Having adjusted the config to point at a local server, cd ../bin and run "jzkit_service.sh start" this will give you a running Z3950 server, by default running on port 2100.

From here, it should be plain sailing, here's the output of a yaz client session:

(N.B.) XML Markup in the result record is being filtered out by blogger. Actual result record is XML (Actually, in this case it's just the SOLR element)

yaz-client tcp:@:2100
Connecting...OK.
Sent initrequest.
Connection accepted by v3 target.
ID : 174
Name : JZkit generic server / JZKit Meta Search Service
Version: 3.0.0-SNAPSHOT
Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets negotiationModel
Elapsed: 0.006948
Z> base Test
Z> find @attr 1=4 Dell
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 1, setno 1
records returned: 0
Elapsed: 0.002231
Z> format xml
Z> show 1
Sent presentRequest (1+1).
Records: 1
[coll]Record type: XML
3007WFPDell Widescreen UltraSharp 3007WFP6
nextResultSetPosition = 2
Elapsed: 0.008594
Z> quit

Other components of the config file can be used to control the XML records returned, to convert SOLR records into MARC or to manage meta searching through the gateway, but thats a subject for next time.

Happy meta-searching.

Friday 25 April 2008

Convention (Profiles?) over Configuration for Repository Interoperability

*Brain dump warning*

As noted elsewhere, I've been playing with Apache Jackrabbit to implement a Institutional Repository (like) system for general use. Initially, we're using it to host records created from an aggregation of UK Theatre Web, the Royal Opera House, Shakespeare Birthplace Trust and the National Theatre. Each performance, Artist, Contributor, Venue, Work is cross referenced and used to augment any existing records in order to create a harmonised historic record of performances in the UK. Theatre researchers want and need this augmented raw data (In XML and Marc format) but also want to search and browse. The bulk of the matching and deduplication work is done in a pretty hefty relational database structure and the Repository really does become a place to store and reference reusable artifacts.

So, we ended up playing with a Jackrabbit based repository instead of fedora or dspace because we already have some pretty sophistocated search tools, and our database is the authoritative source of the data. Our repository is something that sits on the boundary of our system, as a public interface to the raw data instead of our application. It only needs to do one simple job. In deployment terms, we have a tomcat instance hosting a DB driven application, our repository app, a search index app and an OAI app. The data import process orchestrates transactions between these web services. So, after our workflow process has updated the database (Giving us a URL for a web based app page referencing the item), copies of the generated artifacts have been pushed up to the repository (Giving us a URL where the data can be obtained) we push the update up to our indexing service so the whole lot can be searched.... Which gives rise to the initial question:

Back in the old days of Z39.50 we'd profile something like this and create a repository attribute and tag set. This means that anyone wanting to search "A repository" could be sure they didn't need to worry about whos implementation it was. In a solr environment for example, should there be a set of shared fields that all repositories support, I've started off with

repository.item.pid - The persistent identifier of the repo item (I chose the fedora-like naming just to avoid making something new).
repository.item.url The URL of the item in a specific repo
application.item.url - URL of a user-interface / application for the resolved item

and then for good measure

dc.title - So we have something to display
and
dc.itentifier - Just because.

I wonder if there's any consensus that when you search an index service for items which may be found in a repository you should always be able to search on a field set named like this, and always be able to count on result records with agreed field names.

In terms of protocols, this would be:

Z3950 - Do we really want another attrset? repo-1? Probably the "Proper" way to go eg "@attr repo-1 1=1 PID" In terms of result records, we're looking at XML tags?
SRW/SRU - repo context set? eg "repo.pid=xxx" again, in terms of result records, we're looking at a namespace for repository properties xxx
SOLR - Just some pre-agreed fields in the schema for both indexing and retrieval?
Other protocols?

My feeling is that if we're bothered about cross repository interoperability, then how other services like index,search,retrieval,harvest refer in an implementation independent way to the properties of a repository. I'll post something here shortly defining a CQL context set for repositories. In the interim, my SOLR fields relating to repository items are:

field name="repository.item.pid" type="string" indexed="true" stored="true" required="false"
field name="repository.item.url" type="string" indexed="false" stored="true" required="false"
field name="repository.item.owner" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="repository.item.deposit_user" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="app.item.url" type="string" indexed="false" stored="true" required="false"
I'll think about this some more and update / rewrite in english if this means anything to anyone

Monday 14 April 2008

I have this ASN.1 definition and this byte stream, can I use A2J?

I get about 1 email per week about this, and despite the fact that the docs are out there, for some reason people don't seem to be able to find them. So, if you have an ASN.1 definition file and need to encode/decode a byte stream from a device or some other source, here's how to do it with A2J.

1. Get A2J. There are a couple of options. you can download the source code from http://developer.k-int.com/svn/a2j/a2j_v2/trunk/ and build it yourself, or follow the other approach, and use Maven, which is the approach I'll discuss here. The a2j libraries are available from the public maven2 repositories so there's no special download or setup, just add the following to the dependencies section in your project.pom file and the jar will be downloaded from one of the maven2 repositories:

org.jzkit a2j 2.0.4

2. You need to precompile the asn.1 definition into codec classes. Use the following plugin :


maven-antrun-plugin


generate-sources

run



Running ASN.1 Compliation - output to ${project.build.directory}/generated/main/java


Precompile Z39.50

Precompile Character Set Negotiation


${project.build.directory}/generated/main/java





com.sun
tools
1.4
system
${java.home}/../lib/tools.jar





Obviously, replace input_file with your input file, and the base package with whatever java package you want to use. This will generate a load of java stubs that can process and input and output byte streams defined by the asn.1 specification. Am exanple pom can be found here: http://developer.k-int.com/svn/jzkit/jzkit3/trunk/jzkit_z3950_plugin/pom.xml

3. I want to read bytes from an input stream. Again, you can copy code from jzkit3, specifically, http://developer.k-int.com/svn/jzkit/jzkit3/trunk/jzkit_z3950_plugin/src/main/java/org/jzkit/z3950/util/ZEndpoint.java but the abbreviated version:

while(running) {
try {
log.debug("Waiting for data on input stream.....");
BERInputStream bds = new BERInputStream(incoming_data, charset_encoding,DEFAULT_BUFF_SIZE, reg);
PDU_type pdu = null;
pdu = (PDU_type)codec.serialize(bds, pdu, false, "PDU");
log.debug("Notifiy observers");

notifyAPDUEvent(pdu);

log.debug("Yield to other threads....");
yield();
}
}

incoming_data is an input stream, charset_encoding is a Character Set, reg is the OID register than can be used to identify any externals / other OID's appearing in the data.

Have fun!

Saturday 5 April 2008

Jackrabbit Repositories, DUT3 - Data Upload Tool and now Custprops-RDF

A blog post as much for myself so I don't forget where I'm up to as anything else, and also an upadte on building a repository on top of apache Jackrabbit.

I'd got almost to the finish line with an Apache Jackrabbit based repository, and was full of excitement about the JSR repository specification, which (In kind words mode) seemed to parallel and predate much of the thinking currently being done in repositories. At the last step, I wanted to take a few OAI sources and the Marc records from the OpenLibrary project and inject them into the repo to test the storage and event mechanisms, and by implication, test the SOLR indexer which fires on submission events.

We've had a bit of software hanging around since the early days of the IT For Me project which we installed in local and regional authorities and libraries. This kit, dubbed the "Data Upload Tool" had loads of plugins for talking to access database, Excel spreadsheets, directories of documents, etc which converted these heterogeneous data sources into a common schema and uploaded metadata, and possibly digital artifacts to the IT For Me Repository using a proprietary upload service. Since then, we've had the emergence of content packaging, which we looked at for making the metadata/artifacts uploading cleaner, and now SWORD which we've used to replace the proprietary API. This means anyone with authentication can submit data to projects like IT For Me (In this case, for any authority to share Community Information and Service Records) or digitisation projects to submit records to the MLA funded peoples network Discover service. This has been a great step forward for the Data Upload Tool but when I tried to use data upload tool to harvest an OIA collection, and the SWORD publish the records into my new repository, I fell foul of the workflow issues. DUT, great as it was, is too rigid. We needed something a bit more graphical in the vein of a workflow engineering tool. So I've had a brief diversion to work on DUT3 - Data Upload Tool v3 which is looking pretty neat.

BUT, whilst working on DUT3 it became apparent that there were so many objects the time had come to bite the bullet and embed a database instead of storing all the data in props files. Hmmm.. Heterogeneous plugin configuration and storage? Well certainly, the expressiveness of RDF is quite useful, but I like my relational databases.

This led to an evolution of our opensource custprops library. Often with any software product there is the need to be able to extend the base datamodel in the implementation phase. Users need to be able to store their own widget number along side an application object. Custprops takes an RDF-like model using URI based property definitions and extends hibernate objects with an extra properties map that can be used to define and store arbitrary additional data elements. The only problem now is that we have 2 api's for getting at the data. Of course the logical next step was to map the bean properties on to URI properties. Now we have a system that lets you set properties on an object and if they are a part of the standard relational model, they get directed to the standard database tables, if not, they go down the custprops route. Collections of objects work in a similar vein, although known relations have to be constrained by the underlying database model. COOL! an extensible relational schema in (limited) spirit of RDF. Hmm..

OK, so we look at this development and see the recent RDF Test Cases why not write some test cases for this? See if we can load some ontologies into the schema and map them. First test case we hit this example:


test:PositiveParserTest rdf:about="http://w3.example.org/test001">


APPROVED


This is a simple positive parser test example.










Some parsers may produce a warning when running this test




All looks good.. apart from that status=APPROVED element. That sucks. In our relational model, we want some form of reference data table (Either specific to status, or a shared status entity). I don't want to store the damn string status, thats just not in the spirit of the relational model. So, it looks like the thing to do is to write into the mapping configuration a mechanism to try and resolve values to a related entity instead of storing the value itself. If we do it right, we can codify the use cases (Create if non exists, error if non exists, etc) and do a similar job on the output end. Even better, it should be possible to write, for example custom matchers like AACR name matching to have a go at de-duplication. I'm not sure yet if we want to go so far as structures for storing possible matches and asking for user clean-up later on. Such functions would certainly be useful to projects like the TIG (Theater Information Group) gateway.

*end waffle* back to custprops.

Wednesday 19 March 2008

Apache Jackrabbit as an [Institutional|Cultural|Learning Object] Repository

Over the past months I've looked at dspace and fedora and played with both in a pretty serious way. The goal was to determine if we could *relly* use standard IR (Institutional Repo, as opposed to Information Retrieval) software to hold collections of IMS and ieee-LOM (Learning Object Metadata) records, and Peoples Network Cultural records (PNDS-Dublin Core Application Profile) as well as the E20CL (Exploring 20th Century London). The real driver here was that it might be possible to dump our existing OAI code and just use existing solutions. The brick wall in all cases came for me when I tried to integrate the repository "Blob" with our rich domain models for each schema type. Ideally, using repository workflow, I can pass on these blobs to domain specific subsystems that can do real application work with the items. I gave up on integrating with despace and fedora for LOM and cultural items (And now bibliographic resources too). In the end, we created our own repository which made it much easier to integrate with backend domain models.

I first looked at apache jackrabbit and the JSR dealing with content repositories a year or so ago, and decided it wasn't mature enough. David Flanders observation that Content Management companies were at the JISC OSS-Watch event and that IR's should "Watch Out" got me thinking about jackrabbit again. Thing is, IMNSHO, Current content management systems are as much vertical applications as IR's are. The trouble I had *usefully* getting non-IR resources into dspace and fedora (IE, it's very doable at the proof-of concept phase, but after that the 80/20 law quickly takes over) is going to be exactly the same issue content management providers have forcing the square peg of article prints into their web-site round hole. Of course everyone claims to have a "Generic Model" but they seldom are. In these days of rapid development, keeping a pure abstract model intact is difficult indeed.

Apache jackrabbit turns this on it's head a bit for me. Instead of being vertical application trying to spread out horizontally into new domains, it's nothing but a horizontal service thats entirely domain neutral. There's no danger of domain specifics creeping into the model, as there's no application to support directly, only repository services.

So, my new lunchtime project is to re-visit Apache Jackrbbit. It looks a whole lot more useable than it did a year ago, and I think the question I want to answer is can a horizontal tool like jackrabbit have vertical OAI-PMH (Superficially to me, Jackrabbit looks like it will fit the OAI-ORE model very tightly) and SRW/SRU services added to make it behave functionally well in the vertical sectors of Institutional, Cultural and Learning Object repositories. If so, Jackrabbit already has many of the features Jisc CRIG is talking about IR's really needing (Events, Security, etc) and I suspect it could be a real worthwhile approach. Although the startup time won't be as fast as domain specific tools, the developer resource and wealth of existing mature software give longer term benefits.

Having said all that, getting started with jackrabbit is a bit of a curve. The docs and samples seem to be geared to those wishing to improve the horizontal framework. What I needed was a vertical application developer guide for jackrabbit. Over the next few days I'm going to try and invest my lunchtime play hour in documenting the application of jackrabbit to a vertical domain, with specific emphasis on support for OAI-PMH and OAI-ORE. If you're interested, a maven2 pom file that has all the needed dependencies for my vertical test is here: http://developer.k-int.com/svn/default/sandbox/repo/jackrabbit/pom.xml and a sample unit test that creates a stand alone repository is here: http://developer.k-int.com/svn/default/sandbox/repo/jackrabbit/src/test/java/com/k_int/repository/test/RepoTest.java Tomorrows job is unpicking the core authentication mechanism and trying to get some objects in (I've got some LOM records, Marc records, Dublin Core, a pdf and some gifs, so thats a good starter set I reckon).

Watch this space :)

Monday 4 February 2008

Looking forwards to OR08 April 1st and the Repository Challenge

Been thinking about OR08 and the Repository Challenge. Having been involved with a few code-fests in the past, I'm not too sure about the idea of throwing a load of stuff up in the air on the day and seeing which group of developers it lands on and what happens. Particularly, questions of managing the teams, build processes and code sharing all raise practical issues. In the past with code-fests this was less of a problem as usually projects were 90% their own code, and the whole build tree could be shipped around. These days, with so many projects being 5% own code, 95% reuse of other components the issues of configuration management are that much greater. And thats before we get onto required server infrastructure.

I reckon the people who are going to come out of the repository challenge are likely to be those who go with a pretty well defined goal and some infrastructure that lets developers hit the ground running working on the actual problems instead of worrying about the logistics of such an event.

Initially, the repository events api is one that interests me. Having recently finished a solr based indexing component, but having needed to hard-wire that into our lom repository component. It would have been great to be able to just subscribe the indexing service to the repository and let it run. So I'm thinking it might be worth doing a bit of pre-work on a repository events demonstrator, probably with a SOLR based indexing component as a proof of concept. Anyone interested drop me a line :).

MLA's cultural learning objects project - seamless SRW, OAI, Sword and LOM

in late 2007 MLA, the council for Museums, Libraries and Archives commissioned a forward looking project to support the re-purposing of cultural resources as learning objects. Their rational being that the MLA's extensive collection of cultural resources were of value to the learning community. What follows is a brief description of how MLAs choices to use standards based interfaces and open source components has allowed us to integrate several resources and technologies to rapidly provide a standards based solution. Having attended the JISC CRIG Unconference, SWORD seemed the be the natural replacement for the legacy soap and ftp services which used to be used to upload LOM records to the curriculum online repository.

Overview

The solution essentially uses SRW to harvest descriptive records and unique identifiers from the Peoples network discover service (Using the restful web service interface). These brief results are presented to the user. The user selects the records they wish to create LOM descriptions for and hit the download button. The tagging tool looks at each record and decides the best way to obtain the full original source record. In this case, submitting a unique identifier back to the source OAI service (As the discover service only contains brief descriptive records). The source record is retrieved and an appropriate XSL transformation applied to convert it into the lom schema, as fully populated as we can manage. The user then opens the file in the tagging tool propper where validation forces them to manually complete the fields needed for the selected lom profile. Finally, the user selects the upload option which uses the SWORD client libraries to submit the LOM document to a LOM metadata repository.


#1 Tagging tool startup. The user is prompted for their sword username and password, along with other personal information that can be defaulted in to the created LOM records.

 



#2 The tagging tool starts up and presents the user with an empty LOM document that can be used for authoring from scratch.
 


#3 User selects Discover from the main menu allowing them to search remote SRW repositories.
 



#4 Search results. The user is allowed to select search results for download and conversion into lom.
 









#5 The user edits the newly created lom document. Missing mandatory fields are highlighted and defaults applied where possible. The user is supported in selecting controlled classification fields from managed ZThes files.


 



#6 Finally the user selects the upload menu option and the file is delivered to the configured repository using the sword cliet.

Knowledge Integration Ltd