Saturday, 5 April 2008

Jackrabbit Repositories, DUT3 - Data Upload Tool and now Custprops-RDF

A blog post as much for myself so I don't forget where I'm up to as anything else, and also an upadte on building a repository on top of apache Jackrabbit.

I'd got almost to the finish line with an Apache Jackrabbit based repository, and was full of excitement about the JSR repository specification, which (In kind words mode) seemed to parallel and predate much of the thinking currently being done in repositories. At the last step, I wanted to take a few OAI sources and the Marc records from the OpenLibrary project and inject them into the repo to test the storage and event mechanisms, and by implication, test the SOLR indexer which fires on submission events.

We've had a bit of software hanging around since the early days of the IT For Me project which we installed in local and regional authorities and libraries. This kit, dubbed the "Data Upload Tool" had loads of plugins for talking to access database, Excel spreadsheets, directories of documents, etc which converted these heterogeneous data sources into a common schema and uploaded metadata, and possibly digital artifacts to the IT For Me Repository using a proprietary upload service. Since then, we've had the emergence of content packaging, which we looked at for making the metadata/artifacts uploading cleaner, and now SWORD which we've used to replace the proprietary API. This means anyone with authentication can submit data to projects like IT For Me (In this case, for any authority to share Community Information and Service Records) or digitisation projects to submit records to the MLA funded peoples network Discover service. This has been a great step forward for the Data Upload Tool but when I tried to use data upload tool to harvest an OIA collection, and the SWORD publish the records into my new repository, I fell foul of the workflow issues. DUT, great as it was, is too rigid. We needed something a bit more graphical in the vein of a workflow engineering tool. So I've had a brief diversion to work on DUT3 - Data Upload Tool v3 which is looking pretty neat.

BUT, whilst working on DUT3 it became apparent that there were so many objects the time had come to bite the bullet and embed a database instead of storing all the data in props files. Hmmm.. Heterogeneous plugin configuration and storage? Well certainly, the expressiveness of RDF is quite useful, but I like my relational databases.

This led to an evolution of our opensource custprops library. Often with any software product there is the need to be able to extend the base datamodel in the implementation phase. Users need to be able to store their own widget number along side an application object. Custprops takes an RDF-like model using URI based property definitions and extends hibernate objects with an extra properties map that can be used to define and store arbitrary additional data elements. The only problem now is that we have 2 api's for getting at the data. Of course the logical next step was to map the bean properties on to URI properties. Now we have a system that lets you set properties on an object and if they are a part of the standard relational model, they get directed to the standard database tables, if not, they go down the custprops route. Collections of objects work in a similar vein, although known relations have to be constrained by the underlying database model. COOL! an extensible relational schema in (limited) spirit of RDF. Hmm..

OK, so we look at this development and see the recent RDF Test Cases why not write some test cases for this? See if we can load some ontologies into the schema and map them. First test case we hit this example:

test:PositiveParserTest rdf:about="">


This is a simple positive parser test example.

Some parsers may produce a warning when running this test

All looks good.. apart from that status=APPROVED element. That sucks. In our relational model, we want some form of reference data table (Either specific to status, or a shared status entity). I don't want to store the damn string status, thats just not in the spirit of the relational model. So, it looks like the thing to do is to write into the mapping configuration a mechanism to try and resolve values to a related entity instead of storing the value itself. If we do it right, we can codify the use cases (Create if non exists, error if non exists, etc) and do a similar job on the output end. Even better, it should be possible to write, for example custom matchers like AACR name matching to have a go at de-duplication. I'm not sure yet if we want to go so far as structures for storing possible matches and asking for user clean-up later on. Such functions would certainly be useful to projects like the TIG (Theater Information Group) gateway.

*end waffle* back to custprops.


Knowledge Integration Ltd