Friday 25 April 2008

Convention (Profiles?) over Configuration for Repository Interoperability

*Brain dump warning*

As noted elsewhere, I've been playing with Apache Jackrabbit to implement a Institutional Repository (like) system for general use. Initially, we're using it to host records created from an aggregation of UK Theatre Web, the Royal Opera House, Shakespeare Birthplace Trust and the National Theatre. Each performance, Artist, Contributor, Venue, Work is cross referenced and used to augment any existing records in order to create a harmonised historic record of performances in the UK. Theatre researchers want and need this augmented raw data (In XML and Marc format) but also want to search and browse. The bulk of the matching and deduplication work is done in a pretty hefty relational database structure and the Repository really does become a place to store and reference reusable artifacts.

So, we ended up playing with a Jackrabbit based repository instead of fedora or dspace because we already have some pretty sophistocated search tools, and our database is the authoritative source of the data. Our repository is something that sits on the boundary of our system, as a public interface to the raw data instead of our application. It only needs to do one simple job. In deployment terms, we have a tomcat instance hosting a DB driven application, our repository app, a search index app and an OAI app. The data import process orchestrates transactions between these web services. So, after our workflow process has updated the database (Giving us a URL for a web based app page referencing the item), copies of the generated artifacts have been pushed up to the repository (Giving us a URL where the data can be obtained) we push the update up to our indexing service so the whole lot can be searched.... Which gives rise to the initial question:

Back in the old days of Z39.50 we'd profile something like this and create a repository attribute and tag set. This means that anyone wanting to search "A repository" could be sure they didn't need to worry about whos implementation it was. In a solr environment for example, should there be a set of shared fields that all repositories support, I've started off with

repository.item.pid - The persistent identifier of the repo item (I chose the fedora-like naming just to avoid making something new).
repository.item.url The URL of the item in a specific repo
application.item.url - URL of a user-interface / application for the resolved item

and then for good measure

dc.title - So we have something to display
and
dc.itentifier - Just because.

I wonder if there's any consensus that when you search an index service for items which may be found in a repository you should always be able to search on a field set named like this, and always be able to count on result records with agreed field names.

In terms of protocols, this would be:

Z3950 - Do we really want another attrset? repo-1? Probably the "Proper" way to go eg "@attr repo-1 1=1 PID" In terms of result records, we're looking at XML tags?
SRW/SRU - repo context set? eg "repo.pid=xxx" again, in terms of result records, we're looking at a namespace for repository properties xxx
SOLR - Just some pre-agreed fields in the schema for both indexing and retrieval?
Other protocols?

My feeling is that if we're bothered about cross repository interoperability, then how other services like index,search,retrieval,harvest refer in an implementation independent way to the properties of a repository. I'll post something here shortly defining a CQL context set for repositories. In the interim, my SOLR fields relating to repository items are:

field name="repository.item.pid" type="string" indexed="true" stored="true" required="false"
field name="repository.item.url" type="string" indexed="false" stored="true" required="false"
field name="repository.item.owner" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="repository.item.deposit_user" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="app.item.url" type="string" indexed="false" stored="true" required="false"
I'll think about this some more and update / rewrite in english if this means anything to anyone

Monday 14 April 2008

I have this ASN.1 definition and this byte stream, can I use A2J?

I get about 1 email per week about this, and despite the fact that the docs are out there, for some reason people don't seem to be able to find them. So, if you have an ASN.1 definition file and need to encode/decode a byte stream from a device or some other source, here's how to do it with A2J.

1. Get A2J. There are a couple of options. you can download the source code from http://developer.k-int.com/svn/a2j/a2j_v2/trunk/ and build it yourself, or follow the other approach, and use Maven, which is the approach I'll discuss here. The a2j libraries are available from the public maven2 repositories so there's no special download or setup, just add the following to the dependencies section in your project.pom file and the jar will be downloaded from one of the maven2 repositories:

org.jzkit a2j 2.0.4

2. You need to precompile the asn.1 definition into codec classes. Use the following plugin :


maven-antrun-plugin


generate-sources

run



Running ASN.1 Compliation - output to ${project.build.directory}/generated/main/java


Precompile Z39.50

Precompile Character Set Negotiation


${project.build.directory}/generated/main/java





com.sun
tools
1.4
system
${java.home}/../lib/tools.jar





Obviously, replace input_file with your input file, and the base package with whatever java package you want to use. This will generate a load of java stubs that can process and input and output byte streams defined by the asn.1 specification. Am exanple pom can be found here: http://developer.k-int.com/svn/jzkit/jzkit3/trunk/jzkit_z3950_plugin/pom.xml

3. I want to read bytes from an input stream. Again, you can copy code from jzkit3, specifically, http://developer.k-int.com/svn/jzkit/jzkit3/trunk/jzkit_z3950_plugin/src/main/java/org/jzkit/z3950/util/ZEndpoint.java but the abbreviated version:

while(running) {
try {
log.debug("Waiting for data on input stream.....");
BERInputStream bds = new BERInputStream(incoming_data, charset_encoding,DEFAULT_BUFF_SIZE, reg);
PDU_type pdu = null;
pdu = (PDU_type)codec.serialize(bds, pdu, false, "PDU");
log.debug("Notifiy observers");

notifyAPDUEvent(pdu);

log.debug("Yield to other threads....");
yield();
}
}

incoming_data is an input stream, charset_encoding is a Character Set, reg is the OID register than can be used to identify any externals / other OID's appearing in the data.

Have fun!

Saturday 5 April 2008

Jackrabbit Repositories, DUT3 - Data Upload Tool and now Custprops-RDF

A blog post as much for myself so I don't forget where I'm up to as anything else, and also an upadte on building a repository on top of apache Jackrabbit.

I'd got almost to the finish line with an Apache Jackrabbit based repository, and was full of excitement about the JSR repository specification, which (In kind words mode) seemed to parallel and predate much of the thinking currently being done in repositories. At the last step, I wanted to take a few OAI sources and the Marc records from the OpenLibrary project and inject them into the repo to test the storage and event mechanisms, and by implication, test the SOLR indexer which fires on submission events.

We've had a bit of software hanging around since the early days of the IT For Me project which we installed in local and regional authorities and libraries. This kit, dubbed the "Data Upload Tool" had loads of plugins for talking to access database, Excel spreadsheets, directories of documents, etc which converted these heterogeneous data sources into a common schema and uploaded metadata, and possibly digital artifacts to the IT For Me Repository using a proprietary upload service. Since then, we've had the emergence of content packaging, which we looked at for making the metadata/artifacts uploading cleaner, and now SWORD which we've used to replace the proprietary API. This means anyone with authentication can submit data to projects like IT For Me (In this case, for any authority to share Community Information and Service Records) or digitisation projects to submit records to the MLA funded peoples network Discover service. This has been a great step forward for the Data Upload Tool but when I tried to use data upload tool to harvest an OIA collection, and the SWORD publish the records into my new repository, I fell foul of the workflow issues. DUT, great as it was, is too rigid. We needed something a bit more graphical in the vein of a workflow engineering tool. So I've had a brief diversion to work on DUT3 - Data Upload Tool v3 which is looking pretty neat.

BUT, whilst working on DUT3 it became apparent that there were so many objects the time had come to bite the bullet and embed a database instead of storing all the data in props files. Hmmm.. Heterogeneous plugin configuration and storage? Well certainly, the expressiveness of RDF is quite useful, but I like my relational databases.

This led to an evolution of our opensource custprops library. Often with any software product there is the need to be able to extend the base datamodel in the implementation phase. Users need to be able to store their own widget number along side an application object. Custprops takes an RDF-like model using URI based property definitions and extends hibernate objects with an extra properties map that can be used to define and store arbitrary additional data elements. The only problem now is that we have 2 api's for getting at the data. Of course the logical next step was to map the bean properties on to URI properties. Now we have a system that lets you set properties on an object and if they are a part of the standard relational model, they get directed to the standard database tables, if not, they go down the custprops route. Collections of objects work in a similar vein, although known relations have to be constrained by the underlying database model. COOL! an extensible relational schema in (limited) spirit of RDF. Hmm..

OK, so we look at this development and see the recent RDF Test Cases why not write some test cases for this? See if we can load some ontologies into the schema and map them. First test case we hit this example:


test:PositiveParserTest rdf:about="http://w3.example.org/test001">


APPROVED


This is a simple positive parser test example.










Some parsers may produce a warning when running this test




All looks good.. apart from that status=APPROVED element. That sucks. In our relational model, we want some form of reference data table (Either specific to status, or a shared status entity). I don't want to store the damn string status, thats just not in the spirit of the relational model. So, it looks like the thing to do is to write into the mapping configuration a mechanism to try and resolve values to a related entity instead of storing the value itself. If we do it right, we can codify the use cases (Create if non exists, error if non exists, etc) and do a similar job on the output end. Even better, it should be possible to write, for example custom matchers like AACR name matching to have a go at de-duplication. I'm not sure yet if we want to go so far as structures for storing possible matches and asking for user clean-up later on. Such functions would certainly be useful to projects like the TIG (Theater Information Group) gateway.

*end waffle* back to custprops.

Knowledge Integration Ltd