Friday, 25 April 2008

Convention (Profiles?) over Configuration for Repository Interoperability

*Brain dump warning*

As noted elsewhere, I've been playing with Apache Jackrabbit to implement a Institutional Repository (like) system for general use. Initially, we're using it to host records created from an aggregation of UK Theatre Web, the Royal Opera House, Shakespeare Birthplace Trust and the National Theatre. Each performance, Artist, Contributor, Venue, Work is cross referenced and used to augment any existing records in order to create a harmonised historic record of performances in the UK. Theatre researchers want and need this augmented raw data (In XML and Marc format) but also want to search and browse. The bulk of the matching and deduplication work is done in a pretty hefty relational database structure and the Repository really does become a place to store and reference reusable artifacts.

So, we ended up playing with a Jackrabbit based repository instead of fedora or dspace because we already have some pretty sophistocated search tools, and our database is the authoritative source of the data. Our repository is something that sits on the boundary of our system, as a public interface to the raw data instead of our application. It only needs to do one simple job. In deployment terms, we have a tomcat instance hosting a DB driven application, our repository app, a search index app and an OAI app. The data import process orchestrates transactions between these web services. So, after our workflow process has updated the database (Giving us a URL for a web based app page referencing the item), copies of the generated artifacts have been pushed up to the repository (Giving us a URL where the data can be obtained) we push the update up to our indexing service so the whole lot can be searched.... Which gives rise to the initial question:

Back in the old days of Z39.50 we'd profile something like this and create a repository attribute and tag set. This means that anyone wanting to search "A repository" could be sure they didn't need to worry about whos implementation it was. In a solr environment for example, should there be a set of shared fields that all repositories support, I've started off with - The persistent identifier of the repo item (I chose the fedora-like naming just to avoid making something new).
repository.item.url The URL of the item in a specific repo
application.item.url - URL of a user-interface / application for the resolved item

and then for good measure

dc.title - So we have something to display
dc.itentifier - Just because.

I wonder if there's any consensus that when you search an index service for items which may be found in a repository you should always be able to search on a field set named like this, and always be able to count on result records with agreed field names.

In terms of protocols, this would be:

Z3950 - Do we really want another attrset? repo-1? Probably the "Proper" way to go eg "@attr repo-1 1=1 PID" In terms of result records, we're looking at XML tags?
SRW/SRU - repo context set? eg "" again, in terms of result records, we're looking at a namespace for repository properties xxx
SOLR - Just some pre-agreed fields in the schema for both indexing and retrieval?
Other protocols?

My feeling is that if we're bothered about cross repository interoperability, then how other services like index,search,retrieval,harvest refer in an implementation independent way to the properties of a repository. I'll post something here shortly defining a CQL context set for repositories. In the interim, my SOLR fields relating to repository items are:

field name="" type="string" indexed="true" stored="true" required="false"
field name="repository.item.url" type="string" indexed="false" stored="true" required="false"
field name="repository.item.owner" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="repository.item.deposit_user" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="app.item.url" type="string" indexed="false" stored="true" required="false"
I'll think about this some more and update / rewrite in english if this means anything to anyone


Knowledge Integration Ltd