Ian's work blog: CRIG

Sunday, 8 June 2008

JISC CRIG / IEDemonstrator BarCamp - Controlled Vocabs

Beautiful Sunny morning here in Sheffield, and all seems well with the world. It's taken me a day to recover from the traveling (mostly) but now I'm feeling vaguely human again, it's time to write about bits of the CRIG / IEDemonstrator day.

Controlled Vocabularies / Terminology Services

Had a great discussion about this with the Names, HILT and the STAR project. Everyone showed a sample of what kind of vocab service they are working with, and the pattern of a pretty web app fronting a web-service back-end was pretty much the defacto. K-Int's interest really centers around the work we are doing with Vocman in the learning sector (See screen-shot). Although the lexaurus suite isn't tied to any particular metadata scheme or representation we have worked almost exclusively with ZThes to date. After talking with these projects it seems critical that we write the SKOS adapters sooner rather than later for import and export, so thats something I'm going to push for ASAP in the vocman development plan. Hopefully, that will add another SRU searchable terminology service to the IE.

Our small prototyping group was tasked with working out how vocabulary services could be used WRT repositories. We talked around many use cases, from improved metadata creation and validation on submission (This works great for both subject headings and the name authority services like NAMES) and also improved precision for searchers, and better current awareness and dissemination services, by allowing subscribers to follow a single controlled term and have that term translated into whatever subject scheme is in use at a given repo. The issue here is that without the initial effort of improved metadata (Keeping in mind Pauls closing comment about lets not get too hung up on the metadata) we decided to focus on ways of improving the metadata of items attached to deposited artifacts.

One of our group (I'm really sorry, memory has failed me, but please comment if it was you), discussed ways they have managed to put an external metadata editing page behind a repository submission page, through use of proxies. Thus, the repository is kept un-polluted by the metadata editing app, but the presentation of a form is transparent to the depositor. So our final paper prototype extended the deposit service by adding a response parameter of the URL at which the metadata for an item could be edited. This editing environment would be pre-configured to use external vocabulary services and assist the user in selecting such terms. The tool could them post back the metadata using some repository specific adapter. For example, adding a Datastream to a fedora object using the rest service, or some other system, for example, auto publishing into an indexing service such as Zebra.

One interesting side note is that we ran into the old content dis-aggregation ussues again a little when talking about how we can improve the metadata attached to a packaged item.

At k-int we've long since discussed the need to take the Tagging Tool and turn it into a web application for editing metadata records using controlled vocab sources and then publishing those records using a pluggable system of adapters. The Controlled Vocab conversations have made me look at this in a new light, and I think its about time we got to hacking something out. One for next weekend perhaps!

Friday, 25 April 2008

Convention (Profiles?) over Configuration for Repository Interoperability

*Brain dump warning*

As noted elsewhere, I've been playing with Apache Jackrabbit to implement a Institutional Repository (like) system for general use. Initially, we're using it to host records created from an aggregation of UK Theatre Web, the Royal Opera House, Shakespeare Birthplace Trust and the National Theatre. Each performance, Artist, Contributor, Venue, Work is cross referenced and used to augment any existing records in order to create a harmonised historic record of performances in the UK. Theatre researchers want and need this augmented raw data (In XML and Marc format) but also want to search and browse. The bulk of the matching and deduplication work is done in a pretty hefty relational database structure and the Repository really does become a place to store and reference reusable artifacts.

So, we ended up playing with a Jackrabbit based repository instead of fedora or dspace because we already have some pretty sophistocated search tools, and our database is the authoritative source of the data. Our repository is something that sits on the boundary of our system, as a public interface to the raw data instead of our application. It only needs to do one simple job. In deployment terms, we have a tomcat instance hosting a DB driven application, our repository app, a search index app and an OAI app. The data import process orchestrates transactions between these web services. So, after our workflow process has updated the database (Giving us a URL for a web based app page referencing the item), copies of the generated artifacts have been pushed up to the repository (Giving us a URL where the data can be obtained) we push the update up to our indexing service so the whole lot can be searched.... Which gives rise to the initial question:

Back in the old days of Z39.50 we'd profile something like this and create a repository attribute and tag set. This means that anyone wanting to search "A repository" could be sure they didn't need to worry about whos implementation it was. In a solr environment for example, should there be a set of shared fields that all repositories support, I've started off with

repository.item.pid - The persistent identifier of the repo item (I chose the fedora-like naming just to avoid making something new).
repository.item.url The URL of the item in a specific repo
application.item.url - URL of a user-interface / application for the resolved item

and then for good measure

dc.title - So we have something to display
and
dc.itentifier - Just because.

I wonder if there's any consensus that when you search an index service for items which may be found in a repository you should always be able to search on a field set named like this, and always be able to count on result records with agreed field names.

In terms of protocols, this would be:

Z3950 - Do we really want another attrset? repo-1? Probably the "Proper" way to go eg "@attr repo-1 1=1 PID" In terms of result records, we're looking at XML tags?
SRW/SRU - repo context set? eg "repo.pid=xxx" again, in terms of result records, we're looking at a namespace for repository properties xxx
SOLR - Just some pre-agreed fields in the schema for both indexing and retrieval?
Other protocols?

My feeling is that if we're bothered about cross repository interoperability, then how other services like index,search,retrieval,harvest refer in an implementation independent way to the properties of a repository. I'll post something here shortly defining a CQL context set for repositories. In the interim, my SOLR fields relating to repository items are:

field name="repository.item.pid" type="string" indexed="true" stored="true" required="false"
field name="repository.item.url" type="string" indexed="false" stored="true" required="false"
field name="repository.item.owner" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="repository.item.deposit_user" type="string" indexed="true" stored="true" required="false" multivalued="true"
field name="app.item.url" type="string" indexed="false" stored="true" required="false"
I'll think about this some more and update / rewrite in english if this means anything to anyone

Wednesday, 19 March 2008

Apache Jackrabbit as an [Institutional|Cultural|Learning Object] Repository

Over the past months I've looked at dspace and fedora and played with both in a pretty serious way. The goal was to determine if we could *relly* use standard IR (Institutional Repo, as opposed to Information Retrieval) software to hold collections of IMS and ieee-LOM (Learning Object Metadata) records, and Peoples Network Cultural records (PNDS-Dublin Core Application Profile) as well as the E20CL (Exploring 20th Century London). The real driver here was that it might be possible to dump our existing OAI code and just use existing solutions. The brick wall in all cases came for me when I tried to integrate the repository "Blob" with our rich domain models for each schema type. Ideally, using repository workflow, I can pass on these blobs to domain specific subsystems that can do real application work with the items. I gave up on integrating with despace and fedora for LOM and cultural items (And now bibliographic resources too). In the end, we created our own repository which made it much easier to integrate with backend domain models.

I first looked at apache jackrabbit and the JSR dealing with content repositories a year or so ago, and decided it wasn't mature enough. David Flanders observation that Content Management companies were at the JISC OSS-Watch event and that IR's should "Watch Out" got me thinking about jackrabbit again. Thing is, IMNSHO, Current content management systems are as much vertical applications as IR's are. The trouble I had *usefully* getting non-IR resources into dspace and fedora (IE, it's very doable at the proof-of concept phase, but after that the 80/20 law quickly takes over) is going to be exactly the same issue content management providers have forcing the square peg of article prints into their web-site round hole. Of course everyone claims to have a "Generic Model" but they seldom are. In these days of rapid development, keeping a pure abstract model intact is difficult indeed.

Apache jackrabbit turns this on it's head a bit for me. Instead of being vertical application trying to spread out horizontally into new domains, it's nothing but a horizontal service thats entirely domain neutral. There's no danger of domain specifics creeping into the model, as there's no application to support directly, only repository services.

So, my new lunchtime project is to re-visit Apache Jackrbbit. It looks a whole lot more useable than it did a year ago, and I think the question I want to answer is can a horizontal tool like jackrabbit have vertical OAI-PMH (Superficially to me, Jackrabbit looks like it will fit the OAI-ORE model very tightly) and SRW/SRU services added to make it behave functionally well in the vertical sectors of Institutional, Cultural and Learning Object repositories. If so, Jackrabbit already has many of the features Jisc CRIG is talking about IR's really needing (Events, Security, etc) and I suspect it could be a real worthwhile approach. Although the startup time won't be as fast as domain specific tools, the developer resource and wealth of existing mature software give longer term benefits.

Having said all that, getting started with jackrabbit is a bit of a curve. The docs and samples seem to be geared to those wishing to improve the horizontal framework. What I needed was a vertical application developer guide for jackrabbit. Over the next few days I'm going to try and invest my lunchtime play hour in documenting the application of jackrabbit to a vertical domain, with specific emphasis on support for OAI-PMH and OAI-ORE. If you're interested, a maven2 pom file that has all the needed dependencies for my vertical test is here: http://developer.k-int.com/svn/default/sandbox/repo/jackrabbit/pom.xml and a sample unit test that creates a stand alone repository is here: http://developer.k-int.com/svn/default/sandbox/repo/jackrabbit/src/test/java/com/k_int/repository/test/RepoTest.java Tomorrows job is unpicking the core authentication mechanism and trying to get some objects in (I've got some LOM records, Marc records, Dublin Core, a pdf and some gifs, so thats a good starter set I reckon).

Watch this space :)

Monday, 4 February 2008

Looking forwards to OR08 April 1st and the Repository Challenge

Been thinking about OR08 and the Repository Challenge. Having been involved with a few code-fests in the past, I'm not too sure about the idea of throwing a load of stuff up in the air on the day and seeing which group of developers it lands on and what happens. Particularly, questions of managing the teams, build processes and code sharing all raise practical issues. In the past with code-fests this was less of a problem as usually projects were 90% their own code, and the whole build tree could be shipped around. These days, with so many projects being 5% own code, 95% reuse of other components the issues of configuration management are that much greater. And thats before we get onto required server infrastructure.

I reckon the people who are going to come out of the repository challenge are likely to be those who go with a pretty well defined goal and some infrastructure that lets developers hit the ground running working on the actual problems instead of worrying about the logistics of such an event.

Initially, the repository events api is one that interests me. Having recently finished a solr based indexing component, but having needed to hard-wire that into our lom repository component. It would have been great to be able to just subscribe the indexing service to the repository and let it run. So I'm thinking it might be worth doing a bit of pre-work on a repository events demonstrator, probably with a SOLR based indexing component as a proof of concept. Anyone interested drop me a line :).

MLA's cultural learning objects project - seamless SRW, OAI, Sword and LOM

in late 2007 MLA, the council for Museums, Libraries and Archives commissioned a forward looking project to support the re-purposing of cultural resources as learning objects. Their rational being that the MLA's extensive collection of cultural resources were of value to the learning community. What follows is a brief description of how MLAs choices to use standards based interfaces and open source components has allowed us to integrate several resources and technologies to rapidly provide a standards based solution. Having attended the JISC CRIG Unconference, SWORD seemed the be the natural replacement for the legacy soap and ftp services which used to be used to upload LOM records to the curriculum online repository.

Overview

The solution essentially uses SRW to harvest descriptive records and unique identifiers from the Peoples network discover service (Using the restful web service interface). These brief results are presented to the user. The user selects the records they wish to create LOM descriptions for and hit the download button. The tagging tool looks at each record and decides the best way to obtain the full original source record. In this case, submitting a unique identifier back to the source OAI service (As the discover service only contains brief descriptive records). The source record is retrieved and an appropriate XSL transformation applied to convert it into the lom schema, as fully populated as we can manage. The user then opens the file in the tagging tool propper where validation forces them to manually complete the fields needed for the selected lom profile. Finally, the user selects the upload option which uses the SWORD client libraries to submit the LOM document to a LOM metadata repository.

#1 Tagging tool startup. The user is prompted for their sword username and password, along with other personal information that can be defaulted in to the created LOM records.

#2 The tagging tool starts up and presents the user with an empty LOM document that can be used for authoring from scratch.

#3 User selects Discover from the main menu allowing them to search remote SRW repositories.

#4 Search results. The user is allowed to select search results for download and conversion into lom.

#5 The user edits the newly created lom document. Missing mandatory fields are highlighted and defaults applied where possible. The user is supported in selecting controlled classification fields from managed ZThes files.

#6 Finally the user selects the upload menu option and the file is delivered to the configured repository using the sword cliet.

Wednesday, 12 December 2007

JISC CRIG #2 - An Undiscovered Scenario?

I'm pretty sure one of the agendas it was hoped I would push at the CRIG Unconference was the libraries / search one. More specifically, the scenario "I'm a librarian, and I want to see results from the institutional repository in my OPAC". There are tons of variations on this one, but it boils down to the exposing of repository items in a way that is compatible with existing search services. I never really made it as far as putting that on a sheet of paper, mostly because I was trying to engage in other discussions and arrive at a common point where we could discuss this. Some of the barriers to discussing this use case.....

1) The word repository... it has at least two different meanings just in terms of being "A container you can put stuff in". It can be metadata, digitial items, content packages, etc. We started from the useful perspective of "It doesn't matter, it's still just a repository" but from the perspective of an OPAC, it certainly does matter to me. It matters even more if from all scenarios you want to be able to provide 1-click access to the actual resource. Perhaps if it doesn't matter what kind of repository it is, we need to be more specific about the classes of item we can put in a repository.

2) Metadata, packages, disaggregation... The "What sort of stuff is in a repository" issue starts to raise fundamental questions about content disaggregation. If we think of the base item, for example a PDF of a paper, as being the actual item we want to give people access to, then we need to ask, how does a specific metadata record become attached to that item, is it via a content package, or via loose coupling URI references. I'm a programmer and I like loose coupling. The upshot of loose coupling however is that our opac really needs to access metadata repositories that can point at content repositories. Repositories of content packages can submit their metadata components for indexing, but that metadata is typically poor.

3) What word do we use for the thing that takes the metadata records, builds and maintains a searchable index. At times the word repository has been used by different communities, but thats right out now I reckon. Search index has specific meaning in the IR community, so thats out. I've heard "Searchable Repository" but I think that muddies the water. (Search is a repository service, I would seriously have to suggest muddies the already murky waters). There is certainly a need for this indexing component that takes content-metadata records (Whatever the source) and points at content repositories containing the actual item.

4) SOLR is great but..... when SOLR people talk about federated search they mean federated amongst SOLR instances. There are already well established protocols for remote search. I really wish the SOLR people would quit trying to create a new defacto standard "SOLR search via URI and XML presentation" and adopt one of the more standard ones. I'd personally love it to be SRW but OpenSearch would be good too. I tried to engage the SOLR community and offered to work on the SRW adapter but there was absolutely no interest. Whilst I appreciate the "Just do it" nature of open source, it was incredibly hard to gain traction on this. One of the reasons for this is that SOLR has lots of really nice extensions for hit-highlighting and results categorisation that aren't present in other protocols. But many search services don't support those features. Thats one of the harder things about federated search, but just using a proprietary protocol is putting off the problem until, as we say, your librarian simply wants to Z39.50 or SRW cross search the Institutional repository content alongside the catalog holdings.

Anyway, all that aside, I just want there to be a way to see an SRW or Z39.50 service which will supply me appropriately profiled metadata records pointing me at the actual resource (Physical or Electronic). I guess thats a challenge. The SOLR SRW project seems like an important one to me in this context, maybe the time has come to try and revive it. A good first step would certainly be the "Explain function" for repositories, which would give us a way to profile standard metadata schemes and access points against the ad-hoc indexes and metadata schemes we find in most SOLR indexes.

So, some use cases:

There is a PDF on "The effects of IR spectrum light on bacterial growth" in a content repository, it has a descriptive entry in a metadata repository, how do I enable an actor typing an appropriate query into their opac to see a the descriptive record of sufficient quality to enable them to judge it of interest and retrieve the digital item (Lets leave aside appropriate copy and authentication for now).

There's an item in a well structured web-site containing an animated gif of a transverse wave. For lecturers teaching sound engineering, the resource is considered an ideal supplement for wave-mechanics lectures. How do we enable actors to find and use this resource. Actually, this leads into a second more interesting scenario.. that of the old JISC concept of "Search landscapes". Does our "Sound engineering student" want their search landscape to be "The Opac, The Institutional Repository, my course repository (Containing more specific materials), and my lecturers X,Y and Z repository. How do we enable the actor to locate and select the searchable indexes (And the current awareness feeds for that matter). In this case of course, there is a metadata record, but no item in a content repository.

Finally, there's the case of a known item in a content repoisitory with no metadata description? I think this UC basically just pulls in the auto metadata creation UC and we go from there?

Well that was a bit cathartic, and way too long but hey.

Monday, 10 December 2007

JISC CRIG Unconference #1 - The Meta Stuff

Just back from the JISC CRIG Unconference

Wow!

I think that about covers it. I have to completely take my hat off to the CRIG Support team for the sheer bravery in innovation they've shown with the unconference approach. The event was hugely fun, even if quite draining, and I think practice for all involved can only improve the outputs of these kinds of events in the future. JISC should take the support team and give them a huge pat on the back for their work here.

From a personal perspective, I found the unconference process entirely charming. As someone who quite seriously studied Stafford Beer and Organisational Cybernetics experiencing the unconference was like living in the pages of Beyond Dispute What follows is more for my own benefit and memory, but may be of interest I suppose. The result of what I know as the "problem jostle" generated some variety, although a few soap-boxes did seem to skew the work. The team did a great job of assembling enough variety of backgrounds to try and get some emergent activity. I think there needed to be a little more attenuation and coordination as a result of the initial problem jostle, there were terms that needed to be harmonized and some topics that I think probably needed to be toned down. To this end, I think it might have been fun to have a "System 2" (In terms of the VSM) board somewhere in the room, a board where we can scribble common definitions, and other coordinative activity. Again, one slight problem with the coordinative activity is that it all seemed to take place in the heads of the facilitators, which is sort of what you want when freeing the participants to think about their areas, but it did raise the slight spectre of agenda setting. In part I think this is a danger of the facilitators having expert domain knowledge. Although I'm not at all complaining, I think the team did a great job. The balloons, apparently functioning as some kind of parasympathetic channel, didn't really work I don't think. I can see where it might work in the US, but there were too many british sensitivities preventing them being useful. Actually, this is a pity because there was a need for some mechanism like this. I think if the participants had more time to gel before the actual event, it might have been less of an issue (Then again, it might have been more of an issue). Overall, the outputs seemed to be quite rich, although there didn't seem to be quite as much new variety as I expected.

Well thats enough meta-conference for now, on to the details....

Ian's work blog

Blog Archive