Ian's work blog: SOLR

Sunday, 13 July 2008

New version of JZkit Proxy Server

An updated version of the JZKit Proxy server (With an example configuration exposing a number of SOLR backends as a z3950 target) is available for download Here

Since the initial test at the Bath CRIG barcamp we've improved the error handling and diagnostics, improved the mapping of Z3950 database names, made the attribute mapping more configurable (The out out the box sample config is pretty much Bath profile level 1 OK), improved the structure==year mappings for date search and done some pretty hefty load testing which revealed a thread leak in certain situations (It's fixed now).

Please download, play and get in touch with any questions / comments.

Ian.

Tuesday, 27 May 2008

Exposing SOLR service(s) as a Z3950 server

JZKit is a pretty large toolkit for developers of search services to embed in their own systems, and it's not always easy to get to grips with ;). Partly thats because if you're using JZKit you're probably already dealing with the Z39.50 specifications along with a host of other concerns.

What developers need are simple starter apps that they can use to hit the ground running. In JZKit3 we've decided to try and address this by putting up some sample configurations of the tool that do useful stuff out of the box. First up.. Making a SOLR server visible as a Z39.50 Server using an easy to change XML config file. Why do that? Well lots of reasons, the most common one at the moment is that SOLR is being used to provide search interfaces into lots of interesting new content, not least of which are the whole new breed of digital object repository projects like Fedora and DSpace. What seems to keep coming back around is institutional librarians saying "But I want that content to be available along-side everything else".

Kewl, a problem we can do something about. For now, the gateway distro lives in our maven repository Here. If you fancy setting up a Z39.50 server to proxy for a local SOLR server, download the compressed tar file from the above URL and unpack it.

Configuration.

After unpacking, look in etc/JZKitConfig.xml you'll see a number of elements which define the SOLR services we want to search. (Actually, you can define all kinds of searchable resources here, not just SOLR, but other Z39.50 targets, SRU/SRW, OpenSearch, etc but thats for another day). You should be able to see the top points at a public SOLR server, adjust the URL to your local repository. The other elements inside the Repository element control the mapping of Z39.50 use attributes onto the SOLR search language and what record elements are requested as a part of search results. The code attribute specified what Z39.50 Database name this resource will be made available under. Having adjusted the config to point at a local server, cd ../bin and run "jzkit_service.sh start" this will give you a running Z3950 server, by default running on port 2100.

From here, it should be plain sailing, here's the output of a yaz client session:

(N.B.) XML Markup in the result record is being filtered out by blogger. Actual result record is XML (Actually, in this case it's just the SOLR element)

yaz-client tcp:@:2100
Connecting...OK.
Sent initrequest.
Connection accepted by v3 target.
ID : 174
Name : JZkit generic server / JZKit Meta Search Service
Version: 3.0.0-SNAPSHOT
Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets negotiationModel
Elapsed: 0.006948
Z> base Test
Z> find @attr 1=4 Dell
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 1, setno 1
records returned: 0
Elapsed: 0.002231
Z> format xml
Z> show 1
Sent presentRequest (1+1).
Records: 1
[coll]Record type: XML
3007WFPDell Widescreen UltraSharp 3007WFP6
nextResultSetPosition = 2
Elapsed: 0.008594
Z> quit

Other components of the config file can be used to control the XML records returned, to convert SOLR records into MARC or to manage meta searching through the gateway, but thats a subject for next time.

Happy meta-searching.

Wednesday, 12 December 2007

JISC CRIG #2 - An Undiscovered Scenario?

I'm pretty sure one of the agendas it was hoped I would push at the CRIG Unconference was the libraries / search one. More specifically, the scenario "I'm a librarian, and I want to see results from the institutional repository in my OPAC". There are tons of variations on this one, but it boils down to the exposing of repository items in a way that is compatible with existing search services. I never really made it as far as putting that on a sheet of paper, mostly because I was trying to engage in other discussions and arrive at a common point where we could discuss this. Some of the barriers to discussing this use case.....

1) The word repository... it has at least two different meanings just in terms of being "A container you can put stuff in". It can be metadata, digitial items, content packages, etc. We started from the useful perspective of "It doesn't matter, it's still just a repository" but from the perspective of an OPAC, it certainly does matter to me. It matters even more if from all scenarios you want to be able to provide 1-click access to the actual resource. Perhaps if it doesn't matter what kind of repository it is, we need to be more specific about the classes of item we can put in a repository.

2) Metadata, packages, disaggregation... The "What sort of stuff is in a repository" issue starts to raise fundamental questions about content disaggregation. If we think of the base item, for example a PDF of a paper, as being the actual item we want to give people access to, then we need to ask, how does a specific metadata record become attached to that item, is it via a content package, or via loose coupling URI references. I'm a programmer and I like loose coupling. The upshot of loose coupling however is that our opac really needs to access metadata repositories that can point at content repositories. Repositories of content packages can submit their metadata components for indexing, but that metadata is typically poor.

3) What word do we use for the thing that takes the metadata records, builds and maintains a searchable index. At times the word repository has been used by different communities, but thats right out now I reckon. Search index has specific meaning in the IR community, so thats out. I've heard "Searchable Repository" but I think that muddies the water. (Search is a repository service, I would seriously have to suggest muddies the already murky waters). There is certainly a need for this indexing component that takes content-metadata records (Whatever the source) and points at content repositories containing the actual item.

4) SOLR is great but..... when SOLR people talk about federated search they mean federated amongst SOLR instances. There are already well established protocols for remote search. I really wish the SOLR people would quit trying to create a new defacto standard "SOLR search via URI and XML presentation" and adopt one of the more standard ones. I'd personally love it to be SRW but OpenSearch would be good too. I tried to engage the SOLR community and offered to work on the SRW adapter but there was absolutely no interest. Whilst I appreciate the "Just do it" nature of open source, it was incredibly hard to gain traction on this. One of the reasons for this is that SOLR has lots of really nice extensions for hit-highlighting and results categorisation that aren't present in other protocols. But many search services don't support those features. Thats one of the harder things about federated search, but just using a proprietary protocol is putting off the problem until, as we say, your librarian simply wants to Z39.50 or SRW cross search the Institutional repository content alongside the catalog holdings.

Anyway, all that aside, I just want there to be a way to see an SRW or Z39.50 service which will supply me appropriately profiled metadata records pointing me at the actual resource (Physical or Electronic). I guess thats a challenge. The SOLR SRW project seems like an important one to me in this context, maybe the time has come to try and revive it. A good first step would certainly be the "Explain function" for repositories, which would give us a way to profile standard metadata schemes and access points against the ad-hoc indexes and metadata schemes we find in most SOLR indexes.

So, some use cases:

There is a PDF on "The effects of IR spectrum light on bacterial growth" in a content repository, it has a descriptive entry in a metadata repository, how do I enable an actor typing an appropriate query into their opac to see a the descriptive record of sufficient quality to enable them to judge it of interest and retrieve the digital item (Lets leave aside appropriate copy and authentication for now).

There's an item in a well structured web-site containing an animated gif of a transverse wave. For lecturers teaching sound engineering, the resource is considered an ideal supplement for wave-mechanics lectures. How do we enable actors to find and use this resource. Actually, this leads into a second more interesting scenario.. that of the old JISC concept of "Search landscapes". Does our "Sound engineering student" want their search landscape to be "The Opac, The Institutional Repository, my course repository (Containing more specific materials), and my lecturers X,Y and Z repository. How do we enable the actor to locate and select the searchable indexes (And the current awareness feeds for that matter). In this case of course, there is a metadata record, but no item in a content repository.

Finally, there's the case of a known item in a content repoisitory with no metadata description? I think this UC basically just pulls in the auto metadata creation UC and we go from there?

Well that was a bit cathartic, and way too long but hey.

Ian's work blog

Blog Archive