2009-06-05

Moving to SOLR

We are currently running tests with SOLR to index the content of our Library Content Management System. At this moment the system is relying on Oracle Text Retrieval. We do not feel this meets our demands and we find it too tightly knit to our storage solution, which we want to be data base independant. We are planning an architectural description of our complete environment before we start developing, but I do have some ideas on which way to go of course.

We will start leaving our complete system as it is to start with and add an extra index layer. This will allow us to gradually leave Oracle text retrieval and evolve the sytem in stead of disrupting it.
The idea is to store the record id and a root table name after a record is succesfully entered or updated. So imagine someone changes a title of a publication. The id and table containing the bibliographic description is added to a list. The same id and table name will also be registered when a record pointing to it (for example of a book item) is modified. This list has the form of a list of URL's, since every record in our Library CMS is uniquely identified by a URL.

There will be a process reading this file retrieving each URL. The URL will return an XML file, that will be processed by an XSLT. This XSLT can enrich the XML via web services.
For example our own web services, to add information about the book-items to the bibliographic description. We can also add OCLC numbers and LCC headings by consulting web services at both OCLC Pica and OCLC Worldcat.
Many other enrichment of the data may be done before it will be offered to SOLR to index it. This way we will be able to retrieve the records by keys that are not present in the actual record, but which are added just before indexing.

This has great potentials. We can for example check our SFX server for information and add a flag to know before hand a publication is electronically available. We can check Worldcat and add better normalized author names and lots of stuff we might think of later.

We will remain compatible with our current way to search and present results.
We will store the same XML representation in the SOLR index as is stored in our CMS, so we can still use all XSLT's that are currently used to present results.

We will also 'hide' the query interface of SOLR behind our current SRU like interface. Of course we will have to add ways to make use of the new sorting and facetting features we will be able to work with.
This way we also keep our Z39.50 working since it makes use of the same interface to search the LCMS.

Well these are our first ideas, so we will do a lot of tests and rethinking I suppose. I just like to share it with the readers of this blog, so anyone can jump in with even better ideas.
I hope to share some more results with you at the end of this summer. I am really excited about this new journey we are taking.

1 comment:

Mads Villadsen said...

Disclaimer: I am the technical project lead for the Summa project.

Have you considered looking at the Summa project for this?

Like Solr, Summa is a search engine built on top of Lucene, but the advantage I can see for using Summa in this instance is that it comes with a lot of support for creating workflows like the ones you are planning.

In particular it is quite easy to configure ingest of data from a variety of sources, and add extra information to them or manipulate them in other ways using what we call filters.

In fact I would go so far as to say that supporting workflows such as yours is precisely what we designed the workflow part of Summa for. We use it ourselves at the State and University Library in a wide variety of ways.

So have a look, and if you have any questions, comments, or just want to know more feel free to contact me directly or post a question to the Summa mailing list.

Regards
Mads Villadsen (mv@statsbiblioteket.dk)