WebQuery @ Wageningen UR

2010-11-15

The future of libraries. The end of the library as we know it

Recently I visited the largest national Library congres in the Netherlands. The new chairman of the Dutch Federation of professional librarians, Michael Wesseling pointed out that we have moved from an era where access to information was scarce to an era where information is abundantly present. Libraries have to move from gathering information just in case to providing information just in time. The library has an advantage over Google in pointing people to relevant information since libraries are organisations that have a reputation of transparency. Google relies on PageRank, google's best kept secret.

Even though I think he is absolutely right, I am convinced that leading people to relevant information is not the future role of libraries. Libraries, especially scholarly libraries, can teach people information literacy, but we have to stop thinking that libraries are the leading organisations in helping people find the information they need in a global environment.
I think libraries have to change their view of the world 180 degrees. Libraries should no longer aim to be the portal to the world of information for a local audience. Libraries should try to be the portal to local information for a global audience.

A university and research library should focus on making the local production of research output known to the world. This change is already taking place at these institutions. University libraries are quite often involved in organizing local repositories of publications More and more are active in supporting their users with research data repositories. Some university libraries, like ours, are even involved in unlocking research project information and research staff ans skills.
Our most recent involvement concerns the indexing of local data by using Solr an ManifoldCF to create a university enterprise search solution.

This approach may seem obvious for University libraries, but what about National and Public libraries ? The Koninklijke Bibliotheek was very successful digitizing their vast collection of Dutch newspapers. Unlocking their unique collections to the world should be a very important goal for National libraries.
Public libraries might put some effort in informing 'the world' about local activities. And wouldn't it be nice if public libraries would get involved in unlocking all these badly indexed resources of local governmental bodies ?

2009-10-08

Why are we changing our search architecture ? Waarom passen we onze zoekarchitectuur aan ?

This time a blogpost in Dutch. Why ? I feel the need to inform my colleagues in Wageningen, especially those in the library, a bit more on this issue. Sorry.

Ik kan natuurlijk een interne memo schrijven, maar een blogpost geeft me meer mogelijkheden tot het linken naar voorbeelden en is ook bereikbaar voor andere geïnteresseerden.

Op dit moment starten we met de ontwikkeling van een nieuwe zoekarchitectuur, gebaseerd op Solr. We hebben de afgelopen maanden nagedacht op welke manier we dat willen gaan doen, maar sind 1 oktober hebben we een nieuwe jonge collega, Joost van Ingen, die we voorlopig gaan vrijhouden voor dit project.

Op dit moment gebruiken we voor de indexering van het XML waarin we al onze informatie opslaan een product van Oracle, SQL text retrieval. We zijn niet zo ontevreden met dit product, maar er zit niet zo veel ontwikkeling in. Een aantal dingen die we graag zouden willen ondersteunt het niet en het is heel nauw gekoppeld aan de informatie die is opgeslagen in de Oracle data base. Daarom gaan we over op andere indexerings software (Lucene) en een andere manier van indexeren op basis van Solr.

Wat is er dan straks mogelijk ? Ik zal een aantal dingen noemen, niet noodzakelijkerwijs in volgorde van belangrijkheid.

Bladeren door grote resultaatlijsten. In de huidige situatie kun je in een set van meer dan een paar honderd resultaten niet goed bladeren. Na een pagina of 6 gaat het eigenlijk te traag en beginnen anderen last te krijgen met de snelheid van hun zoekactie door jouw geblader. Dit is doorgaans niet zo'n groot probleem, want je kunt je zoekset verkleinen en mensen bladeren meestal toch niet verder dan een paar pagina's. Het levert echter wel problemen op voor harvesters als Google en OAI harvesters, die wel alsmaar willen doorbladeren om alles te kunnen indexeren. Daarom staan we deze web crawlers niet toe verder te bladeren dan 600 resultaten. Voor OAI harvesters geldt dat niet, maar moeten we weer speciale dingen doen om het bladeren voor elkaar te krijgen. In een Solr/Lucene gebaseerde verzameling, zoals die van Gent, is dat geen probleem.

De totale resultaatset sorteren op titel, jaar van uitgave of iets anders. In de huidige situatie zouden we daar heel veel moeite voor moeten doen, voor iedere sorteringswens die er bestaat. Daarom sorteren we op dit moment alleen de resultaten per pagina. Als 'doekje voor het bloeden' geven we daarom vrij grote zoekresultaten per pagina terug, van wel honderd tot duizend titels per pagina. Dat is een vrij zware belasting voor zowel het systeem als je browser. De meeste zoeksystemen geven maar zo'n 10 titels per pagina terug.

Het sorteren van het zoekresultaat op relevantie. Dat kan nu wel, maar geeft in een verzameling metadata niet zo'n bevredigend resultaat. In Oracle worden daar traditionele technieken voor gebruikt uit de wereld van de full text retrieval. Meta data records in een resultaat set verschillen dan zeer weinig in relevantie. Ook in Solr worden standaard dezelfde soort technieken gebruikt om de relevantie te bepalen, maar bestaat de mogelijkheid om zelf nieuwe relevantie algoritmen te bedenken. We zouden bijvoorbeeld veel uitgeleende boeken een hogere relevantie kunnen geven. Tijdschriften met een hogere impact factor relevanter maken. Publicaties van vooraanstaande Wageningse auteurs belangrijker maken. Let wel, dit zijn meer of minder gelukkige voorbeelden, waarbij je weer kanttekeningen kunt maken. Ik verzoek dan ook om jullie creativiteit te gebruiken en met briljante oplossingen hiervoor te komen.

Op dit moment geven we de mogelijkheid om te zoeken in verschillende velden en die zoekresultaten in te perken op bepaalde kenmerken. Uit verschillende onderzoeken blijkt dat gebruikers enigszins geïntimideerd zij door deze complexe zoekschermen. Tijdens de cursussen die wij aan studenten geven blijkt dat men het zoeken met booleaanse operatoren en de daaraan gekoppelde verzamelingenleer niet zonder meer beheerst. Daarnaast kan bij het inperken vaak gekozen worden voor waarden van kenmerken die helemaal niet in de gevonden set voorkomen. Zo kan je zoeken naar alle publicaties met in de titel 'voedselkwaliteit in de Betuwe' beperkt tot die titels die in het Bulgaars zijn gepubliceerd. Of alles over waterbeheer in de categorie onkruidbestrijding. Men is gewend aan Google. Je hebt één vakje en daar tik je iets in en vanuit dat zoekresultaat zoek je weer verder. Zonde van al die metadata die we verzamelen, dus die willen we gaan gebruiken voor 'facettering' van het zoekresultaat. Je krijgt dan de mogelijkheid om in te perken op bepaalde metadata kenmerken, maar alleen op waarden die voorkomen in de gevonden set resultaten. Solr ondersteunt dit soort facettering zoals blijkt uit bijvoorbeeld de Gentse catalogus. We zien dit soort faciliteiten wel meer in combinatie met Lucene gebaseerde indexen, zoals bijvoorbeeld bij Discover van de TU Delft.

Meer content om op te zoeken. We willen misschien ook de volledige tekst gaan indexeren, zodat ook op de inhoud van de publicaties kan worden gezocht. Er is echter een probleem wanneer je gaat zoeken in een verzameling publicaties, waarvan maar een deel full text beschikbaar is. De andere records vallen dan weg in de resultaat sets. Dus voordat we dat gaan doen willen we vooralsnog op andere wijze de records gaan verrijken, voordat ze worden geïndexeerd. Op dit moment doen we veel aan verrijking van records, maar we doen dat op het moment van presentatie. Op dat moment halen we covers op, eventuele inhoudsopgaven van tijdschriften, abstracts, de betekenis van rubriekscodes, etc. Dat betekent dat deze informatie bij het opzoeken van deze titels niet kan worden gebruikt voor het vinden van die titels. In de nieuwe zoekarchitectuur zal een groot deel van de verrijking plaatsvinden vóórdat de titel gaat worden geïndexeerd. Dat betekent dat we de mogelijkheid krijgen om die extra informatie mee te indexeren.

Over meerdere collecties tegelijk kunnen zoeken. Op dit moment kunnen we tegelijkertijd zoeken over alle soorten titelbeschrijvingen die we verzamelen (artikelen, tijdschriften, boeken, webbronnen, etc). Dat bieden we onze gebruikers echter niet aan. In het verre verleden bleek dat mensen dan dachten dat alle artikelen van alle tijdschriften in de catalogus te vinden zouden zijn. Inmiddels zijn gebruikers veel meer bekend met heterogene bestanden en Solr biedt middels de facetten in ieder geval de mogelijkheid om te kunnen laten zien uit welke type titelbeschrijvingen het zoekresultaat is opgebouwd. Het zal ook mogelijk worden om de afwijkende beschrijvingen van Wageningse publicaties uit Metis mee te indexeren in een en dezelfde index. We krijgen dan wel veel dubbele titels in het zoekresultaat, dus wanneer we dat willen gaan doen moeten we goed bedenken hoe we hier mee om willen gaan. Ook kunnen we gaan overwegen of we ook andere dataverzamelingen in de index gaan meenemen. Bijvoorbeeld Wageningse onderzoeksprojecten of andere interessante dataverzamelingen waar we mee geconfronteerd worden. We kunnen ook gaan overwegen of we nog grotere verzamelingen van buiten mee gaan indexeren, zoals de Delftse Universiteitbibliotheek doet in Discover, waar men ook Elsevier tijdschrift artikelen mee indexeert.

Kortom. Onze nieuwe zoekarchitectuur zal ons vele nieuwe mogelijkheden bieden.

Onze oude manier van zoeken behouden we ook nog. Onze oude indexen zullen namelijk de enige indexen zijn die 'real-time' zullen blijven worden bijgewerkt, wanneer we mutaties uitvoeren. Het bijwerken van de krachtige Lucene indexen is namelijk een tijdrovend karwei. We zullen dit asynchroon gaan doen. Dat wil zeggen dat gewijzigde records zullen worden verzameld tot groepjes die in batch zullen worden geïndexeerd door Solr. Het is op dit moment onwerkbaar om de Solr index bij te werken op het moment dat een record wordt opgeslagen. Ten eerste zou het wachten na een recordwijziging onacceptabel lang duren. Daanaast beïnvloedt het bijwerken van de index ook de zoeksnelheid in negatieve zin. Solr maakt hevig gebruik van caching, waarbij de indexen permanent in het computergeheugen blijven staan, waardoor het zoeken zeer snel gaat. Wanneer we de index gaan bijwerken, zal de caching van Solr niet langer geldig blijven. De index is bijgewerkt op schijf en Solr zal de index opnieuw moeten gaan lezen en in het geheugen plaatsen, dit beïnvloedt de zoeksnelheid enorm. Hoe we met deze asynchroniciteit om moeten gaan moeten we de komende tijd ook gaan uitwerken.

2009-06-16

Using Worldcat services in the library catalog

For the first time we have implemented features into our library catalog that are 'invisible' to most of our library users.

I was inspired to build in these features during my participation in the Worldcat mashathon in Amsterdam on 13-14 May 2009. It was easy to build them in and I hope we do serve unexpected audience this way.

So what are these features ?

If you have ever tried searching worldcat you have probably noticed you can see which libraries hold the title you have just found. At the mashathon I learned that there is an API to give you a list of all the libraries holding copies of a title that is identified by a specific OCLC number. In our library catalog we do not hold the OCLC number, but as I described in a previous post, we can address a OCLC Pica service to get the OCLC number belonging to the dutch PPN which we do hold.

This allowed me to look up all libraries holding copies of a title in our catalog and show them in our full record display. Well, this is not really true. Worldcat will only return the libraries in your 'neigborhood'. This is a pretty large area that is defined by the IP address you are using the service from.

OK, it worked, but I had two problems with it. Firstly. Why would we bother our users with other libraries when we hold the title ourselves ? And secondly. The service always returned libraries in the Wageningen area, since the service was addressed from our server, based in Wageningen. Even though this is a pretty large area, showing not only Dutch libraries, but also German and British libraries and sometimes even libraries on the east coast of Canada and the USA, this is not what you want.

Fortunately, the service allows you to provide an IP address in the request, which will make it ignore our server IP address and will use the IP address provided to decide which libraries to show. So now we pick up the IP address from the browser user. If this a 'Wageningen IP address' we will not invoke the Worldcat API service, so our users will not be bothered by other library holdings. For other IP adresses we will take the IP adress and provide the user with a list of other libraries in their 'neighborhood' holding the title.

If you follow a link like this you may not only notice this new feature, but you may also notice another feature. A feature that will only be shown to you when your IP adress can be related to a library in the Worldcat registry. And only if the person that registered that library also registered a base URL of an OpenURL resolver for that library.

In that case, we will show a link to your local OpenURL resolver to provide you with (mostly full text) services your library will provide for the title you have found in our catalog. Isn't that great !!

One drawback however. Pretty hard to test for us :)

So I am glad Sarah Miller from the library at the National Science Foundation contacted us about a problem they had with this last service. Her SFX button did not lead to a full text link, for a journal they hold. She even pointed out why. We should add 'sfx.ignore_date_threshold=1' to the OpenURL to show full text links from the journal record. She was right. We had this problem when we implemented SFX years ago. A that time the only way to solve this problem was to add year=999 to the OpenURL. We did not do this, but added this to the OpenURL when it was parsed and discovered to be coming from our own journal catalog record. This still works, but only with our own SFX server. We changed this, so you may actually be able to use this new feature.

There is one potential problem with this service. OCLC restricts access to the Worldcat search API to 5000 requests a day. I hope we do not hit that limit very often. It will be so if more than 5000 requests of a full display of a catalog record a day occur. Well this does happen, mostly because of crawlers. I try to filter out a few known crawlers and do not hit the Worldcat service if they request a full record display. And now just hope for the best. ..

2009-06-05

Moving to SOLR

We are currently running tests with SOLR to index the content of our Library Content Management System. At this moment the system is relying on Oracle Text Retrieval. We do not feel this meets our demands and we find it too tightly knit to our storage solution, which we want to be data base independant. We are planning an architectural description of our complete environment before we start developing, but I do have some ideas on which way to go of course.

We will start leaving our complete system as it is to start with and add an extra index layer. This will allow us to gradually leave Oracle text retrieval and evolve the sytem in stead of disrupting it.

The idea is to store the record id and a root table name after a record is succesfully entered or updated. So imagine someone changes a title of a publication. The id and table containing the bibliographic description is added to a list. The same id and table name will also be registered when a record pointing to it (for example of a book item) is modified. This list has the form of a list of URL's, since every record in our Library CMS is uniquely identified by a URL.

There will be a process reading this file retrieving each URL. The URL will return an XML file, that will be processed by an XSLT. This XSLT can enrich the XML via web services.

For example our own web services, to add information about the book-items to the bibliographic description. We can also add OCLC numbers and LCC headings by consulting web services at both OCLC Pica and OCLC Worldcat.

Many other enrichment of the data may be done before it will be offered to SOLR to index it. This way we will be able to retrieve the records by keys that are not present in the actual record, but which are added just before indexing.

This has great potentials. We can for example check our SFX server for information and add a flag to know before hand a publication is electronically available. We can check Worldcat and add better normalized author names and lots of stuff we might think of later.

We will remain compatible with our current way to search and present results.

We will store the same XML representation in the SOLR index as is stored in our CMS, so we can still use all XSLT's that are currently used to present results.

We will also 'hide' the query interface of SOLR behind our current SRU like interface. Of course we will have to add ways to make use of the new sorting and facetting features we will be able to work with.

This way we also keep our Z39.50 working since it makes use of the same interface to search the LCMS.

Well these are our first ideas, so we will do a lot of tests and rethinking I suppose. I just like to share it with the readers of this blog, so anyone can jump in with even better ideas.

I hope to share some more results with you at the end of this summer. I am really excited about this new journey we are taking.

2009-03-12

RSS services from journals in library catalog

About a month ago, Terry Bucknell of the University of Liverpool announced on the code4lib list. that the ticTOCs project (a project funded by JISC in the UK to create a single, freely available source of RSS feeds for tables of contents - see http://www.tictocs.ac.uk/ ) is exposing their information about journals and their RSS feeds as a TAB delimited file.

I immediately spent two hours to develop a way to read that table into our Library Content Management System and join this table to our catalog records using their ISSN. It took a little longer to get everybody to agree on the way this should be presented in the user interface, but today it can be seen in our production environment. RSS buttons appear in our Journal A-Z list and also in the full record presentation of a journal. Within the full presentation we also pick up the feed and show the recent articles within the record presentation. Most of the discussion during implementation was about whether these recent articles would appear directly or whether they should be hidden to begin with. The hiders won.

Terry mentioned that they were thinking of making an API for this service, but because our LCMS is completely based on XML services, I accidentally created the API Terry is mentioning : http://library.wur.nl/WebQuery/toc/xml?toc=1530-9932. (Leave out the /xml and it will give you the RSS feed itself transformed with our local stylesheet)

It is this kind of services that should be shared more often between libraries in the world. We already discovered journals that are missing from Terry's list and will send them to him to update this wonderful service.

At Wageningen we are able to add this sort of services to our library catalogue easily, since we develop it all ourselves, but if you are using something you have bought from a library vendor, you might be able to so the same thing.

If you are able to add some javascript to your full record presentation, and can pick up the issn from the record, you could use javascript to retrieve the URL of the RSS and present this to the user. For this you could ask our service, but this will be costly when you present a list of journals and have to go to the service for each journal. If this would become very popular we would also have a performance issue on our side I'm afraid.

If you are using a Open URL linker, I think the best way is to convince your supplier to process the http://www.tictocs.ac.uk/text.php and populate your link menu with RSS links. Or if you are using SFX and are familiar with Perl, put the file on your SFX server and write a parser to find the URL's for the relevant issn's and use that to add links to your SFX menu.

2008-09-28

Finally a component based library environment

At IgeLu 2008 in Madrid, the international Ex Libris user group meeting, Ex Libris explained their future strategy. Their software products are going to be part of a component based architecture with open interfaces. They are going to use one central data store for all meta data, a Unified Resource Management System to control this information and a Unified Resource Discovery and Delivery environment to present information to the end user.

In Wageningen we decided to rebuild our own library system some 7 years ago. We discussed buying an Integrated Library System, but realized this sort of product was outdated. The traditional ILS does not cover all the tools a modern library uses nowadays. More importantly, these systems were not open. They did not enable you to build additional components, basically because these systems were not component based to start with.

Because of this, for example, almost every library ends up with separate systems for their electronic resources and for their paper resources, forcing their users to choose between electronic or paper, before they start their search for information.

We decided to develop, what we call, a Library Content Management system, sharing one single data store for all meta data, accessible for all system components. We would like to buy these system components, but they were unavailable. The thing that came close was Ex Libris URL resolver. It had it's own meta data store, the so called knowledge base, but because of its open interfaces we could extend it to use our own content management system as an additional knowledge base. So we decided to buy this component and develop all other components ourselves. (Actually we also bought Ex Libris Meta search product, Metalib, because it also fits in our architecture but there are a lot of issues when it comes to meta searching, which I will discuss in a future blog post.)

Now Ex Libris chooses to follow a similar strategy. They have chosen better names. I prefer Unified Resource Management System over Library Content Management System, which we have chosen. We had some internal discussion about this and I feel that the term Content Management System has been misused so much over the last decade, we did the wrong thing picking this name. (Although I do feel it covers quite well what the system is doing)

I think we would have invested in Ex Libris URM solution, if it would have been introduced 6 years ago. Even though it still has to prove itself, since it is basically drawing board technology for the moment, I feel they have chosen the right architecture. I suppose they can do a better job then we, with our small development team.
For now we will be very keen on what is evolving and see what we can learn.. If their interfaces are going to be really open, we may want to replace components by Ex Libris products in the future or make our components work within their environment.

But who knows, may be other vendors will see the light as well and we will have even more choices in the future.

2008-03-20

Hooray for OCLC Pica customer response !

In my post about our Google Book Search implementation, I mentioned that we could only do this for records containing an ISBN. Google also accepts other identifiers, like oclc numbers and the numbers of the Library of Congres.
We don't have that data recorded, but all of our records end up in Worldcat. Time to get in touch with OCLC pica, the European branch of OCLC, which manages our Dutch Union Catalog and makes sure these records are also uploaded to Worldcat.

I had a short but fruitful email communication. Their first reaction helped me to understand that I can locate a corresponding record in Worldcat using a URL containing the PPN (Pica Production Number) which we do record, since that is the identifier for the Dutch Union Catalog. That's neat. I can now point to Worldcat's 'Find in a Library close to you' page, from the catalog record, for books out on loan or non Wageningen UR users. On the resulting page the OCLC number is present. I could do some page scraping (which is pretty easy, since we only use XML tools and worldcat returns XHTML pages (bravo !)), but it is not elegant and pretty slow as well. I mentioned this to OCLC pica and also pointed that the link from Worldcat to our local catalog should always be based upon the PPN. (Worldcat only does this for non ISBN holding records and accidentally does this using the OCLC number in stead of the PPN). OCLC Pica responded quickly that it was indeed better to have a small service to request for the OCLC number when providing a PPN and that they would make this available to me before the end of March. I was astonished. Isn't that great. When I thanked for this immediate response, I took the liberty to request if they would add the Library of Congres number with the response as well. Thanks Martin.

2008-03-19

Hooray for Google customer response !

It was not easy to find an appropriate response form on Google's web site to complain about my problems using the Google Books API. I found a form that was supposedly for authors and publishers wanting to advocate their book on Google Books and used it. Google responded today:

Thank you for notifying us of this problem regarding our API. I have forwarded these issues on to our specialists, who will look into the matter. Please feel free to reply to this email if you have any further details about the difficulties you are experiencing.

Sincerely,
Greg
The Google Book Search Team

I noticed earlier (when Google introduced URL resolving in Google Scholar) that they can be quite responsive. Of course I haven't got a solution yet. But have you ever had a response on your problems with Microsoft, even though we pay them for their products ?

2008-03-18

Google books API. Do they really want you to use it ......

Last week there has been a lot of discussion about the Google Books API, allowing one to check whether Google has a book description, can provide you with a cover and tell you whether it has scanned the book completely or partly. Examples for scripts appeared on the Google books site, Tim Spalding gave examples on the LibraryThing Thingology blog and Godmar Back responded with some alternate scripts on the code4lib discussion list.
Ex Libris announced proudly that they had implemented the 'About this book' product into their products and that it only took a week to get the link in place. Sunday evening at 11:00 pm. I decided to see wether it would be difficult to implement this into our OPAC. Just after midnight I had implemented it and it has been running since.
As Wouter Gerritsma explains in his blog, we can only check Google for a book, when we have an ISBN. Now we want to be able to do it for books that have not got a ISBN, using the OCLC number which we have not registered in our records. However, we do have a PPN (Pica Production Number) and OCLC Pica makes sure our titles end up in worldcat, so we should be able to get hold of the OCLC number.

So far so good.
But Google has some policies that obstruct the usage of their API. A product like SFX may suffer severely from this. (Depending on the way they are going to implement this) It surely affected our implementation severely and now I am trying to find a way to get around this.

I don't know if you have ever experienced to end up with the We're sorry .... message of Google, telling you that you probably are infected with spyware or some virus. (Some people are really shocked when they see this warning !!)
Google sends this message when it detects 'anomalous queries' from one single IP adress. We occasionally see this error in Wageningen and I am not sure if some computer on the university network does some extreme Google access or whether it is just busy with people searching Google. All the request from the network look like coming from one or just a few computers to Google, due to the network address translation on the firewall. Anyway, Google books seems to suffer much harder from this problem than other Google services. Just a few hours after implementation, the API did not respond with a JSON object (containing the requested information for this service) but with an ordinary html page, the 'We're sorry page' messing up this service completely.
I can hardly believe this is just caused by implementing this service, so I have now defined a ProxyPass directive on the web server so requests to Google for the API go via our library web server. Google will see all requests coming from this server now. This way we avoid it to see the requests coming from the firewall gateway and we will not suffer from all other Wageningen UR PC's searching Google. If this does not solve the problem, I will be sure that Google will see normal usage of the API as unwanted traffic. If so, what kind of API are they offering us ?? For the Google Map API or the Google Custom Search API they have a so called access key to use this service, I guess that would be the way to go for this API to prevent unwanted use.

2008-02-06

Validating XML records and validating input, how to proceed ?

WebQuery posts forms to an XML data base. To be honest, the Oracle data base exists of tables with just a few columns. The most important is a CLob (Character Large Object) containing the actual XML record. The only real reason we put it in an Oracle table at this time is, because we make use of SQL Text Retrieval to get the stuff out. So we don't use Oracle at all to do any record validation or constraint checking. We are happy with this, because we want to be independent of proprietary data base features. Before posting the record, the XML is validated against its schema. We have two problems with this. The first problem is, that this validation is done at the server side. Often this is not very user friendly and we do additional javascript validation at the client level. The second problem is, that the schema language does not have sufficient validation syntax. What we need is something like schematron. However, this is not widely used yet.

An issue related with this is the generation of forms for updating and modifying XML records. We have a lot of them and up until now they have mostly been developed one by one. It would be most efficient to make them automagically, based on the schema. Schema is not suited for this either and I guess schematron is also insufficient.

At this moment we are working on XML files that provide the information to build forms automatically using general XSL style sheets. We are looking at how to avoid definitions to be redundantly defined. For example: At this moment field enumerations are defined in the schema. However, they are also defined in a data base, so they can be easily edited. We definitely have to come up with a more consistent architecture for this. It is not as if we are the only ones in this world that are trying to solve this problem, are we ?

2008-01-23

V-sources, of course there is a Dutch library using an ERM solution

Joost de Vletter from Eindhoven University addressed me to say they have developed V-sources an ERM system, which they use and is going to be used by Delft University as well. Two posts ago I said I did not know of any Dutch libraries using an ERM system.
I was thinking of the efforts of the consortium of Dutch University libraries and the Royal Library to select a system. These efforts were unsuccessful so far. I forgot that Eindhoven invested in developing a system themselves, which will also be available commercially.

Joost also reacted on my remarks about the lack of integration of management of paper and electronic subscriptions. He answered that their system, like most other ERM systems is based on the DLF (Digital Library Federation) ERMI document. This describes a data model that does not consider the administration of paper subscriptions. He is right and following standards is a very good thing. And since these new library system components are far more open then before, it is probably easier to link between the paper subscription in one system and the electronic subscription in the other. However, the 'old' serials management systems of most vendors will not be that open, which will probably make integration more difficult. I sometimes think that vendors use the component based architecture argument to just sell more sytems.

Don't get me wrong, I am all for component based and even service oriented architectures, but not for just a part of our systems. Our management problems with electronic subscriptions, with mostly 'big deals' do not justify a separate ERM system. We can solve our problems with extra features within the serials management system. However, if there would be a national ERM system, shared between university libraries we would also benefit from a single point of administration. That would be a reason to implement ERM, assuming it would have a rich set of web services available. I heared from Thomas Place of Tilburg University, he is thinking in this same direction and I think most Dutch university libraries will consider this a good way to go.

2008-01-14

European Library Automation Group - 32nd ELAG Library Systems Seminar - 14-16 april 2008

We have just opened the web site we have created (using WebQuery of course)for this years ELAG conference, which we will host in April. I will present a paper on our "Digital Library" as we call our library website. I will speak about our decisions and efforts to start building a Library Content Management System. There will be a lot of other interesting papers as well. Another exciting thing about ELAG meetings are the workshops. ELAG is not about listening to presentations only. You participate in one workshop during the conference and discuss a subject. Workshop reports are presented on the last day. This year you can also come and talk, without having to prepare a paper!!! You can express your ideas in 5 minute ligthning talks. So join us. Be fast. Convenient hotel accomodation nearby is limited.

2008-01-09

Serials Management, the last conversion

This week we hope to start using our new serials management application. The last application that is still running on our old system. Cardex, the name of the application, is now based on the new CMS. It is entirely build using XSLT, some javascript and a bit of perl that does background printing and emailing of claims for missing issues or stagnating subscriptions. The old version, which has been used until now was build in 1982 (to be honest, the first patch of the code dates back to november 1982, so I presume 1982 must have been the year it was born). It was based on Minisis and written in SPL, a Pascal like proprietary programming language for the HP3000 series computers. Although the application was old, it had evolved over the years and people were pretty happy with it. We kept the same application properties, but the application, a web application now, has got a completely different look and feel. It now has all the old features and some more. Next step is to put in new functionality concerning electronic subscriptions or to integrate it with components for this, so called Electronic Resource Management (ERM) systems. The problem is, that I have not seen systems that make the connection with traditional serials management systems, which is quite strange, especially since subscriptions are often paper plus digital content (ok, it is changing). They do integrate with Open URL resolvers and sometimes with cataloging components. It feels like these systems force to separate paper and electronic serials management, just like in most systems, cataloging digital content (in meta search portals and OpenURL systems) is separated from traditional cataloguing of paper content in a ILS. Vendors have not split up their traditional ILS's into components yet.
For now I think we have to built these features ourselves. So far, no Dutch library I know has implemented a ERM system. Am I wrong ?

2008-01-08

By the way ......

One of my last posts in Dutch I mentioned we were running WebQuery version 5.34. Now, two years later we are running version 5.52 This latest version introduced PAM (plugable authentication module) authentication. This is our first step towards implementing federated authentication.

Code reviewing

We are working with 8 people on the sytem now. Six of them do application development and we have reached a point where we have to consider implementing some quality control tools. One of them is code reviewing. We started with this recently. We have found it too much to start doing it systematically, for the moment. So we do not have every bit of code reviewed. We will take a piece of code every month and one of us, not the developer, will write e review and present this to the rest of us. We spend a morning discussing the review. This will lead to agreements on coding standards and one of us will document this in our wiki.
So far we have had only one review. The discussions at the moment are very much about very basic standards we have never explicitly formulated. I suppose that later on we will discuss more local practices. I think we have found an excellent way to create acceptable standards. Everybody is excited about it.

And now I will continue ...

I have been very quiet on this blog for two years !!
We have been working hard on implementing all library applications using the LCMS and I haven't found the urge to report about it. However, blogging has become more popular over the last two years and reading blogs as well. I will also start blogging in English (please be merciful !) since this might appeal to a much larger community.

2005-12-28

En waneer je nu eens geen text retrieval wil doen ?

Ik heb vorige week de Gewasbeschermingskennisbank overgebracht naar Oracle. Het probleem dat daarbij boven water kwam, is dat we in oracle niet zo eenvoudig text indexen en volledig veld indexen door elkaar kunnen gaan gebruiken. Op dit moment ondersteunen we per tabel in wqoracle 4 indexen. Op recordnummer, op datum van invoer, op datum van wijziging en de context-index voor text retrieval op de inhoud van de XML elementen.
Bij de gewasbeschermingskennisbank moet echter het onderscheid kunnen worden gemaakt tussen middelen die zijn toegestaan in "siergewassen onder glas" en middelen toegestaan in "plantgoed voor siergewassen onder glas". Deze laatsten worden ook gevonden als je zoekt naar "siergewassen onder glas".
De wat onbeholpen work-around die ik er nu voor gebruik is om naast het normale xml element een tweede xml element te definiëren met dezelfde inhoud maar ontdaan van woordscheidingstekens, zoals spaties en die te gebruiken voor de zoekactie.
Het werkt, maar ik denk dat we de mogelijkheid moeten inbouwen om 'normale indexen' te kunnen gebruiken (normaal in de database wereld, anders dan die wij in onze bibliotheekomgeving doorgaans gebruiken) en ergens in een zoekcommando moeten kunnen geven dat we voor een bepaald veld de andere dan de contextindex moeten gebruiken

2005-12-15

WebQuery 5.34

Inderdaad, alweer een nieuwe versie van WebQuery.

Deze versie is vandaag in gebruik genomen vanwege de correctie van een fout in de afhandeling van de multipart-variant van de formulierafhandeling. Deze fout trad alleen op in formulieren waarmee een bestand werd overgebracht naar de server. Onder bepaalde omstandigheden werd dan een deel van het formulier na de browse-knop niet of niet goed verwerkt. Dat probleem is in deze versie opgelost.

Er is één andere wijziging: wq_sfx wordt nu standaard meegegeven als xslt-parameter ($wq_sfx). Doel hiervan is om uniformering van de "next"
link en evt. andere links te vereenvoudigen. De xslt "ora-default.xslt", bedoeld als vervanging en uitbreiding van ora_errors.xslt, maakt hier al gebruik van (let op het streepje: dat is een minteken, geen underscore).
De nieuwe uitleen gebruikt deze xslt in de testomgeving.

2005-12-09

WebQuery 5.33 en wqoracle 1.3 actief

Zoals je aan het onderwerp al zag, er zijn weer nieuwe versies van WebQuery en wqoracle in gebruik genomen, en daarbij hoort een lijstje met de wijzigingen ten opzichte van de vorige versie:

WebQuery 5.33 (in combinatie met wqoracle; ik test nieuwe features als deze niet meer met wqstub/mindbsrv, dus gebruik het daarmee niet): wq_max, waarmee in een zoekvraag het te tonen aantal records per resultaatpagina kan worden opgegeven, kent en nieuw gebruik. Als aan wq_max de waarde 0 wordt meegegeven, wordt alleen het aantal hits getoond en geen resultaat set. In de oude versie had wq_max=0 het zelfde effect als het weglaten van wq_max (dan wordt de default-waarde van hit-limit uit de cgiparm parameter file gebruikt). Hiermee vervalt tevens het keyword "hits" (achter de servicenaam in de url), dat vrijwel nooit gebruikt is en in combinatie met wqoracle niet werkte.
Verder worden nu om alle zoektermen (behalve in wq_qry) haakjes gezet, om te voorkomen dat onbedoeld op alle velden gezocht wordt.

Ik wil er nogmaals op wijzen dat oracle context indexen (en dus ook het zoekmechanisme in wqoracle) een groot aantal niet-alfanumerieke tekens als operators interpreteert. Een paar voorbeelden (de aanhalingstekens zijn een markering van de zoekterm, tik die dus niet in als je het effect wil zien): zoek je op "plant-hg", dan vind je alle records, waarin het betreffende veld de kreet "plant" staat, NIET gevolgd door "hg". Zet accolades om de zoekterm, of een backslash voor het minteken, als je dat wilt voorkomen. "{plant-hg}" en "plant\-hg" geven het gewenste resultaat. "A&F" vindt alle records waarin in het betreffende veld zowel een A als een F staan (vergelijkbaar met het zoeken op "A F" in minisis). Ook hier helpen accolades en backslash om de kreet "A&F" te vinden. "A F" heeft in dit geval het zelfde effect, omdat je dan zoekt naar een losse "A" gevolgd door een losse "F", en de index de "&" in het record afwijkend behandelt. Vermijd ook woorden als "AND" en "OR" in een zoekterm, tenzij je die bewust als operator gebruiken wilt. Gebruik haakjes en accolades altijd in paren, en realiseer je dat "plant soil" (39 hits in titel) niet meer het zelfde resultaat geeft als "soil plant" (124 hits) of "soil & plant" (1536 hits). Kortom, "elluk foordeel hep se nadeel".

wqoracle 1.3: wq_max=0 (zie hierboven). Daarnaast is een interne wijziging doorgevoerd op verzoek van oracle, als workaround voor een van de oracle-bugs die tot problemen bij het zoeken leidde. Deze wijziging komt er op neer dat de xml-velden nu in porties van 4000 bytes worden opgehaald uit de database, in plaats van zoveel mogelijk in één keer. Tenslotte is de maximale veldlengte van 132KB naar 1MB uitgebreid omdat isn 51066 in titelplus te groot was (ca 250KB).

Age Jan

2005-12-07

Uploaden van documenten in een database

Al enige tijd bestaat de mogelijkheid om bestanden toe te voegen aan records in een database. Op dit moment gebeurt dat bij Wageningen UR publicaties, maar we gaan dat nu bijvoorbeeld ook doen bij de documentatiedatabase van COGEM (COmmissie GEnetische Modificatie)en bij de leermiddelendatabase in het kader van LORENET. Deze laatste 2 databases zijn echter Oracle databases en dat heeft wat consequenties. Oorspronkelijk draaide de Minisis db server en WebQuery allebei op de HP3000. Het uploaden van een bestand was een gecoordineerde actie tussen die twee processen, op dezelfde computer op basis van afspraken over de plaats van die bestanden. Tegenwoordig draait WebQuery op de Web server en WQoracle op de database server. WebQuery slaat de bestanden niet op in de database, maar gewoon in directories op de web server. De bestandsnaam wordt vervolgens doorgegeven aan WQoracle waarna het in de database wordt opgeslagen. Bij invoer van een record is er een bestandsnaam gegenereerd die weliswaar de naam van de tabel bevat, maar verder betekenisloos is. Wordt het bestand toegevoegd aan een bestaand record, dan wordt het ISN meegenomen in de bestandsnaam. Zo is een bestand altijd terug te herleiden tot een uniek database record. Wanneer er nog geen record is, kan dat natuurlijk niet. Wanneer de interface het enigszins toestaat, probeer dan in het invoerscherm het uploaden niet aan te bieden, maar doe dat pas nadat het record is ingevoerd.
Het verwijderen van een file is een ander probleem. WebQuery en WQoracle zijn gescheiden processen, niet gekoppeld door een transactie logging mechanisme. Wanneer je een nieuwe file toevoegt in plaats van de vorige, of je wil er een verwijderen, dan kun je het record aanpassen en eventueel een nieuwe file uploaden, maar er is geen mechanisme om een oude te verwijderen. Deze blijven dan 'los' in de directory achter en dat geeft vervuiling. WQoracle zou WebQuery kunnen 'informeren' over een te verwijderen file, maar als het verwijderen mislukt dan hebben we geen mechanisme om de hele boel terug te draaien. Vervuiling blijft dan dus mogelijk.
We zouden er ook voor kunnen zorgen dat de filename nooit uit het XML kan worden verwijderd, maar dat alleen een delete attribuut kan worden aangezet, waar je dan in het stylesheet rekening moet houden (dit maakt ook een UNDELETE mogelijk) We zouden dan periodiek in batch kunnen opschonen. We denken erover na. Heb je een brilant idee, laat horen !!

WebQuery op het web

Bij wat zoekacties naar deze blog op Google kwam ik twee leuke dingen tegen. Allereerst een WebQuery handleiding van mijn hand in het Spaans. Nou ja, wel een oude handleiding en een soort van samenvatting. Grappig dat iemand dat heeft gedaan, Met name omdat we alleen een Nederlandse, Engelse en Franse versie hebben gemaakt. Er is nooit Spaanse interesse voor WebQuery geweest. Er is overigens ook een geregistreerd product met dezelfde naam. Iets om op te letten.
Een tweede ding is een alternatieve manier om vanuit huis 'off campus access' tot de desktop library te krijgen. Zoek in Google op 'pauzeren' en klik op het derde resultaat. Klik op de login knop en vervolg ...... niet iets om in de helpteksten op te nemen :)

2005-11-28

characters sets en foutafhandeling

wellicht ten overvloede nog even een paar opmerkingen over xslt's in samenwerking met WebQuery:

a) als er gebruik gemaakt wordt van xslt's (dus inmiddels bijna altijd) bepaalt WebQuery aan de hand van de output opties van de xslt wat de eigenschappen van de output zijn. Neem dus in IEDERE xslt een output regel op, en geef daarin een method en een encoding aan. Vergeet je die, dan zijn de defaults xml en utf-8, dus waarschijnlijk niet wat je wilt.

b) als er een fout optreedt buiten de xslt, wordt ALTIJD de xslt aangeroepen. Bij een aantal algemene fouten kan dat een andere dan de "gewenste" xslt zijn, omdat wq_sfx dan nog niet is geinterpreteerd. Zorg er dus voor dat IEDERE xslt gebruik maakt van (voorlopig nog) ora-errors.xslt voor de foutafhandeling.

2005-11-23

Klaar voor OAI interface !

In minisis konden we zoeken op een range van waarden, bijvoorbeeld jaren van uitgave, door de twee waarden gescheiden door een dubbele slash als zoek criterium mee te geven. (bv 1950//1960 of sneeuw//sneeuwschuiver). Dit is niet erg eenvoudig te realiseren in een text georiënteerde zoekindex. In Minisis is het pas in een van de laatste versies op veler verzoek geïmplementeerd. In Oracle text retrieval bestaat deze mogelijkheid dus niet.
Dit is een probleem voor onze OAI (Open Archive Inititative) interface op de Wageningen UR publicaties.
Daarvoor moeten we namelijk eenvoudig kunnen zoeken naar gewijzigde dan wel ingevoerde records tussen twee datums. Om dit mogelijk te maken moeten we een extra index definiëren. We hadden tot nu toe voor iedere Oracle tabel 2 indexen. De text index op het gehele xml document en een aparte index voor het ISN. Er komen nu standaard 2 indexen bij voor de invoer- en wijzigingsdatum van een record.

In WebQuery 5.32 en wqoracle 1.2, die sinds gisteren de productieversies zijn, is het gebruik van deze nieuwe indexen mogelijk gemaakt. Je kunt nu in de "creation date" en de "modification date" zoeken op een range . Daarvoor zijn twee nieuwe pseudo-velden gedefinieerd, wq_cdt en wq_mdt. De zoekstring voor deze velden MOET een (één) slash bevatten als delimiter tussen begin- en eind-moment. Beide momenten mogen afgekapt worden waar je wilt. Een paar voorbeelden: wq_mdt=2005-10/2005-11 zoekt alle in oktober gewijzigde records in dit geval in de NIEUWS database.
De zoekvraag wq_cdt=2005-10-19T11:30/2005-10-20 betekent: zoek alle op 19 oktober na 11:30 uur ingevoerde records.

Zo kan ook de OIA interface op Wageningen UR publicaties gebruik gaan maken van de snellere Oracle database.

2005-11-18

De nabijheidsoperator

In een eerdere post vertelde ik al dat in het nieuwe Bibliotheek Content Management System we te maken hebben met standaard 'phrase searching' en dat word searching expliciet moet worden gedaan door het gebruik van de 'AND' operator. We hebben nu echter ook de "near()" operator om te zoeken naar woorden binnen een record. Zo kan bijvoorbeeld met http://library.wur.nl/WebQuery/wurpubs?title=near((water,management),2) gezocht worden naar alle records in WaY waarbij in de titel de woorden water en management voorkomen, mits tussen deze woorden maximaal 2 andere woorden zitten.
In dit geval vinden we dus titels waarin "water management" staat, maar ook "water quality management" en "management of irrigation water" wordt gevonden.