PYTHEAS Introduction

PYTHEAS

Introduction

PYTHEAS¹ (Powerful Yet Tactfully Helpful Electronic Arranger of Sources) is designed to be a multi-tier ILS (Integrated Library System) or Library Application Framework (LAF)². The two major building blocks for providing server-based metadata and information retrieval capabilities in PYTHEAS are MARC (MAchine Readable Cataloguing) and RDF (the Resource Description Framework), standards that define a format for describing objects and can package highly structured metadata for describing content and content relationships in physical and digital objects. The database also has some support for Web Ontologies and Topic Maps. The server architecture is based on XML (eXtensible Markup Language), Web Services, and EJB (Enterprise Java Beans). We make use of Exolab’s Castor XML mapping tool to manage XML in a database-independent manner.

The latest version of of the source can be found here and a demo can be found here (please note that the host machine is several years old and supports several other applications so be gentle, it's a small collection but possible searches include "java" and "perl"). The overall architecture for PYTHEAS can be seen in the following diagram:

Architecture

XML Layout

In order to understand how PYTHEAS uses XML, it is important to understand that Castor is a Java-based Object/Relational mapping tool that can provide an XML view of a database. On the plus side, these "views" can be varied, allowing for example, a "full" and "brief" mapping, and allowing a node to be switched from an "element" to an "attribute" and vice-versa on the fly. Sometimes, though, the XML view is a little too dependent on the structure of the underlying database. For example, most OSS RDBMS options limit text fields to 255 characters, leading to results like the following:

How large textual fields are handled in a typical RDBMS.

Here the 505 field is broken into three parts, and requires either the XML to be modified on the way out (which isn't that big a deal) or merged by a stylesheet. You can see it up close here. Again, this is dependent on the underlying database, Castor will faithfully map to whatever the database allows, but this is why a full text document may not be a good match for RDBMS varchar fields.

A full treatment of MARC and RDF is far beyond the scope of this document, but some background information may be useful to understand the XML representation used in PYTHEAS. With MARC, conceptually distinct segments or "fields" of information are identified by 3 character numeric codes called tags. Within each field, elements are organized through the use of "subfield" codes consisting of a single letter or number preceded by a "delimiter" sign. Fields may be fixed length or variable in size. Within variable fields, the first two character positions contain values that interpret the data found in the field. Fields and subfields may be repeated, allowing the MARC format to support an incredible level of content and detail. The MARC formats supported in PYTHEAS are for bibliographic, holdings, and authority records. MARC 21 support has yet not been put into the mapping files and not all of the fields for authorities and holdings have been specified. The mapping file should be modified to list all possible MARC 21 fields and a view should be defined for each record type, but this is fairly straightforward to add.

The trickier MARC-related construct is authorities. Authority Control seeks to improve quality and consistency in the database. Perhaps no other aspect of library databases demonstrate better demonstrate the tremendous intellectual effort libraries have put towards organizing information than the many different kind of relationships accommodated by authority work. The relational model is very good for "many to many" relationships, e,g., many books are about "environmental studies", and immediate updates, e.g. "environmental studies" becomes "environmental earth studies", but raises some issues when updating records directly by XML. For example, if a work is updated to reflect that it is no longer about "environmental studies" then the reference in the many table needs to be removed. This can be dealt with during the update but requires some extra plumbing in the processing.

RDF is arguably more grounded in XML and allows metadata to be packaged in XML based on a schema, or collection of classes that represent a common vocabulary. Various research communities or groups with a common interest may define these schemas. We use a Dublin Core/MARC Crosswalk to create a base metadata set and use NACO Normalization Rules to create an element for string matching in queries. RDF is envisaged as the backbone of most searching in the database, although every effort has been made to fully take advantage and expose the power of MARC, the public side of the system will almost certainly be layered on top of RDF.

EJB/SOAP & Castor

Castor can be a persistence engine for an EJB implementation and makes it easy to process SOAP requests. With EJB and SOAP, it is possible to build distributed applications be combining components from different sources, for example, a general ledger bean for acquisitions processing. ILS vendors typically develop these layers from the ground up, EJB/SOAP helps to directly combine our efforts with those of other Open Source and commercial EJB/SOAP initiatives, in addition to legacy systems through proxies. For example, using IBM’s Business Objects or SAP through a connector. The current distribution includes a servlet-based SOAP server and a sample client.

Ontologies & Topic Maps

Web Ontologies and Topic Maps veer into that contentious territory know as the Semantic Web (SW), and it is far beyond the scope of this document to tackle the issues surrounding the issues of knowledge representation on the Web³. However, regardless of one's feeling about the ultimate utility of futility of the SW, it is our belief that both of these areas provide a useful syntax and an opening for leveraging mainstream technologies for information retrieval. Topic Maps may also hold special utility for IFLA's Functional Requirements for Bibliographic Records (FRBR). With Topic Maps, it is possible to create "associations" such as "is_realized_through" for Work and Expression, and "is_exemplified_by" between Manifestation and Item. The current data loader creates very simple Topic Map objects but this area is ripe for further exploration.

Searching the Database

The interface and the form parsers need a tremendous amount of work, but they hopefully demonstrate one possible method of capturing information for information retrieval. All of the searching forms use the "Next" button to move to the "next" step in a query. If a user specifies a boolean operator, the form is redrawn to add an additional input box for all the repeatable elements. So in the following screenshot, for example, the "AND" operator has been specified:

Example of using Boolean operator for additional field

And the result is that an additional set of options is made available:

Additional field is added to the form

RDF searching shows how WordNet and the Google API are brought into the process. Using the RDF data model, search terms are matched against about 10,000 nouns from WordNet, and then additional terms are used for the query. Terms are also passed to Google's SOAP-based "suggest a term" service so that a search on "proogram" yields the following:

Example of hooks to WordNet and Google

A Note on Recursive Data Models

One of the difficulties of mapping models like Topic Maps and RDF to a database is that they incredibly recursive. For example, in RDF, you make "assertions" about a resource, but you can qualify the "assertion" itself with an infinite amount of addition RDF statements. There are various ways of dealing with this but the approach in PYTHEAS has been to add an element that "points" to what the object is about and to supply a mechanism to support queries without requiring endless "drilling" down into multiple tables. To get a sense of this, consider our table view of Topic Maps:

Layout of Topic Maps in tables

Almost everything in a Topic Map is a "topic" that is mixed and grouped together in numerous ways to express relationships. We define a boolean element for high level searching so that queries like "find me information where William Shakespeare is a member of an association" can be carried out without cyclical table lookups, e.g., where basename="William Shakespeare" and isMember=true. Some RDF databases take the approach of generating tables on the fly but the important aspect is to be able to pull out and query the relationships that have been added to the database.

Client Environments

PYTHEAS is exposed as an XML resource to any editing tool that can work with XML, from simple text editors to complex XML applications. However, there are advantages to constructing client applications that are fully "MARC –aware”. XML tools tend to be "document-centric”, which is highly useful for presentation of content, but editing hundreds of MARC records as individual documents may be very inefficient in a large collection.

It is envisioned that staff client applications will be browser-based and utilize Mozilla’s XUL (XML-based User Interface) tools and LiveConnect. XUL maximizes the potential of the web browser as an application engine while LiveConnect can step outside of the stateless constraints of the web for staff applications. The standard stateless web model is viable for public side information retrieval and content presentation purposes. None of the client software is available at this time, though if you want to maximize your browser experience, XUL is definitely worth a look.

Object Query Language and Lazy Loading

Object Query Language (OQL) is a mechanism supplied by Castor to use a high level query language for retrieving information while generating SQL underneath to interact with RDBMS Tables. For example, we can issue the query:

SELECT b from pytheas.Bib b where tags.subfs.subfValue LIKE "%unix%" and langCode="eng"

and retrieve the XML representation of our record. You can use the query tool with Castor to experiment with the database and also to view the SQL that Castor generates for each OQL statement.

Lazy Loading is a mechanism to minimize the amount of object loading required in the middleware when pulling records from a database. It makes a tremendous difference in the response time for queries and Castor can also use the LIMIT keyword to cut back on the amount of work performed by the database itself.

Next Steps

The distribution only includes one stylesheet, these are not hard to do and at least several should be added to the mix. The basic framework is in place for handling SOAP but the list of services that make sense for handling by SOAP needs to be defined. The parsers need some attention and the web form processing logic should really be revamped. It would make more sense to use the DOM to set up a form object to process repeatable fields than the current approach, and it would be worth investigating Struts and the Java Form API to bring together a high level of web functionality at the presentation level.

Although this work is somewhat "java-centric", it would be possible to use the same approach in other environments. The table mappings are independent of the middleware, and the query tools can be used to see how the SQL pulls the information together. The key to PYTHEAS is XML, not the underlying programming, exposing library resources via XML opens the door to plugging into many other environments.

We welcome contributions, sage advice, and/or any kind of feedback in all these areas.

1. Pytheas (b. 300 BC) was a seaman from the Greek colony of Massalia who mapped a trade route to circumvent what was then a Phoenician monopoly on the tin trade, much like Linux and other open source initiatives are circumventing traditional software vendors. For more information on this project, see the background paper (in glorious RTF format!).
2. I have sketched out some thoughts on LAFs in The End of the ILS”, InsideOLITA. Fall/Winter 2001.
3. Of course, such restraint is completely abandoned in The Semantic Web & Libraries”, InsideOLITA. Srping/Summer 2002.

Revision History
23-March-1999: Document first made available
31-March-1999: Extended table layouts
04-April-1999: Added section on client environments
12-April-1999: Expanded section on client environments
17-May-1999: Added section on development roadmap
25-June-1999: Link to PYSOCK added
09-July-1999: Client environment section modified
12-August-1999: Work on crosswalk
20-September-1999: Examine linkages with Mozilla
15-November-1999: Decision to split PYMAN into "standard" and "MOO-enabled"
29-November-1999: Added link to MARC Editor, more info on client environments
15-December-1999: Revamped LiveConnect to work with both Netscape and Internet Explorer
05-January-2000: Decision to move everything to java (sobbed a lot)
02-February-2000: Existing code made available
05-April-2000: First java-based version made available
01-July-2000: Work started to utilize EJB and Castor.
11-November-2000: First distribution to use Castor.
12-December-2001: MARC Loader and demo files to be made available.
6-July-2002: Demo link added.
18-September-2002: Distribution released with SOAP & Topic Maps.
Last updated: Date: 17-September-2002 by Art Rhyno <http://www.uwindsor.ca/library/leddy/people/art/>