[ You are here:
XTF ->
Experimental Features -> MARC Record Parsing ]
MARC Record Parsing
California Digital Library (CDL) is experimenting with indexing millions of MARC 21 records using XTF. MARC stands for MAchine Readable Cataloging and was developed by the
Library of Congress, Adding MARC support is important for
XTF since many existing library catalogs are available only in MARC format, and XTF can add powerful capabilities (such as relevance-based ranking, spelling correction, similarity queries, integration with full text data, etc.) to these catalogs.
Since MARC records are in a binary (non-XML) format, the
textIndexer has the ability to convert them to XML, and then send each record to one or more pre-filter stylesheets.
One might ask: why not convert all the records to MARCXML once, store those XML files in the filesystem, and then index them? While this does work, it has proven very slow, and also quite wasteful of hard drive space. MARC records on their own are very compact, and so great efficiency is gained by converting them
in memory to MARCXML, and passing that XML directly to the pre-filter stylesheets for indexing.
To enable this processing, when the
Document Selector Stylesheet encounters a file containing MARC records, it should output a
<file> tag with the format attribute set to
MARC. The textIndexer will recognize this, open the file, and convert each record to a separate MARCXML record, and pass each record in turn to the pre-filter stylesheets(s). The conversion from MARC to MARCXML is performed by the
marc4j library, included with the XTF distribution.
CDL's experience with MARC conversion has been very positive: the conversion process is quite fast and the source records remain compact. To deal with some data corruption problems, XTF now makes every attempt to normalize Unicode characters when possible, and skip invalid characters. Also, if an entire record is corrupted, XTF will attempt to re-synchronize on the next uncorrupted record. In this way, the system is now fairly resilient in the face of minor data corruption.
Note that
dynaXML does not handle MARC record parsing or display. In CDL's system,
crossQuery was used to display the records, drawing its data directly from the index.