org.cdlib.xtf.lazyTree
Class SearchTree

Object
  extended by NodeImpl
      extended by ParentNodeImpl
          extended by LazyDocument
              extended by SearchTree
All Implemented Interfaces:
Source, SourceLocator, DocumentInfo, FingerprintedNode, Item, NodeInfo, ValueRepresentation, PersistentTree

public class SearchTree
extends LazyDocument

SearchTree annotates a lazy-loading tree with TextEngine search results. Many careful gyrations are required to load as little as possible of the lazy tree from disk.

This class maintains the illusion that the entire tree has been loaded from disk, carefully searched, each hit annotated, and a list of all the snippets inserted at the top. In reality, this is done on-the-fly as needed, leaving as much as possible on disk.

To use SearchTree, simply call the constructor: SearchTree(Configuration, String, StructuredStore), passing it the key to use for index lookups and the persistent file to load from. Then call the search(QueryProcessor, QueryRequest) method to perform the actual search, and use the tree normally. As you access various parts of the tree, they'll be annotated on the fly.

Author:
Martin Haye

Field Summary
(package private)  CharMap accentMap
          Set of accented chars to remove diacritics from
private static Pattern ampPattern
           
(package private)  int continuesAttrCode
          Name-code for all <continues> attributes
(package private)  DocHit docHit
          Document hit from the text engine, containing snippets for this doc
private static Pattern gtPattern
           
(package private) static int HIT_ELMT_MARKER
          Each hit in the document is marked by a <hit> element.
(package private)  int hitCountAttrCode
          Name-code for all <hitCount> attributes
(package private)  int hitElementCode
          Name-code for all <hit> elements
(package private)  int hitElementFingerprint
          Name fingerprint for <xtf:hit> elements (includes namespace)
(package private)  int hitNumAttrCode
          Name-code for all <hitNum> attributes
(package private)  int[] hitRankToNum
          Mapping from hitsByScore -> hitsByLocation
(package private)  Snippet[] hitsByLocation
          Array of snippets sorted in document order
(package private)  Snippet[] hitsByScore
          Array of snippets sorted by descending score
private static Pattern ltPattern
           
(package private) static int MARKER_BASE
          All the synthetic nodes added in the tree are assigned a node number >= MARKER_BASE
(package private) static int MARKER_RANGE
          There are several kinds of synthetic nodes; each one takes up a range of node numbers of size MARKER_RANGE.
(package private)  int moreElementCode
          Name-code for all <more> elements
(package private)  int nextVirtualNum
          Keeps track of the node number to assign the next virtual node (see VIRTUAL_MARKER for more info.)
(package private)  int nHits
          Total number of text hits within this document
(package private)  WordMap pluralMap
          Set of plural words to change from plural to singular
(package private) static int PREV_SIB_MARKER
          Special node numbers are used to mark an un-loaded sibling so that getNode() can catch them and secretly load the node before anybody notices.
(package private)  int rankAttrCode
          Name-code for all <rank> attributes
(package private)  int scoreAttrCode
          Name-code for all <score> attributes
(package private)  int sectionTypeAttrCode
          Name-code for all <sectionType> attributes
(package private) static int SNIPPET_MARKER
          At the start of the document, the SearchTree adds a synthetic <xtf:snippets> element, and under that creates on demand a <xtf:snippet> element for each snippet.
(package private)  int snippetElementCode
          Name-code for all <snippet> elements
(package private)  int snippetElementFingerprint
          Name fingerprint for the <xtf:snippet> element (includes namespace)
(package private)  int snippetsElementCode
          Name-code for the <snippets> element
(package private)  String sourceKey
          Prefix for this document in the Lucene index
(package private)  Set stopSet
          Set of "stop-words" (i.e. short words like "the", "and", "a", etc.)
(package private)  boolean suppressScores
          True to suppress marking the hits with scores (useful for automated testing where the exact score isn't being tested.
(package private)  int termElementCode
          Name-code for all <term> elements
(package private)  Set termMap
          Map containing all terms used in the query
(package private)  int termMode
          Where to mark terms (all, context only, etc.)
(package private)  SearchElementImpl topSnippetNode
          The top-level <xtf:snippet> element.
(package private)  int totalHitCountAttrCode
          Name-code for all <totalHitCount> attributes
(package private) static int VIRTUAL_MARKER
          Marking a hit in the middle of a string of text requires splitting up real nodes and inserting virtual ones.
(package private)  int xtfFirstHitAttrCode
          Name-code for all <xtf:firstHit> attributes (includes namespace)
(package private)  int xtfHitCountAttrCode
          Name-code for all <xtf:hitCount> attributes (includes namespace)
(package private)  int xtfNamespaceCode
          Namespace code for the XTF namespace
(package private) static String xtfURI
          Snippet, hit, and term elements will all be marked with the XTF namespace, given by this URI: "http://cdlib.org/xtf"
 
Fields inherited from class LazyDocument
allPermanent, attrBuf, attrBytes, attrFile, config, debug, documentNumber, mainStore, maxAttrSize, maxNodeSize, nameNumToCode, namePool, namespaceCode, namespaceParent, NODE_FILE_HEADER_SIZE, nodeBuf, nodeBytes, nodeCache, nodeFile, numberOfNamespaces, numberOfNodes, rootNodeNum, systemIdMap, textFile, usesNamespaces
 
Fields inherited from class ParentNodeImpl
childNum
 
Fields inherited from class NodeImpl
document, nameCode, nextSibNum, NODE_LETTER, nodeNum, parentNum, prevSibNum
 
Fields inherited from interface NodeInfo
ALL_NAMESPACES, EMPTY_NAMESPACE_LIST, IS_DTD_TYPE, IS_NILLED, LOCAL_NAMESPACES, NO_NAMESPACES
 
Fields inherited from interface ValueRepresentation
EMPTY_VALUE_ARRAY
 
Constructor Summary
SearchTree(Configuration config, String sourceKey, StructuredStore treeStore)
          Load the tree from a disk file, and get ready to search it.
 
Method Summary
private  SearchElementImpl addElement(NodeImpl prev, int elNameCode, int nAttribs, boolean addAsChild)
          Create an element as the sibling of another node.
private  void addSnippets()
          Adds the top-level <xtf:snippets> element.
private  SearchTextImpl addText(NodeImpl prev, String text, boolean addAsChild)
          Create a text node.
private  void addXTFNamespace()
          Add our namespace to the list of namespaces.
private  NodeImpl breakupText(String text, NodeImpl prev, boolean addAsChild)
          Create the appropriate node(s) for text within a snippet, including elements for any marked <term>s.
protected  NodeImpl checkCache(int num)
          Checks to see if we've already loaded the node corresponding with the given number.
private  SearchElementImpl createElement(int elNameCode, int nAttribs)
          Does the work of creating an element, but doesn't link it into the tree.
protected  NodeImpl createElementNode()
          Create an element node.
(package private)  SearchElement createHitElement(boolean firstForHit, boolean lastForHit, int hitNum, boolean realNotProxy)
          Does the work of creating a "hit" element.
private  SearchElement createSnippetNode(int num, boolean realNotProxy)
          Creates an on-the-fly snippet node.
private  SearchTextImpl createText(String text)
          Does the work of creating a text node, but doesn't link it into the tree.
protected  NodeImpl createTextNode()
          Create a text node.
private  NodeImpl expandText(SearchTextImpl origNode, boolean returnLastNode)
          Annotate a text node with search results.
(package private)  int findFirstHit(int nodeNum)
          Locates the first hit that could conceivably involve this node, that is, the first hit with node number >= 'nodeNum'.
(package private)  int findLastHit(int nodeNum)
          Locates the last hit that could conceivably involve this node, that is, the last hit with node number >= 'nodeNum'.
protected  AxisIterator getAllElements(int fingerprint)
          Get a list of all elements with a given name.
private  SearchElementImpl getHitElement(int hitNum)
          Given a hit number, this method retrieves the synthetic hit node for it.
private  int getNameCode(String name, boolean withNamespace)
          Retrieve the proper name code from the name pool.
 NodeImpl getNode(int num)
          Get a node by its node number.
private  ElementImpl getRootKid()
          Get the top-level element that can actually be modified.
private  void initElement(SearchElement el, int elNameCode, int nAttrs)
          Initialize all the fields of a new element node.
private  void initNode(SearchNode node)
          Performs initialization tasks common to text and element nodes.
private  void linkChild(ParentNodeImpl parent, NodeImpl node)
          Does the work of linking in a new child element or text node.
private  void linkSibling(NodeImpl prev, NodeImpl node)
          Does the work of linking in a new sibling element or text node.
private  void modifyNode(NodeImpl node)
          Prepares a node for modification.
 void pruneUnused()
          DEBUGGING ONLY: Removes parts of the tree that haven't been loaded yet.
 void putIndex(String indexName, HashMap index)
          Writes a disk-based version of an index.
 void search(QueryProcessor processor, QueryRequest origReq)
          Run the search and save the results for annotating the tree.
 void suppressScores(boolean flag)
          Suppresses score attributes on the snippets.
private  String undoEntities(String str)
          Change entities back into normal text (entities are created inside snippets to differentiate them from normal tags.)
 
Methods inherited from class LazyDocument
close, copy, generateId, generateId, getBaseURI, getConfiguration, getDebug, getDocumentNumber, getDocumentRoot, getIndex, getItemType, getLineNumber, getLineNumber, getNamePool, getNextSibling, getNodeKind, getPreviousSibling, getRoot, getSequenceNumber, getSystemId, getSystemId, getTypeAnnotation, getUnparsedEntity, init, isUsingNamespaces, printProfile, putIndex, selectID, setAllPermanent, setDebug, setElementAnnotation, setLineNumber, setLineNumbering, setRootNode, setSystemId, setSystemId
 
Methods inherited from class ParentNodeImpl
enumerateChildren, getFirstChild, getLastChild, getStringValue, getStringValueCS, hasChildNodes, iterateAxis, iterateAxis
 
Methods inherited from class NodeImpl
atomize, compareOrder, equals, getAttributeValue, getColumnNumber, getDeclaredNamespaces, getDisplayName, getFingerprint, getLocalPart, getNameCode, getNextInDocument, getParent, getPrefix, getPreviousInDocument, getPublicId, getTypeAnnotation, getTypedValue, getURI, hashCode, init, isSameNodeInfo, sendNamespaceDeclarations
 
Methods inherited from class Object
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface NodeInfo
atomize, compareOrder, equals, getAttributeValue, getDeclaredNamespaces, getDisplayName, getFingerprint, getLocalPart, getNameCode, getParent, getPrefix, getStringValue, getTypeAnnotation, getURI, hasChildNodes, hashCode, isSameNodeInfo, iterateAxis, iterateAxis, sendNamespaceDeclarations
 
Methods inherited from interface Item
getStringValueCS, getTypedValue
 

Field Detail

sourceKey

String sourceKey
Prefix for this document in the Lucene index


termMap

Set termMap
Map containing all terms used in the query


stopSet

Set stopSet
Set of "stop-words" (i.e. short words like "the", "and", "a", etc.)


pluralMap

WordMap pluralMap
Set of plural words to change from plural to singular


accentMap

CharMap accentMap
Set of accented chars to remove diacritics from


docHit

DocHit docHit
Document hit from the text engine, containing snippets for this doc


nHits

int nHits
Total number of text hits within this document


hitsByScore

Snippet[] hitsByScore
Array of snippets sorted by descending score


hitsByLocation

Snippet[] hitsByLocation
Array of snippets sorted in document order


hitRankToNum

int[] hitRankToNum
Mapping from hitsByScore -> hitsByLocation


termMode

int termMode
Where to mark terms (all, context only, etc.)


suppressScores

boolean suppressScores
True to suppress marking the hits with scores (useful for automated testing where the exact score isn't being tested.


MARKER_BASE

static final int MARKER_BASE
All the synthetic nodes added in the tree are assigned a node number >= MARKER_BASE

See Also:
Constant Field Values

MARKER_RANGE

static final int MARKER_RANGE
There are several kinds of synthetic nodes; each one takes up a range of node numbers of size MARKER_RANGE.

See Also:
Constant Field Values

PREV_SIB_MARKER

static final int PREV_SIB_MARKER
Special node numbers are used to mark an un-loaded sibling so that getNode() can catch them and secretly load the node before anybody notices. These special elements all have node numbers x such that: PREV_SIB_MARKER <= x < PREV_SIB_MARKER+MARKER_RANGE

See Also:
Constant Field Values

HIT_ELMT_MARKER

static final int HIT_ELMT_MARKER
Each hit in the document is marked by a <hit> element. These elements all have node numbers x such that: HIT_ELMT_MARKER <= x < HIT_ELMT_MARKER+MARKER_RANGE

See Also:
Constant Field Values

SNIPPET_MARKER

static final int SNIPPET_MARKER
At the start of the document, the SearchTree adds a synthetic <xtf:snippets> element, and under that creates on demand a <xtf:snippet> element for each snippet. These elements all have node numbers x such that: SNIPPET_MARKER <= x < SNIPPET_MARKER+MARKER_RANGE

See Also:
Constant Field Values

VIRTUAL_MARKER

static final int VIRTUAL_MARKER
Marking a hit in the middle of a string of text requires splitting up real nodes and inserting virtual ones. These virtual nodes all have node numbers x such that: VIRTUAL_MARKER <= x < VIRTUAL_MARKER+MARKER_RANGE

See Also:
Constant Field Values

nextVirtualNum

int nextVirtualNum
Keeps track of the node number to assign the next virtual node (see VIRTUAL_MARKER for more info.)


topSnippetNode

SearchElementImpl topSnippetNode
The top-level <xtf:snippet> element.


xtfURI

static final String xtfURI
Snippet, hit, and term elements will all be marked with the XTF namespace, given by this URI: "http://cdlib.org/xtf"

See Also:
Constant Field Values

xtfNamespaceCode

int xtfNamespaceCode
Namespace code for the XTF namespace


hitElementFingerprint

int hitElementFingerprint
Name fingerprint for <xtf:hit> elements (includes namespace)


snippetElementFingerprint

int snippetElementFingerprint
Name fingerprint for the <xtf:snippet> element (includes namespace)


hitElementCode

int hitElementCode
Name-code for all <hit> elements


moreElementCode

int moreElementCode
Name-code for all <more> elements


termElementCode

int termElementCode
Name-code for all <term> elements


snippetElementCode

int snippetElementCode
Name-code for all <snippet> elements


snippetsElementCode

int snippetsElementCode
Name-code for the <snippets> element


xtfHitCountAttrCode

int xtfHitCountAttrCode
Name-code for all <xtf:hitCount> attributes (includes namespace)


xtfFirstHitAttrCode

int xtfFirstHitAttrCode
Name-code for all <xtf:firstHit> attributes (includes namespace)


hitCountAttrCode

int hitCountAttrCode
Name-code for all <hitCount> attributes


totalHitCountAttrCode

int totalHitCountAttrCode
Name-code for all <totalHitCount> attributes


scoreAttrCode

int scoreAttrCode
Name-code for all <score> attributes


rankAttrCode

int rankAttrCode
Name-code for all <rank> attributes


hitNumAttrCode

int hitNumAttrCode
Name-code for all <hitNum> attributes


continuesAttrCode

int continuesAttrCode
Name-code for all <continues> attributes


sectionTypeAttrCode

int sectionTypeAttrCode
Name-code for all <sectionType> attributes


ampPattern

private static final Pattern ampPattern

ltPattern

private static final Pattern ltPattern

gtPattern

private static final Pattern gtPattern
Constructor Detail

SearchTree

public SearchTree(Configuration config,
                  String sourceKey,
                  StructuredStore treeStore)
           throws FileNotFoundException,
                  IOException
Load the tree from a disk file, and get ready to search it. To start the actual search, use the search(QueryProcessor, QueryRequest) method.

Throws:
FileNotFoundException
IOException
Method Detail

getNameCode

private int getNameCode(String name,
                        boolean withNamespace)
Retrieve the proper name code from the name pool.


suppressScores

public void suppressScores(boolean flag)
Suppresses score attributes on the snippets. Generally this is useful when running regressions, since the scoring algorithm changes frequently.


search

public void search(QueryProcessor processor,
                   QueryRequest origReq)
            throws IOException
Run the search and save the results for annotating the tree.

Parameters:
processor - Processor used to run the query
origReq - Query to run
Throws:
IOException - If anything goes wrong reading from the Lucene index or the lazy tree file.

getNode

public NodeImpl getNode(int num)
Get a node by its node number. Handles generating synthetic nodes if necessary.

Overrides:
getNode in class LazyDocument
Parameters:
num - The number of the node to get
Returns:
A node, or null if the number is invalid.

addXTFNamespace

private void addXTFNamespace()
Add our namespace to the list of namespaces.


getRootKid

private ElementImpl getRootKid()
Get the top-level element that can actually be modified.


getHitElement

private SearchElementImpl getHitElement(int hitNum)
Given a hit number, this method retrieves the synthetic hit node for it.


createElementNode

protected NodeImpl createElementNode()
Create an element node. Derived classes can override this to provide their own element implementation.

Overrides:
createElementNode in class LazyDocument

createTextNode

protected NodeImpl createTextNode()
Create a text node. Derived classes can override this to provide their own text implementation.

Overrides:
createTextNode in class LazyDocument

checkCache

protected NodeImpl checkCache(int num)
Checks to see if we've already loaded the node corresponding with the given number. If so, return it, else null.

Overrides:
checkCache in class LazyDocument

expandText

private NodeImpl expandText(SearchTextImpl origNode,
                            boolean returnLastNode)
Annotate a text node with search results.

Parameters:
origNode - The text node as loaded from disk.
returnLastNode - true to return the last added node, else first.
Returns:
The adjusted node.

createHitElement

SearchElement createHitElement(boolean firstForHit,
                               boolean lastForHit,
                               int hitNum,
                               boolean realNotProxy)
Does the work of creating a "hit" element.

Parameters:
firstForHit - true if this is the first element for the hit
lastForHit - true if this is the last element for the hit
hitNum - The hit being referenced
realNotProxy - true to create a real node, else make a proxy.

addElement

private SearchElementImpl addElement(NodeImpl prev,
                                     int elNameCode,
                                     int nAttribs,
                                     boolean addAsChild)
Create an element as the sibling of another node.

Parameters:
prev - Node to add sibling to
elNameCode - Name of the new element
nAttribs - How many attributes it will have
addAsChild - true to add as a child of 'prev', false to add as a sibling.

addText

private SearchTextImpl addText(NodeImpl prev,
                               String text,
                               boolean addAsChild)
Create a text node.

Parameters:
prev - Node to add sibling to
text - Initial text string for the new node
addAsChild - true to add as a child of 'prev', false to add as a sibling.

createElement

private SearchElementImpl createElement(int elNameCode,
                                        int nAttribs)
Does the work of creating an element, but doesn't link it into the tree.

Parameters:
elNameCode - The name for the new element
nAttribs - How many attributes it will have.
Returns:
The new element.

initElement

private void initElement(SearchElement el,
                         int elNameCode,
                         int nAttrs)
Initialize all the fields of a new element node.


createText

private SearchTextImpl createText(String text)
Does the work of creating a text node, but doesn't link it into the tree.

Parameters:
text - The initial text for the node
Returns:
The newly created node.

linkSibling

private void linkSibling(NodeImpl prev,
                         NodeImpl node)
Does the work of linking in a new sibling element or text node.


linkChild

private void linkChild(ParentNodeImpl parent,
                       NodeImpl node)
Does the work of linking in a new child element or text node. It will be added as the first child.


initNode

private void initNode(SearchNode node)
Performs initialization tasks common to text and element nodes.


modifyNode

private void modifyNode(NodeImpl node)
Prepares a node for modification. Essentially, makes sure that it will be cached and never reloaded from disk.


addSnippets

private void addSnippets()
Adds the top-level <xtf:snippets> element. If its children are fetched later, they'll be created on the fly.


createSnippetNode

private SearchElement createSnippetNode(int num,
                                        boolean realNotProxy)
Creates an on-the-fly snippet node.

Parameters:
num - The node number (SNIPPET_MARKER + hit #)

undoEntities

private String undoEntities(String str)
Change entities back into normal text (entities are created inside snippets to differentiate them from normal tags.)

Parameters:
str - String to replace entities within
Returns:
Modified string (or same string if no entities found).

breakupText

private NodeImpl breakupText(String text,
                             NodeImpl prev,
                             boolean addAsChild)
Create the appropriate node(s) for text within a snippet, including elements for any marked <term>s.

Parameters:
text - Text to process, with " <term>" stuff inside it.
prev - Node to add to
addAsChild - true to add to prev as a child, else as sibling.
Returns:
Last node added.

findFirstHit

int findFirstHit(int nodeNum)
Locates the first hit that could conceivably involve this node, that is, the first hit with node number >= 'nodeNum'.

Parameters:
nodeNum - The node of interest.
Returns:
Index of the hit (might be == nHits, meaning no hit could apply.)

findLastHit

int findLastHit(int nodeNum)
Locates the last hit that could conceivably involve this node, that is, the last hit with node number >= 'nodeNum'.

Parameters:
nodeNum - Node number of the element in question.
Returns:
Index of the hit (might be == nHits, meaning no hit could apply.)

putIndex

public void putIndex(String indexName,
                     HashMap index)
              throws IOException
Writes a disk-based version of an index. Use getIndex() later to read it. This method is overriden to ensure that no virtual nodes ever get written to a disk index.

Parameters:
indexName - Uniquely computed name
index - HashMap mapping String -> ArrayList[NodeImpl]
Throws:
IOException

getAllElements

protected AxisIterator getAllElements(int fingerprint)
Get a list of all elements with a given name. This is implemented as a memo function: the first time it is called for a particular element type, it remembers the result for next time. It's overriden here to take the special case where "xtf:hit" or "xtf:snippet" is specified.

Overrides:
getAllElements in class LazyDocument

pruneUnused

public void pruneUnused()
DEBUGGING ONLY: Removes parts of the tree that haven't been loaded yet. This can be useful to view the subset of the tree that have actually been accessed. Note that to be useful, LazyDocument.setAllPermanent(boolean) should be called before accessing the tree to ensure that all nodes referenced are kept in RAM.