public class SearchTree extends LazyDocument
SearchTree annotates a lazy-loading tree with TextEngine search results. Many careful gyrations are required to load as little as possible of the lazy tree from disk.
This class maintains the illusion that the entire tree has been loaded from disk, carefully searched, each hit annotated, and a list of all the snippets inserted at the top. In reality, this is done on-the-fly as needed, leaving as much as possible on disk.
To use SearchTree, simply call the constructor:
SearchTree(Configuration, String, StructuredStore)
,
passing it the key to use for index
lookups and the persistent file to load from. Then call the
search(QueryProcessor, QueryRequest)
method to perform the
actual search, and use the tree normally. As you access various parts of
the tree, they'll be annotated on the fly.
Modifier and Type | Field and Description |
---|---|
(package private) CharMap |
accentMap
Set of accented chars to remove diacritics from
|
private static Pattern |
ampPattern |
(package private) int |
continuesAttrCode
Name-code for all <continues> attributes
|
private static Pattern |
gtPattern |
(package private) static int |
HIT_ELMT_MARKER
Each hit in the document is marked by a <hit> element.
|
(package private) int |
hitCountAttrCode
Name-code for all <hitCount> attributes
|
(package private) int |
hitElementCode
Name-code for all <hit> elements
|
(package private) int |
hitElementFingerprint
Name fingerprint for <xtf:hit> elements (includes namespace)
|
(package private) int |
hitNumAttrCode
Name-code for all <hitNum> attributes
|
(package private) int[] |
hitRankToNum
Mapping from hitsByScore -> hitsByLocation
|
(package private) Snippet[] |
hitsByLocation
Array of snippets sorted in document order
|
(package private) Snippet[] |
hitsByScore
Array of snippets sorted by descending score
|
(package private) DocHit[] |
hitsToDocHit
Original DocHit and number within it for each Snippet
|
(package private) int[] |
hitsToDocHitNum |
private static Pattern |
ltPattern |
(package private) static int |
MARKER_BASE
All the synthetic nodes added in the tree are assigned a node number
>= MARKER_BASE
|
(package private) static int |
MARKER_RANGE
There are several kinds of synthetic nodes; each one takes up a range
of node numbers of size MARKER_RANGE.
|
(package private) int |
moreElementCode
Name-code for all <more> elements
|
(package private) int |
nextVirtualNum
Keeps track of the node number to assign the next virtual node (see
VIRTUAL_MARKER for more info.) |
(package private) int |
nHits
Number of hit snippets within this document
|
(package private) WordMap |
pluralMap
Set of plural words to change from plural to singular
|
(package private) static int |
PREV_SIB_MARKER
Special node numbers are used to mark an un-loaded sibling so that
getNode() can catch them and secretly load the node before anybody
notices.
|
(package private) int |
rankAttrCode
Name-code for all <rank> attributes
|
(package private) int |
scoreAttrCode
Name-code for all <score> attributes
|
(package private) int |
sectionTypeAttrCode
Name-code for all <sectionType> attributes
|
(package private) static int |
SNIPPET_MARKER
At the start of the document, the SearchTree adds a synthetic
<xtf:snippets> element, and under that creates on demand a
<xtf:snippet> element for each snippet.
|
(package private) int |
snippetElementCode
Name-code for all <snippet> elements
|
(package private) int |
snippetElementFingerprint
Name fingerprint for the <xtf:snippet> element
(includes namespace)
|
(package private) int |
snippetsElementCode
Name-code for the <snippets> element
|
(package private) String |
sourceKey
Prefix for this document in the Lucene index
|
(package private) Set |
stopSet
Set of "stop-words" (i.e. short words like "the", "and", "a", etc.)
|
(package private) int |
subDocumentAttrCode
Name-code for all <subDocument> attributes
|
(package private) boolean |
suppressScores
True to suppress marking the hits with scores (useful for automated
testing where the exact score isn't being tested.
|
(package private) int |
termElementCode
Name-code for all <term> elements
|
(package private) Set |
termMap
Map containing all terms used in the query
|
(package private) int |
termMode
Where to mark terms (all, context only, etc.)
|
(package private) SearchElementImpl |
topSnippetNode
The top-level <xtf:snippet> element.
|
(package private) int |
totalHitCountAttrCode
Name-code for all <totalHitCount> attributes
|
(package private) int |
totalHits
Total number of hits (might be greater than the number of snippets
|
(package private) static int |
VIRTUAL_MARKER
Marking a hit in the middle of a string of text requires splitting
up real nodes and inserting virtual ones.
|
(package private) int |
xtfFirstHitAttrCode
Name-code for all <xtf:firstHit> attributes
(includes namespace)
|
(package private) int |
xtfHitCountAttrCode
Name-code for all <xtf:hitCount> attributes
(includes namespace)
|
(package private) int |
xtfNamespaceCode
Namespace code for the XTF namespace
|
(package private) static String |
xtfURI
Snippet, hit, and term elements will all be marked with the XTF
namespace, given by this URI: "http://cdlib.org/xtf"
|
allPermanent, attrBuf, attrBytes, attrFile, config, debug, documentNumber, mainStore, maxAttrSize, maxNodeSize, nameNumToCode, namePool, namespaceCode, namespaceParent, NODE_FILE_HEADER_SIZE, nodeBuf, nodeBytes, nodeCache, nodeFile, numberOfNamespaces, numberOfNodes, rootNodeNum, systemIdMap, textFile, usesNamespaces
childNum
document, nameCode, nextSibNum, NODE_LETTER, nodeNum, parentNum, prevSibNum
Constructor and Description |
---|
SearchTree(Configuration config,
String sourceKey,
StructuredStore treeStore)
Load the tree from a disk file, and get ready to search it.
|
Modifier and Type | Method and Description |
---|---|
private SearchElementImpl |
addElement(NodeImpl prev,
int elNameCode,
int nAttribs,
boolean addAsChild)
Create an element as the sibling of another node.
|
private void |
addSnippets()
Adds the top-level <xtf:snippets> element.
|
private SearchTextImpl |
addText(NodeImpl prev,
String text,
boolean addAsChild)
Create a text node.
|
private void |
addXTFNamespace()
Add our namespace to the list of namespaces.
|
private NodeImpl |
breakupText(String text,
NodeImpl prev,
boolean addAsChild)
Create the appropriate node(s) for text within a snippet, including
elements for any marked <term>s.
|
protected NodeImpl |
checkCache(int num)
Checks to see if we've already loaded the node corresponding with the
given number.
|
private SearchElementImpl |
createElement(int elNameCode,
int nAttribs)
Does the work of creating an element, but doesn't link it into the tree.
|
protected NodeImpl |
createElementNode()
Create an element node.
|
(package private) SearchElement |
createHitElement(boolean firstForHit,
boolean lastForHit,
int hitNum,
boolean realNotProxy)
Does the work of creating a "hit" element.
|
private SearchElement |
createSnippetNode(int num,
boolean realNotProxy)
Creates an on-the-fly snippet node.
|
private SearchTextImpl |
createText(String text)
Does the work of creating a text node, but doesn't link it into the tree.
|
protected NodeImpl |
createTextNode()
Create a text node.
|
private NodeImpl |
expandText(SearchTextImpl origNode,
boolean returnLastNode)
Annotate a text node with search results.
|
(package private) int |
findFirstHit(int nodeNum)
Locates the first hit that could conceivably involve this node, that is,
the first hit with node number >= 'nodeNum'.
|
(package private) int |
findLastHit(int nodeNum)
Locates the last hit that could conceivably involve this node, that is,
the last hit with node number >= 'nodeNum'.
|
protected AxisIterator |
getAllElements(int fingerprint)
Get a list of all elements with a given name.
|
private SearchElementImpl |
getHitElement(int hitNum)
Given a hit number, this method retrieves the synthetic hit node for it.
|
private int |
getNameCode(String name,
boolean withNamespace)
Retrieve the proper name code from the name pool.
|
NodeImpl |
getNode(int num)
Get a node by its node number.
|
private ElementImpl |
getRootKid()
Get the top-level element that can actually be modified.
|
int |
getTotalHits() |
private void |
initElement(SearchElement el,
int elNameCode,
int nAttrs)
Initialize all the fields of a new element node.
|
private void |
initNode(SearchNode node)
Performs initialization tasks common to text and element nodes.
|
private void |
linkChild(ParentNodeImpl parent,
NodeImpl node)
Does the work of linking in a new child element or text node.
|
private void |
linkSibling(NodeImpl prev,
NodeImpl node)
Does the work of linking in a new sibling element or text node.
|
private void |
modifyNode(NodeImpl node)
Prepares a node for modification.
|
void |
pruneUnused()
DEBUGGING ONLY: Removes parts of the tree that haven't been loaded yet.
|
void |
putIndex(String indexName,
HashMap index)
Writes a disk-based version of an index.
|
void |
search(QueryProcessor processor,
QueryRequest origReq)
Run the search and save the results for annotating the tree.
|
void |
suppressScores(boolean flag)
Suppresses score attributes on the snippets.
|
private String |
undoEntities(String str)
Change entities back into normal text (entities are created inside
snippets to differentiate them from normal tags.)
|
close, copy, generateId, generateId, getBaseURI, getConfiguration, getDebug, getDocumentNumber, getDocumentRoot, getIndex, getItemType, getLineNumber, getLineNumber, getNamePool, getNextSibling, getNodeKind, getPreviousSibling, getRoot, getSequenceNumber, getSystemId, getSystemId, getTypeAnnotation, getUnparsedEntity, init, isUsingNamespaces, printProfile, putIndex, selectID, setAllPermanent, setDebug, setElementAnnotation, setLineNumber, setLineNumbering, setRootNode, setSystemId, setSystemId
enumerateChildren, getFirstChild, getLastChild, getStringValue, getStringValueCS, hasChildNodes, iterateAxis, iterateAxis
atomize, compareOrder, equals, getAttributeValue, getColumnNumber, getDeclaredNamespaces, getDisplayName, getFingerprint, getLocalPart, getNameCode, getNextInDocument, getParent, getPrefix, getPreviousInDocument, getPublicId, getTypeAnnotation, getTypedValue, getURI, hashCode, init, isSameNodeInfo, sendNamespaceDeclarations
clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait
atomize, compareOrder, equals, getAttributeValue, getDeclaredNamespaces, getDisplayName, getFingerprint, getLocalPart, getNameCode, getParent, getPrefix, getStringValue, getTypeAnnotation, getURI, hasChildNodes, hashCode, isSameNodeInfo, iterateAxis, iterateAxis, sendNamespaceDeclarations
String sourceKey
Set termMap
Set stopSet
WordMap pluralMap
CharMap accentMap
int totalHits
int nHits
Snippet[] hitsByScore
DocHit[] hitsToDocHit
int[] hitsToDocHitNum
Snippet[] hitsByLocation
int[] hitRankToNum
int termMode
boolean suppressScores
static final int MARKER_BASE
static final int MARKER_RANGE
static final int PREV_SIB_MARKER
static final int HIT_ELMT_MARKER
static final int SNIPPET_MARKER
static final int VIRTUAL_MARKER
int nextVirtualNum
VIRTUAL_MARKER
for more info.)SearchElementImpl topSnippetNode
static final String xtfURI
int xtfNamespaceCode
int hitElementFingerprint
int snippetElementFingerprint
int hitElementCode
int moreElementCode
int termElementCode
int snippetElementCode
int snippetsElementCode
int xtfHitCountAttrCode
int xtfFirstHitAttrCode
int hitCountAttrCode
int totalHitCountAttrCode
int scoreAttrCode
int rankAttrCode
int hitNumAttrCode
int continuesAttrCode
int sectionTypeAttrCode
int subDocumentAttrCode
private static final Pattern ampPattern
private static final Pattern ltPattern
private static final Pattern gtPattern
public SearchTree(Configuration config, String sourceKey, StructuredStore treeStore) throws FileNotFoundException, IOException
search(QueryProcessor, QueryRequest)
method.FileNotFoundException
IOException
private int getNameCode(String name, boolean withNamespace)
public void suppressScores(boolean flag)
public void search(QueryProcessor processor, QueryRequest origReq) throws IOException
processor
- Processor used to run the queryorigReq
- Query to runIOException
- If anything goes wrong reading from the Lucene
index or the lazy tree file.public NodeImpl getNode(int num)
getNode
in class LazyDocument
num
- The number of the node to getprivate void addXTFNamespace()
private ElementImpl getRootKid()
private SearchElementImpl getHitElement(int hitNum)
protected NodeImpl createElementNode()
createElementNode
in class LazyDocument
protected NodeImpl createTextNode()
createTextNode
in class LazyDocument
protected NodeImpl checkCache(int num)
checkCache
in class LazyDocument
private NodeImpl expandText(SearchTextImpl origNode, boolean returnLastNode)
origNode
- The text node as loaded from disk.returnLastNode
- true to return the last added node, else first.SearchElement createHitElement(boolean firstForHit, boolean lastForHit, int hitNum, boolean realNotProxy)
firstForHit
- true if this is the first element for the hitlastForHit
- true if this is the last element for the hithitNum
- The hit being referencedrealNotProxy
- true to create a real node, else make a proxy.private SearchElementImpl addElement(NodeImpl prev, int elNameCode, int nAttribs, boolean addAsChild)
prev
- Node to add sibling toelNameCode
- Name of the new elementnAttribs
- How many attributes it will haveaddAsChild
- true to add as a child of 'prev', false to add as a sibling.private SearchTextImpl addText(NodeImpl prev, String text, boolean addAsChild)
prev
- Node to add sibling totext
- Initial text string for the new nodeaddAsChild
- true to add as a child of 'prev', false to add as a sibling.private SearchElementImpl createElement(int elNameCode, int nAttribs)
elNameCode
- The name for the new elementnAttribs
- How many attributes it will have.private void initElement(SearchElement el, int elNameCode, int nAttrs)
private SearchTextImpl createText(String text)
text
- The initial text for the nodeprivate void linkSibling(NodeImpl prev, NodeImpl node)
private void linkChild(ParentNodeImpl parent, NodeImpl node)
private void initNode(SearchNode node)
private void modifyNode(NodeImpl node)
private void addSnippets()
private SearchElement createSnippetNode(int num, boolean realNotProxy)
num
- The node number (SNIPPET_MARKER + hit #)private String undoEntities(String str)
str
- String to replace entities withinprivate NodeImpl breakupText(String text, NodeImpl prev, boolean addAsChild)
text
- Text to process, with " <term>" stuff inside it.prev
- Node to add toaddAsChild
- true to add to prev as a child, else as sibling.int findFirstHit(int nodeNum)
nodeNum
- The node of interest.int findLastHit(int nodeNum)
nodeNum
- Node number of the element in question.public void putIndex(String indexName, HashMap index) throws IOException
indexName
- Uniquely computed nameindex
- HashMap mapping String -> ArrayList[NodeImpl]IOException
protected AxisIterator getAllElements(int fingerprint)
getAllElements
in class LazyDocument
public void pruneUnused()
LazyDocument.setAllPermanent(boolean)
should be
called before accessing the tree to ensure that all nodes referenced
are kept in RAM.public int getTotalHits()