public class ChunkSource
extends Object
Modifier and Type | Field and Description |
---|---|
protected Analyzer |
analyzer
Analyzer to use for tokenizing the text
|
protected int |
chunkBump
Number of words per chunk minus the overlap
|
protected LinkedList |
chunkCache
Cache of recently loaded chunks
|
protected int |
chunkCacheSize
Max # of chunks to cache
|
protected int |
chunkOverlap
Numer of words one chunk overlaps with the next
|
protected int |
chunkSize
Max number of words per chunk
|
protected DocNumMap |
docNumMap
Map of document to chunk numbers
|
protected String |
field
Field to read from the chunks
|
protected int |
firstChunk
First chunk in the document
|
protected int |
lastChunk
Last chunk in the document
|
protected int |
mainDocNum
The main document number
|
protected IndexReader |
reader
Reader to load chunk text from
|
Constructor and Description |
---|
ChunkSource(IndexReader reader,
DocNumMap docNumMap,
int mainDocNum,
String field,
Analyzer analyzer)
Construct the iterator and read in starting text from the given
chunk.
|
Modifier and Type | Method and Description |
---|---|
protected Chunk |
createChunkTokens(int chunkNum)
Create a new storage place for chunk tokens (derived classes may
wish to override)
|
int |
getChunkOverlap()
Retrieve the number of words one chunk overlaps with the next
|
int |
getChunkSize()
Retrieve the max number of words per chunk
|
boolean |
inMainDoc(int chunkNum)
Check if the given chunk is contained within the main document for this
chunk source.
|
Chunk |
loadChunk(int chunkNum)
Read in and tokenize a chunk.
|
protected void |
loadText(int chunkNum,
Chunk chunk)
Read the text for the given chunk (derived classes may
wish to override)
|
protected IndexReader reader
protected DocNumMap docNumMap
protected int mainDocNum
protected int chunkSize
protected int chunkOverlap
protected int chunkBump
protected int firstChunk
protected int lastChunk
protected String field
protected Analyzer analyzer
protected LinkedList chunkCache
protected int chunkCacheSize
public ChunkSource(IndexReader reader, DocNumMap docNumMap, int mainDocNum, String field, Analyzer analyzer)
reader
- where to read the chunks fromdocNumMap
- provides a mapping from main document number to
to chunk numbers.mainDocNum
- is the document ID of the main docfield
- is the name of the field to read inanalyzer
- will be used to tokenize the stored field contentsprotected Chunk createChunkTokens(int chunkNum)
public boolean inMainDoc(int chunkNum)
protected void loadText(int chunkNum, Chunk chunk) throws IOException
IOException
public Chunk loadChunk(int chunkNum)
public int getChunkSize()
public int getChunkOverlap()