This class processes document text into tokens that can be indexed.
The text is assumed to contain some HTML/XML tags. The tokenizer tries to extract as much data as possible from each document, even if it is not well formed (e.g. there are start tags with no ending tags). The resulting document object contains an array of terms and an array of tags.
@author trevor
|
|
|
|
|
|
|
|