Class SimplePatternSplitTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.util.AbstractAnalysisFactory
org.apache.lucene.analysis.util.TokenizerFactory
org.apache.lucene.analysis.pattern.SimplePatternSplitTokenizerFactory
Factory for
SimplePatternSplitTokenizer, for producing tokens by splitting according to the provided regexp.
This tokenizer uses Lucene RegExp pattern matching to construct distinct tokens
for the input stream. The syntax is more limited than PatternTokenizer, but the
tokenization is quite a bit faster. It takes two arguments:
- "pattern" (required) is the regular expression, according to the syntax described at
RegExp - "maxDeterminizedStates" (optional, default 10000) the limit on total state count for the determined automaton computed from the regexp
The pattern matches the characters that should split tokens, like String.split, and the
matching is greedy such that the longest token separator matching at a given point is matched. Empty
tokens are never created.
For example, to match tokens delimited by simple whitespace characters:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
</analyzer>
</fieldType>- Since:
- 6.5.0
- See Also:
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final Automatonprivate final intstatic final StringSPI namestatic final StringFields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new SimpleSplitPatternTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate(AttributeFactory factory) Creates a TokenStream of the specified input using the given AttributeFactoryMethods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizersMethods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
PATTERN
- See Also:
-
dfa
-
maxDeterminizedStates
private final int maxDeterminizedStates
-
-
Constructor Details
-
SimplePatternSplitTokenizerFactory
Creates a new SimpleSplitPatternTokenizerFactory
-
-
Method Details
-
create
Description copied from class:TokenizerFactoryCreates a TokenStream of the specified input using the given AttributeFactory- Specified by:
createin classTokenizerFactory
-