Class ICUTokenizerFactory
java.lang.Object
org.apache.lucene.analysis.util.AbstractAnalysisFactory
org.apache.lucene.analysis.util.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
- All Implemented Interfaces:
ResourceLoaderAware
Factory for
ICUTokenizer.
Words are broken across script boundaries, then segmented according to
the BreakIterator and typing provided by the DefaultICUTokenizerConfig.
To use the default set of per-script rules:
<fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
</analyzer>
</fieldType>
You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.
To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):
<fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
</analyzer>
</fieldType>- Since:
- 3.1
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final booleanprivate ICUTokenizerConfigprivate final booleanstatic final StringSPI name(package private) static final StringFields inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion -
Constructor Summary
ConstructorsConstructorDescriptionICUTokenizerFactory(Map<String, String> args) Creates a new ICUTokenizerFactory -
Method Summary
Modifier and TypeMethodDescriptioncreate(AttributeFactory factory) Creates a TokenStream of the specified input using the given AttributeFactoryvoidinform(ResourceLoader loader) Initializes this component with the provided ResourceLoader (used for loading classes, files, etc).private com.ibm.icu.text.BreakIteratorparseRules(String filename, ResourceLoader loader) Methods inherited from class org.apache.lucene.analysis.util.TokenizerFactory
availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizersMethods inherited from class org.apache.lucene.analysis.util.AbstractAnalysisFactory
get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames
-
Field Details
-
NAME
SPI name- See Also:
-
RULEFILES
- See Also:
-
tailored
-
config
-
cjkAsWords
private final boolean cjkAsWords -
myanmarAsWords
private final boolean myanmarAsWords
-
-
Constructor Details
-
ICUTokenizerFactory
Creates a new ICUTokenizerFactory
-
-
Method Details
-
inform
Description copied from interface:ResourceLoaderAwareInitializes this component with the provided ResourceLoader (used for loading classes, files, etc).- Specified by:
informin interfaceResourceLoaderAware- Throws:
IOException
-
parseRules
private com.ibm.icu.text.BreakIterator parseRules(String filename, ResourceLoader loader) throws IOException - Throws:
IOException
-
create
Description copied from class:TokenizerFactoryCreates a TokenStream of the specified input using the given AttributeFactory- Specified by:
createin classTokenizerFactory
-