Class Dictionary
java.lang.Object
org.apache.lucene.analysis.hunspell.Dictionary
In-memory structure for the dictionary (.dic) and affix (.aff)
data of a hunspell dictionary.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionprivate static classImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded as two ASCII characters whose codes must be combined into a single character.(package private) static classAbstraction of the process of parsing flags taken from the affix and dic filesprivate static classImplementation ofDictionary.FlagParsingStrategythat assumes each flag is encoded in its numerical form.private static classSimple implementation ofDictionary.FlagParsingStrategythat treats the chars in each String as a individual flags. -
Field Summary
FieldsModifier and TypeFieldDescription(package private) byte[]private static final Stringprivate intprivate String[](package private) boolean(package private) intprivate static final String(package private) booleanprivate static final Stringprivate intprivate static Path(package private) static final Patternpattern accepts optional BOM + SET + any whitespaceprivate static final String(package private) final char(package private) BytesRefHashprivate Dictionary.FlagParsingStrategy(package private) booleanprivate static final String(package private) booleanprivate static final Stringprivate char[]private static final String(package private) boolean(package private) intprivate static final Stringprivate static final String(package private) Stringprivate static final Stringprivate static final String(package private) final charprivate intprivate String[](package private) intprivate static final String(package private) boolean(package private) boolean(package private) static final char[]private static final Stringprivate static final String(package private) intprivate static final String(package private) ArrayList<CharacterRunAutomaton> private static final Stringprivate static final Stringprivate static final Stringprivate intprivate String[](package private) char[](package private) int[]private static final Stringprivate static final Stringprivate final Path(package private) booleanprivate static final String -
Constructor Summary
ConstructorsConstructorDescriptionDictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files.Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. -
Method Summary
Modifier and TypeMethodDescription(package private) static voidapplyMappings(FST<CharsRef> fst, StringBuilder sb) (package private) charcaseFold(char c) folds single character (according to LANG if present)(package private) CharSequencecleanInput(CharSequence input, StringBuilder reuse) (package private) static char[](package private) static voidencodeFlags(BytesRefBuilder b, char[] flags) (package private) static StringescapeDash(String re) private StringgetAliasValue(int id) (package private) static PathReturns the default temporary directory.(package private) static StringgetDictionaryEncoding(InputStream affix) Parses the encoding specified in the affix file readable through the provided InputStream(package private) static Dictionary.FlagParsingStrategygetFlagParsingStrategy(String flagLine) Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix filebooleanReturns true if this dictionary was constructed with theignoreCaseoptionprivate CharsetDecodergetJavaEncoding(String encoding) Retrieves the CharsetDecoder for the given encoding.(package private) StringgetStemException(int id) (package private) static booleanhasFlag(char[] flags, char flag) (package private) static intindexOfSpaceOrTab(String text, int start) (package private) IntsRef(package private) IntsReflookupPrefix(char[] word, int offset, int length) (package private) IntsReflookupSuffix(char[] word, int offset, int length) (package private) IntsReflookupWord(char[] word, int offset, int length) Looks up Hunspell word forms from the dictionary(package private) static intmorphBoundary(String line) private voidparseAffix(TreeMap<String, List<Integer>> affixes, String header, LineNumberReader reader, String conditionPattern, Map<String, Integer> seenPatterns, Map<String, Integer> seenStrips) Parses a specific affix rule putting the result into the provided affix mapprivate voidparseAlias(String line) parseConversions(LineNumberReader reader, int num) private voidparseMorphAlias(String line) private StringparseStemException(String morphData) private voidreadAffixFile(InputStream affixStream, CharsetDecoder decoder) Reads the affix file through the provided InputStream, building up the prefix and suffix mapsprivate voidreadDictionaryFiles(Directory tempDir, String tempFileNamePrefix, List<InputStream> dictionaries, CharsetDecoder decoder, Builder<IntsRef> words) Reads the dictionary file through the provided InputStreams, building up the words mapstatic voidsetDefaultTempDir(Path tempDir) Used by test framework(package private) StringunescapeEntry(String entry)
-
Field Details
-
NOFLAGS
static final char[] NOFLAGS -
ALIAS_KEY
- See Also:
-
MORPH_ALIAS_KEY
- See Also:
-
PREFIX_KEY
- See Also:
-
SUFFIX_KEY
- See Also:
-
FLAG_KEY
- See Also:
-
COMPLEXPREFIXES_KEY
- See Also:
-
CIRCUMFIX_KEY
- See Also:
-
IGNORE_KEY
- See Also:
-
ICONV_KEY
- See Also:
-
OCONV_KEY
- See Also:
-
FULLSTRIP_KEY
- See Also:
-
LANG_KEY
- See Also:
-
KEEPCASE_KEY
- See Also:
-
NEEDAFFIX_KEY
- See Also:
-
PSEUDOROOT_KEY
- See Also:
-
ONLYINCOMPOUND_KEY
- See Also:
-
NUM_FLAG_TYPE
- See Also:
-
UTF8_FLAG_TYPE
- See Also:
-
LONG_FLAG_TYPE
- See Also:
-
PREFIX_CONDITION_REGEX_PATTERN
- See Also:
-
SUFFIX_CONDITION_REGEX_PATTERN
- See Also:
-
prefixes
-
suffixes
-
patterns
ArrayList<CharacterRunAutomaton> patterns -
words
-
flagLookup
BytesRefHash flagLookup -
stripData
char[] stripData -
stripOffsets
int[] stripOffsets -
affixData
byte[] affixData -
currentAffix
private int currentAffix -
flagParsingStrategy
-
aliases
-
aliasCount
private int aliasCount -
morphAliases
-
morphAliasCount
private int morphAliasCount -
stemExceptions
-
stemExceptionCount
private int stemExceptionCount -
hasStemExceptions
boolean hasStemExceptions -
tempPath
-
ignoreCase
boolean ignoreCase -
complexPrefixes
boolean complexPrefixes -
twoStageAffix
boolean twoStageAffix -
circumfix
int circumfix -
keepcase
int keepcase -
needaffix
int needaffix -
onlyincompound
int onlyincompound -
ignore
private char[] ignore -
iconv
-
oconv
-
needsInputCleaning
boolean needsInputCleaning -
needsOutputCleaning
boolean needsOutputCleaning -
fullStrip
boolean fullStrip -
language
String language -
alternateCasing
boolean alternateCasing -
ENCODING_PATTERN
pattern accepts optional BOM + SET + any whitespace -
CHARSET_ALIASES
-
FLAG_SEPARATOR
final char FLAG_SEPARATOR- See Also:
-
MORPH_SEPARATOR
final char MORPH_SEPARATOR- See Also:
-
DEFAULT_TEMP_DIR
-
-
Constructor Details
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, InputStream dictionary) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionary- InputStream for reading the hunspell dictionary file (won't be closed).- Throws:
IOException- Can be thrown while reading from the InputStreamsParseException- Can be thrown if the content of the files does not meet expected formats
-
Dictionary
public Dictionary(Directory tempDir, String tempFileNamePrefix, InputStream affix, List<InputStream> dictionaries, boolean ignoreCase) throws IOException, ParseException Creates a new Dictionary containing the information read from the provided InputStreams to hunspell affix and dictionary files. You have to close the provided InputStreams yourself.- Parameters:
tempDir- Directory to use for offline sortingtempFileNamePrefix- prefix to use to generate temp file namesaffix- InputStream for reading the hunspell affix file (won't be closed).dictionaries- InputStream for reading the hunspell dictionary files (won't be closed).- Throws:
IOException- Can be thrown while reading from the InputStreamsParseException- Can be thrown if the content of the files does not meet expected formats
-
-
Method Details
-
lookupWord
Looks up Hunspell word forms from the dictionary -
lookupPrefix
-
lookupSuffix
-
lookup
-
readAffixFile
private void readAffixFile(InputStream affixStream, CharsetDecoder decoder) throws IOException, ParseException Reads the affix file through the provided InputStream, building up the prefix and suffix maps- Parameters:
affixStream- InputStream to read the content of the affix file fromdecoder- CharsetDecoder to decode the content of the file- Throws:
IOException- Can be thrown while reading from the InputStreamParseException
-
affixFST
- Throws:
IOException
-
escapeDash
-
parseAffix
private void parseAffix(TreeMap<String, List<Integer>> affixes, String header, LineNumberReader reader, String conditionPattern, Map<String, throws IOException, ParseExceptionInteger> seenPatterns, Map<String, Integer> seenStrips) Parses a specific affix rule putting the result into the provided affix map- Parameters:
affixes- Map where the result of the parsing will be putheader- Header line of the affix rulereader- BufferedReader to read the content of the rule fromconditionPattern-String.format(String, Object...)pattern to be used to generate the condition regex patternseenPatterns- map from condition -> index of patterns, for deduplication.- Throws:
IOException- Can be thrown while reading the ruleParseException
-
parseConversions
private FST<CharsRef> parseConversions(LineNumberReader reader, int num) throws IOException, ParseException - Throws:
IOExceptionParseException
-
getDictionaryEncoding
Parses the encoding specified in the affix file readable through the provided InputStream- Parameters:
affix- InputStream for reading the affix file- Returns:
- Encoding specified in the affix file
- Throws:
IOException- Can be thrown while reading from the InputStreamParseException- Thrown if the first non-empty non-comment line read from the file does not adhere to the formatSET <encoding>
-
getJavaEncoding
Retrieves the CharsetDecoder for the given encoding. Note, This isn't perfect as I think ISCII-DEVANAGARI and MICROSOFT-CP1251 etc are allowed...- Parameters:
encoding- Encoding to retrieve the CharsetDecoder for- Returns:
- CharSetDecoder for the given encoding
-
getFlagParsingStrategy
Determines the appropriateDictionary.FlagParsingStrategybased on the FLAG definition line taken from the affix file- Parameters:
flagLine- Line containing the flag information- Returns:
- FlagParsingStrategy that handles parsing flags in the way specified in the FLAG definition
-
unescapeEntry
-
morphBoundary
-
indexOfSpaceOrTab
-
readDictionaryFiles
private void readDictionaryFiles(Directory tempDir, String tempFileNamePrefix, List<InputStream> dictionaries, CharsetDecoder decoder, Builder<IntsRef> words) throws IOException Reads the dictionary file through the provided InputStreams, building up the words map- Parameters:
dictionaries- InputStreams to read the dictionary file throughdecoder- CharsetDecoder used to decode the contents of the file- Throws:
IOException- Can be thrown while reading from the file
-
decodeFlags
-
encodeFlags
-
parseAlias
-
getAliasValue
-
getStemException
-
parseMorphAlias
-
parseStemException
-
hasFlag
static boolean hasFlag(char[] flags, char flag) -
cleanInput
-
caseFold
char caseFold(char c) folds single character (according to LANG if present) -
applyMappings
- Throws:
IOException
-
getIgnoreCase
public boolean getIgnoreCase()Returns true if this dictionary was constructed with theignoreCaseoption -
setDefaultTempDir
Used by test framework -
getDefaultTempDir
Returns the default temporary directory. By default, java.io.tmpdir. If not accessible or not available, an IOException is thrown- Throws:
IOException
-