Class UAX29URLEmailTokenizerImpl

java.lang.Object
org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl

public final class UAX29URLEmailTokenizerImpl extends Object
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

  • <ALPHANUM>: A sequence of alphabetic and numeric characters
  • <NUM>: A number
  • <URL>: A URL
  • <EMAIL>: An email address
  • <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
  • <IDEOGRAPHIC>: A single CJKV ideographic character
  • <HIRAGANA>: A single hiragana character
  • <KATAKANA>: A sequence of katakana characters
  • <HANGUL>: A sequence of Hangul characters
  • <EMOJI>: A sequence of Emoji characters
  • Field Details

    • YYEOF

      public static final int YYEOF
      This character denotes the end of file
      See Also:
    • ZZ_BUFFERSIZE

      private int ZZ_BUFFERSIZE
      initial size of the lookahead buffer
    • YYINITIAL

      public static final int YYINITIAL
      lexical states
      See Also:
    • AVOID_BAD_URL

      public static final int AVOID_BAD_URL
      See Also:
    • ZZ_LEXSTATE

      private static final int[] ZZ_LEXSTATE
      ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
    • ZZ_CMAP_PACKED

      private static final String ZZ_CMAP_PACKED
      Translates characters to character classes
      See Also:
    • ZZ_CMAP

      private static final char[] ZZ_CMAP
      Translates characters to character classes
    • ZZ_ACTION

      private static final int[] ZZ_ACTION
      Translates DFA states to action switch labels.
    • ZZ_ACTION_PACKED_0

      private static final String ZZ_ACTION_PACKED_0
      See Also:
    • ZZ_ROWMAP

      private static final int[] ZZ_ROWMAP
      Translates a state to a row index in the transition table
    • ZZ_ROWMAP_PACKED_0

      private static final String ZZ_ROWMAP_PACKED_0
      See Also:
    • ZZ_TRANS

      private static final int[] ZZ_TRANS
      The transition table of the DFA
    • ZZ_TRANS_PACKED_0

      private static final String ZZ_TRANS_PACKED_0
      See Also:
    • ZZ_TRANS_PACKED_1

      private static final String ZZ_TRANS_PACKED_1
      See Also:
    • ZZ_TRANS_PACKED_2

      private static final String ZZ_TRANS_PACKED_2
      See Also:
    • ZZ_TRANS_PACKED_3

      private static final String ZZ_TRANS_PACKED_3
      See Also:
    • ZZ_TRANS_PACKED_4

      private static final String ZZ_TRANS_PACKED_4
      See Also:
    • ZZ_TRANS_PACKED_5

      private static final String ZZ_TRANS_PACKED_5
      See Also:
    • ZZ_TRANS_PACKED_6

      private static final String ZZ_TRANS_PACKED_6
      See Also:
    • ZZ_TRANS_PACKED_7

      private static final String ZZ_TRANS_PACKED_7
      See Also:
    • ZZ_TRANS_PACKED_8

      private static final String ZZ_TRANS_PACKED_8
      See Also:
    • ZZ_TRANS_PACKED_9

      private static final String ZZ_TRANS_PACKED_9
      See Also:
    • ZZ_TRANS_PACKED_10

      private static final String ZZ_TRANS_PACKED_10
      See Also:
    • ZZ_TRANS_PACKED_11

      private static final String ZZ_TRANS_PACKED_11
      See Also:
    • ZZ_TRANS_PACKED_12

      private static final String ZZ_TRANS_PACKED_12
      See Also:
    • ZZ_TRANS_PACKED_13

      private static final String ZZ_TRANS_PACKED_13
      See Also:
    • ZZ_UNKNOWN_ERROR

      private static final int ZZ_UNKNOWN_ERROR
      See Also:
    • ZZ_NO_MATCH

      private static final int ZZ_NO_MATCH
      See Also:
    • ZZ_PUSHBACK_2BIG

      private static final int ZZ_PUSHBACK_2BIG
      See Also:
    • ZZ_ERROR_MSG

      private static final String[] ZZ_ERROR_MSG
    • ZZ_ATTRIBUTE

      private static final int[] ZZ_ATTRIBUTE
      ZZ_ATTRIBUTE[aState] contains the attributes of state aState
    • ZZ_ATTRIBUTE_PACKED_0

      private static final String ZZ_ATTRIBUTE_PACKED_0
      See Also:
    • zzReader

      private Reader zzReader
      the input device
    • zzState

      private int zzState
      the current state of the DFA
    • zzLexicalState

      private int zzLexicalState
      the current lexical state
    • zzBuffer

      private char[] zzBuffer
      this buffer contains the current text to be matched and is the source of the yytext() string
    • zzMarkedPos

      private int zzMarkedPos
      the textposition at the last accepting state
    • zzCurrentPos

      private int zzCurrentPos
      the current text position in the buffer
    • zzStartRead

      private int zzStartRead
      startRead marks the beginning of the yytext() string in the buffer
    • zzEndRead

      private int zzEndRead
      endRead marks the last character in the buffer, that has been read from input
    • yyline

      private int yyline
      number of newlines encountered up to the start of the matched text
    • yychar

      private int yychar
      the number of characters up to the start of the matched text
    • yycolumn

      private int yycolumn
      the number of characters from the last newline up to the start of the matched text
    • zzAtBOL

      private boolean zzAtBOL
      zzAtBOL == true iff the scanner is currently at the beginning of a line
    • zzAtEOF

      private boolean zzAtEOF
      zzAtEOF == true iff the scanner is at the EOF
    • zzEOFDone

      private boolean zzEOFDone
      denotes if the user-EOF-code has already been executed
    • zzFinalHighSurrogate

      private int zzFinalHighSurrogate
      The number of occupied positions in zzBuffer beyond zzEndRead. When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0.
    • WORD_TYPE

      public static final int WORD_TYPE
      Alphanumeric sequences
      See Also:
    • NUMERIC_TYPE

      public static final int NUMERIC_TYPE
      Numbers
      See Also:
    • SOUTH_EAST_ASIAN_TYPE

      public static final int SOUTH_EAST_ASIAN_TYPE
      Chars in class \p{Line_Break = Complex_Context} are from South East Asian scripts (Thai, Lao, Myanmar, Khmer, etc.). Sequences of these are kept together as as a single token rather than broken up, because the logic required to break them at word boundaries is too complex for UAX#29.

      See Unicode Line Breaking Algorithm: http://www.unicode.org/reports/tr14/#SA

      See Also:
    • IDEOGRAPHIC_TYPE

      public static final int IDEOGRAPHIC_TYPE
      Ideographic token type
      See Also:
    • HIRAGANA_TYPE

      public static final int HIRAGANA_TYPE
      Hiragana token type
      See Also:
    • KATAKANA_TYPE

      public static final int KATAKANA_TYPE
      Katakana token type
      See Also:
    • HANGUL_TYPE

      public static final int HANGUL_TYPE
      Hangul token type
      See Also:
    • EMAIL_TYPE

      public static final int EMAIL_TYPE
      Email token type
      See Also:
    • URL_TYPE

      public static final int URL_TYPE
      URL token type
      See Also:
    • EMOJI_TYPE

      public static final int EMOJI_TYPE
      Emoji token type
      See Also:
  • Constructor Details

    • UAX29URLEmailTokenizerImpl

      public UAX29URLEmailTokenizerImpl(Reader in)
      Creates a new scanner
      Parameters:
      in - the java.io.Reader to read input from.
  • Method Details

    • zzUnpackAction

      private static int[] zzUnpackAction()
    • zzUnpackAction

      private static int zzUnpackAction(String packed, int offset, int[] result)
    • zzUnpackRowMap

      private static int[] zzUnpackRowMap()
    • zzUnpackRowMap

      private static int zzUnpackRowMap(String packed, int offset, int[] result)
    • zzUnpackTrans

      private static int[] zzUnpackTrans()
    • zzUnpackTrans

      private static int zzUnpackTrans(String packed, int offset, int[] result)
    • zzUnpackAttribute

      private static int[] zzUnpackAttribute()
    • zzUnpackAttribute

      private static int zzUnpackAttribute(String packed, int offset, int[] result)
    • yychar

      public final int yychar()
      Character count processed so far
    • getText

      public final void getText(CharTermAttribute t)
      Fills CharTermAttribute with the current token text.
    • setBufferSize

      public final void setBufferSize(int numChars)
      Sets the scanner buffer size in chars
    • zzUnpackCMap

      private static char[] zzUnpackCMap(String packed)
      Unpacks the compressed character translation table.
      Parameters:
      packed - the packed character translation table
      Returns:
      the unpacked character translation table
    • zzRefill

      private boolean zzRefill() throws IOException
      Refills the input buffer.
      Returns:
      false, iff there was new input.
      Throws:
      IOException - if any I/O-Error occurs
    • yyclose

      public final void yyclose() throws IOException
      Closes the input stream.
      Throws:
      IOException
    • yyreset

      public final void yyreset(Reader reader)
      Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.
      Parameters:
      reader - the new input stream
    • yystate

      public final int yystate()
      Returns the current lexical state.
    • yybegin

      public final void yybegin(int newState)
      Enters a new lexical state
      Parameters:
      newState - the new lexical state
    • yytext

      public final String yytext()
      Returns the text matched by the current regular expression.
    • yycharat

      public final char yycharat(int pos)
      Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster
      Parameters:
      pos - the position of the character to fetch. A value from 0 to yylength()-1.
      Returns:
      the character at position pos
    • yylength

      public final int yylength()
      Returns the length of the matched text region.
    • zzScanError

      private void zzScanError(int errorCode)
      Reports an error that occured while scanning. In a wellformed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules.
      Parameters:
      errorCode - the code of the errormessage to display
    • yypushback

      public void yypushback(int number)
      Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method
      Parameters:
      number - the number of characters to be read again. This number must not be greater than yylength()!
    • getNextToken

      public int getNextToken() throws IOException
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      Returns:
      the next token
      Throws:
      IOException - if any I/O-Error occurs