Class HTMLStripCharFilter
java.lang.Object
java.io.Reader
org.apache.lucene.analysis.CharFilter
org.apache.lucene.analysis.charfilter.BaseCharFilter
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter
- All Implemented Interfaces:
Closeable,AutoCloseable,Readable
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
-
Nested Class Summary
Nested Classes -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final intprivate static final intprivate static final charprivate static final charprivate static final charprivate static final charprivate static final intprivate static final intprivate static final intprivate intprivate static final intprivate static final intprivate static final intprivate static final intprivate HTMLStripCharFilter.TextSegmentprivate static final CharArrayMap<Character> private intprivate booleanprivate CharArraySetprivate booleanprivate booleanprivate static final intprivate HTMLStripCharFilter.TextSegmentprivate intprivate static final intprivate static final intprivate static final intprivate static final intprivate intprivate HTMLStripCharFilter.TextSegmentprivate intprivate static final charprivate intprivate static final intprivate static final intprivate static final charprivate static final intprivate static final intprivate static final intprivate static final intprivate static final intprivate static final intprivate static final intprivate static final charprivate intthe number of characters up to the start of the matched textprivate intthe number of characters from the last newline up to the start of the matched textprivate static final intThis character denotes the end of fileprivate static final intlexical statesprivate intnumber of newlines encountered up to the start of the matched textprivate static final int[]Translates DFA states to action switch labels.private static final Stringprivate static final int[]ZZ_ATTRIBUTE[aState] contains the attributes of stateaStateprivate static final Stringprivate static final intinitial size of the lookahead bufferprivate static final char[]Translates characters to character classesprivate static final StringTranslates characters to character classesprivate static final String[]private static final int[]ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integerprivate static final intprivate static final intprivate static final int[]Translates a state to a row index in the transition tableprivate static final Stringprivate static final int[]The transition table of the DFAprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final Stringprivate static final intprivate booleanzzAtBOL == true iff the scanner is currently at the beginning of a lineprivate booleanzzAtEOF == true iff the scanner is at the EOFprivate char[]this buffer contains the current text to be matched and is the source of the yytext() stringprivate intthe current text position in the bufferprivate intendRead marks the last character in the buffer, that has been read from inputprivate booleandenotes if the user-EOF-code has already been executedprivate intThe number of occupied positions in zzBuffer beyond zzEndRead.private intthe current lexical stateprivate intthe textposition at the last accepting stateprivate Readerthe input deviceprivate intstartRead marks the beginning of the yytext() string in the bufferprivate intthe current state of the DFAFields inherited from class org.apache.lucene.analysis.CharFilter
input -
Constructor Summary
ConstructorsConstructorDescriptionCreates a new scannerHTMLStripCharFilter(Reader in, Set<String> escapedTags) Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Closes the underlying input stream.(package private) static intprivate intnextChar()Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.intread()intread(char[] cbuf, int off, int len) private final voidyybegin(int newState) Enters a new lexical stateprivate final charyycharat(int pos) Returns the character at position pos from the matched text.private final voidyyclose()Closes the input stream.private final intyylength()Returns the length of the matched text region.private voidyypushback(int number) Pushes the specified amount of characters back into the input stream.private final voidResets the scanner to read from a new input stream.private final intyystate()Returns the current lexical state.private final Stringyytext()Returns the text matched by the current regular expression.private voidzzDoEOF()Contains user EOF-code, which will be executed exactly once, when the end of file is reachedprivate booleanzzRefill()Refills the input buffer.private voidzzScanError(int errorCode) Reports an error that occurred while scanning.private static int[]private static intzzUnpackAction(String packed, int offset, int[] result) private static int[]private static intzzUnpackAttribute(String packed, int offset, int[] result) private static char[]zzUnpackCMap(String packed) Unpacks the compressed character translation table.private static int[]private static intzzUnpackRowMap(String packed, int offset, int[] result) private static int[]private static intzzUnpackTrans(String packed, int offset, int[] result) Methods inherited from class org.apache.lucene.analysis.charfilter.BaseCharFilter
addOffCorrectMap, correct, getLastCumulativeDiffMethods inherited from class org.apache.lucene.analysis.CharFilter
correctOffset
-
Field Details
-
YYEOF
private static final int YYEOFThis character denotes the end of file- See Also:
-
ZZ_BUFFERSIZE
private static final int ZZ_BUFFERSIZEinitial size of the lookahead buffer- See Also:
-
YYINITIAL
private static final int YYINITIALlexical states- See Also:
-
AMPERSAND
private static final int AMPERSAND- See Also:
-
NUMERIC_CHARACTER
private static final int NUMERIC_CHARACTER- See Also:
-
CHARACTER_REFERENCE_TAIL
private static final int CHARACTER_REFERENCE_TAIL- See Also:
-
LEFT_ANGLE_BRACKET
private static final int LEFT_ANGLE_BRACKET- See Also:
-
BANG
private static final int BANG- See Also:
-
COMMENT
private static final int COMMENT- See Also:
-
SCRIPT
private static final int SCRIPT- See Also:
-
SCRIPT_COMMENT
private static final int SCRIPT_COMMENT- See Also:
-
LEFT_ANGLE_BRACKET_SLASH
private static final int LEFT_ANGLE_BRACKET_SLASH- See Also:
-
LEFT_ANGLE_BRACKET_SPACE
private static final int LEFT_ANGLE_BRACKET_SPACE- See Also:
-
CDATA
private static final int CDATA- See Also:
-
SERVER_SIDE_INCLUDE
private static final int SERVER_SIDE_INCLUDE- See Also:
-
SINGLE_QUOTED_STRING
private static final int SINGLE_QUOTED_STRING- See Also:
-
DOUBLE_QUOTED_STRING
private static final int DOUBLE_QUOTED_STRING- See Also:
-
END_TAG_TAIL_INCLUDE
private static final int END_TAG_TAIL_INCLUDE- See Also:
-
END_TAG_TAIL_EXCLUDE
private static final int END_TAG_TAIL_EXCLUDE- See Also:
-
END_TAG_TAIL_SUBSTITUTE
private static final int END_TAG_TAIL_SUBSTITUTE- See Also:
-
START_TAG_TAIL_INCLUDE
private static final int START_TAG_TAIL_INCLUDE- See Also:
-
START_TAG_TAIL_EXCLUDE
private static final int START_TAG_TAIL_EXCLUDE- See Also:
-
START_TAG_TAIL_SUBSTITUTE
private static final int START_TAG_TAIL_SUBSTITUTE- See Also:
-
STYLE
private static final int STYLE- See Also:
-
STYLE_COMMENT
private static final int STYLE_COMMENT- See Also:
-
ZZ_LEXSTATE
private static final int[] ZZ_LEXSTATEZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer -
ZZ_CMAP_PACKED
Translates characters to character classes- See Also:
-
ZZ_CMAP
private static final char[] ZZ_CMAPTranslates characters to character classes -
ZZ_ACTION
private static final int[] ZZ_ACTIONTranslates DFA states to action switch labels. -
ZZ_ACTION_PACKED_0
- See Also:
-
ZZ_ROWMAP
private static final int[] ZZ_ROWMAPTranslates a state to a row index in the transition table -
ZZ_ROWMAP_PACKED_0
- See Also:
-
ZZ_TRANS
private static final int[] ZZ_TRANSThe transition table of the DFA -
ZZ_TRANS_PACKED_0
- See Also:
-
ZZ_TRANS_PACKED_1
- See Also:
-
ZZ_TRANS_PACKED_2
- See Also:
-
ZZ_TRANS_PACKED_3
- See Also:
-
ZZ_TRANS_PACKED_4
- See Also:
-
ZZ_TRANS_PACKED_5
- See Also:
-
ZZ_TRANS_PACKED_6
- See Also:
-
ZZ_TRANS_PACKED_7
- See Also:
-
ZZ_TRANS_PACKED_8
- See Also:
-
ZZ_TRANS_PACKED_9
- See Also:
-
ZZ_TRANS_PACKED_10
- See Also:
-
ZZ_TRANS_PACKED_11
- See Also:
-
ZZ_TRANS_PACKED_12
- See Also:
-
ZZ_UNKNOWN_ERROR
private static final int ZZ_UNKNOWN_ERROR- See Also:
-
ZZ_NO_MATCH
private static final int ZZ_NO_MATCH- See Also:
-
ZZ_PUSHBACK_2BIG
private static final int ZZ_PUSHBACK_2BIG- See Also:
-
ZZ_ERROR_MSG
-
ZZ_ATTRIBUTE
private static final int[] ZZ_ATTRIBUTEZZ_ATTRIBUTE[aState] contains the attributes of stateaState -
ZZ_ATTRIBUTE_PACKED_0
- See Also:
-
zzReader
the input device -
zzState
private int zzStatethe current state of the DFA -
zzLexicalState
private int zzLexicalStatethe current lexical state -
zzBuffer
private char[] zzBufferthis buffer contains the current text to be matched and is the source of the yytext() string -
zzMarkedPos
private int zzMarkedPosthe textposition at the last accepting state -
zzCurrentPos
private int zzCurrentPosthe current text position in the buffer -
zzStartRead
private int zzStartReadstartRead marks the beginning of the yytext() string in the buffer -
zzEndRead
private int zzEndReadendRead marks the last character in the buffer, that has been read from input -
yyline
private int yylinenumber of newlines encountered up to the start of the matched text -
yychar
private int yycharthe number of characters up to the start of the matched text -
yycolumn
private int yycolumnthe number of characters from the last newline up to the start of the matched text -
zzAtBOL
private boolean zzAtBOLzzAtBOL == true iff the scanner is currently at the beginning of a line -
zzAtEOF
private boolean zzAtEOFzzAtEOF == true iff the scanner is at the EOF -
zzEOFDone
private boolean zzEOFDonedenotes if the user-EOF-code has already been executed -
zzFinalHighSurrogate
private int zzFinalHighSurrogateThe number of occupied positions in zzBuffer beyond zzEndRead. When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0. -
upperCaseVariantsAccepted
-
entityValues
-
INITIAL_INPUT_SEGMENT_SIZE
private static final int INITIAL_INPUT_SEGMENT_SIZE- See Also:
-
BLOCK_LEVEL_START_TAG_REPLACEMENT
private static final char BLOCK_LEVEL_START_TAG_REPLACEMENT- See Also:
-
BLOCK_LEVEL_END_TAG_REPLACEMENT
private static final char BLOCK_LEVEL_END_TAG_REPLACEMENT- See Also:
-
BR_START_TAG_REPLACEMENT
private static final char BR_START_TAG_REPLACEMENT- See Also:
-
BR_END_TAG_REPLACEMENT
private static final char BR_END_TAG_REPLACEMENT- See Also:
-
SCRIPT_REPLACEMENT
private static final char SCRIPT_REPLACEMENT- See Also:
-
STYLE_REPLACEMENT
private static final char STYLE_REPLACEMENT- See Also:
-
REPLACEMENT_CHARACTER
private static final char REPLACEMENT_CHARACTER- See Also:
-
escapedTags
-
inputStart
private int inputStart -
cumulativeDiff
private int cumulativeDiff -
escapeBR
private boolean escapeBR -
escapeSCRIPT
private boolean escapeSCRIPT -
escapeSTYLE
private boolean escapeSTYLE -
restoreState
private int restoreState -
previousRestoreState
private int previousRestoreState -
outputCharCount
private int outputCharCount -
eofReturnValue
private int eofReturnValue -
inputSegment
-
outputSegment
-
entitySegment
-
-
Constructor Details
-
HTMLStripCharFilter
Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags.- Parameters:
in- Reader to strip html tags from.escapedTags- Tags in this set (both start and end tags) will not be filtered out.
-
HTMLStripCharFilter
Creates a new scanner- Parameters:
in- the java.io.Reader to read input from.
-
-
Method Details
-
zzUnpackAction
private static int[] zzUnpackAction() -
zzUnpackAction
-
zzUnpackRowMap
private static int[] zzUnpackRowMap() -
zzUnpackRowMap
-
zzUnpackTrans
private static int[] zzUnpackTrans() -
zzUnpackTrans
-
zzUnpackAttribute
private static int[] zzUnpackAttribute() -
zzUnpackAttribute
-
read
- Overrides:
readin classReader- Throws:
IOException
-
read
- Specified by:
readin classReader- Throws:
IOException
-
close
Description copied from class:CharFilterCloses the underlying input stream.NOTE: The default implementation closes the input Reader, so be sure to call
super.close()when overriding this method.- Specified by:
closein interfaceAutoCloseable- Specified by:
closein interfaceCloseable- Overrides:
closein classCharFilter- Throws:
IOException
-
getInitialBufferSize
static int getInitialBufferSize() -
zzUnpackCMap
Unpacks the compressed character translation table.- Parameters:
packed- the packed character translation table- Returns:
- the unpacked character translation table
-
zzRefill
Refills the input buffer.- Returns:
false, iff there was new input.- Throws:
IOException- if any I/O-Error occurs
-
yyclose
Closes the input stream.- Throws:
IOException
-
yyreset
Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.- Parameters:
reader- the new input stream
-
yystate
private final int yystate()Returns the current lexical state. -
yybegin
private final void yybegin(int newState) Enters a new lexical state- Parameters:
newState- the new lexical state
-
yytext
Returns the text matched by the current regular expression. -
yycharat
private final char yycharat(int pos) Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster- Parameters:
pos- the position of the character to fetch. A value from 0 to yylength()-1.- Returns:
- the character at position pos
-
yylength
private final int yylength()Returns the length of the matched text region. -
zzScanError
private void zzScanError(int errorCode) Reports an error that occurred while scanning. In a wellformed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules.- Parameters:
errorCode- the code of the errormessage to display
-
yypushback
private void yypushback(int number) Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method- Parameters:
number- the number of characters to be read again. This number must not be greater than yylength()!
-
zzDoEOF
private void zzDoEOF()Contains user EOF-code, which will be executed exactly once, when the end of file is reached -
nextChar
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.- Returns:
- the next token
- Throws:
IOException- if any I/O-Error occurs
-