Class HTMLStripCharFilter

All Implemented Interfaces:
Closeable, AutoCloseable, Readable

public final class HTMLStripCharFilter extends BaseCharFilter
A CharFilter that wraps another Reader and attempts to strip out HTML constructs.
  • Field Details

    • YYEOF

      private static final int YYEOF
      This character denotes the end of file
      See Also:
    • ZZ_BUFFERSIZE

      private static final int ZZ_BUFFERSIZE
      initial size of the lookahead buffer
      See Also:
    • YYINITIAL

      private static final int YYINITIAL
      lexical states
      See Also:
    • AMPERSAND

      private static final int AMPERSAND
      See Also:
    • NUMERIC_CHARACTER

      private static final int NUMERIC_CHARACTER
      See Also:
    • CHARACTER_REFERENCE_TAIL

      private static final int CHARACTER_REFERENCE_TAIL
      See Also:
    • LEFT_ANGLE_BRACKET

      private static final int LEFT_ANGLE_BRACKET
      See Also:
    • BANG

      private static final int BANG
      See Also:
    • COMMENT

      private static final int COMMENT
      See Also:
    • SCRIPT

      private static final int SCRIPT
      See Also:
    • SCRIPT_COMMENT

      private static final int SCRIPT_COMMENT
      See Also:
    • LEFT_ANGLE_BRACKET_SLASH

      private static final int LEFT_ANGLE_BRACKET_SLASH
      See Also:
    • LEFT_ANGLE_BRACKET_SPACE

      private static final int LEFT_ANGLE_BRACKET_SPACE
      See Also:
    • CDATA

      private static final int CDATA
      See Also:
    • SERVER_SIDE_INCLUDE

      private static final int SERVER_SIDE_INCLUDE
      See Also:
    • SINGLE_QUOTED_STRING

      private static final int SINGLE_QUOTED_STRING
      See Also:
    • DOUBLE_QUOTED_STRING

      private static final int DOUBLE_QUOTED_STRING
      See Also:
    • END_TAG_TAIL_INCLUDE

      private static final int END_TAG_TAIL_INCLUDE
      See Also:
    • END_TAG_TAIL_EXCLUDE

      private static final int END_TAG_TAIL_EXCLUDE
      See Also:
    • END_TAG_TAIL_SUBSTITUTE

      private static final int END_TAG_TAIL_SUBSTITUTE
      See Also:
    • START_TAG_TAIL_INCLUDE

      private static final int START_TAG_TAIL_INCLUDE
      See Also:
    • START_TAG_TAIL_EXCLUDE

      private static final int START_TAG_TAIL_EXCLUDE
      See Also:
    • START_TAG_TAIL_SUBSTITUTE

      private static final int START_TAG_TAIL_SUBSTITUTE
      See Also:
    • STYLE

      private static final int STYLE
      See Also:
    • STYLE_COMMENT

      private static final int STYLE_COMMENT
      See Also:
    • ZZ_LEXSTATE

      private static final int[] ZZ_LEXSTATE
      ZZ_LEXSTATE[l] is the state in the DFA for the lexical state l ZZ_LEXSTATE[l+1] is the state in the DFA for the lexical state l at the beginning of a line l is of the form l = 2*k, k a non negative integer
    • ZZ_CMAP_PACKED

      private static final String ZZ_CMAP_PACKED
      Translates characters to character classes
      See Also:
    • ZZ_CMAP

      private static final char[] ZZ_CMAP
      Translates characters to character classes
    • ZZ_ACTION

      private static final int[] ZZ_ACTION
      Translates DFA states to action switch labels.
    • ZZ_ACTION_PACKED_0

      private static final String ZZ_ACTION_PACKED_0
      See Also:
    • ZZ_ROWMAP

      private static final int[] ZZ_ROWMAP
      Translates a state to a row index in the transition table
    • ZZ_ROWMAP_PACKED_0

      private static final String ZZ_ROWMAP_PACKED_0
      See Also:
    • ZZ_TRANS

      private static final int[] ZZ_TRANS
      The transition table of the DFA
    • ZZ_TRANS_PACKED_0

      private static final String ZZ_TRANS_PACKED_0
      See Also:
    • ZZ_TRANS_PACKED_1

      private static final String ZZ_TRANS_PACKED_1
      See Also:
    • ZZ_TRANS_PACKED_2

      private static final String ZZ_TRANS_PACKED_2
      See Also:
    • ZZ_TRANS_PACKED_3

      private static final String ZZ_TRANS_PACKED_3
      See Also:
    • ZZ_TRANS_PACKED_4

      private static final String ZZ_TRANS_PACKED_4
      See Also:
    • ZZ_TRANS_PACKED_5

      private static final String ZZ_TRANS_PACKED_5
      See Also:
    • ZZ_TRANS_PACKED_6

      private static final String ZZ_TRANS_PACKED_6
      See Also:
    • ZZ_TRANS_PACKED_7

      private static final String ZZ_TRANS_PACKED_7
      See Also:
    • ZZ_TRANS_PACKED_8

      private static final String ZZ_TRANS_PACKED_8
      See Also:
    • ZZ_TRANS_PACKED_9

      private static final String ZZ_TRANS_PACKED_9
      See Also:
    • ZZ_TRANS_PACKED_10

      private static final String ZZ_TRANS_PACKED_10
      See Also:
    • ZZ_TRANS_PACKED_11

      private static final String ZZ_TRANS_PACKED_11
      See Also:
    • ZZ_TRANS_PACKED_12

      private static final String ZZ_TRANS_PACKED_12
      See Also:
    • ZZ_UNKNOWN_ERROR

      private static final int ZZ_UNKNOWN_ERROR
      See Also:
    • ZZ_NO_MATCH

      private static final int ZZ_NO_MATCH
      See Also:
    • ZZ_PUSHBACK_2BIG

      private static final int ZZ_PUSHBACK_2BIG
      See Also:
    • ZZ_ERROR_MSG

      private static final String[] ZZ_ERROR_MSG
    • ZZ_ATTRIBUTE

      private static final int[] ZZ_ATTRIBUTE
      ZZ_ATTRIBUTE[aState] contains the attributes of state aState
    • ZZ_ATTRIBUTE_PACKED_0

      private static final String ZZ_ATTRIBUTE_PACKED_0
      See Also:
    • zzReader

      private Reader zzReader
      the input device
    • zzState

      private int zzState
      the current state of the DFA
    • zzLexicalState

      private int zzLexicalState
      the current lexical state
    • zzBuffer

      private char[] zzBuffer
      this buffer contains the current text to be matched and is the source of the yytext() string
    • zzMarkedPos

      private int zzMarkedPos
      the textposition at the last accepting state
    • zzCurrentPos

      private int zzCurrentPos
      the current text position in the buffer
    • zzStartRead

      private int zzStartRead
      startRead marks the beginning of the yytext() string in the buffer
    • zzEndRead

      private int zzEndRead
      endRead marks the last character in the buffer, that has been read from input
    • yyline

      private int yyline
      number of newlines encountered up to the start of the matched text
    • yychar

      private int yychar
      the number of characters up to the start of the matched text
    • yycolumn

      private int yycolumn
      the number of characters from the last newline up to the start of the matched text
    • zzAtBOL

      private boolean zzAtBOL
      zzAtBOL == true iff the scanner is currently at the beginning of a line
    • zzAtEOF

      private boolean zzAtEOF
      zzAtEOF == true iff the scanner is at the EOF
    • zzEOFDone

      private boolean zzEOFDone
      denotes if the user-EOF-code has already been executed
    • zzFinalHighSurrogate

      private int zzFinalHighSurrogate
      The number of occupied positions in zzBuffer beyond zzEndRead. When a lead/high surrogate has been read from the input stream into the final zzBuffer position, this will have a value of 1; otherwise, it will have a value of 0.
    • upperCaseVariantsAccepted

      private static final Map<String,String> upperCaseVariantsAccepted
    • entityValues

      private static final CharArrayMap<Character> entityValues
    • INITIAL_INPUT_SEGMENT_SIZE

      private static final int INITIAL_INPUT_SEGMENT_SIZE
      See Also:
    • BLOCK_LEVEL_START_TAG_REPLACEMENT

      private static final char BLOCK_LEVEL_START_TAG_REPLACEMENT
      See Also:
    • BLOCK_LEVEL_END_TAG_REPLACEMENT

      private static final char BLOCK_LEVEL_END_TAG_REPLACEMENT
      See Also:
    • BR_START_TAG_REPLACEMENT

      private static final char BR_START_TAG_REPLACEMENT
      See Also:
    • BR_END_TAG_REPLACEMENT

      private static final char BR_END_TAG_REPLACEMENT
      See Also:
    • SCRIPT_REPLACEMENT

      private static final char SCRIPT_REPLACEMENT
      See Also:
    • STYLE_REPLACEMENT

      private static final char STYLE_REPLACEMENT
      See Also:
    • REPLACEMENT_CHARACTER

      private static final char REPLACEMENT_CHARACTER
      See Also:
    • escapedTags

      private CharArraySet escapedTags
    • inputStart

      private int inputStart
    • cumulativeDiff

      private int cumulativeDiff
    • escapeBR

      private boolean escapeBR
    • escapeSCRIPT

      private boolean escapeSCRIPT
    • escapeSTYLE

      private boolean escapeSTYLE
    • restoreState

      private int restoreState
    • previousRestoreState

      private int previousRestoreState
    • outputCharCount

      private int outputCharCount
    • eofReturnValue

      private int eofReturnValue
    • inputSegment

      private HTMLStripCharFilter.TextSegment inputSegment
    • outputSegment

      private HTMLStripCharFilter.TextSegment outputSegment
    • entitySegment

      private HTMLStripCharFilter.TextSegment entitySegment
  • Constructor Details

    • HTMLStripCharFilter

      public HTMLStripCharFilter(Reader in, Set<String> escapedTags)
      Creates a new HTMLStripCharFilter over the provided Reader with the specified start and end tags.
      Parameters:
      in - Reader to strip html tags from.
      escapedTags - Tags in this set (both start and end tags) will not be filtered out.
    • HTMLStripCharFilter

      public HTMLStripCharFilter(Reader in)
      Creates a new scanner
      Parameters:
      in - the java.io.Reader to read input from.
  • Method Details

    • zzUnpackAction

      private static int[] zzUnpackAction()
    • zzUnpackAction

      private static int zzUnpackAction(String packed, int offset, int[] result)
    • zzUnpackRowMap

      private static int[] zzUnpackRowMap()
    • zzUnpackRowMap

      private static int zzUnpackRowMap(String packed, int offset, int[] result)
    • zzUnpackTrans

      private static int[] zzUnpackTrans()
    • zzUnpackTrans

      private static int zzUnpackTrans(String packed, int offset, int[] result)
    • zzUnpackAttribute

      private static int[] zzUnpackAttribute()
    • zzUnpackAttribute

      private static int zzUnpackAttribute(String packed, int offset, int[] result)
    • read

      public int read() throws IOException
      Overrides:
      read in class Reader
      Throws:
      IOException
    • read

      public int read(char[] cbuf, int off, int len) throws IOException
      Specified by:
      read in class Reader
      Throws:
      IOException
    • close

      public void close() throws IOException
      Description copied from class: CharFilter
      Closes the underlying input stream.

      NOTE: The default implementation closes the input Reader, so be sure to call super.close() when overriding this method.

      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class CharFilter
      Throws:
      IOException
    • getInitialBufferSize

      static int getInitialBufferSize()
    • zzUnpackCMap

      private static char[] zzUnpackCMap(String packed)
      Unpacks the compressed character translation table.
      Parameters:
      packed - the packed character translation table
      Returns:
      the unpacked character translation table
    • zzRefill

      private boolean zzRefill() throws IOException
      Refills the input buffer.
      Returns:
      false, iff there was new input.
      Throws:
      IOException - if any I/O-Error occurs
    • yyclose

      private final void yyclose() throws IOException
      Closes the input stream.
      Throws:
      IOException
    • yyreset

      private final void yyreset(Reader reader)
      Resets the scanner to read from a new input stream. Does not close the old reader. All internal variables are reset, the old input stream cannot be reused (internal buffer is discarded and lost). Lexical state is set to ZZ_INITIAL. Internal scan buffer is resized down to its initial length, if it has grown.
      Parameters:
      reader - the new input stream
    • yystate

      private final int yystate()
      Returns the current lexical state.
    • yybegin

      private final void yybegin(int newState)
      Enters a new lexical state
      Parameters:
      newState - the new lexical state
    • yytext

      private final String yytext()
      Returns the text matched by the current regular expression.
    • yycharat

      private final char yycharat(int pos)
      Returns the character at position pos from the matched text. It is equivalent to yytext().charAt(pos), but faster
      Parameters:
      pos - the position of the character to fetch. A value from 0 to yylength()-1.
      Returns:
      the character at position pos
    • yylength

      private final int yylength()
      Returns the length of the matched text region.
    • zzScanError

      private void zzScanError(int errorCode)
      Reports an error that occurred while scanning. In a wellformed scanner (no or only correct usage of yypushback(int) and a match-all fallback rule) this method will only be called with things that "Can't Possibly Happen". If this method is called, something is seriously wrong (e.g. a JFlex bug producing a faulty scanner etc.). Usual syntax/scanner level error handling should be done in error fallback rules.
      Parameters:
      errorCode - the code of the errormessage to display
    • yypushback

      private void yypushback(int number)
      Pushes the specified amount of characters back into the input stream. They will be read again by then next call of the scanning method
      Parameters:
      number - the number of characters to be read again. This number must not be greater than yylength()!
    • zzDoEOF

      private void zzDoEOF()
      Contains user EOF-code, which will be executed exactly once, when the end of file is reached
    • nextChar

      private int nextChar() throws IOException
      Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
      Returns:
      the next token
      Throws:
      IOException - if any I/O-Error occurs