Class CharacterReader

java.lang.Object
org.jsoup.parser.CharacterReader
All Implemented Interfaces:
AutoCloseable

public final class CharacterReader extends Object implements AutoCloseable
CharacterReader consumes tokens off a string. Used internally by jsoup. API subject to changes.

If the underlying reader throws an IOException during any operation, the CharacterReader will throw an UncheckedIOException. That won't happen with String / StringReader inputs.

  • Field Details

    • EOF

      static final char EOF
      See Also:
    • MaxStringCacheLen

      private static final int MaxStringCacheLen
      See Also:
    • StringCacheSize

      private static final int StringCacheSize
      See Also:
    • stringCache

      private String[] stringCache
    • StringPool

      private static final SoftPool<String[]> StringPool
    • BufferSize

      static final int BufferSize
      See Also:
    • RefillPoint

      static final int RefillPoint
      See Also:
    • RewindLimit

      private static final int RewindLimit
      See Also:
    • reader

      private Reader reader
    • charBuf

      private char[] charBuf
    • bufPos

      private int bufPos
    • bufLength

      private int bufLength
    • fillPoint

      private int fillPoint
    • consumed

      private int consumed
    • bufMark

      private int bufMark
    • readFully

      private boolean readFully
    • BufferPool

      private static final SoftPool<char[]> BufferPool
    • newlinePositions

      private ArrayList<Integer> newlinePositions
    • lineNumberOffset

      private int lineNumberOffset
    • lastIcSeq

      private String lastIcSeq
    • lastIcIndex

      private int lastIcIndex
  • Constructor Details

    • CharacterReader

      public CharacterReader(Reader input, int sz)
    • CharacterReader

      public CharacterReader(Reader input)
    • CharacterReader

      public CharacterReader(String input)
  • Method Details

    • close

      public void close()
      Specified by:
      close in interface AutoCloseable
    • bufferUp

      private void bufferUp()
    • doBufferUp

      private void doBufferUp()
      Reads into the buffer. Will throw an UncheckedIOException if the underling reader throws an IOException.
      Throws:
      UncheckedIOException - if the underlying reader throws an IOException
    • mark

      void mark()
    • unmark

      void unmark()
    • rewindToMark

      void rewindToMark()
    • pos

      public int pos()
      Gets the position currently read to in the content. Starts at 0.
      Returns:
      current position
    • readFully

      boolean readFully()
      Tests if the buffer has been fully read.
    • trackNewlines

      public void trackNewlines(boolean track)
      Enables or disables line number tracking. By default, will be off.Tracking line numbers improves the legibility of parser error messages, for example. Tracking should be enabled before any content is read to be of use.
      Parameters:
      track - set tracking on|off
      Since:
      1.14.3
    • isTrackNewlines

      public boolean isTrackNewlines()
      Check if the tracking of newlines is enabled.
      Returns:
      the current newline tracking state
      Since:
      1.14.3
    • lineNumber

      public int lineNumber()
      Get the current line number (that the reader has consumed to). Starts at line #1.
      Returns:
      the current line number, or 1 if line tracking is not enabled.
      Since:
      1.14.3
      See Also:
    • lineNumber

      int lineNumber(int pos)
    • columnNumber

      public int columnNumber()
      Get the current column number (that the reader has consumed to). Starts at column #1.
      Returns:
      the current column number
      Since:
      1.14.3
      See Also:
    • columnNumber

      int columnNumber(int pos)
    • posLineCol

      String posLineCol()
      Get a formatted string representing the current line and column positions. E.g. 5:10 indicating line number 5 and column number 10.
      Returns:
      line:col position
      Since:
      1.14.3
      See Also:
    • lineNumIndex

      private int lineNumIndex(int pos)
    • scanBufferForNewlines

      private void scanBufferForNewlines()
      Scans the buffer for newline position, and tracks their location in newlinePositions.
    • isEmpty

      public boolean isEmpty()
      Tests if all the content has been read.
      Returns:
      true if nothing left to read.
    • isEmptyNoBufferUp

      private boolean isEmptyNoBufferUp()
    • current

      public char current()
      Get the char at the current position.
      Returns:
      char
    • consume

      public char consume()
      Consume one character off the queue.
      Returns:
      first character on queue, or EOF if the queue is empty.
    • unconsume

      void unconsume()
      Unconsume one character (bufPos--). MUST only be called directly after a consume(), and no chance of a bufferUp.
    • advance

      public void advance()
      Moves the current position by one.
    • nextIndexOf

      int nextIndexOf(char c)
      Returns the number of characters between the current position and the next instance of the input char
      Parameters:
      c - scan target
      Returns:
      offset between current position and next instance of target. -1 if not found.
    • nextIndexOf

      int nextIndexOf(CharSequence seq)
      Returns the number of characters between the current position and the next instance of the input sequence
      Parameters:
      seq - scan target
      Returns:
      offset between current position and next instance of target. -1 if not found.
    • consumeTo

      public String consumeTo(char c)
      Reads characters up to the specific char.
      Parameters:
      c - the delimiter
      Returns:
      the chars read
    • consumeTo

      public String consumeTo(String seq)
      Reads the characters up to (but not including) the specified case-sensitive string.

      If the sequence is not found in the buffer, will return the remainder of the current buffered amount, less the length of the sequence, such that this call may be repeated.

      Parameters:
      seq - the delimiter
      Returns:
      the chars read
    • consumeMatching

      String consumeMatching(CharacterReader.CharPredicate func)
      Read characters while the input predicate returns true.
      Returns:
      characters read
    • consumeMatching

      String consumeMatching(CharacterReader.CharPredicate func, int maxLength)
      Read characters while the input predicate returns true, up to a maximum length.
      Parameters:
      func - predicate to test
      maxLength - maximum length to read. -1 indicates no maximum
      Returns:
      characters read
    • consumeToAny

      public String consumeToAny(char... chars)
      Read characters until the first of any delimiters is found.
      Parameters:
      chars - delimiters to scan for
      Returns:
      characters read up to the matched delimiter.
    • consumeToAnySorted

      String consumeToAnySorted(char... chars)
    • consumeData

      String consumeData()
    • consumeAttributeQuoted

      String consumeAttributeQuoted(boolean single)
    • consumeRawData

      String consumeRawData()
    • consumeTagName

      String consumeTagName()
    • consumeToEnd

      String consumeToEnd()
    • consumeLetterSequence

      String consumeLetterSequence()
    • consumeLetterThenDigitSequence

      String consumeLetterThenDigitSequence()
    • consumeHexSequence

      String consumeHexSequence()
    • consumeDigitSequence

      String consumeDigitSequence()
    • matches

      boolean matches(char c)
    • matches

      boolean matches(String seq)
    • matchesIgnoreCase

      boolean matchesIgnoreCase(String seq)
    • matchesAny

      boolean matchesAny(char... seq)
      Tests if the next character in the queue matches any of the characters in the sequence, case sensitively.
      Parameters:
      seq - list of characters to check for
      Returns:
      true if any matched, false if none did
    • matchesAnySorted

      boolean matchesAnySorted(char[] seq)
    • matchesAsciiAlpha

      boolean matchesAsciiAlpha()
      Checks if the current pos matches an ascii alpha (A-Z a-z) per https://infra.spec.whatwg.org/#ascii-alpha
      Returns:
      if it matches or not
    • matchesDigit

      boolean matchesDigit()
    • matchConsume

      boolean matchConsume(String seq)
    • matchConsumeIgnoreCase

      boolean matchConsumeIgnoreCase(String seq)
    • containsIgnoreCase

      boolean containsIgnoreCase(String seq)
      Used to check presence of , when we're in RCData and see a invalid input: '<'xxx. Only finds consistent case.
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • cacheString

      private static String cacheString(char[] charBuf, String[] stringCache, int start, int count)
      Caches short strings, as a flyweight pattern, to reduce GC load. Just for this doc, to prevent leaks.

      Simplistic, and on hash collisions just falls back to creating a new string, vs a full HashMap with Entry list. That saves both having to create objects as hash keys, and running through the entry list, at the expense of some more duplicates.

    • rangeEquals

      static boolean rangeEquals(char[] charBuf, int start, int count, String cached)
      Check if the value of the provided range equals the string.
    • rangeEquals

      boolean rangeEquals(int start, int count, String cached)