Class Lucene87StoredFieldsFormat
Principle
This StoredFieldsFormat compresses blocks of documents in
order to improve the compression ratio compared to document-level
compression. It uses the LZ4
compression algorithm by default in 16KB blocks, which is fast to compress
and very fast to decompress data. Although the default compression method
that is used (BEST_SPEED) focuses more on speed than on
compression ratio, it should provide interesting compression ratios
for redundant inputs (such as log files, HTML or plain text). For higher
compression, you can choose (BEST_COMPRESSION),
which uses the DEFLATE
algorithm with 48kB blocks and shared dictionaries for a better ratio at the
expense of slower performance. These two options can be configured like this:
// the default: for high performance indexWriterConfig.setCodec(new Lucene87Codec(Mode.BEST_SPEED)); // instead for higher performance (but slower): // indexWriterConfig.setCodec(new Lucene87Codec(Mode.BEST_COMPRESSION));
File formats
Stored fields are represented by three files:
-
A fields data file (extension
.fdt). This file stores a compact representation of documents in compressed blocks of 16KB or more. When writing a segment, documents are appended to an in-memorybyte[]buffer. When its size reaches 16KB or more, some metadata about the documents is flushed to disk, immediately followed by a compressed representation of the buffer using the LZ4 compression format.Notes
- When at least one document in a chunk is large enough so that the chunk
is larger than 32KB, the chunk will actually be compressed in several LZ4
blocks of 16KB. This allows
StoredFieldVisitors which are only interested in the first fields of a document to not have to decompress 10MB of data if the document is 10MB, but only 16KB. - Given that the original lengths are written in the metadata of the chunk, the decompressor can leverage this information to stop decoding as soon as enough data has been decompressed.
- In case documents are incompressible, the overhead of the compression format is less than 0.5%.
- When at least one document in a chunk is large enough so that the chunk
is larger than 32KB, the chunk will actually be compressed in several LZ4
blocks of 16KB. This allows
-
A fields index file (extension
.fdx). This file stores twomonotonic arrays, one for the first doc IDs of each block of compressed documents, and another one for the corresponding offsets on disk. At search time, the array containing doc IDs is binary-searched in order to find the block that contains the expected doc ID, and the associated offset on disk is retrieved from the second array. -
A fields meta file (extension
.fdm). This file stores metadata about the monotonic arrays stored in the index file.
Known limitations
This StoredFieldsFormat does not support individual documents
larger than (231 - 214) bytes.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumConfiguration option for stored fields. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final intstatic final CompressionModeCompression mode forLucene87StoredFieldsFormat.Mode.BEST_COMPRESSIONprivate static final intstatic final CompressionModeCompression mode forLucene87StoredFieldsFormat.Mode.BEST_SPEED(package private) final Lucene87StoredFieldsFormat.Modestatic final StringAttribute key for compression mode. -
Constructor Summary
ConstructorsConstructorDescriptionStored fields format with default optionsStored fields format with specified mode -
Method Summary
Modifier and TypeMethodDescriptionfieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) Returns aStoredFieldsReaderto load stored fields.fieldsWriter(Directory directory, SegmentInfo si, IOContext context) Returns aStoredFieldsWriterto write stored fields.(package private) StoredFieldsFormat
-
Field Details
-
MODE_KEY
Attribute key for compression mode. -
mode
-
BEST_COMPRESSION_BLOCK_LENGTH
private static final int BEST_COMPRESSION_BLOCK_LENGTH- See Also:
-
BEST_COMPRESSION_MODE
Compression mode forLucene87StoredFieldsFormat.Mode.BEST_COMPRESSION -
BEST_SPEED_BLOCK_LENGTH
private static final int BEST_SPEED_BLOCK_LENGTH- See Also:
-
BEST_SPEED_MODE
Compression mode forLucene87StoredFieldsFormat.Mode.BEST_SPEED
-
-
Constructor Details
-
Lucene87StoredFieldsFormat
public Lucene87StoredFieldsFormat()Stored fields format with default options -
Lucene87StoredFieldsFormat
Stored fields format with specified mode
-
-
Method Details
-
fieldsReader
public StoredFieldsReader fieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException Description copied from class:StoredFieldsFormatReturns aStoredFieldsReaderto load stored fields.- Specified by:
fieldsReaderin classStoredFieldsFormat- Throws:
IOException
-
fieldsWriter
public StoredFieldsWriter fieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException Description copied from class:StoredFieldsFormatReturns aStoredFieldsWriterto write stored fields.- Specified by:
fieldsWriterin classStoredFieldsFormat- Throws:
IOException
-
impl
-