|
Open Chinese Convert 1.3.2
A project for conversion between Traditional and Simplified Chinese
|
This directory contains the C++ core library, the public C API, command-line tools, and native extension entry points. The code is organized around a simple conversion pipeline:
The data files that drive this pipeline live outside this directory:
data/config/*.json: conversion schemes.data/dictionary/*.txt: source dictionaries..ocd2 files: marisa-trie dictionary binaries.Config.hpp, Config.cppResourceProvider.ResourceProvider.hpp, ResourceProvider.cppResourceProvider and FilesystemResourceProvider.FilesystemResourceProvider searches configured resource directories in order and returns a resolved path for lower-level dictionary loaders.Segmentation.hpp, Segmentation.cppMaxMatchSegmentation.hpp, MaxMatchSegmentation.cppPluginSegmentation.hpp, PluginSegmentation.cppSegments.hppConverter.hpp, Converter.cppConversionChain.hpp, ConversionChain.cppConversion.hpp, Conversion.cppConversionInspection.hppThe core conversion path depends on segmentation and longest-prefix dictionary matching. Character-by-character replacement is not equivalent to OpenCC behavior because phrase priority and multi-stage conversion order matter.
Dict.hpp, Dict.cppDictEntry.hpp, DictEntry.cppPrefixMatch.hpp, PrefixMatch.cppLexicon.hpp, Lexicon.cppSerializableDict.hppSerializedValues.hpp, SerializedValues.cpp.ocd2.TextDict.hpp, TextDict.cppMarisaDict.hpp, MarisaDict.cpp.ocd2 dictionary format.DartsDict.hpp, DartsDict.cpp.ocd dictionary format.BinaryDict.hpp, BinaryDict.cppDictGroup.hpp, DictGroup.cppDictConverter.hpp, DictConverter.cppSimpleConverter.hpp, SimpleConverter.cppConfig and Converter.ResourceProvider.opencc.hopencc_open, opencc_open_w, opencc_convert_utf8, opencc_convert_utf8_free, opencc_close, and error helpers.Windows path semantics need care:
opencc_open_w(const wchar_t*) is the explicit UTF-16 Windows API.opencc_open(const char*) keeps the historical Windows/MSVC narrow-string behavior. Do not silently change its encoding contract without a migration plan.*_utf8 or *_w rather than relying on ambiguous char* semantics.py_opencc.cppThe native CLI tools live under src/tools.
CommandLine.cppCommandLineMain.hpp, CommandLineMain.cppopencc command implementation.--in-place conversion.PlatformIO.hpp, PlatformIO.cppDictConverter.cppopencc_dict implementation.PhraseExtract.cppopencc_phrase_extract implementation.CmdLineOutput.hppCLI file conversion must remain streaming. Do not replace stream processing with "read whole file into memory" logic. In-place conversion is intentionally opt-in:
--in-place, -i and -o referring to the same actual file is rejected.--in-place, output is written to a temporary file next to the target, then the target is replaced after conversion succeeds.OpenCC uses UTF-8 internally for paths unless an API explicitly documents a different contract. Windows code should convert to UTF-16 at the platform boundary.
Important files:
UTF8Util.hpp, UTF8Util.cppWinUtil.hpptools/PlatformIO.*Maintenance rules:
fopen, std::fstream, stat, std::tmpnam, or narrow Win32 path calls to new Windows-sensitive paths.plugin/OpenCCPlugin.hPluginSegmentation.*The Jieba plugin lives outside src under plugins/jieba.
Most core modules have adjacent *Test.cpp files in this directory. Important test groups include:
ConfigTest.cppSimpleConverterTest.cppConversion*Test.cppMaxMatchSegmentationTest.cppMarisaDictTest.cpp, TextDictTest.cpp, DictGroupTest.cppUTF8*Test.cppCLI tests live in test/CommandLineConvertTest.cpp because they execute the built command-line binary. They cover streaming behavior, Unicode paths, measurement output, inspection modes, and in-place conversion safety.
The source tree is built through both CMake and Bazel:
src/CMakeLists.txtsrc/BUILD.bazelsrc/tools/CMakeLists.txtsrc/tools/BUILD.bazelWhen adding or renaming source files, update both build systems and any direct cross-build scripts that carry explicit source lists.