opencat / filter-plaintext
Plain text (.txt) file filter for the OpenCAT Framework
dev-main
2026-05-09 00:13 UTC
Requires
- php: ^8.2
- opencat/core: ^0.1
Requires (Dev)
- phpunit/phpunit: ^11.0
This package is auto-updated.
Last update: 2026-05-09 00:16:59 UTC
README
Plain text (.txt) file filter for the CAT Framework.
Installation
composer require catframework/filter-plaintext
Usage
use CatFramework\FilterPlaintext\PlainTextFilter; $filter = new PlainTextFilter(); // Extract translatable segments $document = $filter->extract('article.txt', 'en', 'fr'); foreach ($document->getSegmentPairs() as $pair) { $pair->target = new Segment('seg-t', [$translatedText]); } // Write the translated file $filter->rebuild($document, 'article.fr.txt');
How segments are split
The filter splits on two or more consecutive newlines (blank-line paragraph breaks). Each non-whitespace block becomes one segment. Single newlines within a block are preserved as-is and are part of the segment text.
First paragraph. → segment 1
→ (separator, not a segment)
Second paragraph. → segment 2
Third paragraph. → segment 3
Whitespace-only blocks (e.g. multiple blank lines between paragraphs) are passed through unchanged and do not become segments.
Encoding
Input files are auto-detected as UTF-8, ISO-8859-1, or Windows-1252. All output is written in UTF-8. If encoding detection fails, the file is treated as UTF-8.
Skeleton format
[
'parts' => string[], // file split by paragraph boundaries, separators included
'seg_map' => [int => string], // parts array index => segId
]
Limitations
- No inline markup support — the entire segment is plain text; no
InlineCodeelements are produced. - No sentence-level segmentation — each paragraph is one segment regardless of length. Use
catframework/segmentationfor sentence splitting. - Encoding detection relies on
mb_detect_encoding; unusual encodings (e.g. Shift-JIS) are not supported.