opencat / filter-docx
DOCX (.docx) file filter for the OpenCAT Framework
Requires
- php: ^8.2
- ext-dom: *
- ext-libxml: *
- ext-zip: *
- opencat/core: ^0.1
Requires (Dev)
- phpunit/phpunit: ^11.0
This package is auto-updated.
Last update: 2026-05-09 00:16:59 UTC
README
Microsoft Word DOCX file filter for the CAT Framework.
Installation
composer require catframework/filter-docx
Requires ext-dom, ext-libxml, and ext-zip.
Usage
use CatFramework\FilterDocx\DocxFilter; $filter = new DocxFilter(); // Extract translatable segments $document = $filter->extract('report.docx', 'en', 'fr'); foreach ($document->getSegmentPairs() as $pair) { $pair->target = new Segment('seg-t', [$translatedText]); } // Write the translated DOCX $filter->rebuild($document, 'report.fr.docx');
What gets extracted
Each non-empty <w:p> paragraph in the document is one segment. Adjacent runs with identical formatting (<w:rPr>) are merged before extraction, reducing the number of inline code placeholders a translator sees.
Extracted locations (in order):
word/document.xml— main bodyword/header1.xml…word/header10.xml— headersword/footer1.xml…word/footer10.xml— footersword/footnotes.xml,word/endnotes.xml— notes
Formatting runs within a paragraph become InlineCode pairs so translators see {<bold>}translated text{</bold>} instead of raw XML.
RTL support
When the target language is Arabic, Hebrew, Farsi, Urdu, or another RTL language, <w:rtl/> is injected into each run's <w:rPr> and <w:bidi/> is added to paragraph properties on rebuild.
Supported RTL language prefixes: ar, he, fa, ur, yi, dv, ps, sd.
Skeleton format
The skeleton is a temporary DOCX file written to the system temp directory at extract time:
['path' => '/tmp/cat-<uniqid>.skl']
The skeleton file is a copy of the original DOCX ZIP with paragraph content replaced by {{SEG:NNN}} tokens. Do not delete it between extract() and rebuild() calls. It is not automatically cleaned up.
Limitations
- Tables: cell text is extracted as individual paragraph segments; table structure is preserved in the skeleton.
- Text boxes and shapes: content inside drawing anchors is not currently extracted.
- Comments and revisions: tracked changes and comment text are not extracted.
- Skeleton lifetime: the
.skltemp file must survive betweenextract()andrebuild(). For long-lived workflows, persist$document->skeleton['path']and ensure the file is not cleaned up by the OS.