opencat / filter-xml
Generic XML file filter for the OpenCAT Framework
dev-main
2026-05-09 00:14 UTC
Requires
- php: ^8.2
- ext-dom: *
- ext-libxml: *
- opencat/core: ^0.1
Requires (Dev)
- phpunit/phpunit: ^11.0
This package is auto-updated.
Last update: 2026-05-09 00:17:10 UTC
README
Generic XML file filter for the CAT Framework.
Works with any well-formed XML file: Android string resources, app config files, custom XML formats, etc.
Installation
composer require catframework/filter-xml
Usage
use CatFramework\FilterXml\XmlFilter; $filter = new XmlFilter(); // Extract translatable segments from an XML file $document = $filter->extract('strings.xml', 'en', 'fr'); foreach ($document->getSegmentPairs() as $pair) { echo $pair->source->getPlainText() . PHP_EOL; // … send to MT, TM lookup, or human translator … $pair->target = new Segment('seg-t', [$translatedText]); } // Write the translated XML file $filter->rebuild($document, 'strings.fr.xml');
Extraction heuristic
The filter uses a structural heuristic to decide what to extract:
- Translatable element — has at least one non-whitespace direct text node. Its full content (text + any child elements) is extracted as one segment.
- Container element — has only element children. Recursed into; not extracted itself.
Child elements inside a translatable segment are represented as InlineCode pairs so translators see placeholders (<b>, </b>) rather than raw markup.
Example — given:
<resources> <string name="greeting">Hello <b>world</b></string> <container> <item>First item</item> </container> </resources>
Three segments are extracted: Hello {<b>}world{</b>}, First item.
Skeleton format
The skeleton stored in BilingualDocument::$skeleton is:
[
'xml' => string, // full DOMDocument::saveXML() output with tokens in place of segment text
'seg_map' => [ // segId => token string
'seg-1' => '{{SEG:001}}',
'seg-2' => '{{SEG:002}}',
// …
],
]
Tokens are valid XML character data, so the skeleton is always parseable XML.
Limitations
- Generic heuristic: the filter has no knowledge of application-specific schemas. Elements that should not be translated (e.g.
<version>,<id>) will be extracted if they contain text. For schema-aware extraction, subclassXmlFilterand overridewalkElement(). - Whitespace-only nodes: text nodes containing only whitespace (indentation, newlines) are silently skipped.
- CDATA sections: treated as text content by the DOM; extracted and re-encoded as regular text on rebuild.
- XML namespace prefixes are preserved in
InlineCodedata as-is.