opencat / translation-memory
SQLite-backed translation memory with fuzzy matching for the OpenCAT Framework
Package info
github.com/shaikhammar/opencat-translation-memory
pkg:composer/opencat/translation-memory
Requires
- php: ^8.2
- ext-intl: *
- ext-mbstring: *
- ext-pdo: *
- ext-pdo_sqlite: *
- opencat/core: ^0.1
- opencat/tmx: ^0.1
Requires (Dev)
- phpunit/phpunit: ^11.0
Suggests
- ext-pdo_pgsql: Required for PostgresTranslationMemory — also needs pg_trgm extension enabled in the database
This package is auto-updated.
Last update: 2026-05-09 00:57:56 UTC
README
SQLite-backed translation memory with exact and fuzzy matching for the OpenCAT Framework.
Stores TranslationUnit objects, looks them up by similarity against a source Segment, and imports/exports via TMX. A PostgreSQL backend is also available for multi-user deployments.
Installation
composer require opencat/translation-memory
Requires ext-pdo, ext-pdo_sqlite, ext-intl, and ext-mbstring.
For PostgreSQL: install ext-pdo_pgsql and enable the pg_trgm extension in the database.
SQLite TM
use CatFramework\TranslationMemory\SqliteTranslationMemory; $pdo = new PDO('sqlite:project.db'); $tm = new SqliteTranslationMemory($pdo); // Schema is created automatically on first instantiation
Storing translation units
use CatFramework\Core\Model\TranslationUnit; $tm->store(new TranslationUnit( source: $sourceSegment, target: $targetSegment, sourceLanguage: 'en-US', targetLanguage: 'fr-FR', createdAt: new DateTimeImmutable(), createdBy: 'translator@example.com', ));
Duplicate entries (same language pair and normalised source text) are silently overwritten with the new translation.
Looking up matches
$matches = $tm->lookup( source: $segment, sourceLanguage: 'en-US', targetLanguage: 'fr-FR', minScore: 0.7, // 0.0–1.0, default 0.7 maxResults: 5, // default 5 ); foreach ($matches as $match) { echo round($match->score * 100) . '% ' . $match->type->name . PHP_EOL; echo $match->translationUnit->target->getPlainText() . PHP_EOL; }
Results are sorted by score descending. $match->type is one of:
| Score | Type | Meaning |
|---|---|---|
| 1.0 | EXACT |
Identical text and identical inline codes |
| 1.0 | EXACT_TEXT |
Identical plain text, but inline codes differ |
| < 1.0 | FUZZY |
Character-level similarity above $minScore |
Importing and exporting TMX
$count = $tm->import('memory.tmx'); // returns number of units imported $count = $tm->export('backup.tmx'); // returns number of units exported
Import uses the streaming TMX reader, so large files are processed without loading everything into memory.
How fuzzy matching works
- Normalisation — source text is normalised through a pipeline before storage and again at lookup: NFC Unicode → lowercase → collapse whitespace → trim. This makes matching robust to capitalisation and whitespace differences.
- Length pre-filter — only candidates whose character count falls within
[sourceLen × minScore, sourceLen ÷ minScore]are retrieved from the database. This is a fast index scan that avoids running Levenshtein on the entire TM. - Levenshtein similarity — for ASCII text, PHP's native
levenshtein()is used. For multibyte text (Hindi, Urdu, Arabic, CJK),ext-intlgrapheme-cluster arrays are used so that multi-byte characters count as single edit operations.
Custom normaliser pipeline
use CatFramework\TranslationMemory\Normalizer\NormalizerInterface; class MyNormalizer implements NormalizerInterface { public function normalize(string $text): string { return mb_strtolower($text); // custom logic } } $tm->setNormalizers([new MyNormalizer()]);
PostgreSQL TM
For multi-user or large-scale deployments:
use CatFramework\TranslationMemory\PostgresTranslationMemory; $pdo = new PDO('pgsql:host=localhost;dbname=catdb', 'user', 'pass'); $tm = new PostgresTranslationMemory($pdo);
Requires the pg_trgm extension enabled in PostgreSQL (CREATE EXTENSION IF NOT EXISTS pg_trgm). The PostgreSQL backend uses trigram similarity for fuzzy matching instead of Levenshtein, which scales better for large TMs.
Related packages
opencat/core—TranslationUnit,Segment,MatchResult,TranslationMemoryInterfaceopencat/tmx—TmxReaderused byimport(),TmxWriterused byexport()opencat/workflow— usesSqliteTranslationMemoryin the processing pipeline