webrek/tokenizers

Native byte-level BPE (tiktoken-compatible: cl100k_base/o200k_base) plus WordPiece and Unigram tokenizers for PHP 8.3+, with a process-shared vocab cache. Loads HuggingFace tokenizer.json models and counts Claude/Gemini tokens via their APIs.

Maintainers

Package info

github.com/webrek/tokenizers

Homepage

Documentation

Language:C

Type:php-ext

Ext name:ext-tokenizers

pkg:composer/webrek/tokenizers

Statistics

Installs: 0

Dependents: 0

Suggesters: 0

Stars: 0

Open Issues: 0

v0.1.0 2026-06-29 22:40 UTC

This package is auto-updated.

Last update: 2026-06-29 23:03:54 UTC


README

A native PHP extension that counts, encodes, and decodes LLM tokens β€” byte-exact with the reference tokenizers, fast, and with no Rust toolchain. Plus a pure-PHP companion that counts Claude and Gemini tokens through their official APIs.

CI docs version php thread safety license

Think of it as a scale for AI text: before you send a prompt to an LLM, weigh it in tokens so you know what it will cost and whether it fits the model's context window β€” all from PHP, exactly, without rebuilding a 26 MB vocabulary on every request.

use Tokenizers\Encoding;

$enc = Encoding::load('cl100k_base');                 // GPT-4 / GPT-4o-class encoding
echo $enc->countTokens('Hello, world! πŸŽ‰');           // 7

Why this extension?

Property Pure-PHP tiktoken port This extension
Memory per worker ~26 MB vocab rebuilt every request Loaded once per worker process
Worst-case latency O(nΒ²) per pre-token piece O(n log n) heap-based merge
Install Pure PHP Single .so β€” no Rust, no ffi.enable
Accuracy Approximate Byte-exact vs the reference (tiktoken / BERT / T5)
Models OpenAI only OpenAI + BERT + T5 + Llama/Mistral/Qwen… + Claude/Gemini via API

The wins are memory, worst-case latency, accuracy, and installability β€” not raw throughput on tiny inputs. For prompts that fit in a tweet, a pure-PHP port can be faster (no extension-call overhead). This extension is for workloads where the 26 MB-per-worker overhead, adversarial inputs, or byte-exactness actually matter.

Supported models

Locally, byte-exact

Algorithm Models How to load
BPE (tiktoken) GPT-4, GPT-4o (text), o1, o3 Encoding::load('cl100k_base')
BPE (tiktoken) GPT-4o (multimodal), o1 mini/pro Encoding::load('o200k_base')
BPE (HuggingFace) GPT-2, RoBERTa, Llama 3, Mistral (tekken), Qwen, DeepSeek Encoding::fromHuggingFace('tokenizer.json')
WordPiece BERT family (bert-base-uncased, …) Encoding::fromHuggingFace('tokenizer.json')
Unigram T5, ALBERT (SentencePiece) Encoding::fromHuggingFace('tokenizer.json')

Conformance is verified byte-for-byte against Python tiktoken, HuggingFace BertTokenizerFast, and t5-small. Any diff against the committed fixtures fails CI. See Status & conformance.

Remotely (no public tokenizer)

Claude 3+ and Gemini do not publish their tokenizers β€” there is no local vocabulary to load. The pure-PHP companion (Tokenizers\Remote\Anthropic, Tokenizers\Remote\Gemini, Tokenizers\TokenCounter) counts their tokens through the providers' official count_tokens endpoints. It works without building the C extension. See the Remote providers guide.

Install

phpize && ./configure && make && make install

Then enable it in your php.ini:

extension=tokenizers

Verify:

php -m | grep tokenizers          # β†’ tokenizers

pecl install tokenizers and pie install webrek/tokenizers are also supported. Full instructions, requirements (libpcre2), and troubleshooting are in Getting Started.

Quick start

use Tokenizers\Encoding;

// OpenAI encoding (vocab downloads + caches on first use)
$enc = Encoding::load('cl100k_base');
$n   = $enc->countTokens($prompt);     // count without allocating the id array
$ids = $enc->encode($prompt);          // int[]
$str = $enc->decode($ids);             // round-trips for plain text

// HuggingFace model β€” returns Bpe | WordPiece | Unigram by model type
$bert = Encoding::fromHuggingFace('/path/to/bert/tokenizer.json');
$t5   = Encoding::fromHuggingFace('/path/to/t5/tokenizer.json');

// One facade for local + remote, routed by model name
use Tokenizers\TokenCounter;
$tc = new TokenCounter();
$tc->count('cl100k_base',     $text);  // local BPE, no key
$tc->count('claude-opus-4-8', $text);  // remote Anthropic (needs ANTHROPIC_API_KEY)
$tc->count('gemini-1.5-flash',$text);  // remote Gemini   (needs GEMINI_API_KEY)

Documentation

Guide What it covers
Getting Started Install, enable, verify, first tokenization, troubleshooting
Loading models OpenAI / HuggingFace BPE, WordPiece, Unigram, options, the cache
Estimating LLM costs Budget spend, fit context windows, track usage per client
Remote providers (Claude / Gemini) Counting tokens via the provider APIs
API reference Every class, method, and function
Status & limitations What's verified, conformance results, honest limits, roadmap

Project status

v0.1.0, early but functional. All three planned phases are complete and merged:

  • BPE (cl100k_base, o200k_base, HuggingFace BPE) β€” byte-exact, O(n log n) merge.
  • WordPiece (BERT) and Unigram (T5/SentencePiece) β€” byte-exact.
  • Claude / Gemini API companion β€” pure PHP, standalone.

Honest caveats live in Status & limitations (normalization scope, PIE install not yet verified end-to-end, remote counting needs a network call + key).

License

Apache-2.0 β€” see LICENSE. Vocabulary files for built-in encodings are downloaded from OpenAI's public CDN at runtime, checksum-verified, and not redistributed with the extension.