brucetruth / ml-idea
A production-ready machine learning library for PHP
dev-master
2026-06-16 08:03 UTC
Requires
- php: ^8.2
- ext-json: *
Requires (Dev)
- brucetruth/ml-idea-laravel: @dev
- illuminate/database: ^12.62
- illuminate/support: ^12.62
- phpstan/phpstan: ^1.12
- phpunit/phpunit: ^11.5
Suggests
- ext-parallel: Parallel tool execution via ParallelToolCallRunner
- ext-redis: Redis-backed agent session store
- brucetruth/ml-idea-laravel: Laravel service provider, config, and Artisan commands for ToolRoutingAgent
- open-telemetry/sdk: Export ToolRoutingAgent spans via OpenTelemetry
- psr/log: Psr3AgentRunLogger for PSR-3 audit logging
This package is auto-updated.
Last update: 2026-06-16 08:04:51 UTC
README
ml-idea is a modern, production-oriented machine learning library for PHP focused on clean APIs,
strict typing, and practical classification workflows.
Others always look down on PHP & have proclaimed its end since 2000, well, the elephant keeps moving.
Features
- PHP 8.2+ with strict types
- Consistent classifier contract (
train,predict,predictBatch) - Production-ready baseline classifiers:
KNearestNeighborsLogisticRegression(binary classification)GaussianNaiveBayes
- Model persistence (
ModelSerializer) - Data splitting utility (
TrainTestSplit) - Evaluation metrics (
accuracy,precision,recall,f1Score) - Advanced evaluation metrics:
rocAuc,prAuc,logLoss,brierScore,matthewsCorrcoef,meanAbsolutePercentageError - Preprocessing transformers (
StandardScaler,MinMaxScaler) - Workflow tools (
PipelineClassifier,KFoldcross-validation splits) - Extra splitters:
StratifiedKFold,TimeSeriesSplit - Cross-validation helpers:
CrossValidation::crossValScore*,CrossValidation::crossValPredict* - Probability calibration + threshold optimization:
CalibratedClassifierCV(CV +cv='prefit'),ThresholdTuner, isotonic regression - Regression support (
LinearRegression,RegressionMetrics) - Advanced modules:
PCA,KMeans,MiniBatchKMeans,DBSCAN, sparseTfidfVectorizer - Tree ensembles:
RandomForestClassifier/Regressor,GradientBoostingClassifier/Regressor,LinearSVC,DecisionTree - Model selection: stratified
GridSearchClassifier,RandomizedSearchClassifier(accuracy, F1, ROC-AUC, PR-AUC) - Pipeline persistence:
PipelineSerializer,TabularPipelineClassifier(OneHot + scalers + estimator) - Vision module: DCT/noise/patch forensics,
ForensicsVisionEmbedder,OllamaVisionEmbedder,VisionIndexer,VisionEval(ROC-AUC), trainableAuthenticityClassifier, neural hooks - Vision heuristics: color palette analysis, skin-tone risk, AI-generation authenticity scoring
- NLP foundation (Phase 1): fluent Text API, unicode tokenization with offsets, PII redaction, rule-based POS tagging
- NLP Phase 2: language detection, keyword extraction (RAKE), BM25 retrieval, hashing vectorizer, similarity utilities, and NLP RAG helpers
- NLP advanced tagging: multilingual rule-based POS, extensible language profiles, rule-based NER, spaCy-style
Nlp::load()API (104 languages) - NLP neural backend hooks:
CallableNlpBackend,OllamaNlpBackend,HuggingFaceInferenceBackend - GEO service + ML-GEO helpers: country/state/city lookup, nearest-place search, and geo feature building
- Managed dataset assets: registry, integrity checks, licenses metadata, and compiled indexes (trie/automaton/kd-tree)
- RAG foundations: embedders (
EmbedderFactory::fromEnv,OpenAI,AzureOpenAI,Ollama,HuggingFaceEmbedder,TeiEmbedder,HashEmbedder),VisionPathEmbedder, splitters, retriever, vector stores - RAG LLM clients for QA generation:
Echo,OpenAI,Azure OpenAI, andOllama(direct orLlmClientFactory::fromEnv()) - Advanced RAG workflow: document loaders, hybrid retrieval, rerankers, citations/diagnostics, vector-index persistence, tool-calling + streaming hooks
- AI agents + tool routing:
ToolCallingAgent,ToolRoutingAgent, deterministic/local routing, and provider-backed routing (OpenAI/Azure/Anthropic/Ollama/custom) - Unified core contracts (v1.4):
fit/predict, probabilistic, online-learning, serializable model interfaces - Hyperparameter lifecycle helpers:
getParams,setParams,cloneWithParams, random-state aware models - PHPUnit test suite + CI workflow
- Static analysis support with PHPStan
Installation
composer require brucetruth/ml-idea
Quick Start
<?php declare(strict_types=1); require_once 'vendor/autoload.php'; use ML\IDEA\Classifiers\KNearestNeighbors; use ML\IDEA\Data\TrainTestSplit; use ML\IDEA\Metrics\ClassificationMetrics; use ML\IDEA\Preprocessing\StandardScaler; $samples = [[1, 1], [1, 2], [2, 1], [4, 4], [5, 5], [4, 5]]; $labels = ['A', 'A', 'A', 'B', 'B', 'B']; $split = TrainTestSplit::split($samples, $labels, testSize: 0.33, seed: 42); $scaler = new StandardScaler(); $xTrain = $scaler->fitTransform($split['xTrain']); $xTest = $scaler->transform($split['xTest']); $model = new KNearestNeighbors(k: 3, weighted: true); $model->train($xTrain, $split['yTrain']); $predictions = $model->predictBatch($xTest); $accuracy = ClassificationMetrics::accuracy($split['yTest'], $predictions); echo "Accuracy: " . round($accuracy * 100, 2) . "%\n";
Model Persistence
use ML\IDEA\Model\ModelSerializer; ModelSerializer::save($model, __DIR__ . '/knn.model.json'); $loadedModel = ModelSerializer::load(__DIR__ . '/knn.model.json');
Advanced v1.2 Examples
1) Pipeline + KFold
use ML\IDEA\Classifiers\KNearestNeighbors; use ML\IDEA\Data\KFold; use ML\IDEA\Pipeline\PipelineClassifier; use ML\IDEA\Preprocessing\StandardScaler; $samples = [[1,1],[1,2],[2,1],[4,4],[5,5],[4,5]]; $labels = ['A','A','A','B','B','B']; $folds = KFold::split(count($samples), nSplits: 3, shuffle: true, seed: 42); foreach ($folds as $fold) { $xTrain = $yTrain = $xTest = $yTest = []; foreach ($fold['train'] as $i) { $xTrain[] = $samples[$i]; $yTrain[] = $labels[$i]; } foreach ($fold['test'] as $i) { $xTest[] = $samples[$i]; $yTest[] = $labels[$i]; } $model = new PipelineClassifier([new StandardScaler()], new KNearestNeighbors(3, true)); $model->train($xTrain, $yTrain); $pred = $model->predictBatch($xTest); }
2) Linear Regression
use ML\IDEA\Regression\LinearRegression; use ML\IDEA\Metrics\RegressionMetrics; $x = [[1.0], [2.0], [3.0], [4.0]]; $y = [2.0, 4.0, 6.0, 8.0]; $reg = new LinearRegression(learningRate: 0.05, iterations: 5000); $reg->train($x, $y); $pred = $reg->predictBatch($x); echo RegressionMetrics::rootMeanSquaredError($y, $pred);
3) Text Embedding (TF-IDF)
use ML\IDEA\NLP\TfidfVectorizer; $docs = ['machine learning in php', 'php library for intelligence']; $vectorizer = new TfidfVectorizer(); $matrix = $vectorizer->fitTransform($docs);
Development
composer install
composer test
composer analyse
Examples
See runnable use-case scripts in examples/:
- basic classification flow
- CV + advanced metrics
- probability calibration + threshold tuning
- regression pipelines
- text features + clustering
- hyperparameter search
- RAG local chain + vector-store examples
- RAG DB loader example (SQLite/PDO)
- Agent toolbox example (
examples/agents) with local KB + weather + free API tools - Vision examples (palette extraction and content-risk heuristic demo)
- Vision authenticity-risk example (AI-generated likelihood heuristic)
- NLP Text API + POS example (
examples/16_nlp_text_api_and_pos.php) - NLP BM25 + similarity example (
examples/17_nlp_bm25_and_similarity.php) - NLP multilingual POS + NER example (
examples/18_nlp_multilingual_ner.php) - NLP extensibility example (
examples/19_nlp_extensibility_custom_profiles.php) - NLP trainable POS/NER pipeline example (
examples/20_nlp_trainable_pos_ner.php) - ML competitiveness demo (
examples/33_ml_competitiveness.php) — trees, pipelines, search, calibration - Tier-2 ML/RAG demo (
examples/34_tier2_ml_rag.php) — sparse TF-IDF, KMeans/DBSCAN, multiclass calibration, ANN - Vision ML classifier demo (
examples/35_vision_ml_classifier.php) — forensics features + trainable authenticity model - Vision eval demo (
examples/36_vision_eval_demo.php) — ROC-AUC/PR-AUC on labeled fixtures - Vision/RAG frontier hooks (
examples/37_vision_rag_frontier_hooks.php) — neural backends, HF embedder, vec0 factory - Image similarity RAG (
examples/38_image_similarity_rag.php) — forensics embeddings + ANN search - Production embedder/index (
examples/39_production_embedder_and_vision_index.php) —EmbedderFactory,VisionIndexer, directory scan - Image similarity RAG (
examples/38_image_similarity_rag.php) — forensics embeddings + ANN search
Roadmap
Priority 0 — Competitiveness Foundations
- Performance + benchmarking first: reproducible benchmark suite, memory/latency tracking, and publishable baseline reports.
- Production inference contract: versioned model bundles, input/output schema validation, safer deserialization, and deterministic fallback behavior.
- Reproducibility workflow: run metadata capture (params, seed, metrics, artifacts) and standardized experiment summaries.
Priority 1 — Algorithm & Feature Coverage
- More algorithms (tree-based models, multiclass linear models, stronger ensemble baselines).
- Feature preprocessing (normalization, encoding, imputation, outlier-aware transforms, rare-category handling).
- Time-series ML support beyond splitting (lag/rolling feature generators and leakage-safe pipeline patterns).
- Cross-validation utilities expansion (task-specific helpers and richer evaluation modes).
Priority 2 — Interoperability & Ecosystem
- Model interoperability bridges (portable formats/import-export adapters) for cross-runtime workflows.
- Dataset loaders and richer benchmarking tools.
- Documentation expansion: task-first recipes, performance tuning guides, and production deployment playbooks.
Priority 3 — RAG/Agent Production Hardening
- Context and chat history handling for the Tool Routing Agent. (v1.5:
AgentContextManager) - Tool reliability layer for agents (timeouts, retries, fallbacks, structured errors). (v1.5:
ToolReliabilityPolicy,RetryableToolInterface) - Policy and safety guardrails (tool allow/deny rules, injection checks, PII-safe logs).
- Improved routing quality (confidence scoring, clarification turn, top-k tool candidates). (v1.5: decision
confidencefield) - Observability + evaluation harness for routing/tool accuracy regressions. (v1.5:
AgentEvalHarness) - Streaming agent runs and human-in-the-loop approval gates. (v1.6:
chatStream(),resumeWithApproval()) - Session auto-persist for multi-turn agents. (v1.6:
chatInSession(),AnthropicToolRoutingModel, Ollama native tools) - MCP remote tools + pluggable session stores (file default, Redis optional). (v1.7:
McpToolProvider,AgentStateStoreFactory) - Multi-agent supervisor handoffs to specialist agents. (v1.7:
AgentHandoffRegistry,handoffdecision) - OpenTelemetry-style tracing for agent runs. (v1.7:
AgentTracerInterface,OpenTelemetryAgentTracer) - Laravel bridge package with config, facade, and eval Artisan command. (v1.7:
brucetruth/ml-idea-laravel) - AI admin example: standalone demos in
examples/ai-admin/, Laravel copy-paste demo inpackages/laravel/examples/ai-admin/ - Memory strategy beyond raw history (summaries, pruning, retrieval-based recall).
- Cost/latency controls (model tiering, caching, token budgets).
- Human-in-the-loop controls for risky actions and execution approvals.
- Output quality controls (schema validation, grounding/citation checks, consistency pass).
Strategic Positioning
- Position
ml-ideaas: the production PHP AI runtime combining classical ML + NLP + RAG + tool-using agents with strict typing and testability.