brucetruth/ml-idea

A production-ready machine learning library for PHP

Maintainers

Package info

github.com/brucetruth/ml-idea

pkg:composer/brucetruth/ml-idea

Statistics

Installs: 26

Dependents: 0

Suggesters: 0

Stars: 18

Open Issues: 0

dev-master 2026-06-16 08:03 UTC

This package is auto-updated.

Last update: 2026-06-16 08:04:51 UTC


README

Minimum PHP Version License

ml-idea is a modern, production-oriented machine learning library for PHP focused on clean APIs, strict typing, and practical classification workflows.

Others always look down on PHP & have proclaimed its end since 2000, well, the elephant keeps moving.

Features

  • PHP 8.2+ with strict types
  • Consistent classifier contract (train, predict, predictBatch)
  • Production-ready baseline classifiers:
    • KNearestNeighbors
    • LogisticRegression (binary classification)
    • GaussianNaiveBayes
  • Model persistence (ModelSerializer)
  • Data splitting utility (TrainTestSplit)
  • Evaluation metrics (accuracy, precision, recall, f1Score)
  • Advanced evaluation metrics: rocAuc, prAuc, logLoss, brierScore, matthewsCorrcoef, meanAbsolutePercentageError
  • Preprocessing transformers (StandardScaler, MinMaxScaler)
  • Workflow tools (PipelineClassifier, KFold cross-validation splits)
  • Extra splitters: StratifiedKFold, TimeSeriesSplit
  • Cross-validation helpers: CrossValidation::crossValScore*, CrossValidation::crossValPredict*
  • Probability calibration + threshold optimization: CalibratedClassifierCV (CV + cv='prefit'), ThresholdTuner, isotonic regression
  • Regression support (LinearRegression, RegressionMetrics)
  • Advanced modules: PCA, KMeans, MiniBatchKMeans, DBSCAN, sparse TfidfVectorizer
  • Tree ensembles: RandomForestClassifier/Regressor, GradientBoostingClassifier/Regressor, LinearSVC, DecisionTree
  • Model selection: stratified GridSearchClassifier, RandomizedSearchClassifier (accuracy, F1, ROC-AUC, PR-AUC)
  • Pipeline persistence: PipelineSerializer, TabularPipelineClassifier (OneHot + scalers + estimator)
  • Vision module: DCT/noise/patch forensics, ForensicsVisionEmbedder, OllamaVisionEmbedder, VisionIndexer, VisionEval (ROC-AUC), trainable AuthenticityClassifier, neural hooks
  • Vision heuristics: color palette analysis, skin-tone risk, AI-generation authenticity scoring
  • NLP foundation (Phase 1): fluent Text API, unicode tokenization with offsets, PII redaction, rule-based POS tagging
  • NLP Phase 2: language detection, keyword extraction (RAKE), BM25 retrieval, hashing vectorizer, similarity utilities, and NLP RAG helpers
  • NLP advanced tagging: multilingual rule-based POS, extensible language profiles, rule-based NER, spaCy-style Nlp::load() API (104 languages)
  • NLP neural backend hooks: CallableNlpBackend, OllamaNlpBackend, HuggingFaceInferenceBackend
  • GEO service + ML-GEO helpers: country/state/city lookup, nearest-place search, and geo feature building
  • Managed dataset assets: registry, integrity checks, licenses metadata, and compiled indexes (trie/automaton/kd-tree)
  • RAG foundations: embedders (EmbedderFactory::fromEnv, OpenAI, AzureOpenAI, Ollama, HuggingFaceEmbedder, TeiEmbedder, HashEmbedder), VisionPathEmbedder, splitters, retriever, vector stores
  • RAG LLM clients for QA generation: Echo, OpenAI, Azure OpenAI, and Ollama (direct or LlmClientFactory::fromEnv())
  • Advanced RAG workflow: document loaders, hybrid retrieval, rerankers, citations/diagnostics, vector-index persistence, tool-calling + streaming hooks
  • AI agents + tool routing: ToolCallingAgent, ToolRoutingAgent, deterministic/local routing, and provider-backed routing (OpenAI/Azure/Anthropic/Ollama/custom)
  • Unified core contracts (v1.4): fit/predict, probabilistic, online-learning, serializable model interfaces
  • Hyperparameter lifecycle helpers: getParams, setParams, cloneWithParams, random-state aware models
  • PHPUnit test suite + CI workflow
  • Static analysis support with PHPStan

Installation

composer require brucetruth/ml-idea

Quick Start

<?php

declare(strict_types=1);

require_once 'vendor/autoload.php';

use ML\IDEA\Classifiers\KNearestNeighbors;
use ML\IDEA\Data\TrainTestSplit;
use ML\IDEA\Metrics\ClassificationMetrics;
use ML\IDEA\Preprocessing\StandardScaler;

$samples = [[1, 1], [1, 2], [2, 1], [4, 4], [5, 5], [4, 5]];
$labels = ['A', 'A', 'A', 'B', 'B', 'B'];

$split = TrainTestSplit::split($samples, $labels, testSize: 0.33, seed: 42);

$scaler = new StandardScaler();
$xTrain = $scaler->fitTransform($split['xTrain']);
$xTest = $scaler->transform($split['xTest']);

$model = new KNearestNeighbors(k: 3, weighted: true);
$model->train($xTrain, $split['yTrain']);

$predictions = $model->predictBatch($xTest);
$accuracy = ClassificationMetrics::accuracy($split['yTest'], $predictions);

echo "Accuracy: " . round($accuracy * 100, 2) . "%\n";

Model Persistence

use ML\IDEA\Model\ModelSerializer;

ModelSerializer::save($model, __DIR__ . '/knn.model.json');
$loadedModel = ModelSerializer::load(__DIR__ . '/knn.model.json');

Advanced v1.2 Examples

1) Pipeline + KFold

use ML\IDEA\Classifiers\KNearestNeighbors;
use ML\IDEA\Data\KFold;
use ML\IDEA\Pipeline\PipelineClassifier;
use ML\IDEA\Preprocessing\StandardScaler;

$samples = [[1,1],[1,2],[2,1],[4,4],[5,5],[4,5]];
$labels = ['A','A','A','B','B','B'];

$folds = KFold::split(count($samples), nSplits: 3, shuffle: true, seed: 42);
foreach ($folds as $fold) {
    $xTrain = $yTrain = $xTest = $yTest = [];
    foreach ($fold['train'] as $i) { $xTrain[] = $samples[$i]; $yTrain[] = $labels[$i]; }
    foreach ($fold['test'] as $i) { $xTest[] = $samples[$i]; $yTest[] = $labels[$i]; }

    $model = new PipelineClassifier([new StandardScaler()], new KNearestNeighbors(3, true));
    $model->train($xTrain, $yTrain);
    $pred = $model->predictBatch($xTest);
}

2) Linear Regression

use ML\IDEA\Regression\LinearRegression;
use ML\IDEA\Metrics\RegressionMetrics;

$x = [[1.0], [2.0], [3.0], [4.0]];
$y = [2.0, 4.0, 6.0, 8.0];

$reg = new LinearRegression(learningRate: 0.05, iterations: 5000);
$reg->train($x, $y);
$pred = $reg->predictBatch($x);

echo RegressionMetrics::rootMeanSquaredError($y, $pred);

3) Text Embedding (TF-IDF)

use ML\IDEA\NLP\TfidfVectorizer;

$docs = ['machine learning in php', 'php library for intelligence'];
$vectorizer = new TfidfVectorizer();
$matrix = $vectorizer->fitTransform($docs);

Development

composer install
composer test
composer analyse

Examples

See runnable use-case scripts in examples/:

  • basic classification flow
  • CV + advanced metrics
  • probability calibration + threshold tuning
  • regression pipelines
  • text features + clustering
  • hyperparameter search
  • RAG local chain + vector-store examples
  • RAG DB loader example (SQLite/PDO)
  • Agent toolbox example (examples/agents) with local KB + weather + free API tools
  • Vision examples (palette extraction and content-risk heuristic demo)
  • Vision authenticity-risk example (AI-generated likelihood heuristic)
  • NLP Text API + POS example (examples/16_nlp_text_api_and_pos.php)
  • NLP BM25 + similarity example (examples/17_nlp_bm25_and_similarity.php)
  • NLP multilingual POS + NER example (examples/18_nlp_multilingual_ner.php)
  • NLP extensibility example (examples/19_nlp_extensibility_custom_profiles.php)
  • NLP trainable POS/NER pipeline example (examples/20_nlp_trainable_pos_ner.php)
  • ML competitiveness demo (examples/33_ml_competitiveness.php) — trees, pipelines, search, calibration
  • Tier-2 ML/RAG demo (examples/34_tier2_ml_rag.php) — sparse TF-IDF, KMeans/DBSCAN, multiclass calibration, ANN
  • Vision ML classifier demo (examples/35_vision_ml_classifier.php) — forensics features + trainable authenticity model
  • Vision eval demo (examples/36_vision_eval_demo.php) — ROC-AUC/PR-AUC on labeled fixtures
  • Vision/RAG frontier hooks (examples/37_vision_rag_frontier_hooks.php) — neural backends, HF embedder, vec0 factory
  • Image similarity RAG (examples/38_image_similarity_rag.php) — forensics embeddings + ANN search
  • Production embedder/index (examples/39_production_embedder_and_vision_index.php) — EmbedderFactory, VisionIndexer, directory scan
  • Image similarity RAG (examples/38_image_similarity_rag.php) — forensics embeddings + ANN search

Roadmap

Priority 0 — Competitiveness Foundations

  • Performance + benchmarking first: reproducible benchmark suite, memory/latency tracking, and publishable baseline reports.
  • Production inference contract: versioned model bundles, input/output schema validation, safer deserialization, and deterministic fallback behavior.
  • Reproducibility workflow: run metadata capture (params, seed, metrics, artifacts) and standardized experiment summaries.

Priority 1 — Algorithm & Feature Coverage

  • More algorithms (tree-based models, multiclass linear models, stronger ensemble baselines).
  • Feature preprocessing (normalization, encoding, imputation, outlier-aware transforms, rare-category handling).
  • Time-series ML support beyond splitting (lag/rolling feature generators and leakage-safe pipeline patterns).
  • Cross-validation utilities expansion (task-specific helpers and richer evaluation modes).

Priority 2 — Interoperability & Ecosystem

  • Model interoperability bridges (portable formats/import-export adapters) for cross-runtime workflows.
  • Dataset loaders and richer benchmarking tools.
  • Documentation expansion: task-first recipes, performance tuning guides, and production deployment playbooks.

Priority 3 — RAG/Agent Production Hardening

  • Context and chat history handling for the Tool Routing Agent. (v1.5: AgentContextManager)
  • Tool reliability layer for agents (timeouts, retries, fallbacks, structured errors). (v1.5: ToolReliabilityPolicy, RetryableToolInterface)
  • Policy and safety guardrails (tool allow/deny rules, injection checks, PII-safe logs).
  • Improved routing quality (confidence scoring, clarification turn, top-k tool candidates). (v1.5: decision confidence field)
  • Observability + evaluation harness for routing/tool accuracy regressions. (v1.5: AgentEvalHarness)
  • Streaming agent runs and human-in-the-loop approval gates. (v1.6: chatStream(), resumeWithApproval())
  • Session auto-persist for multi-turn agents. (v1.6: chatInSession(), AnthropicToolRoutingModel, Ollama native tools)
  • MCP remote tools + pluggable session stores (file default, Redis optional). (v1.7: McpToolProvider, AgentStateStoreFactory)
  • Multi-agent supervisor handoffs to specialist agents. (v1.7: AgentHandoffRegistry, handoff decision)
  • OpenTelemetry-style tracing for agent runs. (v1.7: AgentTracerInterface, OpenTelemetryAgentTracer)
  • Laravel bridge package with config, facade, and eval Artisan command. (v1.7: brucetruth/ml-idea-laravel)
  • AI admin example: standalone demos in examples/ai-admin/, Laravel copy-paste demo in packages/laravel/examples/ai-admin/
  • Memory strategy beyond raw history (summaries, pruning, retrieval-based recall).
  • Cost/latency controls (model tiering, caching, token budgets).
  • Human-in-the-loop controls for risky actions and execution approvals.
  • Output quality controls (schema validation, grounding/citation checks, consistency pass).

Strategic Positioning

  • Position ml-idea as: the production PHP AI runtime combining classical ML + NLP + RAG + tool-using agents with strict typing and testability.