survos/data-bundle

Shared data directory conventions and path utilities for dataset-driven apps (APP_DATA_DIR).

Maintainers

Package info

github.com/survos/data-bundle

Type:symfony-bundle

pkg:composer/survos/data-bundle

Fund package maintenance!

kbond

Statistics

Installs: 352

Dependents: 3

Suggesters: 0

Stars: 0

Open Issues: 0

2.2.5 2026-05-14 02:43 UTC

README

survos/data-bundle centralizes dataset filesystem conventions for dataset-driven Symfony applications.

Despite the historical name, this bundle is not the owner of shared semantic metadata contracts. It manages where dataset files, provider metadata, Pixie databases, run artifacts, cache files, and related JSONL outputs live.

For shared vocabulary and typed metadata contracts, use survos/data-contracts.

Scope

This bundle provides:

  • DataPaths: root-level path resolution under APP_DATA_DIR
  • DatasetPaths: dataset-scoped path helpers
  • dataset metadata loading and ensuring
  • DatasetInfo / Provider registry entities
  • provider snapshot encoding
  • dataset context helpers for console/import workflows
  • commands for browsing, diagnosing, and resolving dataset paths

This bundle does not provide:

  • Dublin Core vocabulary constants
  • collection-object DTO contracts
  • metadata claim storage
  • AI workflow execution
  • media upload, IIIF, or mediary publishing
  • import/normalize/profile logic

Relationship to Other Packages

  • survos/data-contracts: shared metadata vocabulary and DTO contracts.
  • survos/data-bundle: dataset paths, provider storage, and dataset registry.
  • survos/import-bundle: import/convert workflows that may ask this bundle for dataset paths.
  • survos/ai-workflow-bundle: task execution in apps that own subject context.
  • claims bundle: tracked metadata assertions with provenance and confidence.
  • survos/media-bundle: media identity and mediary publishing.

The dependency direction should stay honest: packages should require survos/data-contracts directly when they only need DcTerms, ContentType, or metadata DTOs. Do not require this bundle just to get vocabulary classes.

Core Idea

All dataset work lives under a single root directory:

APP_DATA_DIR=/absolute/path/to/data/root

The bundle avoids repository-relative paths and gives services and commands one place to ask for canonical locations.

Example layout:

$APP_DATA_DIR/
  work/
    <datasetKey>/
      00_meta/
        dataset.json
      10_extract/
        obj.jsonl
      20_normalize/
        obj.jsonl
      21_profile/
        obj.profile.json
      30_terms/
        *.jsonl
  pixie/
    tenants/
      <tenant>.db
    template/
    exports/
  runs/
  cache/

Installation

composer require survos/data-bundle

Set the root directory:

export APP_DATA_DIR=/absolute/path/to/data/root

Usage

Inject DataPaths for root and dataset path resolution:

use Survos\DataBundle\Service\DataPaths;

final class SomeService
{
    public function __construct(
        private readonly DataPaths $paths,
    ) {
    }
}

Common dataset paths:

$paths->datasetDir('dc/tb09jw350');
$paths->extractDir('dc/tb09jw350');
$paths->extractFile('dc/tb09jw350');
$paths->normalizeDir('dc/tb09jw350');
$paths->normalizeFile('dc/tb09jw350');
$paths->profileDir('dc/tb09jw350');
$paths->profileFile('dc/tb09jw350');
$paths->termsDir('dc/tb09jw350');

Pixie paths:

$paths->pixieTenantDb('larco');

Operational directories:

$paths->runsDir;
$paths->cacheDir;

Commands

Current command names retain the historical data:* prefix:

bin/console data:path dc/tb09jw350 20_normalize
bin/console data:head dc/tb09jw350 20_normalize --limit=5
bin/console data:diag dc/tb09jw350
bin/console data:browse
bin/console data:scan-datasets

These may eventually move to dataset:* aliases when the bundle is renamed.

Directory Creation

Ensure global roots exist:

$paths->ensureRootDirs();

Ensure standard dataset stage directories exist:

$paths->ensureDatasetDirs('dc/tb09jw350');

Atomic File Writes

For small metadata files:

$paths->atomicWrite($path, $contents);

The write uses a temporary file in the same directory followed by an atomic rename.

Design Principles

  • Dataset path conventions are centralized.
  • Paths are semantic, not stringly typed.
  • Dataset/provider storage concerns stay separate from semantic metadata contracts.
  • Import, AI workflow, claims, and media publishing remain in their own packages.
  • The bundle should stay boring and infrastructure-focused.

Future Rename

The better long-term name is survos/dataset-bundle. See docs/rename-to-dataset-bundle.md.