Manifest & Reproducibility¶
MetDataPy provides a comprehensive manifest system for tracking data processing pipelines and ensuring reproducibility.
Overview¶
A manifest is a JSON file that captures: - Dataset information: Source, dimensions, time range, missing values - Pipeline steps: All transformations with parameters and timestamps - Features: Original, derived, lag, and calendar features - Quality control: QC summary and flag counts - ML preparation: Scaler parameters and train/val/test split boundaries - Pipeline hash: Deterministic hash for reproducibility verification
Creating Manifests¶
Using ManifestBuilder¶
The ManifestBuilder class allows incremental manifest construction during pipeline execution:
from metdatapy.manifest import ManifestBuilder, ScalerParamsModel, SplitBoundaries
import pandas as pd
# Initialize builder
builder = ManifestBuilder(source="weather_data.csv")
# Set dataset information
df = pd.read_parquet("processed_data.parquet")
builder.set_dataset_info(df, frequency="1H")
# Add pipeline steps
builder.add_step("load", "WeatherSet.from_csv", {"path": "weather_data.csv"})
builder.add_step("normalize", "WeatherSet.normalize_units", {"mapping": "mapping.yml"})
builder.add_step("qc", "WeatherSet.qc_range", {})
builder.add_step("resample", "WeatherSet.resample", {"rule": "1H"})
# Set QC report
builder.set_qc_report(df)
# Set derived features
builder.set_derived_features(["dew_point_c", "vpd_kpa", "heat_index_c"])
# Set scaler parameters
scaler = ScalerParamsModel(
method="standard",
columns=["temp_c", "rh_pct", "pres_hpa"],
parameters={
"temp_c": {"mean": 15.5, "scale": 8.2},
"rh_pct": {"mean": 65.0, "scale": 15.3},
"pres_hpa": {"mean": 1013.0, "scale": 10.5},
}
)
builder.set_scaler(scaler)
# Set split boundaries
split = SplitBoundaries(
train_start="2024-01-01T00:00:00Z",
train_end="2024-09-30T23:59:59Z",
val_start="2024-10-01T00:00:00Z",
val_end="2024-10-31T23:59:59Z",
test_start="2024-11-01T00:00:00Z",
test_end="2024-12-31T23:59:59Z",
)
builder.set_split(split)
# Add custom metadata
builder.add_metadata("project", "Weather Forecasting")
builder.add_metadata("author", "Data Science Team")
# Build and save manifest
manifest = builder.build()
manifest.to_json("manifest.json")
Manifest Structure¶
Complete Example¶
{
"version": "1.0",
"metdatapy_version": "1.3.0",
"created_at": "2025-10-25T10:30:00Z",
"pipeline_hash": "a1b2c3d4e5f6g7h8",
"dataset": {
"source": "weather_data.csv",
"rows": 8761,
"columns": ["temp_c", "rh_pct", "pres_hpa", "wspd_ms", "wdir_deg"],
"start_time": "2024-01-01T00:00:00Z",
"end_time": "2024-12-31T23:00:00Z",
"frequency": "1H",
"missing_values": {
"temp_c": 12,
"wspd_ms": 5
}
},
"pipeline_steps": [
{
"step": "load",
"function": "WeatherSet.from_csv",
"parameters": {"path": "weather_data.csv"},
"timestamp": "2025-01-15T10:25:00Z",
"duration_seconds": 2.5
},
{
"step": "qc",
"function": "WeatherSet.qc_range",
"parameters": {},
"timestamp": "2025-10-25T10:25:05Z",
"duration_seconds": 0.8
}
],
"features": {
"original_features": ["temp_c", "rh_pct", "pres_hpa"],
"derived_features": ["dew_point_c", "vpd_kpa"],
"lag_features": ["temp_c_lag1", "temp_c_lag2"],
"calendar_features": ["hour", "weekday", "month"],
"target_features": ["temp_c_t+1", "temp_c_t+3"]
},
"qc_report": {
"total_flags": 145,
"flagged_percentage": 1.65,
"flags_by_type": {
"qc_temp_c_range": 12,
"qc_temp_c_spike": 8,
"qc_rh_pct_flatline": 125
}
},
"scaler": {
"method": "standard",
"columns": ["temp_c", "rh_pct"],
"parameters": {
"temp_c": {"mean": 15.5, "scale": 8.2},
"rh_pct": {"mean": 65.0, "scale": 15.3}
}
},
"split": {
"train_start": "2024-01-01T00:00:00Z",
"train_end": "2024-09-30T23:59:59Z",
"val_start": "2024-10-01T00:00:00Z",
"val_end": "2024-10-31T23:59:59Z",
"test_start": "2024-11-01T00:00:00Z",
"test_end": "2024-12-31T23:59:59Z"
},
"metadata": {
"project": "Weather Forecasting",
"author": "Data Science Team"
}
}
CLI Commands¶
Validate Manifest¶
Check if a manifest file is valid:
mdp manifest validate manifest.json
# With verbose output
mdp manifest validate manifest.json --verbose
Output:
Validating manifest: manifest.json
✓ Manifest is valid
Manifest Details:
Version: 1.0
MetDataPy Version: 1.3.0
Pipeline Steps: 4
Pipeline Hash: a1b2c3d4e5f6g7h8
Has QC Report: True
Has Scaler: True
Has Split: True
Show Manifest¶
Display manifest contents in different formats:
# Summary view (default)
mdp manifest show manifest.json
# JSON format
mdp manifest show manifest.json --format json
# YAML format
mdp manifest show manifest.json --format yaml
Summary Output:
Manifest Summary
============================================================
Created: 2025-01-15T10:30:00Z
MetDataPy Version: 1.3.0
Pipeline Hash: a1b2c3d4e5f6g7h8
Dataset:
Source: weather_data.csv
Rows: 8,761
Columns: 5
Time Range: 2024-01-01T00:00:00Z to 2024-12-31T23:00:00Z
Frequency: 1H
Pipeline Steps (4):
1. WeatherSet.from_csv (2.50s)
2. WeatherSet.normalize_units (0.15s)
3. WeatherSet.qc_range (0.80s)
4. WeatherSet.resample (1.20s)
Features:
Original: 5
Derived: 2
Lag: 2
Calendar: 3
Target: 2
Quality Control:
Total Flags: 145
Flagged: 1.65%
By Type:
qc_rh_pct_flatline: 125
qc_temp_c_range: 12
qc_temp_c_spike: 8
Scaler:
Method: standard
Columns: 2
Split Boundaries:
Train: 2024-01-01T00:00:00Z to 2024-09-30T23:59:59Z
Val: 2024-10-01T00:00:00Z to 2024-10-31T23:59:59Z
Test: 2024-11-01T00:00:00Z to 2024-12-31T23:59:59Z
Compare Manifests¶
Compare two manifests for reproducibility:
Output:
Comparing Manifests
============================================================
Manifest 1: manifest1.json
Manifest 2: manifest2.json
✓ Same Pipeline
✓ Same Version
✓ Same Features
✓ Same Scaler
✓ Manifests are compatible for reproducibility
Pipeline Hash¶
The pipeline hash is a deterministic SHA-256 hash of: - All pipeline steps (function names and parameters) - Feature engineering configuration
Purpose: - Verify that two datasets were processed identically - Detect pipeline drift over time - Ensure reproducibility in production
Example:
manifest1 = Manifest.from_json("run1/manifest.json")
manifest2 = Manifest.from_json("run2/manifest.json")
if manifest1.pipeline_hash == manifest2.pipeline_hash:
print("Identical processing pipelines")
else:
print("Different pipelines - results may not be comparable")
Reproducibility Validation¶
Compare two manifests to verify reproducibility:
from metdatapy.manifest import Manifest
m1 = Manifest.from_json("experiment1/manifest.json")
m2 = Manifest.from_json("experiment2/manifest.json")
results = m1.validate_reproducibility(m2)
print(f"Same pipeline: {results['same_pipeline']}")
print(f"Same version: {results['same_version']}")
print(f"Same features: {results['same_features']}")
print(f"Same scaler: {results['same_scaler']}")
Best Practices¶
1. Always Create Manifests¶
Generate a manifest for every processed dataset:
# At the end of your pipeline
manifest = builder.build()
manifest.to_json(f"output/manifest_{timestamp}.json")
2. Version Control Manifests¶
Store manifests alongside processed data:
project/
├── data/
│ ├── processed/
│ │ ├── train.parquet
│ │ ├── val.parquet
│ │ ├── test.parquet
│ │ └── manifest.json # ← Store here
3. Validate Before Training¶
Check manifest validity before ML training:
from metdatapy.manifest import validate_manifest
results = validate_manifest("data/manifest.json")
if not results["valid"]:
raise ValueError(f"Invalid manifest: {results['errors']}")
4. Compare Across Runs¶
Track pipeline changes over time:
# Compare production vs. development
prod_manifest = Manifest.from_json("prod/manifest.json")
dev_manifest = Manifest.from_json("dev/manifest.json")
if prod_manifest.pipeline_hash != dev_manifest.pipeline_hash:
print("⚠️ Pipeline has changed - review differences")
results = prod_manifest.validate_reproducibility(dev_manifest)
print(results)
5. Include Metadata¶
Add context-specific information:
builder.add_metadata("experiment_id", "exp_2024_01")
builder.add_metadata("git_commit", "a1b2c3d")
builder.add_metadata("environment", "production")
builder.add_metadata("notes", "Added new QC thresholds")
Integration with ML Workflows¶
Scikit-learn Pipeline¶
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
# Build manifest during data preparation
builder = ManifestBuilder(source="weather.csv")
# ... add steps ...
manifest = builder.build()
manifest.to_json("models/manifest.json")
# Train model
model = Pipeline([
('scaler', StandardScaler()),
('rf', RandomForestRegressor())
])
model.fit(X_train, y_train)
# Save model with manifest reference
joblib.dump(model, "models/model.pkl")
# Manifest is already saved alongside
MLflow Integration¶
import mlflow
with mlflow.start_run():
# Log manifest as artifact
manifest.to_json("manifest.json")
mlflow.log_artifact("manifest.json")
# Log key metrics from manifest
mlflow.log_param("pipeline_hash", manifest.pipeline_hash)
mlflow.log_param("dataset_rows", manifest.dataset.rows)
if manifest.qc_report:
mlflow.log_metric("qc_flagged_pct", manifest.qc_report.flagged_percentage)
Troubleshooting¶
Manifest Validation Fails¶
Issue: validate_manifest() returns errors
Solutions:
1. Check JSON syntax: python -m json.tool manifest.json
2. Verify required fields are present
3. Ensure timestamps are ISO 8601 format
4. Check that pipeline_hash is computed
Pipeline Hash Mismatch¶
Issue: Two manifests with same steps have different hashes
Cause: Parameter values differ (even slightly)
Solution: Compare pipeline steps:
m1 = Manifest.from_json("manifest1.json")
m2 = Manifest.from_json("manifest2.json")
for s1, s2 in zip(m1.pipeline_steps, m2.pipeline_steps):
if s1.parameters != s2.parameters:
print(f"Difference in {s1.function}:")
print(f" M1: {s1.parameters}")
print(f" M2: {s2.parameters}")
Large Manifest Files¶
Issue: Manifest JSON files are very large
Solutions:
1. Limit parameter detail in pipeline steps
2. Don't store full dataframes in metadata
3. Use gzip compression: gzip manifest.json