Lineage
Lineage tracks data provenance through the pipeline.
Why Lineage?
You cannot reason about outputs unless you can reason about inputs.
Lineage provides:
- Auditability
- Reproducibility
- Debugging
- Compliance
Lineage Message
protobuf
message Lineage {
string id = 1;
string artifact_id = 2;
string artifact_type = 3;
Source source = 4;
repeated Transformation transformations = 5;
}
message Source {
string source_type = 1;
string source_id = 2;
string source_url = 3;
string content_hash = 4;
}
message Transformation {
string name = 1;
string version = 2;
repeated string input_ids = 3;
}Example
json
{
"id": "lineage:evt:123",
"artifact_id": "sec:form4:abc",
"source": {
"source_type": "SEC_EDGAR",
"source_url": "https://sec.gov/...",
"content_hash": "sha256:a1b2c3"
},
"transformations": [
{
"name": "parse_form4",
"version": "1.0.0",
"input_ids": ["raw:sec:abc"]
}
]
}Use Cases
- Trace data back to source
- Identify affected outputs when sources change
- Prove data provenance for audits