How to Use Provenance Tracking¶

This guide covers OntExtract's PROV-O provenance tracking features.

Overview¶

OntExtract implements W3C PROV-O provenance tracking directly in the database schema. Every processing operation creates versioned outputs with corresponding provenance records, enabling complete reproducibility of analytical workflows.

Why Provenance Matters¶

Reproducibility - Recreate exact processing conditions
Transparency - Understand how results were generated
Debugging - Trace unexpected results to their source
Scholarly citation - Document analytical methodology

PROV-O Concepts¶

OntExtract uses the PROV-O entity-activity-agent model:

Concept	Description	Examples in OntExtract
Entity	Artifacts created or modified	Documents, text segments, extracted entities
Activity	Processes that generate entities	Segmentation, entity extraction, embedding generation
Agent	Actors responsible for activities	Users, NLP tools (spaCy 3.8.11), LLM orchestrator

PROV-O Relationships¶

Four relationships enable workflow reconstruction:

Relationship	Meaning	Example
`wasDerivedFrom`	Links artifacts to source documents	Text segment derived from uploaded PDF
`wasGeneratedBy`	Connects artifacts to generating processes	Entities generated by extraction activity
`used`	Records which entities were consumed	Segmentation used the original document
`wasAssociatedWith`	Maps operations to tool versions	Extraction associated with spaCy 3.8.11

Provenance Timeline¶

Access the provenance timeline at Provenance > Timeline to view a chronological audit trail of all activities.

Provenance Timeline

Filtering the Timeline¶

Filter provenance records by:

Experiment - Show only activities for a specific experiment
Document - Show activities related to a document family
Term - Filter by anchor term
Activity type - Filter by operation type

Activity Types¶

Activity Type	Description
`document_upload`	Initial document upload
`text_extraction`	Text extracted from PDF/document
`document_segmentation`	Document split into segments
`embedding_generation`	Vector embeddings created
`entity_extraction`	Named entities identified
`temporal_extraction`	Dates and periods extracted
`definition_extraction`	Concept definitions located
`orchestration_run`	LLM orchestration workflow
`tool_execution`	Individual tool execution

Document Versioning¶

OntExtract preserves original documents unchanged. Processing creates new document versions linked through provenance.

Version Types¶

Type	Description
`original`	The initially uploaded document (v1)
`processed`	Result of processing operations
`experimental`	Created within an experiment context
`composite`	Merged or combined from multiple sources

Document Selection in Experiments¶

When creating new experiments, only original (v1) documents appear in the selection dropdown. This ensures experiments start with clean source materials. To use a processed version, reference the original experiment that created it.

Document Selection in Provenance¶

When filtering the provenance timeline by document:

Only original documents appear in the dropdown
Selecting a document shows provenance for the entire document family (all versions)
A hint displays: "Showing provenance for X versions" when multiple versions exist

This design enables tracing complete processing history from a single selection.

Processing Artifacts¶

Analysis results are stored as ProcessingArtifacts - separate database entities linked to source documents through PROV-O relationships. This maintains document integrity and enables applying multiple processing strategies to identical sources.

Artifact Contents¶

Each ProcessingArtifact includes:

Operation type - What processing was performed
Timestamps - When the operation occurred
Configuration parameters - Settings used
Results - Structured output data
Character positions - For text-based artifacts

Viewing Artifact Provenance¶

Provenance information is accessible through the Provenance > Timeline view:

Filter by document to see all processing history
Each activity shows its provenance chain:
Source document (wasDerivedFrom)
Generating activity (wasGeneratedBy)
Tool and version (wasAssociatedWith)

Reproducibility Features¶

Deterministic Operations¶

Document processing operations (segmentation, extraction) produce identical outputs given:

Identical input documents
Same tool versions
Same configuration parameters

Non-Deterministic Operations¶

LLM orchestration recommendations vary across runs due to model non-determinism. However, the system records:

Complete decision context
Recommendations and confidence scores
Human review decisions
Execution parameters

Settings Capture¶

Experiments capture their complete configuration state at creation time, including:

Model selections (spaCy model, embedding model)
Processing method parameters
Output dimensions
Similarity thresholds

Exporting Provenance¶

Export provenance records for external analysis:

JSON - Structured PROV-O compatible format
Timeline view - Chronological audit trail