Frequently Asked Questions¶
Common questions about OntExtract.
General¶
What is OntExtract?¶
OntExtract is a document processing system with integrated provenance tracking. It operates in two modes: standalone mode uses established NLP libraries without external dependencies, while API-enhanced mode adds LLM orchestration for automated tool selection.
Do I need an API key to use OntExtract?¶
No. Core features work without an API key:
- Document upload and management
- Segmentation (paragraph, sentence, semantic)
- Entity extraction (spaCy)
- Temporal expression extraction
- Embedding generation (local sentence-transformers)
- PROV-O provenance tracking
LLM-enhanced features require an Anthropic API key:
- LLM text cleanup (OCR correction)
- Automated tool orchestration
- Cross-document synthesis
What are the two operational modes?¶
Standalone Mode: All document processing uses local NLP libraries (spaCy, NLTK, sentence-transformers). No external API calls required.
API-Enhanced Mode: Adds LLM orchestration through a 5-stage workflow: Analyze → Recommend → Review → Execute → Synthesize. The LLM recommends tools and synthesizes results, but human review is required before execution.
Document Processing¶
What processing operations are available?¶
| Operation | Description | Mode |
|---|---|---|
| LLM Text Cleanup | Fix OCR errors and normalize text | API-enhanced |
| Segmentation | Split into paragraphs/sentences | Standalone |
| Embeddings | Generate vectors for similarity | Both |
| Entity Extraction | Identify people, places, orgs | Standalone |
| Temporal Extraction | Find dates and periods | Standalone |
| Definition Extraction | Pattern matching with strict validation for definitions and acronyms | Standalone |
Does processing modify my original documents?¶
No. OntExtract preserves original documents unchanged. All processing results are stored as separate ProcessingArtifacts linked to source documents through PROV-O relationships.
What is PROV-O provenance?¶
PROV-O is the W3C standard for representing provenance information. OntExtract embeds PROV-O concepts directly in the database, tracking:
- Which tools processed each document (wasAssociatedWith)
- How artifacts were generated (wasGeneratedBy)
- What source documents were used (wasDerivedFrom)
This enables complete reproducibility—you can trace any result back to its source.
Experiments¶
What is a temporal evolution experiment?¶
Temporal evolution experiments analyze how term meanings change over time. You define anchor terms (key concepts to track) and upload historical documents spanning your time range. The system processes documents and organizes results by temporal period.
How are documents assigned to periods?¶
Documents are assigned to temporal periods based on their publication date metadata. Ensure each document has a publication date when uploading.
Troubleshooting¶
Processing operations aren't running¶
- Verify Celery worker is running
- Check Redis connection
- Review application logs for errors
No results after processing¶
- Ensure document has text content (not image-only PDF)
- Verify processing completed without errors
- Check the Processing Artifacts tab on document detail page
LLM features not working¶
- Verify Anthropic API key is configured in settings
- Check API key has sufficient quota
- Review error messages in the interface