How to Process Documents¶

This guide covers the document processing operations available in OntExtract.

Overview¶

After uploading documents, various processing operations can be applied to extract structured information. OntExtract preserves original documents unchanged—all results are stored as ProcessingArtifacts linked to source documents through PROV-O relationships.

Processing Operations¶

Operation	Purpose	Mode
LLM Text Cleanup	Fix OCR errors, normalize spelling	API-enhanced
Segmentation	Split into paragraphs or sentences	Standalone
Embeddings	Generate vector representations	Both
Entity Extraction	Identify people, places, organizations	Standalone
Temporal Extraction	Find dates, periods, durations	Standalone
Definition Extraction	Locate concept definitions	Standalone

LLM Text Cleanup¶

Use this for scanned or OCR'd historical documents with recognition errors.

When to Use¶

Documents with OCR character errors (rn → m, l → I)
Archaic spelling that needs normalization
Scanning artifacts (headers, page numbers in text)

How to Run¶

Navigate to the document detail page
Click the menu button (three dots) in the top-right
Select Clean with LLM
Review the suggested corrections
Accept or modify changes
Save the cleaned version

Note: LLM cleanup creates a new document version. The original is preserved. This operation requires an Anthropic API key configured in settings.

From Document Pipeline¶

Cleanup can also be triggered from the experiment's Document Pipeline:

Go to Experiments > Select experiment > Document Pipeline
Click the broom icon next to any document
Follow the cleanup workflow

Segmentation¶

Split documents into logical sections for analysis.

Segmentation Methods¶

Method	Description	Best For
Paragraph	NLTK-enhanced paragraph detection	Most documents
Sentence	NLTK Punkt tokenizer	Fine-grained analysis

How to Run¶

Go to the document or experiment's Document Pipeline
Select documents to process
Check Segmentation in Processing Operations
Choose a segmentation method
Click Run Selected Tools

Results¶

Segmentation creates TextSegment artifacts with: - Segment text content - Character-level position (start/end offsets) - Segment index within document

Auto-Dependency¶

When selecting Embeddings or Definition Extraction, the system automatically selects Paragraph Segmentation if it hasn't been run. This is because:

Embeddings create segment-level vectors when segments exist (more granular similarity search)
All extraction tools produce better results with structured text segments

Segmentation can be deselected for document-level processing only.

Embedding Generation¶

Create vector representations for semantic similarity search.

Embedding Methods¶

Method	Description	Best For
Local	Standard sentence-transformers model	General modern text
Period Aware	Selects model based on document era/domain	Historical or domain-specific text
OpenAI	text-embedding-3-large (3072 dims)	Highest accuracy (requires API key)

How to Run¶

Go to Document Pipeline or document detail
Check Embeddings in Processing Operations
Select embedding method
Click Run Selected Tools

Period-Aware Embeddings¶

For historical documents or specialized domains, use Period Aware embeddings. This automatically selects appropriate models based on:

Document publication date
Domain (scientific, legal, biomedical)
Detected archaic language patterns

See Period-Aware Embeddings for detailed information.

Results¶

Embeddings enable: - Semantic similarity search across segments - Finding related passages across documents - Clustering similar content

Vectors are stored in PostgreSQL using pgvector for efficient similarity queries.

Entity Extraction¶

Identify named entities using spaCy NLP models.

Entity Types¶

PERSON - People, including fictional
ORG - Organizations, companies, agencies
GPE - Geopolitical entities (countries, cities)
DATE - Dates and periods
WORK_OF_ART - Titles of works

How to Run¶

Select documents in Document Pipeline
Check Entity Extraction in Processing Operations
Click Run Selected Tools

Results¶

Entity extraction creates artifacts with: - Entity text and type - Character positions in source - Confidence scores

Note: Accuracy depends on domain alignment with training corpora. Historical and technical texts may require validation.

Definition Extraction¶

Extract term definitions using pattern matching with strict validation.

Approach¶

OntExtract uses pattern matching to identify definitions in text:

Zero-shot classification (optional, disabled by default)
Uses facebook/bart-large-mnli model (~1.6GB)
Too slow on CPU for large documents (10+ minutes per document)
Enable with environment variable: ENABLE_ZERO_SHOT_DEFINITIONS=true
When enabled, scores sentences for confidence boosting
Pattern matching (default, fast) - Detects 8 definition types:
explicit_definition: "X is defined as Y"
explicit_reference: "X refers to Y"
meaning: "X means Y"
copula: "X is a Y"
acronym: "IRA (Information Retrieval Agent)" with strict validation
also_known_as: "X (also known as Y)"
ie_explanation: "X (i.e., Y)"
appositive: Dependency parsing for noun appositives
Strict acronym validation:
Pattern: 2-6 uppercase letters with capitalized word expansion
Requires expansion first letters to match acronym (e.g., "IRA" must expand to words starting with I, R, A)
Rejects expansions containing years (likely citations)
Eliminates nonsense patterns
Quality filters:
Reject academic citations (e.g., "et al., 2015")
Reject reference lists (year ranges, multiple years)
Reject terms with more than 3 words
Length validation (10-200 characters)

How to Run¶

Select documents in Document Pipeline
Check Definition Extraction in Processing Operations
Click Run Selected Tools

Results¶

Definition extraction creates artifacts with: - Term being defined - Definition text - Pattern type (explicit, acronym, etc.) - Confidence score (0.65-0.90 depending on pattern) - Character positions in source document - Source sentence for context

Results are labeled "Auto" in the UI with a "Pattern" source badge. If zero-shot is enabled, definitions may show "ZeroShot" badge.

Note: Definition extraction works best on documents that explicitly define terminology, such as glossaries, textbook introductions, or standards documents. Research papers that use but do not define terms may return few or no results.

Batch Processing¶

Process multiple documents efficiently:

Go to Experiments > Select experiment > Document Pipeline
Use checkboxes to select multiple documents
Choose operations to apply
Click Run Selected Tools

Document Pipeline

Operations run in parallel where possible. Progress is tracked in the interface.

Processing Without API Costs¶

Run Local Tools processes documents using only local NLP libraries: - spaCy for entity extraction - NLTK for sentence tokenization - sentence-transformers for embeddings

No external API calls are made, enabling offline operation.

Viewing Results¶

After processing, view results from the experiment detail page:

Go to Experiments > Select the experiment
Expand the View Results section
Click a result type:
Definitions - Extracted term definitions
Entities - Named entities and concepts
Embeddings - Generated vectors and similarity data
Segments - Document segments
Temporal - Extracted dates and periods

Result Details¶

Each result page shows: - Extracted items grouped by document - Source text and character positions - Confidence scores and extraction method - Links back to source documents

PROV-O Provenance¶

All processing operations create PROV-O provenance records:

wasDerivedFrom - Links artifacts to source documents
wasGeneratedBy - Connects artifacts to generating activities
wasAssociatedWith - Maps operations to tool versions

This enables complete reproducibility—any result can be traced back to its source to understand exactly how it was generated.

Troubleshooting¶

Processing Stuck¶

Check Celery worker status
Verify Redis connection
Review application logs

No Results Generated¶

Ensure document has text content
Check that source document exists
Verify processing completed (no errors in logs)

Embedding Errors¶

For OpenAI: verify API key in settings
For local: check sentence-transformers installation
Ensure document has been segmented first

How to Process Documents¶

Overview¶

Processing Operations¶

LLM Text Cleanup¶

When to Use¶

How to Run¶

From Document Pipeline¶

Segmentation¶

Segmentation Methods¶

How to Run¶

Results¶

Auto-Dependency¶

Embedding Generation¶

Embedding Methods¶

How to Run¶

Period-Aware Embeddings¶

Results¶

Entity Extraction¶

Entity Types¶

How to Run¶

Results¶

Definition Extraction¶

Approach¶

How to Run¶

Results¶

Batch Processing¶

Processing Without API Costs¶

Viewing Results¶

Result Details¶

PROV-O Provenance¶

Troubleshooting¶

Processing Stuck¶

No Results Generated¶

Embedding Errors¶

Related Guides¶