Skip to content

OntExtract Documentation

Welcome to the OntExtract user manual.

About OntExtract

OntExtract provides a unified interface for document processing with integrated provenance tracking. PROV-O provenance concepts are embedded directly in the database schema, and each processing operation creates a versioned output with corresponding provenance records.

OntExtract Home Page

The system operates in two modes:

  • Standalone mode uses established NLP libraries (spaCy, NLTK, sentence-transformers) without external API dependencies
  • API-enhanced mode adds LLM orchestration for automated tool selection and cross-document synthesis

Users can apply different processing strategies to the same documents and compare results while the system tracks complete analytical provenance.

Research Workflow

OntExtract guides you through a 6-step workflow for semantic change analysis:

Step Task Guide
1 Define Terms - Create anchor terms to track semantic evolution Create Anchor Terms
2 Upload Sources - Add documents from different historical periods Upload Documents
3 Create Experiment - Link terms to document sets with temporal periods Create Temporal Experiment
4 LLM Orchestration - AI suggests processing pipelines LLM Orchestration
5 Execute Pipeline - Process documents with selected tools Process Documents
6 View Results - Explore semantic drift and provenance graphs View Results

Core Features

Document Management

Upload and manage historical documents with automatic metadata extraction from Semantic Scholar and CrossRef. Supports PDF, plain text, Word, and HTML formats.

Anchor Terms

Define key concepts to track across your document corpus. Anchor terms serve as reference points for analyzing semantic change over time.

Temporal Evolution Analysis

Track how term meanings change across historical periods using timeline visualizations and ontology-backed semantic change events.

Document Processing

  • LLM Text Cleanup - Fix OCR errors and formatting issues using Claude
  • Segmentation - Split documents into paragraphs or sentences
  • Embeddings - Generate vector representations for similarity analysis
  • Entity Extraction - Identify named entities and concepts
  • Definition Extraction - Extract term definitions using pattern matching with strict validation
  • Temporal Extraction - Find dates, periods, and historical markers

LLM Orchestration

In API-enhanced mode, the LLM analyzes your experiment and recommends processing strategies through a 5-stage workflow: Analyze → Recommend → Review → Execute → Synthesize.

Ontology-Informed Design

Event types derived from a Semantic Change Ontology with 34 classes based on existing terminology from the literature.

Provenance Tracking

Complete W3C PROV-O provenance capture for all analysis steps. Every processing operation creates versioned outputs with queryable provenance chains.

Getting Help

  • Check the FAQ for common questions
  • Report issues at GitHub

About This Documentation

This manual covers installation, configuration, and usage of OntExtract features. Pages are organized by task and feature area.