How to Upload Documents¶

This guide covers uploading historical documents to OntExtract for analysis.

Overview¶

Documents are the foundation of temporal evolution analysis. OntExtract supports various document formats and captures metadata essential for period-aware processing.

Supported Formats¶

PDF - Scanned or digital PDFs (text extracted automatically)
Plain Text (.txt) - Raw text files
Word Documents (.docx) - Microsoft Word format
HTML (.html, .htm) - Web pages and HTML documents
Markdown (.md) - Markdown-formatted text

Document Versions¶

OntExtract treats each upload as a separate document record. Multiple versions of the same work can be uploaded:

Different formats - PDF and Word versions of the same paper
Different versions - Preprint (v1) and final published version
Updated copies - A cleaner scan or OCR-corrected version

How Versions Are Handled¶

Documents with the same DOI or title are not automatically linked
Each upload creates an independent document record
Provenance tracking records which specific document version was processed

Best Practices¶

Scenario	Recommendation
Better quality scan available	Upload new version, use it for new experiments
Preprint vs published	Upload both if content differs significantly
PDF and Word of same content	Upload whichever extracts text better
Duplicate by accident	Delete the unwanted copy from the Sources list

To identify duplicates, sort by DOI or title in the Sources view.

Upload Methods¶

Single Document Upload¶

Navigate to Library → Sources in the main menu
Click Upload Document
Select or drag-and-drop your file
Review the extracted metadata (see below)
Click Upload and Extract Metadata

Upload Document Interface

Automatic Metadata Extraction¶

By default, OntExtract automatically extracts metadata from uploaded PDFs using a cascade of methods:

Extraction Priority¶

arXiv ID - Checked first in filename, then PDF content. If found, queries Semantic Scholar.
DOI - Extracted from PDF pages. If found, queries Semantic Scholar, then CrossRef.
Title + Authors - Extracted from PDF text. Used to search CrossRef database.
PDF embedded metadata - Falls back to PDF document properties.
Filename - Used as last resort for title if nothing else matches.

Database Lookups¶

OntExtract queries academic databases to enrich metadata:

Database	Best For	What It Provides
Semantic Scholar	arXiv papers, recent preprints	Title, authors, year, abstract, citation count
CrossRef	Published journal articles, books	Title, authors, journal, volume, pages, DOI

Note: Very recent papers (not yet indexed) or very old documents (pre-digital) may not be found in these databases.

Source Indicators¶

After extraction, the upload form shows badges indicating where each metadata field originated:

Green - CrossRef database match
Blue - Semantic Scholar database match
Yellow - Extracted from PDF analysis
Cyan - User-provided value

Disabling Automatic Extraction¶

Uncheck "Automatic metadata extraction" to:

Skip database lookups entirely
Reveal all manual entry fields
Enter metadata directly without API calls

This is useful for:

Personal or unpublished documents
Historical documents not in academic databases
Documents where automatic extraction produces incorrect results

Metadata Fields¶

Required Fields¶

Field	Description
Title	Document title (only required field)

Core Fields (Always Shown)¶

Field	Description
Title	Document title for identification
Authors	Comma-separated author names
Publication Date	Year, month-year, or full date

Extended Fields (Manual Entry Mode)¶

When automatic extraction is disabled, additional fields appear:

Publication Details:

Field	Description
Journal/Publication	Journal or publication name
Volume	Volume number
Issue	Issue number
Pages	Page range (e.g., "123-145")
Publisher	Publishing organization
Container Title	For chapters in edited volumes
Series	Book or journal series
Edition	Edition number or description
Editor	Editor name(s)

Identifiers:

Field	Description
DOI	Digital Object Identifier
URL	Web address
ISBN	Book identifier
ISSN	Serial identifier

Additional Information:

Field	Description
Abstract	Document abstract or summary
Document Type	Academic Paper, Book, Dictionary Entry, etc.
Entry Term	For dictionary/reference entries (headword)
Access Date	When online source was accessed
Notes	Additional context or comments

Publication Date Formats¶

OntExtract accepts various date formats:

Year only: 1910, 1856
Month and year: March 1910, 1910-03
Full date: 1910-03-15, March 15, 1910

The system extracts the year for temporal period assignment.

Planned Features¶

The following features are under consideration for future releases:

Multiple authors with structured entry and Zotero-style lookup
LLM-based metadata guessing from document content

After Upload¶

Once uploaded, documents appear in Library > Sources accessible from the main menu.

Processing Options¶

After upload, documents can be processed with:

LLM Text Cleanup - Fix OCR errors, formatting issues (recommended for scanned documents)
Segmentation - Split into paragraphs or sentences
Embeddings - Generate vector representations for similarity search
Entity Extraction - Identify named entities and concepts

Tips for Historical Documents¶

OCR Quality¶

Scanned historical documents often have OCR errors. Use the LLM Text Cleanup feature to: - Fix character recognition mistakes (rn → m, l → I) - Correct archaic spelling normalization - Remove scanning artifacts (headers, page numbers)

Temporal Periods¶

Documents are automatically assigned to temporal periods based on publication date. For an experiment tracking 1910-2024: - A document from 1910 goes in the earliest period - A document from 2020 goes in the latest period

Troubleshooting¶

Upload Fails¶

Check file size (max 50MB default)
Verify file format is supported
Ensure you're logged in

No Text Extracted¶

PDF may be image-only (scanned without OCR)
Try re-uploading with a different format

Wrong Publication Date¶

Edit the document metadata after upload
Go to Library > Sources > Select document > Edit