How to Use Period-Aware Embeddings¶

This guide covers OntExtract's period-aware embedding feature for historical and domain-specific text analysis.

Overview¶

Historical texts use different vocabulary, spelling, and linguistic patterns than contemporary texts. Using a modern embedding model on archaic text can result in poor semantic representations.

The period-aware embedding service addresses this by:

Selecting models trained on corpora from similar time periods
Using domain-specific models for specialized vocabularies
Detecting archaic language patterns when metadata is unavailable

When to Use Period-Aware Embeddings¶

Use period-aware embeddings when:

Analyzing documents spanning multiple historical periods
Working with archaic or historical language
Processing domain-specific texts (scientific, legal, biomedical)
Comparing semantic similarity across time periods

Model Selection¶

The service selects embedding models based on this priority:

Domain (if specified) - Takes precedence for specialized vocabularies
Year (if specified) - Selects period-appropriate model
Text Analysis (fallback) - Detects archaic/technical language patterns
Default - Falls back to modern model

Period-Based Models¶

Period	Era	Handles Archaic
Pre-1850	Pre-industrial	Yes
1850-1950	Industrial	Yes
1950-2000	Modern	No
2000+	Contemporary	No

Domain-Specific Models¶

Domain	Use Case
Scientific	Scientific papers, technical documentation
Legal	Legal documents, contracts, case law
Biomedical	Medical literature, clinical texts

Using Period-Aware Embeddings¶

From the Document Pipeline¶

Go to Experiments > Select the experiment > Document Pipeline
Select documents to process using the checkboxes
Under the Embeddings section, check Period-Aware Embeddings
Click Run Selected Tools

The service will:

Check the document's publication date metadata
Analyze text for archaic language patterns (if no date available)
Select and apply the appropriate model

Via LLM Orchestration¶

When using LLM orchestration, the system may automatically recommend period-aware embeddings for:

Historical documents (based on publication date)
Documents with detected archaic language
Domain-specific technical papers

Recommendations can be approved or modified during the Review stage.

Setup Requirements¶

Period-aware models must be downloaded before use. Run the download script:

# Download core models (~500MB)
python scripts/download_embedding_models.py --core

# Download all models (~2GB)
python scripts/download_embedding_models.py --all

# Check download status
python scripts/download_embedding_models.py --check

Archaic Language Detection¶

When no publication date is available, the service uses a heuristic approach to detect archaic language, based on lexical markers established in historical linguistics research.

Linguistic Basis¶

The detection approach uses two categories of markers that are well-documented in the literature on Early Modern English (c. 1500-1700):

1. Archaic Second-Person Pronouns and Verb Forms

thou, thee, thy, thine — The singular second-person pronoun system that fell out of standard use by the 17th century. The shift from "thou" to "you" is one of the most studied changes in English historical linguistics (see Burnley, 2000; Wales, 1996).
hath, doth — Third-person singular verb forms with the archaic -eth ending, replaced by modern -s forms ("has," "does") during the Early Modern period.

2. Pronominal Adverbs

whence, wherefore, wherein, whereby, heretofore, hereunto — These are pronominal adverbs formed from wh-/h-/th- stems combined with prepositions. They form systematic patterns (hither/thither/whither for direction-to; hence/thence/whence for direction-from) and are characteristic of both archaic and legal English.

These markers are used in corpus normalization research for Early Modern English texts (see Archer et al., 2015, "Guidelines for normalising Early Modern English corpora") and are recognized as reliable indicators of historical text in computational historical linguistics.

Detection Method¶

Archaic indicators detected:

Historical pronouns: thou, thee, thy, thine
Archaic verbs: hath, doth
Pronominal adverbs: whence, wherefore, wherein, whereby, heretofore, hereunto, notwithstanding

Technical indicators detected:

Academic vocabulary: hypothesis, methodology, parameter
Scientific terms: coefficient, algorithm, paradigm, empirical

If archaic language is detected, the historical model is automatically selected.

Limitations¶

This is a heuristic approach based on lexical markers rather than a trained classifier. It works well for:

Texts containing Early Modern English features (pre-1700)
Legal documents with formal/archaic register
Religious texts (e.g., King James Bible style)

For more sophisticated period detection, future versions may incorporate trained classifiers on dated corpora.

References¶

Archer, D., Kytö, M., Baron, A., & Rayson, P. (2015). Guidelines for normalising Early Modern English corpora: Decisions and justifications. ICAME Journal, 39, 5-24.
Burnley, D. (2000). The History of the English Language: A Source Book (2nd ed.). Longman.
Wales, K. (1996). Personal Pronouns in Present-Day English. Cambridge University Press.
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Morgan & Claypool (Synthesis Lectures on Human Language Technologies, vol. 17).

Understanding Results¶

Embedding Metadata¶

When period-aware embeddings are generated, the processing artifact includes metadata showing:

Field	Description
Selected Model	Which embedding model was used
Selection Reason	Why this model was chosen
Selection Confidence	Confidence score (0-1)
Era	Detected time period category
Handles Archaic	Whether the model handles historical language

Semantic Drift Classification¶

When comparing embeddings across periods, drift is classified as:

Classification	Drift Value	Meaning
Stable	< 0.2	Minimal semantic change
Minor Change	0.2 - 0.4	Some evolution in meaning
Moderate Drift	0.4 - 0.7	Notable semantic shift
Major Shift	≥ 0.7	Substantial meaning change

Tips for Best Results¶

Document Metadata¶

Ensure documents have accurate publication dates for best model selection
Add domain metadata (scientific, legal, biomedical) when applicable

Corpus Considerations¶

Use consistent embedding methods within a single experiment for valid comparisons
When comparing across periods, process all documents with period-aware embeddings
Include multiple documents per period for reliable drift calculations

Model Downloads¶

Download models before batch processing to avoid delays
Core models are sufficient for most historical text analysis
Domain-specific models (scientific, legal, biomedical) are included in the full download