Computational Tools for Diachronic Linguistics

The AthDGC toolkit: open-source, lightweight, classroom-ready

The AthDGC platform ships two open deliverables in parallel: this computational toolkit for diachronic linguistics plus a PROIEL XML 2.0 diachronic Greek corpus. The toolkit is designed to work on the AthDGC corpus and on any other PROIEL XML 2.0 historical treebank. AthDGC publishes PROIEL XML 2.0 only - no CoNLL-U, no UD-flavoured export.

All tools are open source (Apache-2.0 / MIT / GPL where the upstream license requires). All are documented for classroom use - from undergraduate to PhD level - so digital-humanities courses can adopt them without per-seat licensing.

At a glance

Tool What it does License Audience
LightSIDE (mirror + tutorials) Text-mining workbench: feature extraction, classifier training, error analysis BSD Researchers, classrooms
LightSIDE-AthDGC (Lavidas extension) LightSIDE for syntactic features: PROIEL arcs, argument-structure frames, morphology bundles, voice/aspect/mood cross-tabs BSD + Apache-2.0 Diachronic syntacticians
AthDGC concordancer NoSketch-style concordance interface for the AthDGC corpus Apache-2.0 Researchers, students
PROIEL XML 2.0 exporter Convert raw Stanza output to PROIEL XML 2.0 Apache-2.0 Treebank builders
Diachronic Greek Stanza checkpoints Fine-tuned grc_proiel weights for diachronic adaptation Apache-2.0 NLP researchers
Cross-lingual alignment viewer Browse the NT-aligned graph (Greek / Latin / Gothic / OCS) MIT Comparative linguists
Corpus-fix toolkit fix_corpus_data.py - TLG author + Stanza-error corrections at scale Apache-2.0 Annotation curators
Argument-structure extractor Per-verb argument frame (subject / object / oblique / voice / aspect) Apache-2.0 Syntacticians
Showcase generator 51_build_showcase_site.py - regenerate a public showcase from any JSONL corpus Apache-2.0 Project sites
Quarto template pack This site - ready-made multi-output (HTML + PDF + .docx + Reveal.js + Beamer + .pptx) for any diachronic linguistics project MIT All academic users
Valency-frame database (v0.5) Queryable database of per-verb argument-frame distributions across periods Apache-2.0 Syntacticians, typologists
Retranslation-pair browser (v0.5) Same passage in N periods + N IE languages, side by side with aligned PROIEL trees Apache-2.0 Comparatists, translation scholars
Retelling-chain explorer (v0.5) Trace an influential text's reception across periods (epitomes, paraphrases, glosses) Apache-2.0 Reception scholars, philologists
AthDGC PROIEL validator (v0.5) Schema-strict + relation-inventory-strict linter for community contributions Apache-2.0 Treebank builders

All tools live in https://github.com/AthDGC.

LightSIDE (text-mining workbench)

LightSIDE is the open-source text-mining workbench developed by Carolyn Rosé's group at Carnegie Mellon. It pairs naturally with a PROIEL-style treebank because it lets you:

  • extract features from annotated text (n-grams, POS-tag patterns, syntactic dependencies)
  • train classifiers (SVM, logistic regression, decision tree) on diachronic period labels
  • inspect what features each model relies on (i.e. which linguistic patterns let you tell Classical from Koine, or Byzantine from Modern, apart)
  • export the trained models for reuse in research or pedagogy

AthDGC + LightSIDE pipeline

  1. Export an AthDGC slice to LightSIDE's tab-separated input format with tools/athdgc_to_lightside.py.
  2. Open LightSIDE, load the .csv, configure features (POS unigrams, argument-structure frames, lemma trigrams, etc.).
  3. Train a classifier on the period column.
  4. Inspect the most predictive features per period - which morphological patterns survive into Modern Greek? which die out at the Byzantine boundary?

We provide:

  • tools/athdgc_to_lightside.py - export script
  • tools/lightside-recipes/ - worked examples (Archaic-vs-Classical, NT-vs-Byzantine, etc.)
  • slides/lightside_workshop.qmd - a one-hour workshop slide deck for classroom use

(LightSIDE itself remains BSD-licensed and downloadable from CMU. We do not redistribute the binaries - only the AthDGC-side adapters + tutorials.)

LightSIDE-AthDGC for syntax (Lavidas extension)

The standard LightSIDE workbench operates on word-level text features (n-grams, character n-grams, simple POS counts). The LightSIDE-AthDGC fork - developed by N. Lavidas - extends LightSIDE so it operates on PROIEL syntactic features, not just on text. This is the toolkit's signal contribution to diachronic syntactic research.

What LightSIDE-AthDGC adds:

  • Dependency-arc features. Each (head-relation-dependent) triple in a PROIEL parse becomes a feature, e.g. obj:V-N, atr:N-N, obl:V-N+prep. The classifier can learn that obj:V-N with acc.sg.f is over-represented in Archaic and under-represented in Modern, etc.
  • Argument-structure-frame features. For every VERB / AUX token, the full valency frame is emitted as a feature: pred[sub:nom, obj:acc, voc:voc], pred[sub:nom, obj:acc, obl:dat], etc. Period classifiers built on frame features distinguish Greek diachronic stages with markedly better accuracy than text-only models.
  • Morphology-bundle features. The PROIEL 10-character morphological tag becomes a feature bundle (case + gender + number + tense + mood + voice + aspect). Selectable via the LightSIDE feature panel.
  • Voice / aspect / mood cross-tabulations. Joint features like voice=mid,aspect=perf,mood=ind enable detection of construction-level diachronic shifts (e.g. middle-perfect retention).
  • Relation-co-occurrence features. Pairs of relations sharing a head (e.g. sub + obj + voc on the same verb) become single features for valency-pattern classification.

Why this matters

Standard text-mining tools are blind to syntactic structure. A bag-of-words classifier might distinguish Classical from Koine on lexical evidence, but it cannot reveal which syntactic patterns survive into Modern Greek and which die out at the Byzantine boundary. LightSIDE-AthDGC closes that gap by exposing the PROIEL dependency arcs, argument-structure frames, and morphology bundles as native LightSIDE features.

LightSIDE-AthDGC pipeline

  1. Export an AthDGC slice to LightSIDE-AthDGC's tab-separated input format (tools/athdgc_to_lightside_syntax.py) - each row carries token features plus syntactic features (head id, relation, parent relation, dependency-arc triple, valency frame).
  2. Open LightSIDE-AthDGC, load the .csv, and select syntactic-feature panels: Dependency arcs, Argument frames, Morphology bundle, Voice/aspect/mood, Relation co-occurrence.
  3. Train SVM / logistic / decision-tree classifiers on the period column (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern).
  4. Inspect the per-feature weights to find which syntactic patterns drive the classifier's decision per period boundary. Export feature-importance tables for the data paper or a follow-on linguistic-change study.

What we ship (open source, on GitHub)

The LightSIDE-AthDGC fork lives at https://github.com/AthDGC/Diachronic-Linguistics-Platform/tree/main/tools/lightside-athdgc (Apache-2.0 for the AthDGC adapters; BSD as inherited from the upstream LightSIDE licence). What's in the directory:

  • tools/athdgc_to_lightside_syntax.py - the syntactic-feature export script.
  • tools/lightside-athdgc/ - the LightSIDE fork patches: feature plugins for PROIEL arcs, frames, and morphology.
  • tools/lightside-recipes-syntax/ - worked classroom examples (Archaic-vs-Classical-on-arcs, Koine-vs-Byzantine-on-frames, etc.).
  • slides/lightside_syntax_workshop.qmd - a one-hour workshop slide deck.

Worked example - a sample of the syntactic-feature CSV

Running tools/athdgc_to_lightside_syntax.py --in samples.jsonl --out features.csv on a small slice produces a CSV like this (truncated to 4 rows, 8 columns; the real export has ~40 columns for the full PROIEL feature space):

period          token       lemma       pos       morph             arc_feat                 frame_feat              voice_aspect_mood
Archaic         ἄειδε       ἀείδω       V         2sg.impv.act      ROOT                     pred[obj:acc,voc:voc]   act/impf/imp
Archaic         μῆνιν       μῆνις       Nb        acc.sg.f          obj:V-N+acc.sg.f         -                       -
Classical       οἶδα        οἶδα        V         1sg.prs.act       ROOT                     pred[xcomp:clausal]     act/perf/ind
Koine           ἐγένετο     γίγνομαι    V         3sg.aor.ind.mid   ROOT                     pred[sub:nom,xobj:nom]  mid/aor/ind

Load features.csv into LightSIDE-AthDGC, switch to the Diachronic syntactic features feature panel, train an SVM on the period column, and inspect which arc_feat and frame_feat strings are over-represented per period. The export script + sample CSV + worked tutorial all ship under tools/lightside-recipes-syntax/ on the public repo.

Workflow at a glance

PROIEL XML 2.0 AthDGC partitions export features.csv arc / frame / morph load LightSIDE-AthDGC SVM / logistic / decision tree inspect feature weights period contrasts which syntactic patterns drive Archaic vs Classical, Byzantine vs Modern, etc.

Upstream LightSIDE remains BSD-licensed and downloadable from CMU. We do not redistribute binaries; LightSIDE-AthDGC ships as a patch set + helper scripts that the user applies to a standard LightSIDE install.

NoSketch-style concordancer for AthDGC

The corpus is also available as a CWB (Corpus Workbench) index, queriable via NoSketch Engine:

[lemma="αἱρέω" & feats=".*Voice=Mid.*"] within <s/>

…returns every middle-voice instance of αἱρέω across the entire diachronic span, grouped by period.

The build script tools/build_cwb_index.sh converts the JSONL partitions into a CWB-compatible binary index. Sample queries and CQL recipes are in tools/cwb-recipes/.

Diachronic Greek Stanza checkpoints

We provide fine-tuned variants of the Stanford grc_proiel Stanza pipeline, adapted to:

  • Byzantine Greek (grc_byz_proiel)
  • Late-Byzantine + Early-Modern Greek (grc_lbem_proiel)
  • Modern Greek with classical morphological retention (grc_mod_proiel)

Release status: trained on ARIS; Hugging Face mirror at https://huggingface.co/AthDGC under athdgc/* is in setup and ships at v0.5. Training scripts in tools/stanza-finetune/. Eval loss and per-tag F1 reported per checkpoint with the v0.5 release notes.

Cross-lingual alignment viewer

The Neo4j alignment graph is queryable via the open-source Neo4j Browser loaded against a local copy of the alignment graph. The graph dump ships as part of the Zenodo deposition; a public hosted endpoint is planned for v0.5.

Example Cypher - find every Greek aorist active verb whose Vulgate counterpart is passive:

MATCH (gk:Token {language:"grc", upos:"VERB"})-[:TRANSLATED_AS]->(la:Token {language:"lat"})
WHERE gk.feats CONTAINS "Aspect=Perf"
  AND gk.feats CONTAINS "Voice=Act"
  AND la.feats CONTAINS "Voice=Pass"
RETURN gk.form, gk.lemma, gk.feats, la.form, la.lemma, la.feats
ORDER BY gk.work_title
LIMIT 100

Corpus-fix toolkit

fix_corpus_data.py (used to bulk-correct 314,048 rows across 89.9M corpus rows in v0.4) is released so other projects can adapt it for their own PROIEL-style treebanks:

  • TLG-canonical author + period override table (50+ entries)
  • Stanza-error fixes (lemma + POS for crasis, pronouns, common mis-lemmatisations)
  • Idempotent, with .bak backups
  • Reports per-file diff counts

Adapt for your corpus by extending TLG_AUTHOR, AUTHOR_PERIOD, and LEMMA_FIXES.

Argument-structure extractor

The function extract_arg_structures(sentence) (in 51_build_showcase_site.py and exported as a standalone module in tools/arg_structure.py) returns, for every VERB / AUX token:

  • subject (sub, including raised xobj / nonsub patterns)
  • direct object (obj)
  • indirect / oblique args (iobj, obl)
  • vocative addressee (voc)
  • voice (active / middle / passive)
  • aspect (perfective / imperfective)
  • tense, mood, verb-form, person, number

…as a Python dict, ready to be joined onto the cross-lingual alignment edges.

Valency-frame database (v0.5)

Status: ingestion in progress on ARIS; first release with v0.5.

The valency-frame database collects every per-verb argument frame emitted by extract_arg_structures() across the corpus, indexes it by period + verb lemma + frame signature, and exposes it for SQL-style query. The output of the corpus-fix audit pass feeds directly into this index, so v0.5 ships with a clean version.

Example queries supported at v0.5:

  • verb=αἱρέω, voice=middle, aspect=perfective -> frequency by period
  • frame=[sub:nom, obj:acc, obl:dat] -> all verbs that license this frame, with diachronic trend
  • verb=ἀκούω -> case distribution of its object (accusative vs genitive) across periods

The database lives behind a small JSON-RPC interface; a Python client tools/valency_db.py will ship alongside the v0.5 release. The full SQL dump will be deposited at Zenodo as part of the v0.5 dataset record under CC-BY-4.0.

Retranslation-pair browser (v0.5)

Status: design complete; implementation queued for v0.5.

The retranslation-pair browser is the public-facing reading interface for the project's central research focus: tracking the same canonical text across periods and languages. For a given verse or sentence id, the browser displays the source-language passage and every attested retranslation (across Greek periods + IE parallels), each rendered with its own colour-coded compact PROIEL tree. Aligned tokens are visually linked across the panels.

Concretely at v0.5 the browser supports:

  • Iliad 1.1 in Homeric Greek + a Byzantine epitome paraphrase + a Modern Greek translation (a single tri-period panel)
  • NT John 1:1 in Koine Greek + Latin Vulgate + Gothic Wulfila + OCS Marianus (already shipped as a static example on samples.qmd; the v0.5 browser makes any aligned NT verse queryable in the same format)
  • Septuagint Psalm 1:1 in Koine + Byzantine paraphrase + Modern Greek
  • Any verse aligned via the existing LaBSE + AwesomeAlign pipeline

This is the tool that makes retranslation a queryable property rather than a research note.

Retelling-chain explorer (v0.5)

Status: scoping; implementation queued for v0.5.

Where the retranslation-pair browser shows the same sentence under different translations, the retelling-chain explorer follows an influential text's reception: epitomes, paraphrases, glosses, marginalia, and full-scale re-tellings of canonical sources. The explorer renders a directed reception graph (source -> retelling -> commentary -> Modern Greek paraphrase) and displays each node's PROIEL tree on demand.

Reference reception chains targeted for v0.5:

  • Iliad -> Byzantine epitomes (Tzetzes, Eustathios) -> Modern Greek translations
  • New Testament -> patristic commentaries -> Byzantine homily tradition
  • Plato's Phaedrus -> Neoplatonic reception (Proclus) -> later doxographies
  • Septuagint Psalms -> patristic exegesis -> Modern Greek liturgical use

This is the tool that makes retelling a navigable network rather than a literary commonplace.

AthDGC PROIEL validator (v0.5)

Status: ships at v0.5 alongside the public corpus release.

A schema-strict + relation-inventory-strict linter for PROIEL XML 2.0 contributions. Checks:

  • XML well-formedness against the PROIEL XML 2.0 schema
  • relation labels drawn from the PROIEL inventory only (no UD leakage: nsubj, dobj, nmod, amod, case, etc. flagged as errors)
  • head-id integrity (every head-id resolves to a token in the same sentence; no cycles; one root per sentence with empty head-id)
  • morphological-tag well-formedness (10-character PROIEL morphology code, position-by-position legal value)
  • AthDGC house rules: no AI-marker vocabulary in <note> elements; affiliation block exact match in <source> headers

Ships as tools/proiel_validate.py (CLI) and a pre-commit hook for community PRs.

Showcase generator

51_build_showcase_site.py - the script that builds https://athdgc.github.io - is generic: point it at any directory of PROIEL XML 2.0 JSONL partitions and it generates a browsable showcase with per-period tabs, PROIEL dependency trees, argument-structure cards, and review toolbar. PROIEL only - no CoNLL-U pathway.

Reuse it for any historical-treebank project under tools/showcase/.

Quarto template pack

The site you are reading is generated from a Quarto pack (athdgc-quarto) that is itself open source. It produces, from one source per page:

  • HTML website
  • PDF article (via xetex)
  • .docx for journal-submission portals
  • Reveal.js slides for the browser
  • Beamer PDF slides
  • .pptx PowerPoint slides
  • A0 Typst posters

Adapt it for any academic project by editing _quarto.yml + themes/lavidas.scss.

Get the tools

git clone https://github.com/AthDGC/Diachronic-Linguistics-Platform
cd Diachronic-Linguistics-Platform
ls tools/   # all the modules described above

License: Apache-2.0 (except where upstream tools require otherwise; see each tool's LICENSE).

Open source: license matrix + repo map

Every AthDGC tool ships under a recognised OSI-approved licence. The corpus itself ships under CC-BY-4.0 (samples shown publicly; full partitions released at v0.5). The matrix below is the canonical reference.

Module Licence Upstream constraint Code path
tools/lightside-athdgc/ BSD-3-Clause + Apache-2.0 Upstream LightSIDE is BSD-3 Diachronic-Linguistics-Platform/tools/lightside-athdgc/
tools/athdgc_to_lightside_syntax.py Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/athdgc_to_lightside.py Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/proiel_export/ Apache-2.0 PROIEL XML 2.0 schema by Oslo Diachronic-Linguistics-Platform/tools/proiel_export/
tools/proiel_validate.py (v0.5) Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/arg_structure.py Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/fix_corpus_data.py Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/valency_db.py (v0.5) Apache-2.0 - Diachronic-Linguistics-Platform/tools/
tools/stanza-finetune/ Apache-2.0 Stanford Stanza is Apache-2.0 Diachronic-Linguistics-Platform/tools/stanza-finetune/
tools/build_cwb_index.sh GPL-3.0 Upstream CWB is GPL Diachronic-Linguistics-Platform/tools/
tools/cwb-recipes/ CC-BY-4.0 - Diachronic-Linguistics-Platform/tools/cwb-recipes/
tools/showcase/ (51_build_showcase_site.py) Apache-2.0 - Diachronic-Linguistics-Platform/tools/showcase/
athdgc-quarto/ (this site) MIT - Diachronic-Linguistics-Platform/athdgc-quarto/
tools/retranslation-browser/ (v0.5) Apache-2.0 - Diachronic-Linguistics-Platform/tools/retranslation-browser/
tools/retelling-explorer/ (v0.5) Apache-2.0 - Diachronic-Linguistics-Platform/tools/retelling-explorer/
tools/neo4j-align-viewer/ MIT Neo4j Browser is APL-2.0 Diachronic-Linguistics-Platform/tools/neo4j-align-viewer/
Stanza checkpoints athdgc/grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel Apache-2.0 - Hugging Face mirror forthcoming at https://huggingface.co/AthDGC (v0.5)
Corpus samples (this site) CC-BY-4.0 - Zenodo: 10.5281/zenodo.20439182
Corpus partitions (v0.5 onwards) CC-BY-4.0 - Zenodo: per-version DOI

One-line installs

pip install athdgc-tools resolves today (stub release 0.4.0.dev0); the stub reserves the name and points users at the source repo. The full toolkit replaces the stub at v0.5. The Hugging Face model repos at AthDGC/grc_byz_proiel, AthDGC/grc_lbem_proiel, and AthDGC/grc_mod_proiel exist with v0.5-forthcoming READMEs; the weight payloads upload at v0.5 ship time.

# core toolkit
pip install athdgc-tools                    # Apache-2.0, Python 3.11+ (stub until v0.5)

# Stanza checkpoints (weights ship at v0.5; the model repos exist now)
python -c "import stanza; stanza.download('grc', package='athdgc_byz_proiel')"
python -c "import stanza; stanza.download('grc', package='athdgc_lbem_proiel')"
python -c "import stanza; stanza.download('grc', package='athdgc_mod_proiel')"

# LightSIDE-AthDGC patch (on top of an existing LightSIDE install)
git clone https://github.com/AthDGC/Diachronic-Linguistics-Platform
cd Diachronic-Linguistics-Platform/tools/lightside-athdgc
./apply_patches.sh /path/to/your/LightSIDE

Where it all lives

Repository structure

Diachronic-Linguistics-Platform/
|-- LICENSE                            Apache-2.0 (root)
|-- tools/
|   |-- LICENSE-LightSIDE              BSD-3-Clause (upstream)
|   |-- LICENSE-CWB                    GPL-3.0 (upstream)
|   |-- lightside-athdgc/              syntactic-feature plugin
|   |-- athdgc_to_lightside.py         text-feature export
|   |-- athdgc_to_lightside_syntax.py  syntactic-feature export
|   |-- proiel_export/                 Stanza output -> PROIEL XML 2.0
|   |-- proiel_validate.py             v0.5 schema linter
|   |-- arg_structure.py               valency-frame extractor
|   |-- valency_db.py                  v0.5 query client
|   |-- fix_corpus_data.py             314,048-row audit-fix
|   |-- stanza-finetune/               diachronic Stanza checkpoints
|   |-- build_cwb_index.sh             NoSketch-compatible index builder
|   |-- cwb-recipes/                   sample CQL queries
|   |-- showcase/                      site generator
|   |-- neo4j-align-viewer/            cross-lingual graph viewer
|   |-- retranslation-browser/         v0.5 - retranslation UI
|   `-- retelling-explorer/            v0.5 - reception graph UI
|-- athdgc-quarto/                     this website
|-- corpus-samples/                    JSONL samples (CC-BY-4.0)
`-- docs/                              house-style, contributing, code-of-conduct

How to contribute

Community PRs are welcome on every module above. Two gates run on every PR:

  1. PROIEL XML 2.0 validator - tools/proiel_validate.py runs as a pre-commit hook and a GitHub Action. PRs that introduce UD-style labels (nsubj, dobj, nmod, case, etc.) fail the build.
  2. Lavidasised house-style check - tools/style_check.py scans .qmd and .md for AI-marker vocabulary and em-dashes. PRs with leverage, seamlessly, it is worth noting that, unparalleled, etc. fail the build.

For larger changes (a new IE parallel, a new period subcorpus, a new tool module), open an issue first with a short proposal; the team will assign a reviewer.

What is not open

The audit-fix tooling logs (per-author error counts, per-Stanza-checkpoint diff statistics) are internal until the v0.5 release notes ship. Funding-side materials and draft grant applications stay in the private repo. Per the project's data-management policy, raw OCR scans of secondary editions that are still in copyright are not republished; only the PROIEL annotation layer on top of them.

Citation

If you use any AthDGC tool in your research, please cite the corpus DOI and the tool explicitly:

@dataset{lavidas2026athdgc,
  author    = {Lavidas, Nikolaos and Nikiforidou, Kiki and Haug, Dag and Kulikov, Leonid and Geka, Vassiliki and Symeonidis, Vassileios and Michalareas, Theodoros and Chionidi, Sofia and Tsiropina, Anastasia and Plakoutsi, Eleni and Argyropoulos, Evangelos and {Athens Digital Glossa Chronos Research Network}},
  title     = {{AthDGC}: Athens Diachronic Glossa Chronos ({Athens-PROIEL})},
  year      = {2026},
  publisher = {Zenodo},
  version   = {v0.4.0},
  doi       = {10.5281/zenodo.20439182}
}

Roadmap

  • Argument-structure typology workbench (in development): joint cross-lingual visualisation of argument-structure shifts under retranslation
  • Diachronic syntactic-change detector: change-point analysis over PROIEL feature frequencies per century
  • Browser-based PROIEL tree editor: lightweight in-page editor that produces validated PROIEL XML 2.0 patches (PR-friendly diffs that the Oslo PROIEL canon can adopt)
  • Cross-lingual valency-shift detector: detect changes in argument-frame distribution between retranslations of the same canonical text across periods and languages
  • AthDGC PROIEL validator: schema-strict + relation-inventory-strict linter for community contributions ---

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.