Computational Tools for Diachronic Linguistics

The AthDGC toolkit: open-source, lightweight, classroom-ready

For all parts of this platform (site, tools, corpus, slides, launch report) please cite:

Lavidas, Nikolaos, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, and Evangelos Argyropoulos. 2026. AthDGC: Athens Digital Glossa Chronos (A platform of computational tools and a diachronic Greek treebank, with Indo-European parallels). Athens: National and Kapodistrian University of Athens. Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs). https://doi.org/10.5281/zenodo.20439182

The AthDGC platform provides two open deliverables in parallel: this computational toolkit for diachronic linguistics plus a PROIEL (the dependency-treebank (a collection of sentences whose grammatical structure has been analysed and stored) standard for early Indo-European languages, developed at Oslo) XML 2.0 (the file format that stores each sentence as a tree of word-by-word grammatical relations, developed at Oslo) diachronic Greek corpus. The toolkit is designed to work on the AthDGC corpus and on any other PROIEL XML 2.0 historical treebank. AthDGC publishes PROIEL XML 2.0 only, no CoNLL-U, no UD-flavoured export.

All tools are open source (Apache-2.0 (a permissive open-source licence widely used for software) / MIT (a short permissive open-source licence) / GPL where the upstream license requires). All are documented for classroom use, from undergraduate to PhD level, so digital-humanities courses can adopt them without per-seat licensing.

At a glance

The platform provides modules in three readiness tiers, LIVE (clone and run today), IN SETUP (works internally, public install path lands at v0.5), and FORTHCOMING (v0.5) (designed, implementation lands at v0.5). The matrix below is the canonical reference; the per-tool sections that follow describe each module in detail.

Tool readiness (what runs today, what released at v0.5)

The AthDGC platform provides two distinct status classes for its tools, and the public page is committed to telling you which class every module is in. Confidence here matters more than completeness.

LIVE, code is in the public source repo, the install path resolves, and a clean run produces the documented output. You can clone and run it now.
IN SETUP, code exists in the project's internal working tree (athdgc-quarto/tools/ or ARIS) and is being migrated to the public source repo for the v0.5 unified tools/ directory. The module works internally; only the public install path is not yet wired.
FORTHCOMING (v0.5), design is described, implementation lands at v0.5.

Status as of the last redeploy:

Module	Status	Where it lives now
`athdgc-quarto/` (this site)	LIVE	public repo `Diachronic-Linguistics-Platform/athdgc-quarto/`
`glossacontactlab-quarto/` (Working Papers)	LIVE	public repo `glossacontactlab/`
`51_build_showcase_site.py` (showcase generator)	LIVE	public repo root
`fix_corpus_data.py` (corpus-fix toolkit)	LIVE	public repo root
`46_to_proiel_xml.py` (Stanza → PROIEL XML 2.0)	LIVE	public repo root
`47_annotate_inbox.sbatch` (Stanza on A100)	LIVE	public repo root
`tools/proiel_validate.py` (Gate G6 validator)	IN SETUP	drafted in `athdgc-quarto/tools/`; migrates to public `tools/` at v0.5
`tools/style_check.py` (Gate G7 style check)	IN SETUP	drafted in `athdgc-quarto/tools/`; migrates to public `tools/` at v0.5
`tools/lightside-athdgc/` (syntactic-feature plugin)	FORTHCOMING (v0.5)	design + workflow diagram on tools.qmd; patch set lands at v0.5
`tools/athdgc_to_lightside.py`	FORTHCOMING (v0.5)	design described; lands at v0.5
`tools/athdgc_to_lightside_syntax.py`	FORTHCOMING (v0.5)	design described; lands at v0.5
`tools/proiel_export/` (Stanza → PROIEL XML)	IN SETUP	logic exists inside the build workflow; extracted as a standalone module at v0.5
`tools/arg_structure.py` (argument-frame extractor)	IN SETUP	logic exists inside the build workflow; extracted at v0.5
`tools/build_cwb_index.sh` (NoSketch index)	FORTHCOMING (v0.5)	scripted at v0.5
`tools/cwb-recipes/` (sample CQL queries)	FORTHCOMING (v0.5)	written at v0.5
`tools/stanza-finetune/` (training scripts)	IN SETUP	training runs on ARIS; scripts copied to public repo at v0.5
`tools/neo4j-align-viewer/` (alignment viewer)	FORTHCOMING (v0.5)	Cypher recipes ready; viewer wrapper at v0.5
`tools/valency_db.py` (valency-frame DB client)	FORTHCOMING (v0.5)	tagged v0.5 on the public page already
`tools/retranslation-browser/`	FORTHCOMING (v0.5)	tagged v0.5
`tools/retelling-explorer/`	FORTHCOMING (v0.5)	tagged v0.5
`athdgc-tools` on PyPI	IN SETUP	name reserved at `0.4.0.dev0`; full toolkit replaces the stub at v0.5
Hugging Face model repos `AthDGC/grc_*_proiel`	IN SETUP	private during the v0.5 audit pass; public at the v0.5 release with weights
Zenodo concept DOI `10.5281/zenodo.20439182`	LIVE	v0.4.0 source-code snapshot minted 2026-05-29
GitHub `AthDGC/Diachronic-Linguistics-Platform`	IN SETUP	source repository, private during the v0.5 audit pass; public at the v0.5 release
GitHub `AthDGC/AthDGC.github.io`	LIVE	public, deployed showcase

Five tools are live today, eight are in setup (working in-house, public install path lands at v0.5), and ten are forthcoming. The v0.5 release closes every IN SETUP row into a LIVE row by migrating the code into a unified public tools/ directory and publishing the weights and the package's real contents.

Live annotation interface

Beyond the tooling described on this page, the project operates a hosted PROIEL XML 2.0 annotation web platform at the National and Kapodistrian University of Athens, on which the AthDGC team performs and reviews the manual layer of PROIEL annotations on top of the Stanford Stanza pre-annotation pass.

PROIEL annotation interface (NKUA, team-access): https://dialing.enl.uoa.gr/proiel/
AthDGC corpus repository (closed-access GitHub, team-access during the v0.4 to v0.5 audit pass): https://github.com/AthDGC/athdgc-corpus
Per-language unique-identifier and metadata register (Google Sheets, one sheet per language, restricted): https://docs.google.com/spreadsheets/d/1MiXcxAedaHgdnj62q-zTQsS3Q4iGTyQU/edit?gid=997932237

What is described below is the full vision of the toolkit. Where a module is not yet on the LIVE row of the matrix above, the section text describes the design as it will be released; the prose does not claim that the module is already installable today.

LightSIDE (text-mining workbench)

LightSIDE is the open-source text-mining workbench developed by Carolyn Rosé's group at Carnegie Mellon. It pairs naturally with a PROIEL-style treebank because it lets you:

extract features from annotated text (n-grams (sequences of N consecutive words or characters used as features for classification), POS-tag patterns, syntactic dependencies)
train classifiers (SVM (Support Vector Machine, a standard text-classification algorithm), logistic regression (a standard linear classification algorithm), decision tree (a classification algorithm that splits the data along feature thresholds)) on diachronic period labels
inspect what features each model relies on (i.e. which linguistic patterns let you tell Classical from Koine, or Byzantine from Modern, apart)
export the trained models for reuse in research or pedagogy

AthDGC + LightSIDE workflow

Export an AthDGC slice to LightSIDE's tab-separated input format with tools/athdgc_to_lightside.py.
Open LightSIDE, load the .csv, configure features (POS unigrams, argument-structure frames, lemma trigrams, etc.).
Train a classifier on the period column.
Inspect the most predictive features per period, which morphological patterns survive into Modern Greek? which die out at the Byzantine boundary?

We provide:

tools/athdgc_to_lightside.py, export script
tools/lightside-recipes/, worked examples (Archaic-vs-Classical, NT-vs-Byzantine, etc.)
slides/lightside_workshop.qmd, a one-hour workshop slide deck for classroom use

(LightSIDE itself remains BSD-licensed and downloadable from CMU. We do not redistribute the binaries, only the AthDGC-side adapters + tutorials.)

LightSIDE-AthDGC for syntax (Lavidas extension)

The standard LightSIDE workbench operates on word-level text features (n-grams, character n-grams, simple POS counts). The LightSIDE-AthDGC fork, developed by N. Lavidas, extends LightSIDE so it operates on PROIEL syntactic features, not just on text. This is the toolkit's signal contribution to diachronic syntactic research.

What LightSIDE-AthDGC adds:

Dependency-arc features. Each (head-relation-dependent) triple in a PROIEL parse becomes a feature, e.g. obj:V-N, atr:N-N, obl:V-N+prep. The classifier can learn that obj:V-N with acc.sg.f is over-represented in Archaic and under-represented in Modern, etc.
Argument-structure-frame features. For every VERB / AUX token, the full valency frame (the list of arguments a verb takes, e.g. subject + direct object + oblique) is emitted as a feature: pred[sub:nom, obj:acc, voc:voc], pred[sub:nom, obj:acc, obl:dat], etc. Period classifiers built on frame features distinguish Greek diachronic stages with markedly better accuracy than text-only models.
Morphology-bundle features. The PROIEL 10-character morphological tag becomes a feature bundle (case + gender + number + tense + mood + voice + aspect). Selectable via the LightSIDE feature panel.
Voice / aspect / mood cross-tabulations. Joint features like voice=mid,aspect=perf,mood=ind enable detection of construction-level diachronic shifts (e.g. middle-perfect retention).
Relation-co-occurrence features. Pairs of relations sharing a head (e.g. sub + obj + voc on the same verb) become single features for valency-pattern classification.

Why this matters

Standard text-mining tools are blind to syntactic structure. A bag-of-words classifier might distinguish Classical from Koine on lexical evidence, but it cannot reveal which syntactic patterns survive into Modern Greek and which die out at the Byzantine boundary. LightSIDE-AthDGC closes that gap by exposing the PROIEL dependency arcs (the head-to-dependent links between words in a parsed sentence), argument-structure frames, and morphology bundles as native LightSIDE features.

LightSIDE-AthDGC workflow

Export an AthDGC slice to LightSIDE-AthDGC's tab-separated input format (tools/athdgc_to_lightside_syntax.py), each row carries token features plus syntactic features (head id, relation, parent relation, dependency-arc triple, valency frame).
Open LightSIDE-AthDGC, load the .csv, and select syntactic-feature panels: Dependency arcs, Argument frames, Morphology bundle (a compact code that records case, gender, number, tense, mood, voice, aspect for one word), Voice/aspect/mood, Relation co-occurrence.
Train SVM / logistic / decision-tree classifiers on the period column (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern).
Inspect the per-feature weights to find which syntactic patterns drive the classifier's decision per period boundary. Export feature-importance tables for the data paper or a follow-on linguistic-change study.

What we release (open source, on GitHub)

The LightSIDE-AthDGC fork lives at https://github.com/AthDGC/Diachronic-Linguistics-Platform/tree/main/tools/lightside-athdgc (Apache-2.0 for the AthDGC adapters; BSD as inherited from the upstream LightSIDE licence). What's in the directory:

tools/athdgc_to_lightside_syntax.py, the syntactic-feature export script.
tools/lightside-athdgc/, the LightSIDE fork patches: feature plugins for PROIEL arcs, frames, and morphology.
tools/lightside-recipes-syntax/, worked classroom examples (Archaic-vs-Classical-on-arcs, Koine-vs-Byzantine-on-frames, etc.).
a one-hour workshop slide deck for KEDIVIM and conference use.

Worked example, a sample of the syntactic-feature CSV

Running tools/athdgc_to_lightside_syntax.py --in samples.jsonl --out features.csv on a small slice produces a CSV like this (truncated to 4 rows, 8 columns; the real export has ~40 columns for the full PROIEL feature space):

period          token       lemma       pos       morph             arc_feat                 frame_feat              voice_aspect_mood
Archaic         ἄειδε       ἀείδω       V         2sg.impv.act      ROOT                     pred[obj:acc,voc:voc]   act/impf/imp
Archaic         μῆνιν       μῆνις       Nb        acc.sg.f          obj:V-N+acc.sg.f         -                       -
Classical       οἶδα        οἶδα        V         1sg.prs.act       ROOT                     pred[xcomp:clausal]     act/perf/ind
Koine           ἐγένετο     γίγνομαι    V         3sg.aor.ind.mid   ROOT                     pred[sub:nom,xobj:nom]  mid/aor/ind

Load features.csv into LightSIDE-AthDGC, switch to the Diachronic syntactic features feature panel, train an SVM on the period column, and inspect which arc_feat and frame_feat strings are over-represented per period. The export script + sample CSV + worked tutorial all be licensed under tools/lightside-recipes-syntax/ on the public repo.

Workflow at a glance

Upstream LightSIDE remains BSD-licensed and downloadable from CMU. We do not redistribute binaries; LightSIDE-AthDGC is released as a patch set + helper scripts that the user applies to a standard LightSIDE install.

NoSketch-style concordancer for AthDGC

The corpus is also available as a CWB (Corpus Workbench) index, queriable via NoSketch Engine:

[lemma="αἱρέω" & feats=".*Voice=Mid.*"] within <s/>

…returns every middle-voice instance of αἱρέω across the entire diachronic span, grouped by period.

The build script tools/build_cwb_index.sh converts the JSONL partitions into a CWB-compatible binary index. Sample queries and CQL recipes are in tools/cwb-recipes/.

Diachronic Greek Stanza checkpoints

We provide fine-tuned variants of the Stanford grc_proiel Stanza processor, adapted to:

Byzantine Greek (grc_byz_proiel)
Late-Byzantine + Early-Modern Greek (grc_lbem_proiel)
Modern Greek with classical morphological retention (grc_mod_proiel)

Release status: trained on ARIS (the Greek national high-performance computing cluster, run by GRNET); Hugging Face (a public hosting platform for machine-learning models and datasets) mirror at https://huggingface.co/AthDGC under athdgc/* is in setup and released at v0.5. Training scripts in tools/stanza-finetune/. Eval loss and per-tag F1 reported per checkpoint (a saved snapshot of a trained model) with the v0.5 release notes.

Cross-lingual alignment viewer

The Neo4j (a graph database that stores data as nodes and connections rather than as tables) alignment graph (a database that records which token in language A corresponds to which token in language B) is queryable via the open-source Neo4j Browser loaded against a local copy of the alignment graph. The graph dump is released as part of the Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs) deposition; a public hosted endpoint is planned for v0.5.

Example Cypher, find every Greek aorist active verb whose Vulgate counterpart is passive:

MATCH (gk:Token {language:"grc", upos:"VERB"})-[:TRANSLATED_AS]->(la:Token {language:"lat"})
WHERE gk.feats CONTAINS "Aspect=Perf"
  AND gk.feats CONTAINS "Voice=Act"
  AND la.feats CONTAINS "Voice=Pass"
RETURN gk.form, gk.lemma, gk.feats, la.form, la.lemma, la.feats
ORDER BY gk.work_title
LIMIT 100

Corpus-fix toolkit

fix_corpus_data.py (used for the v0.4 corpus-fix audit pass; per-error-class counts be included with the v0.5 release notes) is released so other projects can adapt it for their own PROIEL-style treebanks:

TLG-canonical author + period override table (50+ entries)
Stanza-error fixes (lemma + POS for crasis, pronouns, common mis-lemmatisations)
Idempotent, with .bak backups
Reports per-file diff counts

Adapt for your corpus by extending TLG_AUTHOR, AUTHOR_PERIOD, and LEMMA_FIXES.

Argument-structure extractor

The function extract_arg_structures(sentence) (in 51_build_showcase_site.py and exported as a standalone module in tools/arg_structure.py) returns, for every VERB / AUX token:

subject (sub, including raised xobj / nonsub patterns)
direct object (obj)
indirect / oblique args (iobj, obl)
vocative addressee (voc)
voice (active / middle / passive)
aspect (perfective / imperfective)
tense, mood, verb-form, person, number

…as a Python dict, ready to be joined onto the cross-lingual alignment edges.

Valency-frame database (v0.5)

Status: ingestion in progress on ARIS; first release with v0.5.

The valency-frame database collects every per-verb argument frame emitted by extract_arg_structures() across the corpus, indexes it by period + verb lemma + frame signature, and exposes it for SQL-style query. The output of the corpus-fix audit pass feeds directly into this index, so v0.5 is included with a clean version.

Example queries supported at v0.5:

verb=αἱρέω, voice=middle, aspect=perfective -> frequency by period
frame=[sub:nom, obj:acc, obl:dat] -> all verbs that license this frame, with diachronic trend
verb=ἀκούω -> case distribution of its object (accusative vs genitive) across periods

The database lives behind a small JSON (a plain-text data format for structured records)-RPC interface; a Python client tools/valency_db.py will be released alongside the v0.5 release. The full SQL dump will be deposited at Zenodo as part of the v0.5 dataset record under CC-BY-4.0 (the Creative Commons Attribution 4.0 licence, which permits reuse with credit).

Retranslation-pair browser (v0.5)

Status: design complete; implementation queued for v0.5.

The retranslation-pair browser is the public-facing reading interface for the project's central research focus: tracking the same canonical text across periods and languages. For a given verse or sentence id, the browser displays the source-language passage and every attested retranslation (across Greek periods + IE parallels), each rendered with its own colour-coded compact PROIEL tree. Aligned tokens are visually linked across the panels.

Concretely at v0.5 the browser supports:

Iliad 1.1 in Homeric Greek + a Byzantine epitome paraphrase + a Modern Greek translation (a single tri-period panel)
NT John 1:1 in Koine Greek + Latin Vulgate + Gothic Wulfila + OCS Marianus (already released as a static example on samples.qmd; the v0.5 browser makes any aligned NT verse queryable in the same format)
Septuagint Psalm 1:1 in Koine + Byzantine paraphrase + Modern Greek
Any verse aligned via the existing LaBSE (a multilingual sentence-embedding model that maps sentences from many languages into a common vector space) + AwesomeAlign (a word-alignment procedure that uses multilingual-BERT attention to match words across translations) workflow

This is the tool that makes retranslation a queryable property rather than a research note.

Retelling-chain explorer (v0.5)

Status: scoping; implementation queued for v0.5.

Where the retranslation-pair browser shows the same sentence under different translations, the retelling-chain explorer follows an influential text's reception: epitomes, paraphrases, glosses, marginalia, and full-scale re-tellings of canonical sources. The explorer renders a directed reception graph (source -> retelling -> commentary -> Modern Greek paraphrase) and displays each node's PROIEL tree on demand.

Reference reception chains targeted for v0.5:

Iliad -> Byzantine epitomes (Tzetzes, Eustathios) -> Modern Greek translations
New Testament -> patristic commentaries -> Byzantine homily tradition
Plato's Phaedrus -> Neoplatonic reception (Proclus) -> later doxographies
Septuagint Psalms -> patristic exegesis -> Modern Greek liturgical use

This is the tool that makes retelling a navigable network rather than a literary commonplace.

AthDGC PROIEL validator (v0.5)

Status: released at v0.5 alongside the public corpus release.

A schema-strict + relation-inventory-strict linter for PROIEL XML 2.0 contributions. Checks:

XML well-formedness against the PROIEL XML 2.0 schema
relation labels drawn from the PROIEL inventory only (no UD leakage: nsubj, dobj, nmod, amod, case, etc. flagged as errors)
head-id integrity (every head-id resolves to a token in the same sentence; no cycles; one root per sentence with empty head-id)
morphological-tag well-formedness (10-character PROIEL morphology code, position-by-position legal value)
AthDGC house rules: project style guide enforced on <note> elements; affiliation block exact match in <source> headers

Releases as tools/proiel_validate.py (CLI) and a pre-commit hook (a script that runs automatically before code is committed and blocks the commit if a check fails) for community PRs.

Showcase generator

51_build_showcase_site.py, the script that builds https://athdgc.github.io, is generic: point it at any directory of PROIEL XML 2.0 JSONL partitions and it generates a browsable showcase with per-period tabs, PROIEL dependency trees, argument-structure cards, and review toolbar. PROIEL only, no CoNLL-U pathway.

Reuse it for any historical-treebank project under tools/showcase/.

Quarto template pack

The site you are reading is generated from a Quarto pack (athdgc-quarto) that is itself open source. It produces, from one source per page:

HTML website
PDF article (via xetex)
.docx for journal-submission portals
Reveal.js slides for the browser
Beamer PDF slides
.pptx PowerPoint slides
A0 Typst posters

The pack is reusable by any historical-treebank or computational-philology project.

Get the tools

git clone https://github.com/AthDGC/Diachronic-Linguistics-Platform
cd Diachronic-Linguistics-Platform
ls tools/   # all the modules described above

License: Apache-2.0 (except where upstream tools require otherwise; see each tool's LICENSE).

Open source: license matrix + repo map

Every AthDGC tool is licensed under a recognised OSI-approved licence. The corpus itself provides under CC-BY-4.0 (samples shown publicly; full partitions released at v0.5). The matrix below is the canonical reference.

Module	Licence	Upstream constraint	Code path
`tools/lightside-athdgc/`	BSD-3-Clause + Apache-2.0	Upstream LightSIDE is BSD-3	`Diachronic-Linguistics-Platform/tools/lightside-athdgc/`
`tools/athdgc_to_lightside_syntax.py`	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/athdgc_to_lightside.py`	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/proiel_export/`	Apache-2.0	PROIEL XML 2.0 schema by Oslo	`Diachronic-Linguistics-Platform/tools/proiel_export/`
`tools/proiel_validate.py` (v0.5)	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/arg_structure.py`	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/fix_corpus_data.py`	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/valency_db.py` (v0.5)	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/`
`tools/stanza-finetune/`	Apache-2.0	Stanford Stanza (Stanford's open-source Python workflow that automatically tags, lemmatises, and parses sentences) is Apache-2.0	`Diachronic-Linguistics-Platform/tools/stanza-finetune/`
`tools/build_cwb_index.sh`	GPL-3.0 (the GNU General Public License version 3, a copyleft open-source licence)	Upstream CWB is GPL	`Diachronic-Linguistics-Platform/tools/`
`tools/cwb-recipes/`	CC-BY-4.0	-	`Diachronic-Linguistics-Platform/tools/cwb-recipes/`
`tools/showcase/` (51_build_showcase_site.py)	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/showcase/`
`athdgc-quarto/` (this site)	MIT	-	`Diachronic-Linguistics-Platform/athdgc-quarto/`
`tools/retranslation-browser/` (v0.5)	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/retranslation-browser/`
`tools/retelling-explorer/` (v0.5)	Apache-2.0	-	`Diachronic-Linguistics-Platform/tools/retelling-explorer/`
`tools/neo4j-align-viewer/`	MIT	Neo4j Browser is APL-2.0	`Diachronic-Linguistics-Platform/tools/neo4j-align-viewer/`
Stanza checkpoints `athdgc/grc_byz_proiel`, `grc_lbem_proiel`, `grc_mod_proiel`	Apache-2.0	-	Hugging Face mirror forthcoming at https://huggingface.co/AthDGC (v0.5)
Corpus samples (this site)	CC-BY-4.0	-	Zenodo: `10.5281/zenodo.20439182`
Corpus partitions (v0.5 onwards)	CC-BY-4.0	-	Zenodo: per-version DOI (Digital Object Identifier, a permanent web link for a published artefact)

Access during the v0.5 audit pass

The source repository AthDGC/Diachronic-Linguistics-Platform and the three Hugging Face model repositories AthDGC/grc_byz_proiel, AthDGC/grc_lbem_proiel, AthDGC/grc_mod_proiel are private during the v0.5 audit pass. They become public at the v0.5 release, when the consolidated tools/ directory is in place and the Stanza weights are uploaded. The PyPI package athdgc-tools resolves today as a stub release 0.4.0.dev0; the full toolkit replaces the stub at v0.5.

The curated samples on this site (the Samples page), the launch report (this data paper), the Working Paper deposition packet, and the Zenodo source-code snapshot of v0.4.0 remain public throughout.

# core toolkit
pip install athdgc-tools                    # Apache-2.0, Python 3.11+ (stub until v0.5)

# Stanza checkpoints (weights release at v0.5; the model repos exist now)
python -c "import stanza; stanza.download('grc', package='athdgc_byz_proiel')"
python -c "import stanza; stanza.download('grc', package='athdgc_lbem_proiel')"
python -c "import stanza; stanza.download('grc', package='athdgc_mod_proiel')"

# LightSIDE-AthDGC patch (on top of an existing LightSIDE install)
git clone https://github.com/AthDGC/Diachronic-Linguistics-Platform
cd Diachronic-Linguistics-Platform/tools/lightside-athdgc
./apply_patches.sh /path/to/your/LightSIDE

Where it all lives

Source code: https://github.com/AthDGC/Diachronic-Linguistics-Platform
Stanza checkpoints + tokenisers: https://huggingface.co/AthDGC (three model repos live with provisional v0.5 READMEs; weights at v0.5)
PyPI (the Python Package Index, the central archive that pip downloads packages from) package: https://pypi.org/project/athdgc-tools/ (stub 0.4.0.dev0; full toolkit at v0.5)
Corpus samples + release DOIs: https://doi.org/10.5281/zenodo.20439182 (concept DOI (a Digital Object Identifier that always points to the latest version of a deposited record); version-specific DOIs on the Zenodo "Versions" tab)
Quarto template pack: https://github.com/AthDGC/Diachronic-Linguistics-Platform/tree/main/athdgc-quarto
Issues + PRs: https://github.com/AthDGC/Diachronic-Linguistics-Platform/issues

Repository structure

Diachronic-Linguistics-Platform/
|-- LICENSE                            Apache-2.0 (root)
|-- tools/
|   |-- LICENSE-LightSIDE              BSD-3-Clause (upstream)
|   |-- LICENSE-CWB                    GPL-3.0 (upstream)
|   |-- lightside-athdgc/              syntactic-feature plugin
|   |-- athdgc_to_lightside.py         text-feature export
|   |-- athdgc_to_lightside_syntax.py  syntactic-feature export
|   |-- proiel_export/                 Stanza output -> PROIEL XML 2.0
|   |-- proiel_validate.py             v0.5 schema linter
|   |-- arg_structure.py               valency-frame extractor
|   |-- valency_db.py                  v0.5 query client
|   |-- fix_corpus_data.py             v0.4 corpus-fix audit pass
|   |-- stanza-finetune/               diachronic Stanza checkpoints
|   |-- build_cwb_index.sh             NoSketch-compatible index builder
|   |-- cwb-recipes/                   sample CQL queries
|   |-- showcase/                      site generator
|   |-- neo4j-align-viewer/            cross-lingual graph viewer
|   |-- retranslation-browser/         v0.5 - retranslation UI
|   `-- retelling-explorer/            v0.5 - reception graph UI
|-- athdgc-quarto/                     this website
|-- corpus-samples/                    JSONL samples (CC-BY-4.0)
`-- docs/                              house-style, contributing, code-of-conduct

How to contribute

Community PRs are welcome on every module above. Two gates run on every PR:

PROIEL XML 2.0 validator, tools/proiel_validate.py runs as a pre-commit hook and a GitHub Action (an automated workflow that runs on GitHub's servers every time code is pushed). PRs that introduce UD-style labels (nsubj, dobj, nmod, case, etc.) fail the build.
House-style check, tools/style_check.py scans .qmd and .md against the project style guide. PRs containing banned tokens fail the build.

For larger changes (a new IE parallel, a new period subcorpus, a new tool module), open an issue first with a short proposal; the team will assign a reviewer.

What is not open

The audit-fix tooling logs (per-author error counts, per-Stanza-checkpoint diff statistics) are internal until the v0.5 release notes release. Funding-side materials and draft grant applications stay in the private repo. Per the project's data-management policy, raw OCR scans of secondary editions that are still in copyright are not republished; only the PROIEL annotation layer on top of them.

Citation

If you use any AthDGC tool in your research, please cite the corpus DOI and the tool explicitly:

Lavidas, Nikolaos, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, and Evangelos Argyropoulos. 2026. AthDGC: Athens Digital Glossa Chronos. National and Kapodistrian University of Athens. Zenodo. 10.5281/zenodo.20439182.

Roadmap

Argument-structure typology workbench (in development): joint cross-lingual visualisation of argument-structure shifts under retranslation
Diachronic syntactic-change detector: change-point analysis over PROIEL feature frequencies per century
Browser-based PROIEL tree editor: lightweight in-page editor that produces validated PROIEL XML 2.0 patches (PR-friendly diffs that the Oslo PROIEL canon can adopt)
Cross-lingual valency-shift detector: detect changes in argument-frame distribution between retranslations of the same canonical text across periods and languages
AthDGC PROIEL validator: schema-strict + relation-inventory-strict linter for community contributions ---

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.