AthDGC
A platform of computational tools and a diachronic Greek corpus, with Indo-European parallels
What AthDGC is
The Athens Diachronic Glossa Chronos (AthDGC, Athens-PROIEL; Διαχρονία Γλώσσας :Χρόνος) is a platform, not a single resource. It ships two open deliverables side by side:
an open-source computational toolkit for diachronic linguistics - LightSIDE-compatible feature-extraction + classification pipelines, a NoSketch-style concordancer, fine-tuned Stanza checkpoints, a per-verb argument-structure extractor, a cross-lingual alignment viewer, a corpus-fix toolkit, a showcase generator, and the Quarto pack that builds this very site. See the Tools page.
an open PROIEL-style dependency treebank spanning Greek diachrony (Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern) with Indo-European parallels to Latin (Vulgate), Gothic (Wulfila), and Old Church Slavonic (Marianus) via verse-level cross-lingual alignment of the New Testament. Classical Armenian and Sanskrit ingestion are in progress; Ukrainian, Old English, Avestan, and Old Persian are queued for v0.7. See the Samples page for representative sentences.
The platform's research focus is retranslation, retelling, influential texts, and argument structure. The tools are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus - so other research groups can adopt them without depending on us.
Cite this work. Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., Argyropoulos, E., and the Athens Digital Glossa Chronos Research Network (2026). AthDGC: Athens Diachronic Glossa Chronos. Zenodo. 10.5281/zenodo.20439182.
Train with us. The platform is taught as a KEDIVIM continuing-education course at NKUA in autumn 2026: Digital Tools for the Diachronic Analysis of Language / Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας. See the Training page.
Current state (v0.4)
| Field | Value |
|---|---|
| Total corpus rows | 89.9 M |
| Annotated Greek rows | 4.08 M |
| NT-aligned Greek verses | 6,861 |
| Cross-aligned IE witnesses | 4 (Latin, Gothic, OCS) + Classical Armenian in ingestion |
| Open-source tools shipped | 14 (see Tools) |
| Concept DOI | 10.5281/zenodo.20439182 |
| Repository | https://github.com/AthDGC/Diachronic-Linguistics-Platform |
| Stanza checkpoints | forthcoming at https://huggingface.co/AthDGC (org page + model repos in setup; ship at v0.5) |
| Open-access source archives | Perseus, OGL/First1K, SBL GNT, Rahlfs LXX, Papyri.info, Patrologia Graeca, Bibliotheca Augustana, Anemi, Wikisource el; Wulfila Project, TITUS, GRETIL, SARIT, TEAMS, National Library of Ukraine |
| Launch report | GlossaContactLab Working Papers, NKUA digital edition |
Release status (v0.4)
The platform is under continuous build. v0.4 is not yet a wide release of the corpus. The Stanford-Stanza pass introduced annotation errors in a non-trivial share of sentences (lemma mis-attribution, crasis splits, gender inheritance on indeclinables, period mis-tagging, etc.), and the team is correcting these on the ARIS side before any wide release.
What this public site shows today is therefore a curated samples showcase drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL pipeline. The tools, by contrast, are released and usable now: they live in the public source-code repo and on Hugging Face. See the Samples and Tools pages.
The full corpus partitions will ship at v0.5, once the audit pass is complete. The Zenodo record above is the canonical citation for the project; the version DOIs (v0.4.0, v0.5.0, ...) track the release history.
What is in the public repository (and what is not)
The GitHub repository https://github.com/AthDGC/Diachronic-Linguistics-Platform is deliberately scoped. It is not a corpus dump.
The public repository ships:
- the platform's source code (discovery, filtering, conversion, annotation, alignment, fix, showcase generator, Quarto template pack);
- the open-source toolkit (LightSIDE-AthDGC for syntactic features, NoSketch-style concordancer, PROIEL XML 2.0 exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual alignment viewer);
- the fine-tuned Stanza checkpoints (
grc_byz_proiel,grc_lbem_proiel,grc_mod_proiel) trained on ARIS; Hugging Face mirror underathdgc/*is in setup, ships at v0.5; - the curated showcase samples rendered on this site;
- the Zenodo source-code snapshot under DOI 10.5281/zenodo.20439182.
The public repository does NOT ship:
- the full v0.4 annotated corpus partitions (89.9 M corpus rows / 4.08 M annotated Greek rows). These remain under active audit on the GRNET ARIS national HPC under allocation
pa260305until the v0.5 release. - the raw JSONL partitions per period, the Qdrant vector store, or the Neo4j alignment graph dump. These will ship as a separate Zenodo dataset record at v0.5, under CC-BY-4.0, when the Stanza-introduced annotation errors have been corrected at scale.
This scoping is deliberate: we will not release annotated data with known errors. The tools are released and ready for reuse on any PROIEL-style historical treebank; the corpus ships when it is correct.
Method (in brief)
- Discovery - Daily harvest from archive.org, Perseus, First1K Greek, Wikisource, Diorisis, OpenGreekAndLatin
- Filtering - Greek-script ratio + apparatus-criticus rejection + content-hash deduplication
- Conversion - PROIEL XML 2.0 schema
- Annotation - Stanza
grc_proiel(andla_proiel,cu_proiel,got_proielfor parallels) - Argument structure - per-verb extraction of subject, object, oblique, voice, aspect
- Cross-lingual alignment - LaBSE sentence embedding + multilingual-BERT attention for word-level alignment
- Storage - PROIEL XML + JSONL partitions + Qdrant vector store + Neo4j graph
Funding
Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan.
Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.
Hosted at the National and Kapodistrian University of Athens, Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy.
Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.