AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels
Platform launch report - GlossaContactLab Working Papers, NKUA digital edition
AthDGC (“Athens-PROIEL”) is an open, end-to-end pipeline and dataset that provides the first continuously updated, dependency-parsed treebank of the entire Greek language - from Homeric and Archaic texts to Modern Greek - with cross-lingual alignment to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian via the verse-level cross-alignment of the New Testament. AthDGC follows the PROIEL XML 2.0 schema, uses the Stanford Stanza PROIEL-trained pipeline for annotation, and applies LaBSE sentence embeddings and multilingual-BERT attention for word-level alignment. As of v0.4, the corpus contains 89.9M total rows, 4.08M annotated Greek rows, and 6,861 NT-aligned Greek verses cross-aligned to four sister IE witnesses.
diachronic linguistics, ancient Greek, Byzantine Greek, Modern Greek, Indo-European, PROIEL, dependency treebank, cross-lingual alignment, argument structure, retranslation
1. Overview
1.1 Context and motivation
Greek is the longest continuously attested member of the Indo-European family. The PROIEL Treebank (Haug and Jøhndal 2008; Eckhoff et al. 2018) established the gold standard for syntactically annotated parallel Indo-European Bible corpora, but its scope is restricted to the New Testament and to a small number of language witnesses. No openly licensed, end-to-end pipeline currently covers the full diachronic record of Greek at the same level of annotation, with explicit argument-structure tagging, and with reproducible cross-lingual alignment to its sister IE languages. AthDGC closes that gap.
The project's specific scholarly focus is retranslation: the same canonical text - the Iliad, the New Testament, the Septuagint Psalms, classical historiography - is re-rendered into Greek across periods (Homeric → Koine → Byzantine → Modern) and into sister languages (Latin, Gothic, OCS, Armenian) across the Indo-European family.
1.2 Repository location
- Source code: https://github.com/AthDGC/Diachronic-Linguistics-Platform
- Public showcase: https://athdgc.github.io
- Zenodo deposition: 10.5281/zenodo.20439182
1.3 Institutional context
AthDGC is developed at the National and Kapodistrian University of Athens (NKUA), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy, under the direction of Prof. Nikolaos Lavidas. Compute is supplied by GRNET ARIS, allocation pa260305.
2. Method
2.1 Discovery
Each day, the discovery stage probes open-access repositories - archive.org, Perseus Digital Library, First1K Greek, Wikisource, the Diorisis Corpus, OpenGreekAndLatin/PerseusDL - for new or updated Greek and parallel-language source material.
2.2 Filtering
Candidate texts are filtered by Greek-script ratio (at least 75% of alphabetic characters in the Greek Unicode blocks), a Path-B line filter for bilingual editorial apparatus, and content-hash deduplication.
2.3 Conversion
Surviving text is converted into the PROIEL XML 2.0 schema with sentence-level structure.
2.4 Annotation
Annotation is performed sentence by sentence with the Stanford Stanza pipeline (Qi et al. 2020) using the PROIEL-trained model grc_proiel. Analogous PROIEL-trained Stanza models are used for Latin (la_proiel), Old Church Slavonic (cu_proiel), and Gothic (got_proiel); Classical Armenian is annotated through a PROIEL-style pipeline currently under development. All output is normalised to the PROIEL XML 2.0 schema; no CoNLL-U or other format is published.
2.5 Argument-structure capture
Beyond standard dependency annotation, AthDGC extracts an explicit argument-structure frame for every verb token, using the PROIEL relation inventory strictly: the subject (sub, including the raised patterns xobj and nonsub), the direct object (obj), indirect and oblique arguments (iobj, obl), the vocative addressee (voc), the voice (active / middle / passive), and the aspect (perfective / imperfective).
2.6 Cross-lingual alignment
Sentence-level alignment uses LaBSE embeddings (Feng et al. 2022); word-level alignment uses multilingual-BERT attention via the AwesomeAlign procedure (Dou and Neubig 2021). Phonetic cognate scoring uses ASJP sound-class encoding and LingPy edit distance (List 2014).
3. Dataset description
| Field | Value |
|---|---|
| Object name | AthDGC corpus, v0.4 release |
| Format | PROIEL XML 2.0 (primary); JSONL partitions; Neo4j alignment-graph dump; Qdrant vector store |
| Creation dates | 2025-09 onwards; v0.4.0 minted 29 May 2026 |
| Dataset creator | Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., and Argyropoulos, E. |
| Languages | grc (Ancient Greek), gkm (Medieval Greek), ell (Modern Greek) + parallels in lat, got, chu, xcl; queued for v0.7: san, ang, ave, peo, ukr |
| Licence | Code Apache-2.0; metadata + alignments CC-BY-4.0; per-source raw text under its original licence |
| Repository | https://github.com/AthDGC/Diachronic-Linguistics-Platform |
| Concept DOI | 10.5281/zenodo.20439182 |
| Publication date | 2026-05-29 (v0.4.0 source-code snapshot on Zenodo) |
| Release status | v0.4 is samples-only on the public site; full annotated partitions are under audit on GRNET ARIS and will ship at v0.5 |
3.1 Open-access source provenance
Every primary source text in AthDGC is open-access (public domain, CC-BY, CC-BY-SA, or equivalent). The annotation layer is AthDGC-original and released under CC-BY-4.0. The open-access chain is preserved from input through annotation to distribution.
Greek - per-period source map
| Period | Source archive | Licence |
|---|---|---|
| Archaic, Classical, Hellenistic | Perseus Digital Library; Open Greek and Latin / First Thousand Years of Greek (Leipzig) | CC-BY-SA 4.0 |
| Koine - NT | SBL Greek NT; Tischendorf (1869); Westcott-Hort (1881) | SBL licence; public domain |
| Koine - LXX | Rahlfs (1935) via openscriptures.org | public domain |
| Koine - documentary | Papyri.info / DDbDP | CC-BY 3.0 |
| Late Antique, Byzantine | Patrologia Graeca via Documenta Catholica Omnia + Internet Archive | public domain |
| Byzantine, Late Byzantine | Bibliotheca Augustana; pre-1928 Teubner | public domain |
| Early Modern | Anemi (UoC); Wikisource el | public domain |
| Modern (19th c. - 1928) | Wikisource el; Anemi | public domain |
| Modern (post-1928) | publisher-licensed editions; not republished | in copyright |
IE parallels - per-language source map
| Language | Source archive | Licence |
|---|---|---|
| Latin (Vulgate) | Clementine Vulgate via Vulsearch; Latin Library | public domain |
| Gothic (Wulfila) | Wulfila Project (University of Antwerp) | CC-BY-SA |
| OCS (Marianus) | TITUS (Frankfurt) | academic open access |
| Classical Armenian | TITUS; Digilib Armenian | academic open access |
| Sanskrit (Brahmana, Upanisadic) | GRETIL (Goettingen); SARIT; TITUS | academic open access |
| Old English (Wessex Gospels) | TEAMS; Bosworth-Toller; DOE extracts | public domain + CC-BY-SA |
| Avestan (Yasna, Yashts) | TITUS; Geldner (1886-95) | public domain + open access |
| Old Persian (Behistun) | TITUS; Kent (1953) PD transliteration | public domain |
| Ukrainian (Ostroh 1581) | National Library of Ukraine facsimile | public domain |
| Ukrainian (20th-c. rev.) | Ohienko (1962); Khomenko (1963) | in copyright |
In-copyright editions are never republished verbatim. AthDGC uses an open-access antecedent (e.g. pre-1928 Teubner; SBL GNT instead of Nestle-Aland) or, where unavoidable, short quotation samples under fair use with full attribution. The annotation layer is always AthDGC-original.
3.2 IE parallels covered
| Language | Family | Period | Stanza model | Status | NT-aligned verses (v0.4) |
|---|---|---|---|---|---|
| Greek (Koine) | IE / Hellenic | 1st-2nd c. AD | grc_proiel |
annotated | 7,956 |
| Latin (Vulgate) | IE / Italic | 4th c. AD | la_proiel |
annotated | 7,956 |
| Gothic (Wulfila) | IE / Germanic E | 4th c. AD | got_proiel |
annotated | 3,512 |
| OCS (Marianus) | IE / Slavic | 10th c. AD | cu_proiel |
sampled | 6,861 |
| Classical Armenian | IE / Armenian | 5th c. AD | xcl_proiel (in dev.) |
ingestion | 0 (v0.5) |
| Sanskrit (Brahmana, Upanisadic) | IE / Indo-Iranian | 1000-500 BC | sa_vedic (queued) |
queued (v0.7) | 0 |
| Old English (Wessex Gospels) | IE / Germanic W | 10th c. AD | ang_proiel (queued) |
queued (v0.7) | 0 |
| Avestan (Yasna, Yashts) | IE / Indo-Iranian | 1000-500 BC | ae_proiel (queued) |
queued (v0.7) | 0 |
| Old Persian (Behistun) | IE / Indo-Iranian | 6th-5th c. BC | peo_proiel (queued) |
queued (v0.7) | 0 |
| Ukrainian (Ostroh + 20th-c. rev.) | IE / Slavic E | 1581 + 1962 | uk_dep adapted (queued) |
queued (v0.7) | 0 |
3.3 Retelling and retranslation chains
Five chains are public on the Samples page. Each renders the same canonical passage across periods (retelling, retranslation within Greek) or across languages (cross-IE retranslation), with one PROIEL tree per node and a frame-stability gloss.
| Chain | Type | Nodes | What it shows |
|---|---|---|---|
| Iliad 1.1 across reception | retelling | Homer (8th c. BC); Tzetzes (12th c.); Kazantzakis/Kakridis (1955) | [obj:acc, voc:voc] stable across 2,800 years; collapses only under Tzetzes' narrative reframing |
| LXX Psalm 1:1 across periods | retranslation | LXX (3rd c. BC); Byzantine (10th c.); Modern Greek (post-1976) | pred-nom + sub + atr.rel-cl stable; voice flips twice; οὐκ -> οὐ -> δεν |
| NT John 1:1 patristic + Modern reception | retelling | John (1st c.); Chrysostom (4th c.); John of Damascus (8th c.); Modern liturgical (post-1976) | copular skeleton stable 2,000 years; patristic nodes nest under xcomp of φησίν |
| Plato Apology 17a across reception | retelling | Plato (4th c. BC); Olympiodorus (6th c.); Modern Greek (post-1976) | pred + adv:neg + comp[sub, obl] preserved; perfect aspect synthetic vs analytic |
| LXX Genesis 1:1 across IE languages | retranslation, cross-IE | LXX Greek; Vulgate Latin; Wulfila Gothic; OCS Marianus | pred + sub + obl/adv + obj + coord stable across all four; only obl-vs-adv diverges in OCS |
3.4 Tools (open source)
AthDGC ships fourteen modules under OSI-approved licences. The annotation layer plus the corpus samples are CC-BY-4.0; tool code is Apache-2.0 (most modules), MIT (Quarto pack + Neo4j viewer), BSD-3-Clause (upstream LightSIDE, retained on the patch set), and GPL-3.0 (upstream CWB, retained on the index builder).
| Module | Licence |
|---|---|
tools/lightside-athdgc/ (LightSIDE for PROIEL arcs + frames + morphology) |
BSD-3-Clause + Apache-2.0 |
tools/athdgc_to_lightside_syntax.py, tools/athdgc_to_lightside.py |
Apache-2.0 |
tools/proiel_export/ (Stanza -> PROIEL XML 2.0) |
Apache-2.0 |
tools/proiel_validate.py (v0.5 schema linter) |
Apache-2.0 |
tools/arg_structure.py, tools/fix_corpus_data.py, tools/valency_db.py |
Apache-2.0 |
tools/stanza-finetune/ (diachronic Stanza checkpoints) |
Apache-2.0 |
tools/build_cwb_index.sh (NoSketch-style CWB index) |
GPL-3.0 |
tools/cwb-recipes/ (sample CQL queries) |
CC-BY-4.0 |
tools/showcase/ (51_build_showcase_site.py) |
Apache-2.0 |
tools/neo4j-align-viewer/ (cross-lingual graph viewer) |
MIT |
tools/retranslation-browser/ (v0.5) |
Apache-2.0 |
tools/retelling-explorer/ (v0.5) |
Apache-2.0 |
athdgc-quarto/ (this Working Paper + the public site) |
MIT |
Stanza checkpoints athdgc/grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel (HF mirror v0.5) |
Apache-2.0 |
Two automated gates run on every community PR: a PROIEL XML 2.0 schema validator (rejecting UD-style labels such as nsubj, dobj, nmod, case) and a Lavidasised house-style check (rejecting AI-marker vocabulary and em-dashes).
4. Reuse potential
AthDGC supports four primary reuse classes:
Historical syntactic research. Per-period dependency parses allow analysts to track argument-structure changes across three millennia of Greek (active-to-mediopassive shifts, loss of optative, rise of periphrastic perfect, accusative-to-genitive variation under verbs of perception, etc.). The colour-coded compact dependency overview on each sample is designed for fast scanning of these patterns.
Indo-European comparative syntax. The verse-level cross-lingual alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian provides aligned PROIEL parses across four sister IE languages, with Sanskrit, Old English, Avestan, Old Persian, and Ukrainian queued at v0.7. Users can query the Neo4j alignment graph for typological correspondences (e.g. every Greek aorist active verb whose Latin Vulgate counterpart is passive).
Computational diachronic NLP. The fine-tuned Stanza checkpoints (
grc_byz_proiel,grc_lbem_proiel,grc_mod_proiel) are trained on ARIS; the Hugging Face mirror underathdgc/*is in setup and ships at v0.5. The checkpoints can be loaded directly into Stanza for parsing new Greek diachronic data once published. The LightSIDE-AthDGC fork extends LightSIDE to work on syntactic features (dependency arcs, argument-structure frames, morphology bundles), opening a new feature space for classification of diachronic stages.Digital editions + pedagogy. The showcase generator (
51_build_showcase_site.py) produces browsable per-sample PROIEL trees from any JSONL corpus; the Quarto template pack generates the whole multi-output site (HTML, Reveal.js slides, .docx, .pptx, Beamer, A0 poster) from a single source. Both are open-source and reusable by other historical-treebank projects.
The platform is taught as a KEDIVIM continuing-education course at NKUA: Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας / Digital Tools for the Diachronic Analysis of Language, autumn 2026. See https://athdgc.github.io/training.html.
5. Access + format
The public site at https://athdgc.github.io ships curated samples only for the v0.4 release. The full annotated corpus partitions remain under audit on the GRNET ARIS national HPC under allocation pa260305; they will ship at v0.5 as a separate Zenodo dataset record under CC-BY-4.0. The source-code snapshot at the v0.4.0 Zenodo record is Apache-2.0.
The platform's fourteen open-source tools (LightSIDE-AthDGC for PROIEL syntactic features, LightSIDE-compatible text-feature export, NoSketch-style concordancer, fine-tuned diachronic Stanza checkpoints, PROIEL XML 2.0 exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual Neo4j alignment viewer, showcase generator, v0.5 PROIEL XML 2.0 validator, v0.5 valency-frame database client, v0.5 retranslation-pair browser, v0.5 retelling-chain explorer, and the Quarto template pack) are released and usable now per §3.4. See https://athdgc.github.io/tools.html.
6. Acknowledgements
Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.
Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.
Team: Prof. Nikolaos Lavidas (PI, NKUA), Prof. Emerita Kiki Nikiforidou (NKUA; Co-Editor, Genres and Influential Texts volume), Prof. Dag Haug (Oslo, PROIEL Project Director), Prof. Leonid Kulikov (Ghent; Diachronic typology, valency questionnaires), Dr. Vassiliki Geka (NKUA; Post-Doctoral Researcher; Co-Editor, Genres and Influential Texts volume), Dr. Vassileios Symeonidis (NKUA; Post-Doctoral Researcher), Dr. Theodoros Michalareas (NKUA; Post-Doctoral Researcher), Sofia Chionidi (NKUA; PhD Candidate), Anastasia Tsiropina (NKUA; PhD Candidate), Eleni Plakoutsi (NKUA; PhD Candidate), Evangelos Argyropoulos (NKUA; Research Assistant); and the Athens Digital Glossa Chronos Research Network as collective author.