AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

Platform launch report - GlossaContactLab Working Papers, NKUA digital edition

Author
Affiliation

Nikolaos Lavidas

National and Kapodistrian University of Athens (NKUA), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy

Abstract

AthDGC (“Athens-PROIEL”) is an open, end-to-end pipeline and dataset that provides the first continuously updated, dependency-parsed treebank of the entire Greek language - from Homeric and Archaic texts to Modern Greek - with cross-lingual alignment to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian via the verse-level cross-alignment of the New Testament. AthDGC follows the PROIEL XML 2.0 schema, uses the Stanford Stanza PROIEL-trained pipeline for annotation, and applies LaBSE sentence embeddings and multilingual-BERT attention for word-level alignment. As of v0.4, the corpus contains 89.9M total rows, 4.08M annotated Greek rows, and 6,861 NT-aligned Greek verses cross-aligned to four sister IE witnesses.

Keywords

diachronic linguistics, ancient Greek, Byzantine Greek, Modern Greek, Indo-European, PROIEL, dependency treebank, cross-lingual alignment, argument structure, retranslation

1. Overview

1.1 Context and motivation

Greek is the longest continuously attested member of the Indo-European family. The PROIEL Treebank (Haug and Jøhndal 2008; Eckhoff et al. 2018) established the gold standard for syntactically annotated parallel Indo-European Bible corpora, but its scope is restricted to the New Testament and to a small number of language witnesses. No openly licensed, end-to-end pipeline currently covers the full diachronic record of Greek at the same level of annotation, with explicit argument-structure tagging, and with reproducible cross-lingual alignment to its sister IE languages. AthDGC closes that gap.

The project's specific scholarly focus is retranslation: the same canonical text - the Iliad, the New Testament, the Septuagint Psalms, classical historiography - is re-rendered into Greek across periods (Homeric → Koine → Byzantine → Modern) and into sister languages (Latin, Gothic, OCS, Armenian) across the Indo-European family.

1.2 Repository location

1.3 Institutional context

AthDGC is developed at the National and Kapodistrian University of Athens (NKUA), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy, under the direction of Prof. Nikolaos Lavidas. Compute is supplied by GRNET ARIS, allocation pa260305.

2. Method

2.1 Discovery

Each day, the discovery stage probes open-access repositories - archive.org, Perseus Digital Library, First1K Greek, Wikisource, the Diorisis Corpus, OpenGreekAndLatin/PerseusDL - for new or updated Greek and parallel-language source material.

2.2 Filtering

Candidate texts are filtered by Greek-script ratio (at least 75% of alphabetic characters in the Greek Unicode blocks), a Path-B line filter for bilingual editorial apparatus, and content-hash deduplication.

2.3 Conversion

Surviving text is converted into the PROIEL XML 2.0 schema with sentence-level structure.

2.4 Annotation

Annotation is performed sentence by sentence with the Stanford Stanza pipeline (Qi et al. 2020) using the PROIEL-trained model grc_proiel. Analogous PROIEL-trained Stanza models are used for Latin (la_proiel), Old Church Slavonic (cu_proiel), and Gothic (got_proiel); Classical Armenian is annotated through a PROIEL-style pipeline currently under development. All output is normalised to the PROIEL XML 2.0 schema; no CoNLL-U or other format is published.

2.5 Argument-structure capture

Beyond standard dependency annotation, AthDGC extracts an explicit argument-structure frame for every verb token, using the PROIEL relation inventory strictly: the subject (sub, including the raised patterns xobj and nonsub), the direct object (obj), indirect and oblique arguments (iobj, obl), the vocative addressee (voc), the voice (active / middle / passive), and the aspect (perfective / imperfective).

2.6 Cross-lingual alignment

Sentence-level alignment uses LaBSE embeddings (Feng et al. 2022); word-level alignment uses multilingual-BERT attention via the AwesomeAlign procedure (Dou and Neubig 2021). Phonetic cognate scoring uses ASJP sound-class encoding and LingPy edit distance (List 2014).

3. Dataset description

Field Value
Object name AthDGC corpus, v0.4 release
Format PROIEL XML 2.0 (primary); JSONL partitions; Neo4j alignment-graph dump; Qdrant vector store
Creation dates 2025-09 onwards; v0.4.0 minted 29 May 2026
Dataset creator Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., and Argyropoulos, E.
Languages grc (Ancient Greek), gkm (Medieval Greek), ell (Modern Greek) + parallels in lat, got, chu, xcl; queued for v0.7: san, ang, ave, peo, ukr
Licence Code Apache-2.0; metadata + alignments CC-BY-4.0; per-source raw text under its original licence
Repository https://github.com/AthDGC/Diachronic-Linguistics-Platform
Concept DOI 10.5281/zenodo.20439182
Publication date 2026-05-29 (v0.4.0 source-code snapshot on Zenodo)
Release status v0.4 is samples-only on the public site; full annotated partitions are under audit on GRNET ARIS and will ship at v0.5

3.1 Open-access source provenance

Every primary source text in AthDGC is open-access (public domain, CC-BY, CC-BY-SA, or equivalent). The annotation layer is AthDGC-original and released under CC-BY-4.0. The open-access chain is preserved from input through annotation to distribution.

Greek - per-period source map

Period Source archive Licence
Archaic, Classical, Hellenistic Perseus Digital Library; Open Greek and Latin / First Thousand Years of Greek (Leipzig) CC-BY-SA 4.0
Koine - NT SBL Greek NT; Tischendorf (1869); Westcott-Hort (1881) SBL licence; public domain
Koine - LXX Rahlfs (1935) via openscriptures.org public domain
Koine - documentary Papyri.info / DDbDP CC-BY 3.0
Late Antique, Byzantine Patrologia Graeca via Documenta Catholica Omnia + Internet Archive public domain
Byzantine, Late Byzantine Bibliotheca Augustana; pre-1928 Teubner public domain
Early Modern Anemi (UoC); Wikisource el public domain
Modern (19th c. - 1928) Wikisource el; Anemi public domain
Modern (post-1928) publisher-licensed editions; not republished in copyright

IE parallels - per-language source map

Language Source archive Licence
Latin (Vulgate) Clementine Vulgate via Vulsearch; Latin Library public domain
Gothic (Wulfila) Wulfila Project (University of Antwerp) CC-BY-SA
OCS (Marianus) TITUS (Frankfurt) academic open access
Classical Armenian TITUS; Digilib Armenian academic open access
Sanskrit (Brahmana, Upanisadic) GRETIL (Goettingen); SARIT; TITUS academic open access
Old English (Wessex Gospels) TEAMS; Bosworth-Toller; DOE extracts public domain + CC-BY-SA
Avestan (Yasna, Yashts) TITUS; Geldner (1886-95) public domain + open access
Old Persian (Behistun) TITUS; Kent (1953) PD transliteration public domain
Ukrainian (Ostroh 1581) National Library of Ukraine facsimile public domain
Ukrainian (20th-c. rev.) Ohienko (1962); Khomenko (1963) in copyright

In-copyright editions are never republished verbatim. AthDGC uses an open-access antecedent (e.g. pre-1928 Teubner; SBL GNT instead of Nestle-Aland) or, where unavoidable, short quotation samples under fair use with full attribution. The annotation layer is always AthDGC-original.

3.2 IE parallels covered

Language Family Period Stanza model Status NT-aligned verses (v0.4)
Greek (Koine) IE / Hellenic 1st-2nd c. AD grc_proiel annotated 7,956
Latin (Vulgate) IE / Italic 4th c. AD la_proiel annotated 7,956
Gothic (Wulfila) IE / Germanic E 4th c. AD got_proiel annotated 3,512
OCS (Marianus) IE / Slavic 10th c. AD cu_proiel sampled 6,861
Classical Armenian IE / Armenian 5th c. AD xcl_proiel (in dev.) ingestion 0 (v0.5)
Sanskrit (Brahmana, Upanisadic) IE / Indo-Iranian 1000-500 BC sa_vedic (queued) queued (v0.7) 0
Old English (Wessex Gospels) IE / Germanic W 10th c. AD ang_proiel (queued) queued (v0.7) 0
Avestan (Yasna, Yashts) IE / Indo-Iranian 1000-500 BC ae_proiel (queued) queued (v0.7) 0
Old Persian (Behistun) IE / Indo-Iranian 6th-5th c. BC peo_proiel (queued) queued (v0.7) 0
Ukrainian (Ostroh + 20th-c. rev.) IE / Slavic E 1581 + 1962 uk_dep adapted (queued) queued (v0.7) 0

3.3 Retelling and retranslation chains

Five chains are public on the Samples page. Each renders the same canonical passage across periods (retelling, retranslation within Greek) or across languages (cross-IE retranslation), with one PROIEL tree per node and a frame-stability gloss.

Chain Type Nodes What it shows
Iliad 1.1 across reception retelling Homer (8th c. BC); Tzetzes (12th c.); Kazantzakis/Kakridis (1955) [obj:acc, voc:voc] stable across 2,800 years; collapses only under Tzetzes' narrative reframing
LXX Psalm 1:1 across periods retranslation LXX (3rd c. BC); Byzantine (10th c.); Modern Greek (post-1976) pred-nom + sub + atr.rel-cl stable; voice flips twice; οὐκ -> οὐ -> δεν
NT John 1:1 patristic + Modern reception retelling John (1st c.); Chrysostom (4th c.); John of Damascus (8th c.); Modern liturgical (post-1976) copular skeleton stable 2,000 years; patristic nodes nest under xcomp of φησίν
Plato Apology 17a across reception retelling Plato (4th c. BC); Olympiodorus (6th c.); Modern Greek (post-1976) pred + adv:neg + comp[sub, obl] preserved; perfect aspect synthetic vs analytic
LXX Genesis 1:1 across IE languages retranslation, cross-IE LXX Greek; Vulgate Latin; Wulfila Gothic; OCS Marianus pred + sub + obl/adv + obj + coord stable across all four; only obl-vs-adv diverges in OCS

3.4 Tools (open source)

AthDGC ships fourteen modules under OSI-approved licences. The annotation layer plus the corpus samples are CC-BY-4.0; tool code is Apache-2.0 (most modules), MIT (Quarto pack + Neo4j viewer), BSD-3-Clause (upstream LightSIDE, retained on the patch set), and GPL-3.0 (upstream CWB, retained on the index builder).

Module Licence
tools/lightside-athdgc/ (LightSIDE for PROIEL arcs + frames + morphology) BSD-3-Clause + Apache-2.0
tools/athdgc_to_lightside_syntax.py, tools/athdgc_to_lightside.py Apache-2.0
tools/proiel_export/ (Stanza -> PROIEL XML 2.0) Apache-2.0
tools/proiel_validate.py (v0.5 schema linter) Apache-2.0
tools/arg_structure.py, tools/fix_corpus_data.py, tools/valency_db.py Apache-2.0
tools/stanza-finetune/ (diachronic Stanza checkpoints) Apache-2.0
tools/build_cwb_index.sh (NoSketch-style CWB index) GPL-3.0
tools/cwb-recipes/ (sample CQL queries) CC-BY-4.0
tools/showcase/ (51_build_showcase_site.py) Apache-2.0
tools/neo4j-align-viewer/ (cross-lingual graph viewer) MIT
tools/retranslation-browser/ (v0.5) Apache-2.0
tools/retelling-explorer/ (v0.5) Apache-2.0
athdgc-quarto/ (this Working Paper + the public site) MIT
Stanza checkpoints athdgc/grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel (HF mirror v0.5) Apache-2.0

Two automated gates run on every community PR: a PROIEL XML 2.0 schema validator (rejecting UD-style labels such as nsubj, dobj, nmod, case) and a Lavidasised house-style check (rejecting AI-marker vocabulary and em-dashes).

4. Reuse potential

AthDGC supports four primary reuse classes:

  1. Historical syntactic research. Per-period dependency parses allow analysts to track argument-structure changes across three millennia of Greek (active-to-mediopassive shifts, loss of optative, rise of periphrastic perfect, accusative-to-genitive variation under verbs of perception, etc.). The colour-coded compact dependency overview on each sample is designed for fast scanning of these patterns.

  2. Indo-European comparative syntax. The verse-level cross-lingual alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian provides aligned PROIEL parses across four sister IE languages, with Sanskrit, Old English, Avestan, Old Persian, and Ukrainian queued at v0.7. Users can query the Neo4j alignment graph for typological correspondences (e.g. every Greek aorist active verb whose Latin Vulgate counterpart is passive).

  3. Computational diachronic NLP. The fine-tuned Stanza checkpoints (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel) are trained on ARIS; the Hugging Face mirror under athdgc/* is in setup and ships at v0.5. The checkpoints can be loaded directly into Stanza for parsing new Greek diachronic data once published. The LightSIDE-AthDGC fork extends LightSIDE to work on syntactic features (dependency arcs, argument-structure frames, morphology bundles), opening a new feature space for classification of diachronic stages.

  4. Digital editions + pedagogy. The showcase generator (51_build_showcase_site.py) produces browsable per-sample PROIEL trees from any JSONL corpus; the Quarto template pack generates the whole multi-output site (HTML, Reveal.js slides, .docx, .pptx, Beamer, A0 poster) from a single source. Both are open-source and reusable by other historical-treebank projects.

The platform is taught as a KEDIVIM continuing-education course at NKUA: Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας / Digital Tools for the Diachronic Analysis of Language, autumn 2026. See https://athdgc.github.io/training.html.

5. Access + format

The public site at https://athdgc.github.io ships curated samples only for the v0.4 release. The full annotated corpus partitions remain under audit on the GRNET ARIS national HPC under allocation pa260305; they will ship at v0.5 as a separate Zenodo dataset record under CC-BY-4.0. The source-code snapshot at the v0.4.0 Zenodo record is Apache-2.0.

The platform's fourteen open-source tools (LightSIDE-AthDGC for PROIEL syntactic features, LightSIDE-compatible text-feature export, NoSketch-style concordancer, fine-tuned diachronic Stanza checkpoints, PROIEL XML 2.0 exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual Neo4j alignment viewer, showcase generator, v0.5 PROIEL XML 2.0 validator, v0.5 valency-frame database client, v0.5 retranslation-pair browser, v0.5 retelling-chain explorer, and the Quarto template pack) are released and usable now per §3.4. See https://athdgc.github.io/tools.html.

6. Acknowledgements

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.

Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.

Team: Prof. Nikolaos Lavidas (PI, NKUA), Prof. Emerita Kiki Nikiforidou (NKUA; Co-Editor, Genres and Influential Texts volume), Prof. Dag Haug (Oslo, PROIEL Project Director), Prof. Leonid Kulikov (Ghent; Diachronic typology, valency questionnaires), Dr. Vassiliki Geka (NKUA; Post-Doctoral Researcher; Co-Editor, Genres and Influential Texts volume), Dr. Vassileios Symeonidis (NKUA; Post-Doctoral Researcher), Dr. Theodoros Michalareas (NKUA; Post-Doctoral Researcher), Sofia Chionidi (NKUA; PhD Candidate), Anastasia Tsiropina (NKUA; PhD Candidate), Eleni Plakoutsi (NKUA; PhD Candidate), Evangelos Argyropoulos (NKUA; Research Assistant); and the Athens Digital Glossa Chronos Research Network as collective author.

References

Dou, Zi-Yi, and Graham Neubig. 2021. “Word Alignment by Fine-Tuning Embeddings on Parallel Corpora.” Proceedings of EACL 2021.
Eckhoff, Hanne M., Kristin Bech, Gerlof Bouma, et al. 2018. “The PROIEL Treebank Family: A Standard for Early Attestations of Indo-European Languages.” Language Resources and Evaluation 52 (1): 29–65.
Feng, Fangxiaoyu, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. “Language-Agnostic BERT Sentence Embedding.” Proceedings of ACL 2022.
Haug, Dag T. T., and Marius Jøhndal. 2008. “Creating a Parallel Treebank of the Old Indo-European Bible Translations.” Proceedings of the Second Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2008).
List, Johann-Mattis. 2014. Sequence Comparison in Historical Linguistics. Düsseldorf University Press.
Qi, Peng, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages.” Proceedings of ACL 2020.