AthDGC

A platform of computational tools and a diachronic Greek corpus, with Indo-European parallels

AthDGC (Athens Diachronic Glossa Chronos / Διαχρονία Γλώσσας :Χρόνος) is a PROIEL-style platform: open computational tools for diachronic linguistics plus an open dependency treebank of the entire Greek language, with Indo-European parallels.

What AthDGC is

The Athens Diachronic Glossa Chronos (AthDGC, Athens-PROIEL; Διαχρονία Γλώσσας :Χρόνος) is a platform, not a single resource. It ships two open deliverables side by side:

  1. an open-source computational toolkit for diachronic linguistics - LightSIDE-compatible feature-extraction + classification pipelines, a NoSketch-style concordancer, fine-tuned Stanza checkpoints, a per-verb argument-structure extractor, a cross-lingual alignment viewer, a corpus-fix toolkit, a showcase generator, and the Quarto pack that builds this very site. See the Tools page.

  2. an open PROIEL-style dependency treebank spanning Greek diachrony (Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern) with Indo-European parallels to Latin (Vulgate), Gothic (Wulfila), and Old Church Slavonic (Marianus) via verse-level cross-lingual alignment of the New Testament. Classical Armenian and Sanskrit ingestion are in progress; Ukrainian, Old English, Avestan, and Old Persian are queued for v0.7. See the Samples page for representative sentences.

The platform's research focus is retranslation, retelling, influential texts, and argument structure. The tools are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus - so other research groups can adopt them without depending on us.

Cite this work. Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., Argyropoulos, E., and the Athens Digital Glossa Chronos Research Network (2026). AthDGC: Athens Diachronic Glossa Chronos. Zenodo. 10.5281/zenodo.20439182.

Train with us. The platform is taught as a KEDIVIM continuing-education course at NKUA in autumn 2026: Digital Tools for the Diachronic Analysis of Language / Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας. See the Training page.

Current state (v0.4)

Field Value
Total corpus rows 89.9 M
Annotated Greek rows 4.08 M
NT-aligned Greek verses 6,861
Cross-aligned IE witnesses 4 (Latin, Gothic, OCS) + Classical Armenian in ingestion
Open-source tools shipped 14 (see Tools)
Concept DOI 10.5281/zenodo.20439182
Repository https://github.com/AthDGC/Diachronic-Linguistics-Platform
Stanza checkpoints forthcoming at https://huggingface.co/AthDGC (org page + model repos in setup; ship at v0.5)
Open-access source archives Perseus, OGL/First1K, SBL GNT, Rahlfs LXX, Papyri.info, Patrologia Graeca, Bibliotheca Augustana, Anemi, Wikisource el; Wulfila Project, TITUS, GRETIL, SARIT, TEAMS, National Library of Ukraine
Launch report GlossaContactLab Working Papers, NKUA digital edition

Release status (v0.4)

The platform is under continuous build. v0.4 is not yet a wide release of the corpus. The Stanford-Stanza pass introduced annotation errors in a non-trivial share of sentences (lemma mis-attribution, crasis splits, gender inheritance on indeclinables, period mis-tagging, etc.), and the team is correcting these on the ARIS side before any wide release.

What this public site shows today is therefore a curated samples showcase drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL pipeline. The tools, by contrast, are released and usable now: they live in the public source-code repo and on Hugging Face. See the Samples and Tools pages.

The full corpus partitions will ship at v0.5, once the audit pass is complete. The Zenodo record above is the canonical citation for the project; the version DOIs (v0.4.0, v0.5.0, ...) track the release history.

What is in the public repository (and what is not)

The GitHub repository https://github.com/AthDGC/Diachronic-Linguistics-Platform is deliberately scoped. It is not a corpus dump.

The public repository ships:

  • the platform's source code (discovery, filtering, conversion, annotation, alignment, fix, showcase generator, Quarto template pack);
  • the open-source toolkit (LightSIDE-AthDGC for syntactic features, NoSketch-style concordancer, PROIEL XML 2.0 exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual alignment viewer);
  • the fine-tuned Stanza checkpoints (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel) trained on ARIS; Hugging Face mirror under athdgc/* is in setup, ships at v0.5;
  • the curated showcase samples rendered on this site;
  • the Zenodo source-code snapshot under DOI 10.5281/zenodo.20439182.

The public repository does NOT ship:

  • the full v0.4 annotated corpus partitions (89.9 M corpus rows / 4.08 M annotated Greek rows). These remain under active audit on the GRNET ARIS national HPC under allocation pa260305 until the v0.5 release.
  • the raw JSONL partitions per period, the Qdrant vector store, or the Neo4j alignment graph dump. These will ship as a separate Zenodo dataset record at v0.5, under CC-BY-4.0, when the Stanza-introduced annotation errors have been corrected at scale.

This scoping is deliberate: we will not release annotated data with known errors. The tools are released and ready for reuse on any PROIEL-style historical treebank; the corpus ships when it is correct.

Method (in brief)

  1. Discovery - Daily harvest from archive.org, Perseus, First1K Greek, Wikisource, Diorisis, OpenGreekAndLatin
  2. Filtering - Greek-script ratio + apparatus-criticus rejection + content-hash deduplication
  3. Conversion - PROIEL XML 2.0 schema
  4. Annotation - Stanza grc_proiel (and la_proiel, cu_proiel, got_proiel for parallels)
  5. Argument structure - per-verb extraction of subject, object, oblique, voice, aspect
  6. Cross-lingual alignment - LaBSE sentence embedding + multilingual-BERT attention for word-level alignment
  7. Storage - PROIEL XML + JSONL partitions + Qdrant vector store + Neo4j graph

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan.

Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.

Hosted at the National and Kapodistrian University of Athens, Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy.

Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.