AthDGC

A platform of computational tools and a diachronic Greek corpus, with Indo-European parallels

AthDGC (Athens Diachronic Glossa Chronos / Διαχρονία Γλώσσας :Χρόνος) is a PROIEL-style platform: open computational tools for diachronic linguistics plus an open dependency treebank of the entire Greek language, with Indo-European parallels.

What AthDGC is

The Athens Diachronic Glossa Chronos (AthDGC, Athens-PROIEL (the dependency-treebank (a collection of sentences whose grammatical structure has been analysed and stored) standard for early Indo-European languages, developed at Oslo); Διαχρονία Γλώσσας :Χρόνος) is a platform, not a single resource. It ships two open deliverables side by side:

  1. an open-source computational toolkit for diachronic linguistics - LightSIDE (an open-source text-mining workbench developed at Carnegie Mellon)-compatible feature-extraction + classification pipelines, a NoSketch-style concordancer (a search tool that shows every occurrence of a word in its surrounding context), fine-tuned (further trained on new data to adapt it to a particular text type) Stanza (Stanford's open-source Python pipeline that automatically tags, lemmatises, and parses sentences) checkpoints (saved snapshots of a trained model), a per-verb argument-structure extractor, a cross-lingual alignment (matching corresponding words or sentences between texts in different languages) viewer, a corpus-fix toolkit, a showcase generator, and the Quarto (an open-source publishing system that builds websites, slides, papers, and posters from a single source) pack that builds this very site. See the Tools page.

  2. an open PROIEL-style dependency treebank spanning Greek diachrony (Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern) with Indo-European parallels to Latin (Vulgate), Gothic (Wulfila), and Old Church Slavonic (Marianus) via verse-level cross-lingual alignment of the New Testament. Classical Armenian and Sanskrit ingestion are in progress; Ukrainian, Old English, Avestan, and Old Persian are queued for v0.7. See the Samples page for representative sentences.

The platform's research focus is retranslation, retelling, influential texts, and argument structure (the set of obligatory and optional partners a verb requires, e.g. subject, object, oblique). The tools are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus - so other research groups can adopt them without depending on us.

Cite this work. Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., and Argyropoulos, E. (2026). AthDGC: Athens Diachronic Glossa Chronos. Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs). 10.5281/zenodo.20439182.

Train with us. The platform is taught as a KEDIVIM continuing-education course at NKUA in autumn 2026: Digital Tools for the Diachronic Analysis of Language / Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας. See the Training page.

Current state (v0.4)

Field Value
Greek periods covered 8 (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern)
Cross-aligned IE witnesses 4 (Latin Vulgate, Gothic Wulfila, OCS Marianus, Classical Armenian in ingestion)
Per-witness verified verse counts reported in v0.5 release notes (post-ARIS audit pass)
Open-source tools tiered LIVE / IN SETUP / FORTHCOMING matrix on Tools
Concept DOI (a Digital Object Identifier that always points to the latest version of a deposited record) 10.5281/zenodo.20439182
Repository https://github.com/AthDGC/Diachronic-Linguistics-Platform
Stanza checkpoints forthcoming at https://huggingface.co/AthDGC (org page + model repos in setup; ship at v0.5)
Open-access source archives Perseus, OGL/First1K, SBL GNT, Rahlfs LXX, Papyri.info, Patrologia Graeca, Bibliotheca Augustana, Anemi, Wikisource el; Wulfila Project, TITUS, GRETIL, SARIT, TEAMS, National Library of Ukraine
Launch report GlossaContactLab Working Papers, NKUA digital edition

Release status (v0.4)

The platform is under continuous build. v0.4 is not yet a wide release of the corpus. The Stanford-Stanza pass introduced annotation errors in a non-trivial share of sentences (lemma mis-attribution, crasis splits, gender inheritance on indeclinables, period mis-tagging, etc.), and the team is correcting these on the ARIS side before any wide release.

What this public site shows today is therefore a curated samples showcase drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL pipeline. The tools, by contrast, are released and usable now: they live in the public source-code repo and on Hugging Face (a public hosting platform for machine-learning models and datasets). See the Samples and Tools pages.

The full corpus partitions will ship at v0.5, once the audit pass is complete. The Zenodo record above is the canonical citation for the project; the version DOIs (v0.4.0, v0.5.0, ...) track the release history.

What is in the public repository (and what is not)

The GitHub repository https://github.com/AthDGC/Diachronic-Linguistics-Platform is deliberately scoped. It is not a corpus dump.

The public repository ships:

  • the platform's source code (discovery, filtering, conversion, annotation, alignment, fix, showcase generator, Quarto template pack);
  • the open-source toolkit (LightSIDE-AthDGC (the Lavidas-extension fork of LightSIDE that operates on syntactic features rather than only text features) for syntactic features, NoSketch-style concordancer, PROIEL XML 2.0 (the file format that stores each sentence as a tree of word-by-word grammatical relations, developed at Oslo) exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual alignment viewer);
  • the fine-tuned Stanza checkpoints (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel) trained on ARIS; Hugging Face mirror under athdgc/* is in setup, ships at v0.5;
  • the curated showcase samples rendered on this site;
  • the Zenodo source-code snapshot under DOI 10.5281/zenodo.20439182.

The public repository does NOT ship:

  • the full v0.4 annotated corpus partitions. These remain under active audit on the GRNET ARIS (the Greek national high-performance computing cluster, run by GRNET) national HPC (high-performance computing, i.e. a cluster of fast machines used for heavy computation) under allocation pa260305 until the v0.5 release. Verified scale (total rows, annotated rows per period, per-witness aligned-verse counts) ships with the v0.5 release notes.
  • the raw JSONL (a plain-text data format that stores one record per line) partitions per period, the Qdrant vector store, or the Neo4j (a graph database that stores data as nodes and connections rather than as tables) alignment graph dump. These will ship as a separate Zenodo dataset record at v0.5, under CC-BY-4.0 (the Creative Commons Attribution 4.0 licence, which permits reuse with credit), when the Stanza-introduced annotation errors have been corrected at scale.

This scoping is deliberate: we will not release annotated data with known errors. The tools are released and ready for reuse on any PROIEL-style historical treebank; the corpus ships when it is correct.

Method (in brief)

  1. Discovery - Daily harvest from archive.org, Perseus, First1K Greek, Wikisource, Diorisis, OpenGreekAndLatin
  2. Filtering - Greek-script ratio + apparatus-criticus rejection + content-hash deduplication
  3. Conversion - PROIEL XML 2.0 schema
  4. Annotation - Stanza grc_proiel (and la_proiel, cu_proiel, got_proiel for parallels)
  5. Argument structure - per-verb extraction of subject, object, oblique, voice, aspect
  6. Cross-lingual alignment - LaBSE (a multilingual sentence-embedding model that maps sentences from many languages into a common vector space) sentence embedding (a numeric vector that represents the meaning of a sentence) + multilingual-BERT (a neural language model trained on text from over 100 languages) attention for word-level alignment
  7. Storage - PROIEL XML + JSONL partitions + Qdrant vector store + Neo4j graph

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan.

Funded project: CVL-CDSAML - A Corpus-based Valency (the number and type of arguments a verb takes) Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.

Hosted at the National and Kapodistrian University of Athens, Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy.

Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.