AthDGC

A platform of computational tools and a diachronic Greek corpus, with Indo-European parallels

AthDGC (Athens Digital Glossa Chronos) is a PROIEL-style platform: open computational tools for diachronic linguistics plus an open dependency treebank of the entire Greek language, with Indo-European parallels.

For all parts of this platform (site, tools, corpus, slides, launch report) please cite:

Lavidas, Nikolaos, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, and Evangelos Argyropoulos. 2026. AthDGC: Athens Digital Glossa Chronos (A platform of computational tools and a diachronic Greek treebank, with Indo-European parallels). Athens: National and Kapodistrian University of Athens. Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs). https://doi.org/10.5281/zenodo.20439182

What AthDGC is

The Athens Digital Glossa Chronos (AthDGC, also referred to as Athens-PROIEL) is a platform, not a single resource. It provides two open deliverables side by side:

an open-source computational toolkit for diachronic linguistics, LightSIDE (an open-source text-mining workbench developed at Carnegie Mellon)-compatible feature-extraction + classification workflows, a NoSketch-style concordancer (a search tool that shows every occurrence of a word in its surrounding context), fine-tuned (further trained on new data to adapt it to a particular text type) Stanza (Stanford's open-source Python workflow that automatically tags, lemmatises, and parses sentences) checkpoints (saved snapshots of a trained model), a per-verb argument-structure extractor, a cross-lingual alignment (matching corresponding words or sentences between texts in different languages) viewer, a corpus-fix toolkit, a showcase generator, and the Quarto (an open-source publishing system that builds websites, slides, papers, and posters from a single source) pack that builds this very site. See the Tools page.
an open PROIEL-style dependency treebank spanning Greek diachrony (Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern) with Indo-European parallels to Latin (Vulgate), Gothic (Wulfila), and Old Church Slavonic (Marianus) via verse-level cross-lingual alignment of the New Testament. Classical Armenian and Sanskrit ingestion are in progress; Ukrainian, Old English, Avestan, and Old Persian are queued for v0.7. See the Samples page for representative sentences.

The platform's research focus is retranslation, retelling, influential texts, and argument structure (the set of obligatory and optional partners a verb requires, e.g. subject, object, oblique). The tools are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus, so other research groups can adopt them without depending on us.

Cite this work. Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., and Argyropoulos, E. (2026). AthDGC: Athens Digital Glossa Chronos. Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs). 10.5281/zenodo.20439182.

Train with us. The platform is taught as a KEDIVIM continuing-education course at NKUA in autumn 2026: Digital Tools for the Diachronic Analysis of Language / Ψηφιακά Εργαλεία για τη Διαχρονική Ανάλυση της Γλώσσας. See the Training page.

Current state (v0.4)

Field	Value
Greek periods covered	8 (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern)
Cross-aligned IE witnesses	4 (Latin Vulgate, Gothic Wulfila, OCS Marianus, Classical Armenian in ingestion)
Per-witness verified verse counts	reported in v0.5 release notes (post-ARIS audit pass)
Open-source tools	tiered LIVE / IN SETUP / FORTHCOMING matrix on Tools
Concept DOI (a Digital Object Identifier that always points to the latest version of a deposited record)	10.5281/zenodo.20439182
Platform repository (public, Apache-2.0)	https://github.com/AthDGC/Diachronic-Linguistics-Platform
Corpus repository (closed, team-access during v0.4-v0.5 audit)	https://github.com/AthDGC/athdgc-corpus
Per-language unique-id + metadata register (Excel, restricted)	https://docs.google.com/spreadsheets/d/1MiXcxAedaHgdnj62q-zTQsS3Q4iGTyQU/edit?gid=997932237
Live PROIEL XML 2.0 annotation interface (NKUA)	https://dialing.enl.uoa.gr/proiel/
Stanza checkpoints	forthcoming at https://huggingface.co/AthDGC (org page + model repos in setup; release at v0.5)
Open-access source archives	Perseus, OGL/First1K, SBL GNT, Rahlfs LXX, Papyri.info, Patrologia Graeca, Bibliotheca Augustana, Anemi, Wikisource el; Wulfila Project, TITUS, GRETIL, SARIT, TEAMS, National Library of Ukraine
Launch report	GlossaContactLab Working Papers, NKUA digital edition

Release status (v0.4)

The platform is under continuous build. v0.4 is not yet a wide release of the corpus. The Stanford-Stanza pass introduced annotation errors in a non-trivial share of sentences (lemma mis-attribution, crasis splits, gender inheritance on indeclinables, period mis-tagging, etc.), and the team is correcting these on the ARIS side before any wide release.

What this public site shows today is therefore a curated samples showcase drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL workflow. The tools, by contrast, are released and usable now: they live in the public source-code repo and on Hugging Face (a public hosting platform for machine-learning models and datasets). See the Samples and Tools pages.

The full corpus partitions will release at v0.5, once the audit pass is complete. The Zenodo record above is the canonical citation for the project; the version DOIs (v0.4.0, v0.5.0, ...) track the release history.

What is in the public repository (and what is not)

The GitHub repository https://github.com/AthDGC/Diachronic-Linguistics-Platform is deliberately scoped. It is not a corpus dump.

The public repository provides:

the platform's source code (discovery, filtering, conversion, annotation, alignment, fix, showcase generator, Quarto template pack);
the open-source toolkit (LightSIDE-AthDGC (the Lavidas-extension fork of LightSIDE that operates on syntactic features rather than only text features) for syntactic features, NoSketch-style concordancer, PROIEL XML 2.0 (the file format that stores each sentence as a tree of word-by-word grammatical relations, developed at Oslo) exporter, corpus-fix toolkit, argument-structure extractor, cross-lingual alignment viewer);
the fine-tuned Stanza checkpoints (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel) trained on ARIS; Hugging Face mirror under athdgc/* is in setup, released at v0.5;
the curated showcase samples rendered on this site;
the Zenodo source-code snapshot under DOI 10.5281/zenodo.20439182.

The public repository does NOT release:

the full v0.4 annotated corpus partitions. These remain under active audit on the GRNET ARIS (the Greek national high-performance computing cluster, run by GRNET) national HPC (high-performance computing, i.e. a cluster of fast machines used for heavy computation) under allocation pa260305 until the v0.5 release. Verified scale (total rows, annotated rows per period, per-witness aligned-verse counts) is included with the v0.5 release notes.
the raw JSONL (a plain-text data format that stores one record per line) partitions per period, the Qdrant vector store, or the Neo4j (a graph database that stores data as nodes and connections rather than as tables) alignment graph dump. These will be released as a separate Zenodo dataset record at v0.5, under CC-BY-4.0 (the Creative Commons Attribution 4.0 licence, which permits reuse with credit), when the Stanza-introduced annotation errors have been corrected at scale.

This scoping is deliberate: we will not release annotated data with known errors. The tools are released and ready for reuse on any PROIEL-style historical treebank; the corpus is released when it is correct.

Method (in brief)

Discovery, Daily harvest from archive.org, Perseus, First1K Greek, Wikisource, Diorisis, OpenGreekAndLatin
Filtering, Greek-script ratio + apparatus-criticus rejection + content-hash deduplication
Conversion, PROIEL XML 2.0 schema
Annotation, Stanza grc_proiel (and la_proiel, cu_proiel, got_proiel for parallels)
Argument structure, per-verb extraction of subject, object, oblique, voice, aspect
Cross-lingual alignment, LaBSE (a multilingual sentence-embedding model that maps sentences from many languages into a common vector space) sentence embedding (a numeric vector that represents the meaning of a sentence) + multilingual-BERT (a neural language model trained on text from over 100 languages) attention for word-level alignment
Storage, PROIEL XML + JSONL partitions + Qdrant vector store + Neo4j graph

Project provenance, Thessaloniki and Oslo, 2012

The diachronic-Greek PROIEL line that AthDGC continues began in 2012 as a Thessaloniki and Oslo collaboration between Prof. Dag Trygve Truslew Haug (University of Oslo, PROIEL Project Director) and Nikolaos Lavidas, then at the University of Thessaloniki. Their first joint anchor text was George Sphrantzes, Chronicon Sive Minus (Chronicles, post-1453, ed. Grecu 1966), the principal historiographic account of the Fall of Constantinople and the only post-Koine, Late-Byzantine Greek text in the original PROIEL release series. The annotated edition appears in PROIEL Release 20180408 under CC-BY-NC-SA 4.0, with principal investigators Dag T. T. Haug and Nikolaos Lavidas, and funding from the University of Oslo and the University of Thessaloniki. The annotation and review team comprised Þorsteinn Vilhjálmsson, Anastasia Michali, Maria Geramani, Evgenia Klidona, Athina Papadopoulou, and Dag Haug. The release is available at https://github.com/proiel/proiel-treebank and is browsable sentence by sentence at https://syntacticus.org under source identifier proiel:20180408:chron (for example, the sentence at https://syntacticus.org/sentence/proiel:20180408:chron:89063). The Oslo side of the collaboration is housed in the Foni research group in linguistics (Forskergruppe i lingvistikk) at the University of Oslo.

The Sphrantzes Chronicle is the historical hinge between the original PROIEL programme and AthDGC: PROIEL covered Greek up to the Koine of the New Testament, with Sphrantzes as the lone post-Koine, Late-Byzantine extension; AthDGC takes that extension as a starting point and carries it through the entire diachronic span of Greek (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern), under the same PROIEL XML 2.0 schema and the same relation inventory. The 2012 Thessaloniki and Oslo collaboration thus continues today, with the same PROIEL Project Director and the same Greek PI, now at the National and Kapodistrian University of Athens.

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan.

Funded project: CVL-CDSAML, A Corpus-based Valency (the number and type of arguments a verb takes) Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.

Hosted at the National and Kapodistrian University of Athens, Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy.

Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.