About AthDGC

A platform of computational tools and a diachronic Greek corpus with Indo-European parallels

For all parts of this platform (site, tools, corpus, slides, launch report) please cite:

Lavidas, Nikolaos, Kiki Nikiforidou, Dag Haug, Leonid Kulikov, Vassiliki Geka, Vassileios Symeonidis, Theodoros Michalareas, Sofia Chionidi, Anastasia Tsiropina, Eleni Plakoutsi, and Evangelos Argyropoulos. 2026. AthDGC: Athens Digital Glossa Chronos (A platform of computational tools and a diachronic Greek treebank, with Indo-European parallels). Athens: National and Kapodistrian University of Athens. Zenodo (an open research-data repository hosted at CERN that mints permanent DOIs). https://doi.org/10.5281/zenodo.20439182

What AthDGC is

AthDGC is an open platform, not a single resource. It pairs:

a computational toolkit for diachronic linguistics (see Tools), and
a PROIEL (the dependency-treebank (a collection of sentences whose grammatical structure has been analysed and stored) standard for early Indo-European languages, developed at Oslo)-style dependency treebank of the entire Greek language (see Samples).

Both deliverables are open-source, openly licensed, and usable independently. Other research groups can adopt the tools on their own historical treebanks; other corpus projects can use the samples as a citation reference without buying into our workflow.

The platform is developed at the National and Kapodistrian University of Athens (Εθνικόν και Καποδιστριακόν Πανεπιστήμιον Αθηνών), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy. It is led by Prof. Nikolaos Lavidas and coordinated by the Athens Digital Glossa Chronos Research Network.

The CVL-CDSAML project

A Corpus-based Valency (the number and type of arguments a verb takes) Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.

The CVL-CDSAML project builds a corpus-based valency lexicon for the contrastive and diachronic study of languages from antiquity to today, using computational linguistic methods to track valency patterns and argument-structure changes across more than 3,000 years of documented language evolution.

The project adapts PROIEL annotation schemes to create selective treebanks that capture valency patterns across multiple historical stages of Greek, English, Germanic, and Romance languages.

The two halves of the platform

(i) Computational toolkit

The toolkit provides fourteen open-source modules: LightSIDE (an open-source text-mining workbench developed at Carnegie Mellon)-AthDGC (the Lavidas-extension fork of LightSIDE that operates on syntactic features rather than only text features) for PROIEL syntactic features (dependency arcs (the head-to-dependent links between words in a parsed sentence), argument-structure frames, morphology bundles); LightSIDE-compatible text-feature extraction; a NoSketch-style concordancer (a search tool that shows every occurrence of a word in its surrounding context); fine-tuned (further trained on new data to adapt it to a particular text type) Stanza (Stanford's open-source Python workflow that automatically tags, lemmatises, and parses sentences) checkpoints (saved snapshots of a trained model) for diachronic Greek (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel; Hugging Face (a public hosting platform for machine-learning models and datasets) mirror released at v0.5); a per-verb argument-structure extractor; a cross-lingual alignment (matching corresponding words or sentences between texts in different languages) viewer (Neo4j (a graph database that stores data as nodes and connections rather than as tables)); a corpus-fix toolkit; a showcase generator; a PROIEL XML 2.0 (the file format that stores each sentence as a tree of word-by-word grammatical relations, developed at Oslo) exporter; a v0.5 PROIEL XML 2.0 schema validator (a program that checks a data file against the expected structural rules); a v0.5 valency-frame database client; a v0.5 retranslation-pair browser; a v0.5 retelling-chain explorer; and the Quarto (an open-source publishing system that builds websites, slides, papers, and posters from a single source) template pack that generates this very site. All modules are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus. See Tools.

(ii) Diachronic Greek treebank

The treebank covers Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, with verse-level cross-lingual alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian (the last in ingestion at v0.5). Sanskrit, Old English, Avestan, Old Persian, and Ukrainian are queued at v0.7. See Samples for representative sentences per period with full PROIEL annotation.

Open-access source chain

Every primary source text is open-access (public domain, CC-BY, CC-BY-SA, or equivalent). Greek sources draw on Perseus Digital Library, Open Greek and Latin / First1K (Leipzig), SBL Greek NT, Tischendorf and Westcott-Hort, Rahlfs LXX via openscriptures.org, Papyri.info, Patrologia Graeca via Documenta Catholica Omnia, Bibliotheca Augustana, Anemi (UoC), and Wikisource el. IE parallels draw on Vulsearch + Latin Library, the Wulfila Project (University of Antwerp), TITUS (Frankfurt), Digilib Armenian, GRETIL (Goettingen), SARIT, TEAMS, the DOE corpus, and the National Library of Ukraine. The annotation layer is AthDGC-original under CC-BY-4.0 (the Creative Commons Attribution 4.0 licence, which permits reuse with credit). In-copyright editions are never republished verbatim; we use the open-access antecedent or short quotation samples under fair use.

The treebank is built by an autonomous daily harvest workflow on the GRNET ARIS (the Greek national high-performance computing cluster, run by GRNET) national HPC (high-performance computing, i.e. a cluster of fast machines used for heavy computation): discovery, filtering, OCR (Optical Character Recognition, the conversion of scanned page images into searchable text) conversion to PROIEL XML 2.0, Stanza dependency parsing (an analysis that shows which word in a sentence depends on which other word as its head), and cross-lingual alignment to the four IE witnesses.

Status

The platform is published in continuous release. The tools are usable today; the corpus is at v0.4 and is undergoing an audit pass on the ARIS side before the wide v0.5 release. Public samples are drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL workflow. Dependency parses and morphology are produced by Stanza grc_proiel and curated by the AthDGC team; individual sentences may still contain Stanza-introduced errors (lemmatisation of contracted forms, POS of pronouns, feature-bundle mismatches), which are corrected at the source in continuous data-cleanup passes (see fix_corpus_data.py in the source repository).

See the Method section of the data paper for the complete workflow description.

Project resources

The project maintains three working resources beyond this public site:

AthDGC corpus repository (closed-access GitHub): https://github.com/AthDGC/athdgc-corpus. the canonical Git-managed working location of the AthDGC corpus across the eight Greek diachronic periods and the four Indo-European parallels. Access is restricted to the eleven-member team during the v0.4 to v0.5 audit pass; the public release of the annotated partitions follows at v0.5 as a separate Zenodo dataset record under CC-BY-4.0.
Per-language unique-identifier and metadata register (Google Sheets, restricted): https://docs.google.com/spreadsheets/d/1MiXcxAedaHgdnj62q-zTQsS3Q4iGTyQU/edit?gid=997932237. one sheet per language (grc, lat, got, chu, xcl, plus queued IE witnesses), holding for each sentence the unique identifier, period, source archive, edition, licence, and ingestion timestamp.
PROIEL annotation interface (NKUA): https://dialing.enl.uoa.gr/proiel/. the live PROIEL XML 2.0 annotation web platform hosted at the National and Kapodistrian University of Athens, on which the AthDGC team performs and reviews the manual annotation layer.

Project provenance, Thessaloniki and Oslo, 2012

The diachronic-Greek PROIEL line that AthDGC continues began in 2012 as a Thessaloniki and Oslo collaboration between Prof. Dag Trygve Truslew Haug (University of Oslo, PROIEL Project Director) and Nikolaos Lavidas, then at the University of Thessaloniki. The first joint anchor text was George Sphrantzes, Chronicon Sive Minus (Chronicles, post-1453, ed. Grecu 1966), the principal historiographic account of the Fall of Constantinople and the only post-Koine, Late-Byzantine Greek text in the original PROIEL release series. The annotated edition was published in PROIEL Release 20180408 under CC-BY-NC-SA 4.0, with principal investigators Dag Trygve Truslew Haug and Nikolaos Lavidas, and funding from the University of Oslo and the University of Thessaloniki. The annotation and review team comprised Þorsteinn Vilhjálmsson, Anastasia Michali, Maria Geramani, Evgenia Klidona, Athina Papadopoulou, and Dag Haug. The release is available at https://github.com/proiel/proiel-treebank and is browsable sentence by sentence at https://syntacticus.org under source identifier proiel:20180408:chron (an example is the sentence at https://syntacticus.org/sentence/proiel:20180408:chron:89063). The Oslo side of the collaboration is housed in the Foni research group in linguistics (Forskergruppe i lingvistikk) at the University of Oslo, Department of Philosophy, Classics, History of Art and Ideas.

The Sphrantzes Chronicle is the historical hinge between the original PROIEL programme and AthDGC. PROIEL covered Greek up to the Koine of the New Testament, with Sphrantzes as the lone post-Koine, Late-Byzantine extension. AthDGC takes that extension as a starting point and carries it through the entire diachronic span of Greek (Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, Modern), under the same PROIEL XML 2.0 schema and the same relation inventory. The 2012 Thessaloniki and Oslo collaboration thus continues today, with the same PROIEL Project Director (Prof. Dag T. T. Haug, Oslo) and the same Greek PI (Prof. Nikolaos Lavidas), now at the National and Kapodistrian University of Athens, funded by HFRI Project No. 20577 and the Greece 2.0 National Recovery and Resilience Plan, and supported by GRNET ARIS allocation pa260305.

Funding

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.

Funded project: CVL-CDSAML, A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.