WP-01 - AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels

Platform launch report - GlossaContactLab Working Papers, digital edition

Author
Affiliation

Nikolaos Lavidas

National and Kapodistrian University of Athens (NKUA), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy

Published

May 29, 2026

Abstract

AthDGC (“Athens-PROIEL”) is an open, end-to-end pipeline and dataset that provides the first continuously updated, dependency-parsed treebank of the entire Greek language - from Homeric and Archaic texts to Modern Greek - with cross-lingual alignment to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian via the verse-level cross-alignment of the New Testament. AthDGC follows the PROIEL XML 2.0 schema, uses the Stanford Stanza PROIEL-trained pipeline for annotation, and applies LaBSE sentence embeddings and multilingual-BERT attention for word-level alignment. As of v0.4, the corpus contains 89.9 M total rows, 4.08 M annotated Greek rows, and 6,861 NT-aligned Greek verses cross-aligned to four sister IE witnesses. The platform ships fourteen open-source modules under OSI-approved licences. This Working Paper is the canonical launch report for the AthDGC platform.

Keywords

diachronic linguistics, ancient Greek, Byzantine Greek, Modern Greek, Indo-European, PROIEL, dependency treebank, cross-lingual alignment, argument structure, retranslation

Citation

Lavidas, N., Nikiforidou, K., Haug, D., Kulikov, L., Geka, V., Symeonidis, V., Michalareas, T., Chionidi, S., Tsiropina, A., Plakoutsi, E., Argyropoulos, E., & the Athens Digital Glossa Chronos Research Network (2026). AthDGC: An Open Diachronic Greek Treebank with Indo-European Parallels. GlossaContactLab Working Papers, digital edition, WP-01. National and Kapodistrian University of Athens (NKUA). 10.5281/zenodo.20439182.

@techreport{lavidas2026athdgc,
  type        = {GlossaContactLab Working Paper, digital edition},
  number      = {WP-01},
  author      = {Lavidas, Nikolaos and Nikiforidou, Kiki and Haug, Dag and Kulikov, Leonid and
                 Geka, Vassiliki and Symeonidis, Vassileios and Michalareas, Theodoros and
                 Chionidi, Sofia and Tsiropina, Anastasia and Plakoutsi, Eleni and
                 Argyropoulos, Evangelos and {Athens Digital Glossa Chronos Research Network}},
  title       = {{AthDGC}: An Open Diachronic Greek Treebank with Indo-European Parallels},
  year        = {2026},
  institution = {National and Kapodistrian University of Athens (NKUA)},
  doi         = {10.5281/zenodo.20439182}
}

Full paper

The full text of this Working Paper - method, dataset description, open-access source provenance per period and per IE parallel, IE-parallel matrix, retelling and retranslation chains, tool licence matrix, reuse potential, access + format, acknowledgements - lives at the AthDGC public site:

Read the full launch report at athdgc.github.io/data_paper.html

The same source .qmd renders to HTML, .docx, and (when TinyTeX permits) PDF. This Working Papers entry mirrors the canonical text.

Team

Prof. Nikolaos Lavidas (PI, NKUA), Prof. Emerita Kiki Nikiforidou (NKUA; Co-Editor, Genres and Influential Texts volume), Prof. Dag Haug (Oslo; PROIEL Project Director), Prof. Leonid Kulikov (Ghent; Diachronic typology, valency questionnaires), Dr. Vassiliki Geka (NKUA; Post-Doctoral Researcher; Co-Editor, Genres and Influential Texts volume), Dr. Vassileios Symeonidis (NKUA; Post-Doctoral Researcher), Dr. Theodoros Michalareas (NKUA; Post-Doctoral Researcher), Sofia Chionidi (NKUA; PhD Candidate), Anastasia Tsiropina (NKUA; PhD Candidate), Eleni Plakoutsi (NKUA; PhD Candidate), Evangelos Argyropoulos (NKUA; Research Assistant); and the Athens Digital Glossa Chronos Research Network as collective author.

Acknowledgements

Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305. Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.