About AthDGC
A platform of computational tools and a diachronic Greek corpus with Indo-European parallels
What AthDGC is
AthDGC is an open platform, not a single resource. It pairs:
- a computational toolkit for diachronic linguistics (see Tools), and
- a PROIEL-style dependency treebank of the entire Greek language (see Samples).
Both deliverables are open-source, openly licensed, and usable independently. Other research groups can adopt the tools on their own historical treebanks; other corpus projects can use the samples as a citation reference without buying into our pipeline.
The platform is developed at the National and Kapodistrian University of Athens (Εθνικόν και Καποδιστριακόν Πανεπιστήμιον Αθηνών), Division of Language-Linguistics, Department of English Language and Literature, School of Philosophy. It is led by Prof. Nikolaos Lavidas and coordinated by the Athens Digital Glossa Chronos Research Network.
The CVL-CDSAML project
A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.
The CVL-CDSAML project builds a corpus-based valency lexicon for the contrastive and diachronic study of languages from antiquity to today, using computational linguistic methods to track valency patterns and argument-structure changes across more than 3,000 years of documented language evolution.
The project adapts PROIEL annotation schemes to create selective treebanks that capture valency patterns across multiple historical stages of Greek, English, Germanic, and Romance languages.
The two halves of the platform
(i) Computational toolkit
The toolkit ships fourteen open-source modules: LightSIDE-AthDGC for PROIEL syntactic features (dependency arcs, argument-structure frames, morphology bundles); LightSIDE-compatible text-feature extraction; a NoSketch-style concordancer; fine-tuned Stanza checkpoints for diachronic Greek (grc_byz_proiel, grc_lbem_proiel, grc_mod_proiel; Hugging Face mirror ships at v0.5); a per-verb argument-structure extractor; a cross-lingual alignment viewer (Neo4j); a corpus-fix toolkit; a showcase generator; a PROIEL XML 2.0 exporter; a v0.5 PROIEL XML 2.0 schema validator; a v0.5 valency-frame database client; a v0.5 retranslation-pair browser; a v0.5 retelling-chain explorer; and the Quarto template pack that generates this very site. All modules are usable on any PROIEL-style historical treebank, not only on the AthDGC corpus. See Tools.
(ii) Diachronic Greek treebank
The treebank covers Homeric, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, with verse-level cross-lingual alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian (the last in ingestion at v0.5). Sanskrit, Old English, Avestan, Old Persian, and Ukrainian are queued at v0.7. See Samples for representative sentences per period with full PROIEL annotation.
Open-access source chain
Every primary source text is open-access (public domain, CC-BY, CC-BY-SA, or equivalent). Greek sources draw on Perseus Digital Library, Open Greek and Latin / First1K (Leipzig), SBL Greek NT, Tischendorf and Westcott-Hort, Rahlfs LXX via openscriptures.org, Papyri.info, Patrologia Graeca via Documenta Catholica Omnia, Bibliotheca Augustana, Anemi (UoC), and Wikisource el. IE parallels draw on Vulsearch + Latin Library, the Wulfila Project (University of Antwerp), TITUS (Frankfurt), Digilib Armenian, GRETIL (Goettingen), SARIT, TEAMS, the DOE corpus, and the National Library of Ukraine. The annotation layer is AthDGC-original under CC-BY-4.0. In-copyright editions are never republished verbatim; we use the open-access antecedent or short quotation samples under fair use.
The treebank is built by an autonomous daily harvest pipeline on the GRNET ARIS national HPC: discovery, filtering, OCR conversion to PROIEL XML 2.0, Stanza dependency parsing, and cross-lingual alignment to the four IE witnesses.
Status
The platform is published in continuous release. The tools are usable today; the corpus is at v0.4 and is undergoing an audit pass on the ARIS side before the wide v0.5 release. Public samples are drawn from the v0.4 partitions and rendered through the standard AthDGC PROIEL pipeline. Dependency parses and morphology are produced by Stanza grc_proiel and curated by the AthDGC team; individual sentences may still contain Stanza-introduced errors (lemmatisation of contracted forms, POS of pronouns, feature-bundle mismatches), which are corrected at the source in continuous data-cleanup passes (see fix_corpus_data.py in the source repository).
See the Method section of the data paper for the complete pipeline description.
Funding
Funded by the Hellenic Foundation for Research and Innovation (HFRI) under the 3rd Call for HFRI Research Projects to support Post-Doctoral Researchers, Project No. 20577; with complementary support from the Greece 2.0 National Recovery and Resilience Plan. Compute supplied by GRNET ARIS (Greek national HPC), allocation pa260305.
Funded project: CVL-CDSAML - A Corpus-based Valency Lexicon for a Contrastive and Diachronic Study of Ancient and Medieval Languages.