Seminar in Computational Linguistics

  • Date: –15:00
  • Location: Engelska parken 9-3042
  • Lecturer: Harald Hammarström
  • Contact person: Miryam de Lhoneux
  • Seminarium

Deep and Shallow in Automated Cognate Detection

Abstract: Given a set of languages with wordlists, the problem of
cognate identification is to decide which sets of forms are derived
from a common source (List et al. 2017, Kondrak 2009 and references
therein). Typically, cognate identification is performed by aligning
words phonetically and checking if the similarity exceeds a
human-tuned or pre-trained threshold value. However, if this threshold
value is set too strict, only shallow cognates (drei vs three) will be
discovered, while if set too loose, non-cognates along with deep
cognates (aqua vs eau) will be retrieved.  We will argue that the use
of a threshold value inherently poses this dilemma. As an alternative
we propose a threshold-free approach to cognate identification with a
different conceptualization. Only shallow cognates are targeted in
each phase of cognate identification. Once shallow cognates have been
identified, a (preliminary) most shallow subgroup may be
inferred. This subgroup implies a proto-language which can be
preliminarily reconstructed using the sound correspondences from the
shallow cognates as a starting point. Shallow cognate identification
may then be performed again, this time with the proto-language of the
subgroup, and a new subgroup posited, and so forth iteratively. This
way deep cognates may be recognized if and only if the modern forms
become more similar the deeper we go back in time.