Looking for a Needle in a Haystack: Semi-automatic Creation of a Latvian Multi-word Dictionary from Small Monolingual Corpora

Published on 2018-07-27506 Views

Inguna Skadiņa

Multiword expressions (MWEs) are an indispensable part of almost any dictionary. However, the identification of missing MWEs that have recently appeared in a language is not a simple task. In this p

EURALEX 2018 - Ljubljana

Related categories

Presentation

Looking for the Needle in a Haystack: Semi-automatic Creation of Latvian Multi-word Dictionary from Small Monolingual Corpora00:00

Multi-word expressions01:03

Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian02:37

Tezaurs.lv - the largest open lexical database for Latvian - 105:11

Tezaurs.lv - the largest open lexical database for Latvian - 206:54

The aim of this study07:32

Strategies for MWE identification and extraction08:16

Limitation: rather small amount of data09:20

Application of statistical measures - 110:04

Application of statistical measures - 210:35

Application of Statistical Measures11:39

Lemmatization14:20

Filtering MWE Candidates16:31

Linguistic filters16:37

Results: Balanced Corpus of the Modern Latvian Language17:18

Limitation: 2-3 tokens17:55

t-score as measure for term extraction18:17

Extraction of verbal phrases18:40

Latvian-Lithuanian Corpus LiLa18:56

Latvian-Lithuanian Corpus19:22

Open Subtitles Corpus19:35

Conclusion19:41

Thank you!20:07