Creating a Latvian Wordlist
The University of Latvia
Concern “European”, Ukraine
Lingvistica is engaged in language engineering projects for languages of major importance. In 2004-2005, one more language was added to Lingvistica’s palette – Latvian. With Latvia’s increasing role in the international cooperation, its state language acquires serious importance as a communication means. Hence the necessity of developing such linguistic tools for Latvian as wordlists, dictionaries and word look-up technologies, machine translation systems, etc.
One step in this direction was creating a Latvian wordlist. The latter was ordered by Franklin Electronic Publishers, Inc., a USA-based company. The first version of the wordlist was developed early in 2005 by Lingvistica’s team uniting Canadians, Latvians, and Ukrainians.
We were supposed to create, in a short period of time, a representative list of modern Latvian words featuring word-forms, their frequencies in a representative Latvian text corpus, and hyphenations. To meet the quality and deadline requirements, we decided to create an automatic word-collecting technology that would allow for fast and efficient gathering Latvian words from Internet websites, saving them to a database, and subsequent manual updating.
Two kinds of websites were considered: (a) information portals featuring web pages renewed every day, and (b) websites that don’t feature frequent information updating. Examples of (a): http://www.delfi.lv, http://www.tvnet.lv. Examples of (b): www.izm.gov.lv, www.km.gov.lv. Altogether, over 30 websites were considered.
A program for website scanning was developed. The program is a kind of a “robot” analyzing the website starting with the user-indicated address and moving from one link to another to as many levels down as set up by the user. Besides, the user has the following options:
The “robot” saves the words gathered to an MS Access database, with frequencies and hyphenations. The database name is also selected by the user. History and statistics are displayed in the corresponding windows as well as the number of pages in queue.
Fig.1. Word-gathering “robot”: the dialog window
Web scanning was performed in several iterations:
First, the website that don’t feature regular information updating were scanned. The result was the first version of the database. Then, for approximately a week, the information portals were scanned, and the words were automatically added to the database, the result of which was a database of 76,000 Latvian word-forms with frequencies and hyphenations. Altogether, a text corpus of 1,2 million words was processed, which is rather a representative text sample.
The website scanning robot makes use of the hyphenation rules developed in the framework of this project. Here is the 1st version of the hyphenation rules, to be further improved (see below) in the next versions of the “robot”:
The hyphenation mark, according to the customer’s standard, is rendered in the database as <shy/>.
Fig.2. Latvian wordlist as an MS Access database
The database has two additional fields for the future wordlist version: Lemma, i.e. the initial word-form, and part of speech (POS).
Updating the database
The raw database compiled by the web-scanning “robot” was manually updated by a Latvian linguist. Two classes of mistakes were corrected: (a) hyphenation-related and (b) lexical.
Another typical correction was the separation of the prefix. In Latvian, there is a number of prefixes, such as "aiz-", "ap-", "at-", "ie-", "iz-", "ne-", "no-", "pa-", "pār-", "pie-", "sa-", "uz-". For example:
There are also prefixes of foreign origin, such as "post-", "eks-". Examples:
Another important correction was separating the self-contained parts of compounds. For example:
The above are correct hyphenations. The raw database had such erroneous hyphenations as iz<shy/>pil<shy/>ddi<shy/>rek<shy/>tors.
A lot of corrections were made to separate the ending from the rest of the word. In Latvian, these endings are: “-nieks”, “-niece”, “-šana”, “-šanās”, “-dams”, “-damies”, etc. Examples:
Before the corrections, the wrong hyphenations were, for example:
Quite a few words of the “chat version” of Latvian: "riit" instead of "rīt", "sarezhgjiiti" instead of "sarežģīti", "izraeeliesji" instead of "izraēlieši" - respectively, the diacritics are substituted with double vowel or two consonants are put together (ā=aa, ē=ee, ī=ii, ž=zh or zj, š=sh or sj, etc.). This kind of language is often used in the commentaries on some portals. There were a lot of foreign words used in everyday informal communication, too. Thus approximately 3,000 words were deleted from the database.
The updated database was converted into an XML file according to the customer’s specification:
The next stage of the Latvian wordlist project will feature: