Creating a Latvian Wordlist

 

Daiga Rence

The University of Latvia

Andrew Rutkas
Concern European, Ukraine

Michael S. Blekhman

Lingvistica

 

Lingvistica is engaged in language engineering projects for languages of major importance. In 2004-2005, one more language was added to Lingvisticas palette Latvian. With Latvias increasing role in the international cooperation, its state language acquires serious importance as a communication means. Hence the necessity of developing such linguistic tools for Latvian as wordlists, dictionaries and word look-up technologies, machine translation systems, etc.

 

One step in this direction was creating a Latvian wordlist. The latter was ordered by Franklin Electronic Publishers, Inc., a USA-based company. The first version of the wordlist was developed early in 2005 by Lingvisticas team uniting Canadians, Latvians, and Ukrainians.

 

We were supposed to create, in a short period of time, a representative list of modern Latvian words featuring word-forms, their frequencies in a representative Latvian text corpus, and hyphenations. To meet the quality and deadline requirements, we decided to create an automatic word-collecting technology that would allow for fast and efficient gathering Latvian words from Internet websites, saving them to a database, and subsequent manual updating.

 

Website scanning

 

Two kinds of websites were considered: (a) information portals featuring web pages renewed every day, and (b) websites that dont feature frequent information updating. Examples of (a): http://www.delfi.lv, http://www.tvnet.lv. Examples of (b): www.izm.gov.lv, www.km.gov.lv. Altogether, over 30 websites were considered.

 

A program for website scanning was developed. The program is a kind of a robot analyzing the website starting with the user-indicated address and moving from one link to another to as many levels down as set up by the user. Besides, the user has the following options:

 

 

The robot saves the words gathered to an MS Access database, with frequencies and hyphenations. The database name is also selected by the user. History and statistics are displayed in the corresponding windows as well as the number of pages in queue.

 

 

Fig.1. Word-gathering robot: the dialog window

 

Web scanning was performed in several iterations:

 

First, the website that dont feature regular information updating were scanned. The result was the first version of the database. Then, for approximately a week, the information portals were scanned, and the words were automatically added to the database, the result of which was a database of 76,000 Latvian word-forms with frequencies and hyphenations. Altogether, a text corpus of 1,2 million words was processed, which is rather a representative text sample.

 

Hyphenation rules

 

The website scanning robot makes use of the hyphenation rules developed in the framework of this project. Here is the 1st version of the hyphenation rules, to be further improved (see below) in the next versions of the robot:

 

  1. Not allowed: a single letter to the left and/or to the right of the hyphenation mark (HM).
  2. At least one vowel should be to the left and to the right of the HM.
  3. Not allowed: HM is between a consonant (C) and a vowel (V), i.e. the following is not allowed: CHV.
  4. CVCV = CV HM CV.
  5. VCCV = VC HM CV.
  6. VCCCV = VC HM CCV.
  7. VCCCCV = VCC HM CCV.
  8. ei, oi, ai are not separated, i.e., for example, e HM i is not allowed.

The hyphenation mark, according to the customers standard, is rendered in the database as <shy/>.

 

 

Fig.2. Latvian wordlist as an MS Access database

 

The database has two additional fields for the future wordlist version: Lemma, i.e. the initial word-form, and part of speech (POS).

 

Updating the database

 

The raw database compiled by the web-scanning robot was manually updated by a Latvian linguist. Two classes of mistakes were corrected: (a) hyphenation-related and (b) lexical.

 

  1. The hyphenation-related mistakes only became obvious after a representative database was automatically created. The corresponding rules were added to the hyphenation algorithm in order to avoid similar mistakes at the further stages of wordlist development. The most frequent mistakes corrected manually in the Hyphenation field were separated diphthongs. In Latvian, the diphthongs are: ai, au, ie, ei, ui, iu, o [pronounced "uo"], oi, eu, ou. Besides, consonants "dz" and "dž" should not be separated if they mark one sound, for example, ru<shy/>dzi, iz<shy/>de<shy/>dži. If the consonants mark different sounds, they should be separated, for example: trūd<shy/>ze<shy/>me.

 

Another typical correction was the separation of the prefix. In Latvian, there is a number of prefixes, such as "aiz-", "ap-", "at-", "ie-", "iz-", "ne-", "no-", "pa-", "pār-", "pie-", "sa-", "uz-". For example:

 

pie<shy/>gā<shy/>des

pār<shy/>val<shy/>des

no<shy/>tei<shy/>ku<shy/>mi.

 

There are also prefixes of foreign origin, such as "post-", "eks-". Examples:

 

eks<shy/>prem<shy/>jers

post<shy/>mo<shy/>der<shy/>nisms.

 

Another important correction was separating the self-contained parts of compounds. For example:

 

da<shy/>tor<shy/>teh<shy/>ni<shy/>kas ("dator"+"tehnikas")

oper<shy/><shy/>zi<shy/>kas ("oper"+"mūzikas")

iz<shy/>pild<shy/>di<shy/>rek<shy/>tors ("izpild"+"direktors").

 

The above are correct hyphenations. The raw database had such erroneous hyphenations as iz<shy/>pil<shy/>ddi<shy/>rek<shy/>tors.

 

A lot of corrections were made to separate the ending from the rest of the word. In Latvian, these endings are: -nieks, -niece, -šana, -šanās, -dams, -damies, etc. Examples:

 

gald<shy/>nieks

priekš<shy/>nie<shy/>ce

ģērb<shy/>ša<shy/>na

ska<shy/><shy/>da<shy/>mies.

 

Before the corrections, the wrong hyphenations were, for example:

 

gal<shy/>dnieks, priek<shy/>šnie<shy/>ce.

 

  1. As to the lexical mistakes, the raw database featured a lot of English words gathered from Latvian web pages, as well as many grammatically incorrect Latvian words and those pertaining to local dialects, slang, etc.

 

Quite a few words of the chat version of Latvian: "riit" instead of "rīt", "sarezhgjiiti" instead of "sarežģīti", "izraeeliesji" instead of "izraēlieši" - respectively, the diacritics are substituted with double vowel or two consonants are put together (ā=aa, ē=ee, ī=ii, ž=zh or zj, š=sh or sj, etc.). This kind of language is often used in the commentaries on some portals. There were a lot of foreign words used in everyday informal communication, too. Thus approximately 3,000 words were deleted from the database.

 

Resulting wordlist

 

The updated database was converted into an XML file according to the customers specification:

 

- <word>

  <spelling>aģentūrai</spelling>

- <hyphenation>

  aģen

  <shy />

  <shy />

  rai

  </hyphenation>

  <frequency>95</frequency>

  </word>

- <word>

 

The next stage of the Latvian wordlist project will feature: