“Ambiguity” in Vietnamese machine translation

“Ambiguity” in Vietnamese machine translation

Ambiguity is a common phenomenon in linguistics. When speaking, people are not so concerned about the ambiguity of language and such a problem seems often overlooked.

Ambiguity is a common phenomenon in linguistics, especially in Vietnamese. When speaking, people are not so concerned about the ambiguity of language and such a problem seems often overlooked. However, in writing, this issue tends to be taken more seriously as it can easily lead to confusion, even misinterpretation. 

As an example, when the Vietnamese word “kiếm” appears in a sentence required translation such as “Kiếm gì thế?”, the problem here is to determine whether to render that word into English as “search”, “sword” or “earn”. Though human translators can easily make a decision based on the context and other identifiers, it is not so simple for a machine. Thus, finding the most efficient algorithms to tackle Vietnamese ambiguity has been such a challenging task for programmers.

1. Ambiguity in compounds

In the English language, it is not so difficult to identify word boundaries since every single word carries a full sense of meaning and the boundaries are defined by spaces. However, it is a different story for Vietnamese. As an isolating language, its vocabulary is mainly comprised of compounds so spaces are not always the correct boundaries. 

For instance: 

  • He is a student (en) – Anh ấy là sinh viên (vi)

In the English sentence, the boundaries of can be easily defined as:

  • He / is / a / student

However, in the Vietnamese equivalent, it is not entirely correct if spaces are still used as an indication of boundaries:

  • Anh / ấy / là / sinh / viên

“Sinh viên”, a compound word, is now divided into two single words “sinh” and “viên”, which is an incorrect division. Instead, the boundaries must be:

  • Anh ấy / là / sinh viên
Ambiguity in compounds

2. Grammatical differences

A typical example of grammatical differences between English and Vietnamese is that in Vietnamese verbs and nouns do not conjugate by pronoun and number, and they are kept in the same form despite number. On the other hand, verbs conjugate by pronoun and nouns change their forms according to number in English. 

For example: 

  • “anh đi” (he goes)  – “tôi đi” (I go)
  • “một chai” (one bottle) – “hai chai” (two bottles)

Hence, due to the influence of the mother tongue, Vietnamese people tend to say: I go, you go, he also go; one bottle, two bottle also.

3. Polysemy

Polysemies (words with multiple meanings) exist in every language since many concepts, though do not have completely equivalent senses of meaning, still possess a number of similarities. 

Take an example:

  • “cây” in “cây cối” (tree), “cây số” (kilometer) or “cây vàng” (gold tael)

These carry both similar and different senses of meaning. Or in another case, the word “ăn” has up to 12 definitions in the Vietnamese dictionary. 

  • “Ăn” in ăn uống (eating), “ăn tiệc” (join a party), “ăn Tết” (enjoy the Tet holiday), “ăn đũa” (use chopsticks), “ăn thuốc” (smoke), “ăn xăng” (consume a lot of fuel), “ăn khách” (have a lot of visitors/customers), “ăn lương” (receive salary), “ăn ra biển” (flows into the sea), “phanh này không ăn” (the brake fails), “1 USD ăn 23 nghìn đồng” (1 USD equals to 23.000 VND), “ăn giải” (get the prize).

This phenomenon hinders the process of machine translation because the program cannot decide on the appropriate one among various meanings of a polysemy to translate.

4. Homophone and homograph

Homophones are words that share the same pronunciation but have different meanings, while homographs are words that share the same spelling but have different meanings. Because of the characteristics of Vietnamese language, homophones are usually also homographs, whilst these two concepts do not overlap in other languages. 

It is also necessary to distinguish between homographs and polysemies. A polysemy carries different meanings but with the same origin; hence, its meanings are always related. In contrast, homographs have no connected etymology, so their meanings are completely different. For instance, the words “kiếm” in the following sentences are homographs:

  • Anh ta dùng kiếm rất điêu luyện (He uses swords very skillfully).
  • Anh ấy kiếm tiền tốt lắm (He is very good at earning money).

It is much easier to identify the correct meaning of a homograph than that of a polysemy as the strong distinction in meanings of the former offers a wide range of good criteria for differentiation.

5. Syllable

The first notable difference between English and Vietnamese is that Vietnamese is comprised of monosyllabic words while English is made up of polysyllabic words. Each Vietnamese word has only one syllable, so in pronunciation, a word is pronounced completely within one single syllable, whilst an English word can contain up to two or three syllables. From the contrast between monosyllables and polysyllables, Vietnamese language has tones and diacritics, while English has stress. Vietnamese tones and diacritics are explicitly incorporated in the spelling of a word, whereas stress in English is not represented in spelling but in phonetic transcription. Therefore, pronouncing English words without the correct stress is the same as speaking Vietnamese without any diacritics.

6. Parts of speech in Vietnamese language

Parts of speech are a crucial factor in determining the exact meaning of words and arranging them into complete sentences in machine translation. This means that parts of speech help to eliminate ambiguity, but this concept itself is also ambiguous in some cases. In most inflected languages, parts of speech can be identified easily because when transforming the part of speech, the word form also changes accordingly. 

For example, “free” as an adjective in English means “tự do” in Vietnamese, which is transformed into a noun by adding the suffix “dom” to become “freedom” meaning “sự tự do”. As a result, this process facilitates the automatic labeling of parts of speech through general identifiers. For non-inflected languages such as Vietnamese, however, more complex algorithms are required to determine parts of speech as it is a requisite to analyze syntax; furthermore, there has been no consensus in the linguistic field on the classification of parts of speech in Vietnamese language.

“Ambiguity” in Vietnamese machine translation


What is a Translation Management System (TMS)?

A translation system supports complex translations and allow enterprises and translation companies to centralize and automate the management of localization workflows involving several collaborators that can work simultaneously without geographical restrictions.