Algorithm for estimating the number of English translation words from a Japanese source

Question

Algorithm for estimating the number of English translation words from a Japanese source

I am trying to find a way to estimate the number of English words that will be translated into Japanese. The Japanese have three main scenarios: Kanji , Hiragana and Katakana - and each of them has a different average ratio between the characters (Kanji is the lowest, Katakana is the highest).

<strong> Examples:

computer: コンピュータ (Katakana - 6 characters); 計算機 (Kanji: 3 characters)
whale: くじら (Hiragana - 3 characters); 鯨 (Kanji: 1 character)

As data, I have a large glossary of Japanese words and their translation into English, as well as a rather large body of relevant Japanese source documents and their English translation. I want to come up with a formula that will count the number of characters of Kanji, Hiragana and Katakana in the source text and estimate the number of English words that are likely to turn.

+4

algorithm internationalization text-processing translation

Ryan ginstrom Sep 28 '08 at 4:06

source share

7 answers

Here's what Borland (now Embarcadero) thinks of non-English English:

English string length (in characters)

 Expected increase 1-5 100% 6-12 80% 13-20 60% 21-30 40% 31-50 20% over 50 10%

I think you can apply this (with some modifications) for Japanese to non-Japanese.

Another element you might want to consider is the tone of the language. In English, the instructions are worded as imperative, as in "Click OK." But in Japanese, imperatives are considered rude, and you should formulate instructions in honor (or keigo), as in "OK ボタンを押してください".

Watch out for the three-letter kanji combinations. Many of the big words are translated into three- or four-letter kanji combinations, such as 国際化 (internationalization: 20 characters), 高可用性 (high availability: 17 characters).

+3

Eugene yokota Sep 28 '08 at 5:02

source share

Well, it's a little more complicated than just the number of characters in a noun compared to English, for example, the Japanese also have a different grammatical structure than English, so some sentences will use BIG words in Japanese, while others use LESS words. I really don't know Japanese, so please forgive me for using Korean.

In Korean, a sentence is often shorter than an English sentence, mainly because they are abbreviated, using context to fill in the missing words. For example, the expression “I love you” can be as short as 사랑 이 (“sarangi”, just the verb “love”) or as long as a fully qualified sentence 저는 당신 이 살앙 이예요 (I [topic] you [object ] love [verb + polite modifier]. In the text, as it is written, depends on the context, which is usually set by the earlier sentences in the paragraph.

In any case, having an algorithm to actually KNOW would make this kind of thing very difficult, so you are probably much better off just using statistics. What you have to do is use random samples where famous Japanese texts and English texts have the same meaning. The larger the sample (and the more random it is), the better ... although if they are really random, it will not matter much how many hundreds have passed.

Now, another thing, this ratio would completely change to the type of text being translated. For example, a highly technical document is likely to have a much higher length factor in Japanese / English than a boring novel.

Regarding the simple use of your dictionary of verbal translations - this probably won't work (and probably wrong). The same word does not translate into the same word every time in a different language (although it happens much more often in technical discussions). For example, the word is beautiful. There are not only a few words that I could assign in Korean (i.e. there is a choice), but sometimes I lose that choice, as in the sentence (that the food is fine), where I do not mean that the food looks good . I mean, it tastes good, and my translation for this word is changing. And this is a VERY common circumstance.

Another big problem is optimal translation. Something that a person is really bad, and something that computers are much worse. Whenever I correct a document translated from another text into English, I always see various ways to reduce it much shorter.

So, despite the statistics, you could work out a pretty good average length ratio between translations, it would be far from the same as if all translations were optimal.

+1

Vincent mcnabb Sep 28 '08 at 5:00

source share

In my experience as a translator and localization specialist, a good rule of thumb consists of two Japanese characters for each English word.

+1

Mike sickler Dec 13 '08 at 17:42

source share

As an experienced translator between the Japanese and the British, I can say that it is very difficult to quantify, but, as a rule, in my experience the English text translated from Japanese is almost 200% more characters than the source text. In Japanese, there are many culture-specific terms and nouns that cannot be literally translated and need to be explained in English. When translating, it’s not good for me to take one Japanese sentence and make one English paragraph out of it, so that the meaning is conveyed to the reader. At the top of my example is an example:

「懐かしい」

It literally means nostalgia. However, in Japanese, it can be used as one phrase in an exclamation. However, in English, in order to convey a sense of nostalgia, we need a lot more context. For example, you might want to turn this single phrase into a sentence:

"When I walked around my old elementary school, I was flooded with memories of the past."

This is why machine translation between Japanese and English is not possible.

+1

Elijah Feb 11 '09 at 9:47

source share

It seems simple enough - you just need to figure out the ratio.

For each script, count the number of script characters and English words in the glossary and determine the ratio.

This can be supplemented by Japanese source documents suggesting , and you can determine which script contains the Japanese word, and what the English equivalent phrase is in the translation. Otherwise, you will have to evaluate the coefficients or ignore it as raw data,

Then, as you say, count the number of words in each script of your source text, do the multiplication, and you should have a rough estimate.

0

paxdiablo Sep 28 '08 at 4:38

source share

My (albeit tiny) experience shows that no matter what language, blocks of text occupy the same amount of print space to transmit equivalent information. Thus, for a large-block text block, you can assign each column a width value for each character in English (take it from a regular font, such as Times New Roman), and use a common Japanese font with the same dot size to calculate the number of characters that required.

0

Don werve Apr 9 '09 at 20:03

source share

Rafał dowgird · Accepted Answer · 2008-09-28T18:26:47+0000

I would start with a linear approximation: approx_english_words = a1*no_characters_in_script1 + a2 * no_chars_in_script2 + a3 * no_chars_in_script3 , with the coefficients a1, a2, a3 matching your data using linear least squares.

If this does not approximate very well, look at the worst cases for reasons that they do not fit (specialized words, etc.).

Algorithm for estimating the number of English translation words from a Japanese source

More articles: