Book Translation Data Format

Question

Book Translation Data Format

I am thinking of translating a book from English into my native language. I can easily translate, and I like vim as a text editor. My problem is that I would like to somehow preserve the semantics, that is, which parts of my translation correspond to the original.

I could create a simple XML-based markup language that looks like

 <book> <chapter> <paragraph> <sentence> <original>This is an example sentence.</original> <translation lang="fi">Tämä on esimerkkilause.</translation> </sentence> </paragraph> </chapter> </book>

Now this will probably have its advantages, but I don't think editing would be a lot of fun.

Another possibility that I can think of is to save the original and translation in separate files. If I add a new line after each translation block and continue line numbering, editing will be simple and I can programmatically match the original and the translation.

 original.txt: This is an example sentence. In this format editing is easy. translation-fi.txt: Tämä on esimerkkilause. Tässä muodossa muokkaaminen on helppoa.

However, this does not seem very reliable. It would be easy to get confused. Probably someone has better ideas. So the question is:

What would be the best data format for translating a book using a text editor?

EDIT: The vim tag has been added since I would prefer to do it with vim and I think some vim gurus might have ideas.

EDIT2: started generosity about this. Currently, I am leaning towards the second idea that I am describing, but I hope to get something about the same simple editing (and pretty easy to implement), but more reliable.

+4

vim file-format nlp translation

dancek Mar 30 '11 at 10:25

source share

3 answers

Why not use a simplified diff format?

this is a string that is suitable for whole sentences.
The first character is significant (space, special, + or -)
It will be quite compact
You may not need those @@ parts
Vim will support him and color the English sentence and the Finnish sentence in different colors.

+2

Benoit May 05 '11 at 11:12

source share

Assuming you want to keep a 1 - 1 relationship between the source text and the translated text, the database table is most important.

You will have one table with the following columns:

id - Integer - Autonum
original_text - Text - Not empty
transl_text - Text - Nullable

You will need a source code loading process and a process to show you one line of source code and let you enter the translated text. Perhaps the second process can show you 5 lines (2 before, the line you want to translate, and 2 after) to give you context.

+1

Gilbert le blanc Mar 30 '11 at 14:20

source share

progo · Accepted Answer · 2011-05-07T14:04:24+0000

One thought: if you save each translated fragment (one or several sentences) in its own line, the vim scrollbind , cursorbind and a simple vertical split will help you keep the “synchronized” fragment. This is very similar to what vimdiff does by default. Then the files should have the same number of lines, and you don’t even need to switch windows!

But this is not entirely fine, because the wrapped lines are usually a bit confused. If your translation wraps more than two or three virtual lines than the source text, the visual correlation disappears, because the lines are no longer separate. I could not find a solution or script to fix this behavior.

Another suggestion that I would like to offer is to translate the translation into the original. This approaches the diff method of the Benoit method. After the original is broken into pieces (one fragment per line), I would prefer >> or similar on each line. The translation of one fragment begins with o . The file will look like this:

  >> This is an example sentence. Tämä on esimerkkilause. >> In this format editing is easy. Tässä muodossa muokkaaminen on helppoa.

And I would increase readability by doing :match Comment /^>>.*$/ or the like, regardless of what looks beautiful with your color scheme. It would probably be useful to write an area :syn , which prohibits spell checking for the source text. Finally, as a detail, I would snap <Cj> to 2j and <Ck> to 2k to allow easy jumping between important parts.

The pluses for this latter approach also include the fact that you can wrap things in 80 columns if you feel like I am doing it :) It would be trivial to write <Cj/k> to go between translations.

Cons: Buffer filling suffers as it now completes both the source and translated words. English words, I hope, do not occur in translations, which are often! :) But this is as great as it turns out. A simple grep will clear the source code after you are done.

Book Translation Data Format

More articles: