FCA. It is you who sent me a letter, right? Since a more detailed question is here, I will try (as best as possible) to answer here instead of responding to your email address.
Reading all the text that you and the people here wrote, I realized that you are creating Korean handwriting recognition software. Thus, you will not like the luxury of the Korean input method provided by Apple.
There are two things that I can say. Release one by one. (I believe that you already know about one of two things that I will tell.)
How to write a text in a hangul.
So, after reading your request, it should not be about the Unicode encoding / decomposed Korean line (or just the Ja (consonants) and Mo (vowels) series). The question is how to determine if a consonant is (your term is a consonant tail, right?), Which user writes is the last consonant or initial consonant of the next syllable. Learning Korean is best, but let me explain it briefly.
Say you write μλ°©μ°¨ (fire brigade car). You should write: γ
γ
γ
γ
γ
γ
γ
(Again, Iβm not talking about the Unicode decomposed form, but about how people write Korean text.)
When you type γ
(which is the second char), the mapping system displays μ, appending γ
to the previous γ
. And he will look for a Korean table. (Although how to assemble Hangul, this is the JoHap (μ‘°ν©ν) style, which is called the composite style, there are tables of allowed Korean text defined in any Korean standard called Wansung style (μμ±ν). So, you should test the "assembled" syllable on the table, to see if there is such a syllable). Then you will find "μ" in the table. So you will see "μ".
Now the next char, "γ
", is written. Then here it gets a little trickier. Since the syllable "μ" is in the table, it first appends γ
to the previous syllable. Thus, it will display "μ". However, it is still not fully defined. The user writes the following char, "γ
". He is sure that there is no syllable without the first / initial consonant (Ja). He will search for a table, but will not be able to find the syllable "γ
".
So, he guesses that γ
(edited from γ
. It was a typo) attached to the previous syllable actually refers to the 2nd syllable. And it should display "μλ°". Now he is typing .. Then he tries to tie γ
to the second syllable. So it displays μλ°©. (At this point, he can also search for λ°© in the table. And he is found.)
Now "γ
" is dialed. He can probably check μλ°© μλ°© inside, where o and γ
exist under λ° (I cannot write it because there is no such syllable with o and γ
exist together under λ°, for example λ°.). However, there is no such syllable. Thus, he instantly determines that γ
refers to the next syllable.
Then "γ
" ββis dialed. He will collect γ
and γ
to make μ°¨. When you press the spacebar or the return key or any other spacebar, it completes the compilation of Hangul.
This is a simple case. Korean has more complex syllables, such as λΉ¨, κΌ, ν, etc. For the first consonants 볡μμ (BokJaUm, Double Consonants), such as γ
, γ² in λΉ¨ and κΌ, people type γ
and γ
by pressing the shift key. Then γ
and γ² are displayed. Thus, choosing how to consonants and determining where (the previous syllable or the next syllable) that it belongs to can be easy if the user enters the keyboard. (However, there are some good Korean input methods for Windows and Xterm where it allows you to type γ
twice to do γ
. This is kind of an intelligent function. But test text like λΉ±λΉ λΌλΉ±, μ μ can be tricky because you finish testing 3 or 4 consonants are grouped as {1,3}, {2,2}, {3, 1}.
The bad news is that ... because you are writing handwriting recognition, you may need to handle such a tricky case if you enter recognized Hangul characters one by one into the Korean input mechanism. However, if you write your own input method in your application, you can save your own state machine, so that might be easier. But, as you can see, this is a compromise. Depending on the existing input mechanism and the use of each char in it. (Hmm ... wait ... Maybe the input mechanism can also handle these complex cases.)
FYI, I would like to introduce two open source projects. One of them is the Korean Finder input method for Mac , and the second is a data input mechanism with which you can make a Korean input method. In addition, there is a Korean input method for X-Windows, hosted here . If you prefer the Windows project to look, here is one .
The last two were hosted on KLDP.net, an open source Korean site, but they were moved to Google code. As far as I remember, SaeNaRu and Nabis (butterfly) can support typing the same consonant twice to make a double consonant.
For more information, you can find libhangul and nabi. (I remember that part of the code input method was almost the same between libhangul and nabi before, but at that time they were separate from each other and were expected to evolve independently. Therefore, I think they are different.
OK The first thing to do.
Now let's move on to the second problem. (This is the part that I said that you already know about it. But just to complete my explanation, let me explain it as well.)
It is about which character to choose as input for your likely Korean input mechanism, or a machine like libhangul. There are basically two representations of composed (on the display) symbols of the Hangul: Composed and Unfolded. A composed one contains fully composed symbols. For example, μ¬λ ν©λλ€, each syllable, μ¬, λ, ν©, λ, λ€ is stored as such. They are not saved as γ
, γ
, γΉ, γ
, γ
, γ
, γ
, γ
, γ΄, γ
£, γ·, γ
. This is a composite view in Unicode. This view is commonly used by text editors, etc. Another view breaks down into Unicode. It's like γ
, γ
, γΉ, γ
, γ
, γ
, γ
, γ
, γ΄, γ
£, γ·, γ
.
This view is commonly used by file systems. For example, if you put the file name in Hangul on Windows and access the folder containing it from the Mac, it will appear as γ
γ
γΉ γ
γ
γ
γ
γ
γ΄ γ
£ γ·γ
, although it appears as μ¬λ ν©λλ€ on Windows.
However, there is another set of characters if the memory serves, which is just a list of consonants and vowels of the Hangul. Although they may look the same or similar to the laid out syllables, they actually differ in that the place where they are drawn is in the middle of the space where the symbol is depicted. Its purpose is to present Hangul characters in tables of the Korean alphabet or similar things for educational purposes (or any other purpose).
So, I'm not sure which characters (i.e., decomposed or characters for the list of consonants and vowels of the Hangul) get into the input state machine or the input mechanism that you select or implement. If you implement it, this is your choice, but if you use some external libraries for the engine, you need to understand this.
In addition, as I mentioned in my blog post, in every unified and decomposed view, there are two options that are all defined in the Unicode standard. So, well .. yes .. I agree. This is quite a bit of work.
Like me, I tried to make an input method for Mac (when Apple announced that they would get rid of the Finder plugin architecture for security), but at that time libhangul (yes ... I tried to use it) changed a lot. Therefore, until it stabilized, I decided to hold on. But due to the fact that I became very busy with work and tired when I returned home, I could not succeed in my input method. So, I believe that the state of the libhangul project is much better than ever. So, at least try to look at it.
Also, if you don't have Windows, it would be nice to try hanterm or any xterm derivatives that support Hangul login on their own. Source code will be available on their hosting website.
Good luck with your project, and if you still have questions, ask me, please do so.