Perhaps you think about it a little erroneously. UTF-8 is an encoding that is useful for serializing data, for example. writing it to a file or to the network. However, this is a very non-trivial encoding, and a raw Unicode code point string can end in any number of encoded bytes.
What you should probably do if you want to process the text (given your description) is to store raw , fixed-width strings inside. If you are going to use Unicode (what you need), you will need 21 bits per code, so the closest integral type is uint32_t . In short, keep all your lines inside, as arrays of integers. Then you can randomly access each code point.
Only encoding in UTF-8 when writing to a file or console and decoding from UTF-8 when reading.
By the way, Unicode code is still far from a character. The concept of the character is far enough from a high level to have a simple general mechanic. (For example, "a" + "accent grave" - two code points, how many characters?)
source share