This works very poorly when you think that the characters are not really limited to 256; Unicode is closer to 2 ^ 32 code points; and if you try what you plan on using in the UTF-8 string, it will explode. To a large extent.
A better approach would be to use a digest algorithm such as MD5 or FNV, or do what you do, but rather with a sparse array linked list; , , , , , UTF-8, .
EDIT:
: "På japansk heter regn '雨'."