How to perform operations with 'æ', 'ø' and 'å' in C

Question

How to perform operations with 'æ', 'ø' and 'å' in C

I created a program in C that can replace or remove all vowels from a string. In addition, I would like this to work for these characters: "æ", "ø", "å".

I tried using strstr (), but I was unable to implement it without replacing all the characters in the string containing "æ", "ø", or "å". I also read about wchar , but that only complicates things.

The program works with this array of characters:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

 char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives the following warnings:

warning: multi-character character constant [-Wmultichar]
warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with "a", it replaces "å" with "a".

I also tried with UTF-8 hexval in 'æ', 'ø' and 'å'.

 char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but he gives this error:

redundant elements in char array initializer

Is there a way to make this work without making it too complicated?

+5

c arrays char replace wchar

Martin johansen 21 sept '15 at 12:14

source share

2 answers

Learn about UTF-8 (including its relation to Unicode ) and use some UTF-8 library: libunistring , utfcpp , Glib from GTK, ICU ....

You need to understand what character encoding you are using.

I highly recommend UTF-8 in all cases (which is the default for most Linux systems and almost all Internet and web servers, read locale (7) and utf8 (7) ). Read utf8everywhere ....

^{I do not recommend wchar_t , whose width and range and sign are implementation specific (you cannot be sure that Unicode is suitable for wchar_t , it is rumored that it is not suitable on Windows), Also converting UTF-8 input to Unicode / UCS4 may take a lot of time more than processing UTF-8 ...}

Understand that in UTF-8, a character can be encoded in several bytes. For example, ê (French underlined e circonflexe in lower case) is encoded in two bytes 0xc3, 0xaa and (Russian yery lower case) is encoded in two bytes 0xd1, 0x8b , and both are considered vowels, but do not fit into one char (which is 8 bit on your and my cars).

The concept of vowel is complex (for example, that the vowels are in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there cannot be a simple solution to your problem (since UTF-8 has a combination of characters ).

Are you sure that æ and œ are letters or vowels? (FWIW, å and œ and æ classified as Unicode letters and lowercase letters). I was taught in French elementary school that they are ligatures (and French dictionaries do not mention them as letters, so œuf is in the dictionary in the place oeuf , which means egg). But I am not an expert in this. See strcoll (3) .

On Linux, since UTF-8 is the default encoding (and it’s getting harder to get any other information about the recent distribution), I do not recommend using wchar_t , but using UTF-8 char (like functions that handle multibyte encoded UTF-8), for example (using Glib UTF8 and Unicode functions):

  unsigned count_norvegian_lowercase_vowels(const char*s) { assert (s != NULL); // s should be a not-too-big string // (its `strlen` should be less than UINT_MAX) // s is assumed to be UTF-8 encoded, and should be valid UTF-8: if (!g_utf8_validate(s, -1, NULL)) { fprintf(stderr, "invalid UTF-8 string %s\n", s); exit(EXIT_FAILURE); }; unsigned count = 0; char* next= NULL; char* pc= NULL; for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) { g_unichar u = g_utf8_get_char(pc); // comments from OP make me believe these are the only Norvegian vowels. if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y' || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE || u==(g_unichar)0xf8 //ø U+00F8 LATIN SMALL LETTER O WITH STROKE || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE /* notice that for me  & ê are also vowels but œ is a ligature ... */ ) count++; }; return count; }

I'm not sure my function name is correct; but you told me in the comments that Norwegian (which I don’t know) has no more vowels than what my function calculates.

I specifically did not put UTF-8 in literal strings or wide char literals (only in comments). There are other legacy character encodings (read EBCDIC or KOI8 ), and you'll want to cross-compile the code.

+4

Basile starynkevitch 21 sept '15 at 12:42

source share

Devnull · Accepted Answer · 2015-09-21T12:37:55+0000

There are two approaches to using this symbol. The first page of code that allows you to use extended ASCII characters (values 128-255), but the code page depends on the system and region, so in general this is a bad idea.

A better alternative is to use unicode . A typical case with unicode is to use wide character literals, as in this post :

 wchar_t str[] = L"αγρω";

The main problem with your code is that you are trying to compare ASCII with UTF8, which can be a problem . The solution to this is simple: convert all of your literals to UTF8 wide-character equivalents as well as your strings. You need to work with general coding, not mix it up if you don't have conversion functions to help.

How to perform operations with 'æ', 'ø' and 'å' in C

More articles: