How to perform operations with 'æ', 'ø' and 'å' in C

I created a program in C that can replace or remove all vowels from a string. In addition, I would like this to work for these characters: "æ", "ø", "å".

I tried using strstr (), but I was unable to implement it without replacing all the characters in the string containing "æ", "ø", or "å". I also read about wchar , but that only complicates things.

The program works with this array of characters:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'}; 

I tried with this array:

 char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'}; 

but it gives the following warnings:

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with "a", it replaces "å" with "a".

I also tried with UTF-8 hexval in 'æ', 'ø' and 'å'.

 char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"}; 

but he gives this error:

redundant elements in char array initializer

Is there a way to make this work without making it too complicated?

+5
source share
2 answers

There are two approaches to using this symbol. The first page of code that allows you to use extended ASCII characters (values ​​128-255), but the code page depends on the system and region, so in general this is a bad idea.

A better alternative is to use unicode . A typical case with unicode is to use wide character literals, as in this post :

 wchar_t str[] = L"αγρω"; 

The main problem with your code is that you are trying to compare ASCII with UTF8, which can be a problem . The solution to this is simple: convert all of your literals to UTF8 wide-character equivalents as well as your strings. You need to work with general coding, not mix it up if you don't have conversion functions to help.

+4
source

Learn about UTF-8 (including its relation to Unicode ) and use some UTF-8 library: libunistring , utfcpp , Glib from GTK, ICU ....

You need to understand what character encoding you are using.

I highly recommend UTF-8 in all cases (which is the default for most Linux systems and almost all Internet and web servers, read locale (7) and utf8 (7) ). Read utf8everywhere ....

I do not recommend wchar_t , whose width and range and sign are implementation specific (you cannot be sure that Unicode is suitable for wchar_t , it is rumored that it is not suitable on Windows), Also converting UTF-8 input to Unicode / UCS4 may take a lot of time more than processing UTF-8 ...

Understand that in UTF-8, a character can be encoded in several bytes. For example, ê (French underlined e circonflexe in lower case) is encoded in two bytes 0xc3, 0xaa and (Russian yery lower case) is encoded in two bytes 0xd1, 0x8b , and both are considered vowels, but do not fit into one char (which is 8 bit on your and my cars).

The concept of vowel is complex (for example, that the vowels are in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there cannot be a simple solution to your problem (since UTF-8 has a combination of characters ).

Are you sure that æ and œ are letters or vowels? (FWIW, å and œ and æ classified as Unicode letters and lowercase letters). I was taught in French elementary school that they are ligatures (and French dictionaries do not mention them as letters, so œuf is in the dictionary in the place oeuf , which means egg). But I am not an expert in this. See strcoll (3) .

On Linux, since UTF-8 is the default encoding (and it’s getting harder to get any other information about the recent distribution), I do not recommend using wchar_t , but using UTF-8 char (like functions that handle multibyte encoded UTF-8), for example (using Glib UTF8 and Unicode functions):

  unsigned count_norvegian_lowercase_vowels(const char*s) { assert (s != NULL); // s should be a not-too-big string // (its `strlen` should be less than UINT_MAX) // s is assumed to be UTF-8 encoded, and should be valid UTF-8: if (!g_utf8_validate(s, -1, NULL)) { fprintf(stderr, "invalid UTF-8 string %s\n", s); exit(EXIT_FAILURE); }; unsigned count = 0; char* next= NULL; char* pc= NULL; for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) { g_unichar u = g_utf8_get_char(pc); // comments from OP make me believe these are the only Norvegian vowels. if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y' || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE || u==(g_unichar)0xf8 //ø U+00F8 LATIN SMALL LETTER O WITH STROKE || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE /* notice that for me  & ê are also vowels but œ is a ligature ... */ ) count++; }; return count; } 

I'm not sure my function name is correct; but you told me in the comments that Norwegian (which I don’t know) has no more vowels than what my function calculates.

I specifically did not put UTF-8 in literal strings or wide char literals (only in comments). There are other legacy character encodings (read EBCDIC or KOI8 ), and you'll want to cross-compile the code.

+4
source

Source: https://habr.com/ru/post/1231948/


All Articles