How to get the number of characters in a file (not bytes) in C on Linux

Question

How to get the number of characters in a file (not bytes) in C on Linux

I would like to get the number of characters in a file. By characters, I mean "real" characters, not bytes. Assuming I know the encoding of the file.

I tried to use mbstowcs() , but it does not work because it uses the system locale (or defined using setlocale). Since setlocale is not thread safe, I don't think it's a good idea to use it before calling mbstowcs() . Even if it was safe, I had to be sure that my program would not “jump” (signal, etc.) between calls to setlocale() (one call to set it to the file encoding, and return to previous).

So, to take an example, suppose we have a ru.txt file encoded using Russian encoding (for example, KOI8). So, I would like to open the file and get the number of characters if the encoding of the file is KOI8.

It could be that simple if mbstowcs() can accept the source_encoding argument ...

EDIT: Another problem with using mbstowcs() is that the locale corresponding to the encoding of the file must be set on the system ...

+4

c linux encoding unicode

Thibaut D. Aug 12 '13 at 11:59

source share

2 answers

To calculate the number of UTF8 characters in a file, simply pass its contents to this function:

 int CalcUTF8Chars( const std::string& S ) { int Count = 0; for ( size_t i = 0; i != S.length(); i++ ) { if ( ( S[i] & 0xC0 ) != 0x80 ) { Count++; } } return Count; }

No external dependencies.

Update:

If you want to handle other encodings, you have two options:

Use a third-party library that can handle it, for example, ICU http://site.icu-project.org/
Write the calculation functions for each encoding you want to use.

0

Sergey K. Aug 12 '13 at 12:26

source share

MEL · Accepted Answer · 2013-08-12T12:14:55+0000

I would suggest using iconv (3):

 NAME iconv - perform character set conversion SYNOPSIS #include <iconv.h> size_t iconv(iconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);

and convert to utf32. You get 4 bytes for each character to be converted (plus 2 for the specification). It should be possible to convert an input element piece by piece using a fixed outbuf size if one of them carefully chooses (for example, 4 * inbytesleft + 2: -).

How to get the number of characters in a file (not bytes) in C on Linux

More articles: