Toupper and tolower on linux using XOR bitwise operation

Question

Toupper and tolower on linux using XOR bitwise operation

The implementation of tolower and topupper is implemented below:

static inline unsigned char __tolower(unsigned char c) { if (isupper(c)) c -= 'A'-'a'; return c; } static inline unsigned char __toupper(unsigned char c) { if (islower(c)) c -= 'a'-'A'; return c; }

Can I use XOR (^) bitwise action as shown below.

Is there a potential error if I use the xor operation?

  c -= 'A'-'a'; ----> c = c ^ 0x20 ; //using xor to convert to lower case to upper case and vice versa

+5

c gcc string linux

vinay hunachyal Dec 6 '16 at 12:07

source share

3 answers

unwind · Answer 1 · 2016-12-06T13:02:05+0000

You probably can, but it is very difficult to understand.

XOR: a byte value with a constant is not faster than adding (or subtracting) a constant. And the advantage that it becomes a switch (i.e. toupper() and tolower() can be the same code) is very small, because the amount of code is so small.

When disassembling, these two functions:

 int my_tolower1(int c) { return c + 'a' - 'A'; } int my_tolower2(int c) { return c ^ ('a' - 'A'); }

Pretty much compiles to the same, modulo, of course, adds vs xor:

 my_tolower1(int): pushq %rbp movq %rsp, %rbp movl %edi, -4(%rbp) movl -4(%rbp), %eax addl $32, %eax popq %rbp ret my_tolower2(int): pushq %rbp movq %rsp, %rbp movl %edi, -4(%rbp) movl -4(%rbp), %eax xorl $32, %eax popq %rbp ret

Both addl and xorl have three bytes, so there are no differences. I assume that both of them are the same cycles on the most interesting processors these days.

Please note that, as I said in my comment, you should not go around and assume that your C program runs in an environment where you can make such assumptions. The Linux kernel, however, is such an environment.

nwellnhof · Answer 2 · 2016-12-06T13:00:30+0000

On ASCII platforms, 'a' - 'A' is 0x20 letters AZ and az have sequential values, and all letters differ only in the six least significant bits, so you can use c = c ^ 0x20 . But the C standard does not specify character encoding, which makes this approach unavailable.

A slightly more portable and self-documenting option:

 c ^= 'A' ^ 'a';

(The C standard also does not state that the letters AZ and az have sequential values, so the Linux kernel code is not strictly portable, but it makes fewer assumptions than the XOR trick.)

Vlad from Moscow · Answer 3 · 2016-12-06T12:43:44+0000

It would be more correct to use the space character '' instead of the magic number 0x20. In this case, the functions will also be valid for the EBCDIC table.

Here is a demo program

 #include <stdio.h> char tolower(char c) { return c ^ ' '; } char toupper(char c) { return c ^ ' '; } int main( void ) { char s[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; puts( s ); for (char *p = s; *p; ++p) *p = tolower(*p); puts( s ); for (char *p = s; *p; ++p) *p = toupper(*p); puts( s ); }

Program exit

 ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ

Of course, before calling functions, you must check if the argument is an alpha character in a given range.

Toupper and tolower on linux using XOR bitwise operation

More articles: