I worked on a hobby project for some time (written in C), and it is still far from complete. It is very important that this is quick, so I recently decided to conduct a comparative analysis to make sure that my way of solving the problem will not be ineffective.
$ time ./old real 1m55.92 user 0m54.29 sys 0m33.24
I reworked parts of the program to significantly remove unnecessary operations, reduce memory cache misses and incorrect industry predictions. The wonderful Callgrind tool showed me more and more impressive numbers. Most benchmarking was done without using external processes.
$ time ./old --dry-run real 0m00.75 user 0m00.28 sys 0m00.24 $ time ./new --dry-run real 0m00.15 user 0m00.12 sys 0m00.02
Clearly, at least I was doing something right. However, launching a program for real told a different story.
$ time ./new real 2m00.29 user 0m53.74 sys 0m36.22
As you may have noticed, time mainly depends on external processes. I do not know what caused the regression. There is nothing strange about this; just the traditional vfork / execve / waitpid executed by one thread, running the same programs in the same order.
Something was supposed to cause distortion in order to slow down the work, so I did a little test (similar to the one below) that would only spawn new processes and not have any overhead associated with my program. Obviously, this was supposed to be the fastest.
#define _GNU_SOURCE #include <fcntl.h> #include <sys/wait.h> #include <unistd.h> int main(int argc, const char **argv) { static const char *const _argv[] = {"/usr/bin/md5sum", "test.c", 0}; int fd = open("/dev/null", O_WRONLY); dup2(fd, STDOUT_FILENO); close(fd); for (int i = 0; i < 100000; i++) { int pid = vfork(); int status; if (!pid) { execve("/usr/bin/md5sum", (char*const*)_argv, environ); _exit(1); } waitpid(pid, &status, 0); } return 0; } $ time ./test real 1m58.63 user 0m68.05 sys 0m30.96
I do not think.
At this time, I decided to vote for the governor, and the time got better:
$ for i in 0 1 2 3 4 5 6 7; do sudo sh -c "echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor";done $ time ./test real 1m03.44 user 0m29.30 sys 0m10.66
It seems that each new process is assigned on a separate core, and it takes some time to switch to a higher frequency. I canβt say why the old version was faster. Maybe lucky. Perhaps this (due to inefficiency) caused the CPU to choose a higher frequency earlier.
A good side effect of adjusting the regulator was that compilation time also improved. Compilation seems to require the reuse of many new processes. However, this is not a workable solution, since this program will have to work on other personal computers (and laptops).
The only way to improve the initial time is to limit the program (and child processes) to one processor by adding this code at the beginning:
cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(0, &mask); sched_setaffinity(0, sizeof(mask), &mask);
Which was actually the fastest despite using the default "ondemand":
$ time ./test real 0m59.74 user 0m29.02 sys 0m10.67
This is not only a hacker solution, but also does not work if the running program uses multiple threads. This does not mean that my program knows this.
Does anyone have any ideas how to make running processes run at high processor speeds? It must be automated and does not require su priviliges. Although I have only tested this on Linux so far, I intend to port this to the more or less all popular and impopular desktop OSs (and it will also work on servers). Any idea on any platform is welcome.