Improve string manipulation speed

This is the next question, like this: Write an efficient string replacement function? .

In the (albeit distant) future, I hope to get natural language processing. Of course, because of this, string manipulation speed is important. By chance, I came across this test: http://raid6.com.au/~onlyjob/posts/arena/ - all tests are biased, this is no exception. However, he raised an important question for me. And so I wrote some tests to find out how I do this:

This was my first attempt (I will call it #A):

#A

(defun test () (declare (optimize (debug 0) (safety 0) (speed 3))) (loop with addidtion = (concatenate 'string "abcdefgh" "efghefgh") and initial = (get-internal-real-time) for i from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000) for ln = (* (length addidtion) i) for accumulated = addidtion then (loop with concatenated = (concatenate 'string accumulated addidtion) for start = (search "efgh" concatenated) while start do (replace concatenated "____" :start1 start) finally (return concatenated)) when (zerop (mod ln (* 1024 256))) do (format t "~&~fs | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024))) (values)) (test) 

Having lost the results, I tried to use cl-ppcre - I don’t know what I was hoping for, but the results turned out to be very bad ... Here is the code I used for testing:

#B

 (ql:quickload "cl-ppcre") (defun test () (declare (optimize (debug 0) (safety 0) (speed 3))) (loop with addidtion = (concatenate 'string "abcdefgh" "efghefgh") and initial = (get-internal-real-time) for i from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000) for ln = (* (length addidtion) i) for accumulated = addidtion then (cl-ppcre:regex-replace-all "efgh" (concatenate 'string accumulated addidtion) "____") when (zerop (mod ln (* 1024 256))) do (format t "~&~fs | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024))) (values)) (test) 

Well, then, in the hope that, perhaps, we will move on to some generalizations, I decided to write my own, albeit somewhat naive, version:

#C

 (defun replace-all (input match replacement) (declare (type string input match replacement) (optimize (debug 0) (safety 0) (speed 3))) (loop with pattern fixnum = (1- (length match)) with i fixnum = pattern with j fixnum = i with len fixnum = (length input) do (cond ((>= i len) (return input)) ((zerop j) (loop do (setf (aref input i) (aref replacement j) i (1+ i)) (if (= j pattern) (progn (incf i pattern) (return)) (incf j)))) ((char= (aref input i) (aref match j)) (decf i) (decf j)) (t (setf i (+ i 1 (- pattern j)) j pattern))))) (defun test () (declare (optimize (debug 0) (safety 0) (speed 3))) (loop with addidtion string = (concatenate 'string "abcdefgh" "efghefgh") and initial = (get-internal-real-time) for i fixnum from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000) for ln fixnum = (* (length addidtion) i) for accumulated string = addidtion then (replace-all (concatenate 'string accumulated addidtion) "efgh" "____") when (zerop (mod ln (* 1024 256))) do (format t "~&~fs | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024))) (values)) (test) 

Almost as slow as cl-ppcre ! Now this is unbelievable! There is nothing that could lead to such poor performance ... And yet it suck :(

Understanding that standard functions work best, I looked into the SBCL source, and after some reading I came up with the following:

#D

 (defun replace-all (input match replacement &key (start 0)) (declare (type simple-string input match replacement) (type fixnum start) (optimize (debug 0) (safety 0) (speed 3))) (loop with input-length fixnum = (length input) and match-length fixnum = (length match) for i fixnum from 0 below (ceiling (the fixnum (- input-length start)) match-length) do (loop with prefix fixnum = (+ start (the fixnum (* i match-length))) for j fixnum from 0 below match-length do (when (<= (the fixnum (+ prefix j match-length)) input-length) (loop for k fixnum from (+ prefix j) below (the fixnum (+ prefix j match-length)) for n fixnum from 0 do (unless (char= (aref input k) (aref match n)) (return)) finally (loop for m fixnum from (- k match-length) below k for o fixnum from 0 do (setf (aref input m) (aref replacement o)) finally (return-from replace-all (replace-all input match replacement :start k)))))) finally (return input))) (defun test () (declare (optimize (debug 0) (safety 0) (speed 3))) (loop with addidtion string = (concatenate 'string "abcdefgh" "efghefgh") and initial = (get-internal-real-time) for i fixnum from 0 below (+ (* (/ 1024 (length addidtion)) 1024 4) 1000) for ln fixnum = (* (length addidtion) i) for accumulated string = addidtion then (replace-all (concatenate 'string accumulated addidtion) "efgh" "____") when (zerop (mod ln (* 1024 256))) do (format t "~&~fs | ~d Kb" (/ (- (get-internal-real-time) initial) 1000) (/ ln 1024))) (values)) (test) 

Finally, I can win, although a small part of the performance against the standard library, but still very, very bad compared to almost everything else ...

Here is a table with the results:

 | SBCL #A | SBCL #B | SBCL #C | SBCL #D | C gcc 4 -O3 | String size | |-----------+-----------+------------+-----------+-------------+-------------| | 17.463 s | 166.254 s | 28.924 s | 16.46 s | 1 s | 256 Kb | | 68.484 s | 674.629 s | 116.55 s | 63.318 s | 4 s | 512 Kb | | 153.99 s | gave up | 264.927 s | 141.04 s | 10 s | 768 Kb | | 275.204 s | . . . . . | 474.151 s | 251.315 s | 17 s | 1024 Kb | | 431.768 s | . . . . . | 745.737 s | 391.534 s | 27 s | 1280 Kb | | 624.559 s | . . . . . | 1079.903 s | 567.827 s | 38 s | 1536 Kb | 

Now, the question is: what did I do wrong? Is this something integral to Lisp strings? Could this be mitigated through ... what?

In a long shot, I even thought about writing a specialized library for processing strings. If the problem is not in my bad code, but in the implementation. Does it make sense to do this? If so, what language would you suggest for this?


EDIT: Just for the record, I'm trying to use this library now: https://github.com/Ramarren/ropes to solve string concatenation. Unfortunately, it does not have a replacement function, and performing multiple replacements is not very trivial. But I will keep this post updated when I have something.


I tried to slightly modify the huaiyuan variant to use array fill pointers instead of concatenating strings (to achieve something similar to StringBuilder suggested by Paulo Madeira). It can probably be optimized, but I'm not sure if the / types will be faster, it will be worth redefining the types for * and + to make them work only on fixnum or signed-byte . In any case, here is the code and the standard:

 (defun test/e () (declare (optimize speed)) (labels ((min-power-of-two (num) (declare (type fixnum num)) (decf num) (1+ (progn (loop for i fixnum = 1 then (the (unsigned-byte 32) (ash i 1)) while (< i 17) do (setf num (logior (the fixnum (ash num (the (signed-byte 32) (+ 1 (the (signed-byte 32) (lognot i)))))) num))) num))) (join (xy) (let ((capacity (array-dimension x 0)) (desired-length (+ (length x) (length y))) (x-copy x)) (declare (type fixnum capacity desired-length) (type (vector character) xy x-copy)) (when (< capacity desired-length) (setf x (make-array (min-power-of-two desired-length) :element-type 'character :fill-pointer desired-length)) (replace x x-copy)) (replace xy :start1 (length x)) (setf (fill-pointer x) desired-length) x)) (seek (old str pos) (let ((q (position (aref old 0) str :start pos))) (and q (search old str :start2 q)))) (subs (str old new) (loop for p = (seek old str 0) then (seek old str p) while p do (replace str new :start1 p)) str)) (declare (inline min-power-of-two join seek subs) (ftype (function (fixnum) fixnum) min-power-of-two)) (let* ((builder (make-array 16 :element-type 'character :initial-contents "abcdefghefghefgh" :fill-pointer 16)) (ini (get-internal-real-time))) (declare (type (vector character) builder)) (loop for i fixnum below (+ 1000 (* 4 1024 1024 (/ (length builder)))) for j = builder then (subs (join j builder) "efgh" "____") for k fixnum = (* (length builder) i) when (= 0 (mod k (* 1024 256))) do (format t "~&~8,2F sec ~8D kB" (/ (- (get-internal-real-time) ini) 1000) (/ k 1024)))))) 

  1.68 sec 256 kB 6.63 sec 512 kB 14.84 sec 768 kB 26.35 sec 1024 kB 41.01 sec 1280 kB 59.55 sec 1536 kB 82.85 sec 1792 kB 110.03 sec 2048 kB 
+6
source share
2 answers

Bottle neck is a search function that may not be optimized in SBCL. The next version uses position to skip an impossible region and is about 10 times faster than your version #A on my machine:

 (defun test/e () (declare (optimize speed)) (labels ((join (xy) (concatenate 'simple-base-string xy)) (seek (old str pos) (let ((q (position (char old 0) str :start pos))) (and q (search old str :start2 q)))) (subs (str old new) (loop for p = (seek old str 0) then (seek old str p) while p do (replace str new :start1 p)) str)) (declare (inline join seek subs)) (let* ((str (join "abcdefgh" "efghefgh")) (ini (get-internal-real-time))) (loop for i below (+ 1000 (* 4 1024 1024 (/ (length str)))) for j = str then (subs (join j str) "efgh" "____") for k = (* (length str) i) when (= 0 (mod k (* 1024 256))) do (format t "~&~8,2F sec ~8D kB" (/ (- (get-internal-real-time) ini) 1000) (/ k 1024)))))) 
+5
source

The tests on this page are really biased, so let's see how much. The author claims to be testing string manipulations, but here's what the programs on this page are testing:

  • String concatenation
  • Memory Management, Explicit (C) or Implicit
  • In some languages, regular expressions
  • In other cases, string search algorithms and substring replacement
    • Access to memory that has restrictions in several languages

There are too many aspects. Here's how it is measured:

  • Real time in seconds

This is unsuccessful, since the computer should be fully designed to run only this test for reasonable values ​​without any other processes, such as services, antiviruses, browsers, even the waiting * nix shell. CPU time would be much more useful, you could even run tests on a virtual machine.

Another aspect is that characters in C, C ++, Perl, Python, PHP, and Ruby are 8-bit, but they are 16-bit in many other test languages. This means that memory usage is emphasized in very different quantities, at least 2 times. Here cache misses are much more noticeable.

I suspect Perl is so fast that it checks its arguments once before calling the C function, rather than constantly checking the boundaries. Other languages ​​with 8-bit strings are not so fast, but still fast enough.

JavaScript V8 has strings that are ASCII if possible, so if the added and replaced token was "ëfgh", you would pay a lot more in this implementation.

Python 3 is almost three times slower than Python 2, and I assume this is due to the internal representation of the wchar_t * vs char * strings.

JavaScript SpiderMonkey uses 16-bit strings. I did not dig many sources, but the jsstr.h file mentions ropes.

Java is so slow that String immutable, so for this test it is definitely not the appropriate data type. You pay the price of creating a huge line after each .replace() . I have not tested, but probably StringBuffer will be much faster.

So, this test should be done not only with salt, but also with a handful.


In Common Lisp, border checking and type dispatching in aref and its setf are probably bottlenecks.

For good performance, you will need to generate generic String sequences and use simple-string or simple-vector s, depending on which your implementation is optimized best. Then you should have a way to make schar or svref and their setf capable forms that bypass border checking. From here you can implement your own simple-string-search or simple-character-vector-search (and replace-simple-string or replace-simple-vector , although they play a much smaller role in this particular example) with full speed optimization and declarations types, with border checks on the head of each call instead of every access to the array.

A smart enough compiler ™ would do all this for you, given the “right” declarations. The problem is that you will need to use (concatenate 'simple-string/simple-vector ...) , because neither simple strings nor simple vectors can be changed.

With a compacting / moving GC, in these cases there is always a penalty (for example, copying an array / object), and the choice between array tuning and concatenation should really depend on profiling checks. Otherwise, tuning may be faster than concatenation, while there is enough free memory to grow the array.

You can use custom arrays if the implementation will access the actual elements directly after a brief check of the boundaries at the head of optimized calls / extensions search and replace with custom arrays (for example, using internal definitions that take the actual offset vector / array and the start and end offsets )

But I think a lot here, you need to compile, check the compilation and profile in each implementation for real facts.


As a side note, example C code is filled with errors, such as phased (-1, actually) distributions ( strcat calls write an extra byte, line terminator with a null terminating character) an uninitialized null-terminated gstr (the first strcat works well, as memory may not be initialized to 0), the conversion of size_t and time_t in int and the assumption that these units in printf format string, unused variable pos_c , which is initialized to the first distribution gstr , which increases not received May consider that realloc can move the buffer and do not handle errors at all.

+2
source

Source: https://habr.com/ru/post/949364/


All Articles