I once wrote a text editor. I thought I could do better than others. Then I recognized Vim and realized that I was wrong: P Parts of my highlight engine still exist on GitHub .
Several approaches are possible. You could write real lexical analytic (or small syntactic) routines, but regular expressions can be faster if you use them efficiently and you are not an expert in source parsing theory. I used a combination of the two.
To get good performance, editors are unlikely to select the entire file. Instead, just select the visible area of ββthe file so that you minimize the work done. Of course, then you need to think about what happens when the user starts editing somewhere in the middle of this visible area. My approach was to store a snapshot of the state of the lexer (i.e., placing all tokens and lexical states) in memory all the time, then from the cursor, go back one or two tokens, use the state of the lexer at this point (i.e. Save markers and state stacks on the left and discard them on the right) and restart the marker from this point to the end of the visible range. Since all (I think) source languages ββare read from left to right, the allocation of tokens further to the left of the edited region should never change.
EDIT | Just rereading my source, there were some other optimizations that I made along the way. Long lists of keywords (e.g. built-in function names) are expensive to check. I built them in a radix tree, which had a huge increase in performance.
source share