How to optimize grep regex to match url

Background:

  • I have a directory called "stuff" with 26 files (2.txt and 24.rtf) on Mac OS 10.7.5.
  • I use grep (GNU v2.5.1) to find all the lines in these 26 files that match the structure of the URL, and then print them in a new file (output.txt).
  • The regular expression below works on a small scale. I ran it in a directory with three files (1.rtf and 2..txt) with a bunch of dummy text and 30 URLs, and it successfully completed in less than 1 second.

I use the following regular expression:

1

grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt 

Problem

The current size of my stuff directory is 180 KB with 26 files. In the terminal, I connected to this directory (file), then ran my regular expression. I waited about 15 minutes and decided to kill the process, since it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75 GB ( screenshot ).

Question

  • What can cause the output.txt file to have so many orders of magnitude that it is larger than the whole directory?
  • What else can be added to my regex to optimize processing time.

Thank you in advance for any recommendations you can provide here. I have been working on various variations of my regular expression for almost 16 hours and reading tons of messages on the Internet, but nothing helps. I am new to writing regular expressions, but with few hands, I think I will.

Additional comments

I ran the following command to see what was written in the output.txt file (19.75GB). The regex seems to find the correct strings, except that I consider the odd characters like curly braces } { and a string like: {\fldrslt

  **TERMINAL** $ head -n 100 output.txt http://michacardenas.org/\ http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\ http://www.mumia-themovie.com/"}}{\fldrslt http://www.mumia-themovie.com/}}\ http://www.youtube.com/watch?v=Rvk2dAYkHW8\ http://seniorfitnesssite.com/category/senior-fitness-exercises\ http://www.giac.org/ http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt http://www.youtube.com/watch?v=deOCqGMFFBE}} https://angel.co/jason-a-hoffman\ https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}} http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt http://www.cooking-hacks.com/index.php/documentation 

Directory of regex commands tested so far

2

grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to start / create an empty file (output_2.txt)

3

grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_3.txt)

4

grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / create an empty file (output_4.txt)

5

grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_5.txt)

6

grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_6.txt)

7

grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_7.txt)

eight

grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let it work in 1 minute and manually kill the process / create a 20.63 GB file (output_8.txt) / On the plus side, this regular expression captured lines that were exact in the sense that they did NOT include any additional characters, such as curly braces or RTF file format syntax {\ fldrslt

nine

find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / create an empty file (output_9.txt)

ten

find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / create an empty file (output_10.txt)

eleven

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf

Editor's Note: This regular expression worked correctly when I output lines to a terminal window. This did not work when I output the file_11.txt file.

UNIQUE SUCCESS: All URL strings have been cleared to remove the blank space before and after the string and remove all special markup associated with the .RTF format. Downside: Of the sample URLs tested for accuracy, some of them were interrupted, losing their structure at the end. I estimate that about 10% of the lines were incorrectly truncated.

An example of a truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de

Now the question is:
1.) Is there a way to ensure that we do not exclude part of the URL string, as in the example above?
2.) Would it help to define an evacuation team for regular expression? (if possible).

12

grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / create an empty file (output_12.txt)

13

grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt

FAIL: let it work for 2 minutes and manually kill the process / create a 1 GB file. The purpose of this regular expression was to isolate the output grep file (output.txt) in a subdirectory to ensure that we do not create an infinite loop in which grep read its own result. A solid idea, but not a cigar ( screenshot ).

fourteen

grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: The same result as # 11. The command led to an endless loop with truncated lines.

fifteen

grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST A WINNER: This completely captured the URL string. This led to an endless loop creating millions of lines in the terminal, but I can manually determine where the first loop starts and ends, so this should be fine. GREAT JOB @ acheong87! THANKS!

16

find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTRE URL string, which is good. However, the team turned into an endless cycle. After about 5 seconds of output to the terminal, he produced about 1 million URL lines that were duplicate. That would be a good expression if we could figure out how to avoid this after one cycle.

17

ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt

NEAR SUCCESS: this command led to a single loop through all the files in the directory, which is good b / c solved the problem with an infinite loop. However, the URL string structure was truncated to include the name of the file from which the strings were obtained.

18

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented the endless loop, which is good, it created a new file in the directory that it requested, which was small, about 30 KB. He captured all the correct characters in the string, but a pair is not needed. As Floris mentioned, in those cases where the URL was NOT completed with a space - for example, http://www.mumia-themovie.com/"}}{\fldrslt , he fixed the markup syntax.

19

ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[az./?#=%_-,~&]+'
FAIL: This expression prevents an infinite loop, which is good, however it did NOT commit the entire URL string.

+6
source share
3 answers

The expression I gave in the comments (your test 17) was intended to test two things:

1) can we do an endless loop 2) can we iterate over all the files in the directory

I believe that we have achieved both. So, now I'm bold enough to offer a β€œsolution”:

 ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+' 

Destruction:

 ls *.rtf *.txt - list all .rtf and .txt files grep -v 'output.txt' - skip 'output.txt' (in case it was left from a previous attempt) xargs - "take each line of the input in turn and substitute it - at the end of the following command - (or use -J xxx to sub at place of xxx anywhere in command) grep -i - case insensitive -I - skip binary (shouldn't have any since we only process .txt and .rtf...) -o - print only the matched bit (not the entire line), ie just the URL -h - don't include the name of the source file -E - use extended regular expressions 'http - match starts with http (there are many other URLs possible... but out of scope for this question) s? - next character may be an s, or is not there :// - literal characters that must be there [^[:space:]]+ - one or more "non space" characters (greedy... "as many as possible") 

This seemed to work on a very simple set of files / urls. I think that now that the iteration problem has been solved, the rest is easy. There are many regexes for URL validation. Choose any of them ... the above expression is really just looking for "everything that follows http to the space". If you end up with odd or missing matches, let us know.

+8
source

I guess a little, but for a string like

http: // abcom something foo bar

The pattern may match as

http: // abcom

http: // abcom something

http: // abcom something foo

(always with a space at the end).

But I don't know if grep is trying to match one line multiple times.

Better try

'https: //? \ S + \ s'

as a template

+1
source

"What could cause output.txt to have as many orders of magnitude as more than the entire directory?" I think you run the loop when grep reads its own result? Try directing the output to > ~/tmp/output.txt .

+1
source

Source: https://habr.com/ru/post/958275/


All Articles