Background:
- I have a directory called "stuff" with 26 files (2.txt and 24.rtf) on Mac OS 10.7.5.
- I use grep (GNU v2.5.1) to find all the lines in these 26 files that match the structure of the URL, and then print them in a new file (output.txt).
- The regular expression below works on a small scale. I ran it in a directory with three files (1.rtf and 2..txt) with a bunch of dummy text and 30 URLs, and it successfully completed in less than 1 second.
I use the following regular expression:
1
grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt
Problem
The current size of my stuff directory is 180 KB with 26 files. In the terminal, I connected to this directory (file), then ran my regular expression. I waited about 15 minutes and decided to kill the process, since it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75 GB ( screenshot ).
Question
- What can cause the output.txt file to have so many orders of magnitude that it is larger than the whole directory?
- What else can be added to my regex to optimize processing time.
Thank you in advance for any recommendations you can provide here. I have been working on various variations of my regular expression for almost 16 hours and reading tons of messages on the Internet, but nothing helps. I am new to writing regular expressions, but with few hands, I think I will.
Additional comments
I ran the following command to see what was written in the output.txt file (19.75GB). The regex seems to find the correct strings, except that I consider the odd characters like curly braces } { and a string like: {\fldrslt
**TERMINAL** $ head -n 100 output.txt http://michacardenas.org/\ http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\ http://www.mumia-themovie.com/"}}{\fldrslt http://www.mumia-themovie.com/}}\ http://www.youtube.com/watch?v=Rvk2dAYkHW8\ http://seniorfitnesssite.com/category/senior-fitness-exercises\ http://www.giac.org/ http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt http://www.youtube.com/watch?v=deOCqGMFFBE}} https://angel.co/jason-a-hoffman\ https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}} http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt http://www.cooking-hacks.com/index.php/documentation
Directory of regex commands tested so far
2
grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to start / create an empty file (output_2.txt)
3
grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_3.txt)
4
grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / create an empty file (output_4.txt)
5
grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_5.txt)
6
grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_6.txt)
7
grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / create an empty file (output_7.txt)
eight
grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let it work in 1 minute and manually kill the process / create a 20.63 GB file (output_8.txt) / On the plus side, this regular expression captured lines that were exact in the sense that they did NOT include any additional characters, such as curly braces or RTF file format syntax {\ fldrslt
nine
find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / create an empty file (output_9.txt)
ten
find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / create an empty file (output_10.txt)
eleven
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
Editor's Note: This regular expression worked correctly when I output lines to a terminal window. This did not work when I output the file_11.txt file.
UNIQUE SUCCESS: All URL strings have been cleared to remove the blank space before and after the string and remove all special markup associated with the .RTF format. Downside: Of the sample URLs tested for accuracy, some of them were interrupted, losing their structure at the end. I estimate that about 10% of the lines were incorrectly truncated.
An example of a truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de
Now the question is:
1.) Is there a way to ensure that we do not exclude part of the URL string, as in the example above?
2.) Would it help to define an evacuation team for regular expression? (if possible).
12
grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / create an empty file (output_12.txt)
13
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt
FAIL: let it work for 2 minutes and manually kill the process / create a 1 GB file. The purpose of this regular expression was to isolate the output grep file (output.txt) in a subdirectory to ensure that we do not create an infinite loop in which grep read its own result. A solid idea, but not a cigar ( screenshot ).
fourteen
grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: The same result as # 11. The command led to an endless loop with truncated lines.
fifteen
grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST A WINNER: This completely captured the URL string. This led to an endless loop creating millions of lines in the terminal, but I can manually determine where the first loop starts and ends, so this should be fine. GREAT JOB @ acheong87! THANKS!
16
find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTRE URL string, which is good. However, the team turned into an endless cycle. After about 5 seconds of output to the terminal, he produced about 1 million URL lines that were duplicate. That would be a good expression if we could figure out how to avoid this after one cycle.
17
ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt
NEAR SUCCESS: this command led to a single loop through all the files in the directory, which is good b / c solved the problem with an infinite loop. However, the URL string structure was truncated to include the name of the file from which the strings were obtained.
18
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented the endless loop, which is good, it created a new file in the directory that it requested, which was small, about 30 KB. He captured all the correct characters in the string, but a pair is not needed. As Floris mentioned, in those cases where the URL was NOT completed with a space - for example, http://www.mumia-themovie.com/"}}{\fldrslt , he fixed the markup syntax.
19
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[az./?#=%_-,~&]+'
FAIL: This expression prevents an infinite loop, which is good, however it did NOT commit the entire URL string.