I work with several files .txthighlighted in a directory. From all of these files, how should I extract certain words or chunks of text (i.e. sentences, paragraphs and marker defined using regular expressions) and put them in a pandas dataframe (i.e. a table format), while maintaining this column with the name of each file? So far, I have created this function that performs this task (I know ... this is not ideal):
AT:
import glob, os, re
import pandas as pd
regex = r'\<the regex>\b'
ind = 'path/dir'
out = 'path/dir'
f ='path/redirected/output/'
def foo(ind, reg, out):
for filename in glob.glob(os.path.join(in_directory, '*.txt')):
with open(filename, 'r') as file:
stuff = re.findall(a_regex, file.read(), re.M)
lis = [t[::2] for t in stuff]
cont = ' '.join(map(str, lis))
print(cont)
with open(out, 'a') as f:
print(filename.split('/')[-1] + '\t' + cont, file = f)
foo(directory, regex, out)
Then the output is redirected to the third file:
Of:
fileName1.txt
fileName2.txt stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
fileName3.txt stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
....
fileNameN.txt stringOrChunk
Then I create a dataframe from the previous file (yes, I know it terribly):
import pandas as pd
df = pd.read_csv(/path/of/f/, sep='\t', names = ['file_names','col1'])
df.to_csv('/pathOfNewCSV.csv', index=False, sep='\t')
And finally:
file_names col1
0 fileName1.txt NaN
1 fileName2.txt stringOrChunk stringOrChunk stringOrChunk...
2 fileName3.txt stringOrChunk stringOrChunk stringOrChunk...
3 fileName4.txt stringOrChunk
.....
N fileNameN.txt stringOrChunk
So, any idea on how to do this in a more pythonic and efficient way?
Update
.zip , , , :
a_regex = r"\w+ly"
directory = '/Users/user/Desktop/Docs/'
output_dir = '/Users/user/Desktop/'
foo(ind, reg, out)
:
Files words
doc1.txt
doc2.txt
doc3.txt DIRECTLY PROBABLY EARLY
doc4.txt
, ? , , (.. ). , woosh nltk?
UPDATE
, dataframe, , : JESUITS:
Files words1 words2 words3 words4
0 doc1.txt A GOVERNMENT SPOKESMAN HAS ANNOUNCED THAT WITH... NaN NaN NaN
1 doc2.txt 11/12/98 "THERE WAS NO TORTURE OR MISTREATMENT... NaN NaN NaN
2 doc3.txt WHAT WE HAD PREDICTED HAS OCCURRED. CRISTIANI ... SO, THE QUESTION IS: WHO GAVE THE ORDER TO KIL... THE MASSACRE OF THE JESUITS WAS NOT A PERSONAL... LET US REMEMBER THAT AFTER THE MASSSACRE OF TH...
3 doc4.txt IN 11/12/98 OUR VIEW, THE ASSASSINS OF THE JES... THE ASSASSINATION OF THE JESUITS AGAIN CONFIRM... NaN NaN