Problems trying to generate pandas data columns from regular expressions?

I work with several files .txthighlighted in a directory. From all of these files, how should I extract certain words or chunks of text (i.e. sentences, paragraphs and marker defined using regular expressions) and put them in a pandas dataframe (i.e. a table format), while maintaining this column with the name of each file? So far, I have created this function that performs this task (I know ... this is not ideal):

AT:

import glob, os, re
import pandas as pd
regex = r'\<the regex>\b'
ind = 'path/dir'
out = 'path/dir'
f ='path/redirected/output/'


def foo(ind, reg, out):
    for filename in glob.glob(os.path.join(in_directory, '*.txt')):
        with open(filename, 'r') as file:
            stuff = re.findall(a_regex, file.read(), re.M)
            #my_list = [str([j.split()[0] for j in i]) for i in stuff]

            lis = [t[::2] for t in stuff]
            cont = ' '.join(map(str, lis))
            print(cont)
            with open(out, 'a') as f:
                print(filename.split('/')[-1] + '\t' + cont, file = f)


foo(directory, regex, out)

Then the output is redirected to the third file:

Of:

fileName1.txt       
fileName2.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
fileName3.txt       stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk stringOrChunk
....
fileNameN.txt       stringOrChunk

Then I create a dataframe from the previous file (yes, I know it terribly):

import pandas as pd
df = pd.read_csv(/path/of/f/, sep='\t', names = ['file_names','col1'])
df.to_csv('/pathOfNewCSV.csv', index=False, sep='\t')

And finally:

    file_names  col1
0   fileName1.txt   NaN
1   fileName2.txt   stringOrChunk stringOrChunk stringOrChunk...
2   fileName3.txt   stringOrChunk stringOrChunk stringOrChunk...
3   fileName4.txt   stringOrChunk
.....
N   fileNameN.txt   stringOrChunk

So, any idea on how to do this in a more pythonic and efficient way?

Update

.zip , , , :

a_regex = r"\w+ly"
directory = '/Users/user/Desktop/Docs/'
output_dir = '/Users/user/Desktop/'

foo(ind, reg, out)

:

Files            words
doc1.txt    
doc2.txt    
doc3.txt     DIRECTLY PROBABLY EARLY 
doc4.txt    

, ? , , (.. ). , woosh nltk?

UPDATE

, dataframe, , : JESUITS:

    Files   words1  words2  words3  words4
0   doc1.txt    A GOVERNMENT SPOKESMAN HAS ANNOUNCED THAT WITH...   NaN     NaN     NaN
1   doc2.txt    11/12/98 "THERE WAS NO TORTURE OR MISTREATMENT...   NaN     NaN     NaN
2   doc3.txt    WHAT WE HAD PREDICTED HAS OCCURRED. CRISTIANI ...   SO, THE QUESTION IS: WHO GAVE THE ORDER TO KIL...   THE MASSACRE OF THE JESUITS WAS NOT A PERSONAL...   LET US REMEMBER THAT AFTER THE MASSSACRE OF TH...
3   doc4.txt    IN 11/12/98 OUR VIEW, THE ASSASSINS OF THE JES...   THE ASSASSINATION OF THE JESUITS AGAIN CONFIRM...   NaN     NaN
+4
1

, , - , nltk.

from glob import glob
from os.path import join, split

import nltk
import pandas as pd

dir_name = '/tmp/stackovflw/Docs'
file_to_adverb_dict = {}
nltk_adverb_tags = {'RB', 'RBR', 'RBS'}  # taken from nltk.help.upenn_tagset()

for full_file_path in glob(join(dir_name, '*.txt')):
    with open(full_file_path, 'rb') as f:
        _, file_name = split(full_file_path)
        tokens = nltk.word_tokenize(f.read().lower()) # lower -> seems that nltk behaves differently when the text is uppercase - try it...
        adverbs_in_file = [token for token, tag in nltk.pos_tag(tokens) if tag in nltk_adverb_tags]
        # consider using a "set" here to remove duplicates
        file_to_adverb_dict[file_name] = ' '.join(adverbs_in_file).upper()  #converting it back to uppercase (your input is all uppercase)

print pd.DataFrame(file_to_adverb_dict.items(), columns=['file_names', 'col1'])
#   file_names                                               col1
# 0   doc4.txt  PROBABLY ABROAD ALFONSO HOWEVER ALWAYS ALREADY...
# 1   doc1.txt                                                NOT
# 2   doc3.txt  DIRECTLY NOT SO SOLELY NOT PROBABLY NOT EVEN N...
# 3   doc2.txt

: , "ly" , grep - :

$ grep  -o -i -E  '\w+ly' *.txt
doc3.txt:DIRECTLY
doc3.txt:SOLELY
doc3.txt:PROBABLY
doc3.txt:EARLY
doc4.txt:PROBABLY

-o , -i -E

awk :

 $ grep  -o -i -E  '\w+ly' *.txt | awk -F':' '{a[$1]=a[$1] " "  $2}END{for( i in a ) print  i,"," a[i]}'
doc4.txt , PROBABLY
doc3.txt , DIRECTLY SOLELY PROBABLY EARLY
+2

Source: https://habr.com/ru/post/1657923/


All Articles