Parsing a plain text file into a CSV file using Python

I have a series of HTML files that are processed in a single text file using Beautiful Soup. HTML files are formatted so that their output always consists of three lines in a text file, so the output will look something like this:

Hello! How are you? Well, Bye! 

But it would be just as easy

 83957 And I ain't coming back! hgu39hgd 

In other words, the content of the HTML files is not standard for each of them, but they always create three lines.

So, I was wondering where I should start, if I want, then take a text file that is created from Beautiful Soup and parse it into a CSV file with columns like (using the examples above):

 Title Intro Tagline Hello! How are you? Well, Bye! 83957 And I ain't coming back! hgu39hgd 

Python code to remove HTML from text files:

 import os import glob import codecs import csv from bs4 import BeautifulSoup path = "c:\\users\\me\\downloads\\" for infile in glob.glob(os.path.join(path, "*.html")): markup = (infile) soup = BeautifulSoup(codecs.open(markup, "r", "utf-8").read()) with open("extracted.txt", "a") as myfile: myfile.write(soup.get_text()) 

And I understand that I can use this to customize the columns in my CSV file:

 csv.put_HasColumnNames(True) csv.SetColumnName(0,"title") csv.SetColumnName(1,"intro") csv.SetColumnName(2,"tagline") 

Where I draw an empty one is like iterating through a text file (extract.txt) one line at a time, and as I get into a new line, set it to the correct cell in the CSV file. The first few lines of the file are empty, and there are many blank lines between each grouping of text. So, first I need to open the file and read it:

 file = open("extracted.txt") for line in file.xreadlines(): pass # csv.SetCell(0,0 X) (obviously, I don't know what to put in X) 

Also, I don’t know how to tell Python to just keep reading the file and add it to the CSV file until it finishes. In other words, there is no way to know exactly how many common lines there will be in HTML files, and therefore I can’t just csv.SetCell(0,0) to cdv.SetCell(999,999)

+6
source share
2 answers

I'm not quite sure which CSV library you are using, but this does not look like Python's built-in . Anyway, here's how I do it:

 import csv import itertools with open('extracted.txt', 'r') as in_file: stripped = (line.strip() for line in in_file) lines = (line for line in stripped if line) grouped = itertools.izip(*[lines] * 3) with open('extracted.csv', 'w') as out_file: writer = csv.writer(out_file) writer.writerow(('title', 'intro', 'tagline')) writer.writerows(grouped) 

This type makes a conveyor. First, it receives data from the file, then removes all spaces from the lines, then removes all empty lines, then groups them into groups of three, and then (after writing the CSV header) writes these groups to the CSV file.

To combine the last two columns, as you mentioned in the comments, you can change the call to writerow obvious way, and writerows -:

 writer.writerows((title, intro + tagline) for title, intro, tagline in grouped) 
+7
source

I may not have understood you correctly, but you can do:

 file = open("extracted.txt") # if you don't want to do .strip() again, just create a list of the stripped # lines first. lines = [line.strip() for line in file if line.strip()] for i, line in enumerate(lines): csv.SetCell(i % 3, line) 
+3
source

Source: https://habr.com/ru/post/943765/


All Articles