How to convert structured text to data columns in R?

I have a rather large (1000 pages) list of structured text that I would like to convert to a data frame (preferably with R, but I'm open to suggestions).

The text file is as follows:

AC-Acrelândia TV Canal 18 AINDA NÃO OUTORGADO RTV Canal 9 RADIO TV DO AMAZONAS LTDA RTV Canal 10 RADIO TV DO AMAZONAS LTDA(REDENCAO) TVD Canal 15 RADIO TV DO AMAZONAS LTDA TVD Canal 15 AINDA NÃO OUTORGADO(REDENÇÃO) FM 88,5 MHz RADIO E TV MAIRA LTDA AC-Assis Brasil TV Canal 34 AINDA NÃO OUTORGADO RTV Canal 6 AMAZONIA CABO LTDA RTV Canal 10 RADIO TV DO AMAZONAS LTDA RTV Canal 13 AINDA NÃO OUTORGADO RTV Canal 45 FUNDACAO JOAO PAULO II 

and I would like to convert it to something like this:

 AC Acrelândia TV Canal 18 AINDA NÃO OUTORGADO AC Acrelândia RTV Canal 9 RADIO TV DO AMAZONAS LTDA AC Acrelândia RTV Canal 10 RADIO TV DO AMAZONAS LTDA(REDENCAO) .... 

ReadLines () seems to be a good start, but it's hard for me to construct.

+6
source share
2 answers

To csv file

Since you are open to other languages, I offer a solution in Python. It creates a csv file that looks like this:

 "AC","Acrelândia","TV","Canal 18","AINDA NÃO OUTORGADO" "AC","Acrelândia","RTV","Canal 9","RADIO TV DO AMAZONAS LTDA" "AC","Acrelândia","RTV","Canal 10","RADIO TV DO AMAZONAS LTDA(REDENCAO)" "AC","Acrelândia","TVD","Canal 15","RADIO TV DO AMAZONAS LTDA" "AC","Acrelândia","TVD","Canal 15","AINDA NÃO OUTORGADO(REDENÇÃO)" "AC","Acrelândia","FM","88,5 MHz","RADIO E TV MAIRA LTDA" "AC","Assis Brasil","TV","Canal 34","AINDA NÃO OUTORGADO" "AC","Assis Brasil","RTV","Canal 6","AMAZONIA CABO LTDA" "AC","Assis Brasil","RTV","Canal 10","RADIO TV DO AMAZONAS LTDA" "AC","Assis Brasil","RTV","Canal 13","AINDA NÃO OUTORGADO" "AC","Assis Brasil","RTV","Canal 45","FUNDACAO JOAO PAULO II" 

The code

This makes two assumptions: (1) The first line in the file or any line following an empty line is the name of the station and (2) The fields are separated by two or more spaces

 #-*- coding: utf-8 -*- import re import csv # CREATE DATA STRUCTURE TO SIMULATE READING A TEXT FILE data = u'''AC-Acrelândia TV Canal 18 AINDA NÃO OUTORGADO RTV Canal 9 RADIO TV DO AMAZONAS LTDA RTV Canal 10 RADIO TV DO AMAZONAS LTDA(REDENCAO) TVD Canal 15 RADIO TV DO AMAZONAS LTDA TVD Canal 15 AINDA NÃO OUTORGADO(REDENÇÃO) FM 88,5 MHz RADIO E TV MAIRA LTDA AC-Assis Brasil TV Canal 34 AINDA NÃO OUTORGADO RTV Canal 6 AMAZONIA CABO LTDA RTV Canal 10 RADIO TV DO AMAZONAS LTDA RTV Canal 13 AINDA NÃO OUTORGADO RTV Canal 45 FUNDACAO JOAO PAULO II'''.split('\n') def read_records(): for line in data: yield line # INITIALIZE SPLITTER, READ RECORDS AND WRITE TO CSV FILE splitter = re.compile('\s{2,}') change_station = True station = '' f = open('./output.csv', 'w') writer = csv.writer(f, quoting=csv.QUOTE_ALL) for rec in read_records(): rec = rec.strip() if rec == '': change_station = True elif change_station == True: station = rec.replace('-', ' ') change_station = False else: record = station + ' ' + rec record = record.encode('utf-8') record = re.split(splitter, record) writer.writerow(record) f.close() # READ IN FILE AND PRINT TO CONSOLE FOR DEMO PURPOSES f = open('./output.csv', 'r') print ''.join( f.readlines() ) f.close() 
+5
source

If you are happy to use Python, then this will work (assuming output with delimiters):

 import os program = open('program', 'r') new_prog = open('new_prog', 'w') # Get initial state and city state, city = program.readline().rstrip().split('-') for line in program: # Blank lines denote city change if not line.strip(): line = program.next() state, city = line.rstrip().split('-') line = program.next() band, cname, channel, show = line.rstrip().split(None, 3) new_line = '\t'.join([state, city, band, cname, channel, show, os.linesep]) new_prog.write(new_line) program.close() new_prog.close() 
+3
source

Source: https://habr.com/ru/post/917274/


All Articles