What is the best way to iterate over a python list, excluding specific values ​​and printing the result

I am new to python and ask a question:
I checked similar questions, checked immersion in python, checked documentation on python, googlebinging, similar questions and dozens of other tutorials.
I have a python code section that reads a text file containing 20 tweets. I can extract these 20 tweets using the following code:

with open ('output.txt') as fp: for line in iter(fp.readline,''): Tweets=json.loads(line) data.append(Tweets.get('text')) i=0 while i < len(data): print data[i] i=i+1 

The above loop repeats and produces 20 tweets (lines) from output.txt . However, these 20 lines contain non-English characters such as "Los ladillo a los dos, soy maaaala o maloooooooooooo" , URLs such as "http://t.co/57LdpK" , the string "None" and photos with this URL URL "Photo: http://t.co/kxpaaaaa (I edited this for privacy)

I would like to clear the output of this (which is list ) and exclude the following:

  • None Entries
  • Everything starting with the line "Photo:"
  • This would be a bonus if I can exclude data other than unicode.

I tried the following bits of code

  • Using data.remove("None:") , but I get the error list.remove(x): x not in list.
  • Reading the elements that I don’t want to use in the set, and then doing a comparison on the output, but no luck.
  • Exploring the understanding of lists, but ask if I am looking at the right solution here.

I'm from the Oracle background, where there are functions to cut out any desired / unwanted section of the output, so for the last 2 hours it really has been spinning in circles. Any help is much appreciated!

+4
source share
5 answers

Try something like this:

 def legit(string): if (string.startswith("Photo:") or "None" in string): return False else: return True whatyouwant = [x for x in data if legit(x)] 

I'm not sure if this will work out of the box for your data, but you will get this idea. If you are not familiar, [x for x in data if legit(x)] is called list comprehension

+3
source

First of all, add Tweet.get('text') if there is a text entry:

 with open ('output.txt') as fp: for line in iter(fp.readline,''): Tweets=json.loads(line) if 'text' in Tweets: data.append(Tweets['text']) 

This will not add entries None ( .get() returns None if the 'text' key is not in the dictionary).

I assume that you want to continue processing the data list that you are building here. If not, you can get around the for entry in data: loops below and stick to one loop with if . Tweets['text'] is the same value as entry in for entry in data loops.

Then you iterate over the python unicode values, so use the methods provided on these objects to filter out what you don't want:

 for entry in data: if not entry.startswith("Photo:"): print entry 

Here you can use list comprehension; The following will print all entries at a time:

 print '\n'.join([entry for entry in data if not entry.startswith("Photo:")]) 

In this case, you really do not buy very much, because you are building one large line only to print it; you can simply print individual lines and avoid the cost of building a line.

Please note that all of your data is Unicode data. You might want to filter out text that might use code points outside of ASCII. You can use regular expressions to find that there are code pages outside of ASCII in the text.

 import re nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE) # all codepoints beyond 0x7F are non-ascii for entry in data: if entry.startswith("Photo:") or nonascii.search(entry): continue # skip the rest of this iteration, continue to the next print entry 

A short demonstration of non-ASCII expression:

 >>> import re >>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE) >>> nonascii.search(u'All you see is ASCII') >>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014') <_sre.SRE_Match object at 0x1086275e0> 
+2
source
 with open ('output.txt') as fp: for line in fp.readlines(): Tweets=json.loads(line) if not 'text' in Tweets: continue txt = Tweets.get('text') if txt.replace('.', '').replace('?','').replace(' ','').isalnum(): data.append(txt) print txt 

Small and simple.
The basic principle, one cycle, if the data meets your OK criteria, add it and print it.

As Martijn noted, “text” may not be in all Tweets data.


Replacing regexp for .replace() will go something like this: if re.match('^[\w-\ ]+$', txt) is not None: (it will not work for forms, etc., so yes, as indicated below ..)

+1
source

Try the following:

 with open ('output.txt') as fp: for line in iter(fp.readline,''): Tweets=json.loads(line) data.append(Tweets.get('text')) i=0 while i < len(data): # these conditions will skip (continue) over the iterations # matching your first two conditions. if data[i] == None or data[i].startswith("Photo"): continue print data[i] i=i+1 
+1
source

I would suggest something like the following:

 # use itertools.ifilter to remove items from a list according to a function from itertools import ifilter import re # write a function to filter out entries you don't want def my_filter(value): if not value or value.startswith('Photo:'): return False # exclude unwanted chars if re.match('[^\x00-\x7F]', value): return False return True # Reading the data can be simplified with a list comprehension with open('output.txt') as fp: data = [json.loads(line).get('text') for line in fp] # do the filtering data = list(ifilter(my_filter, data)) # print the output for line in data: print line 

As for Unicode, assuming you are using python 2.x, the open function will not read data as unicode, it will be read as str type. You might want to convert it if you know the encoding, or read the file with the specified encoding using codecs.open .

+1
source

Source: https://habr.com/ru/post/1480763/


All Articles