First of all, add Tweet.get('text') if there is a text entry:
with open ('output.txt') as fp: for line in iter(fp.readline,''): Tweets=json.loads(line) if 'text' in Tweets: data.append(Tweets['text'])
This will not add entries None ( .get() returns None if the 'text' key is not in the dictionary).
I assume that you want to continue processing the data list that you are building here. If not, you can get around the for entry in data: loops below and stick to one loop with if . Tweets['text'] is the same value as entry in for entry in data loops.
Then you iterate over the python unicode values, so use the methods provided on these objects to filter out what you don't want:
for entry in data: if not entry.startswith("Photo:"): print entry
Here you can use list comprehension; The following will print all entries at a time:
print '\n'.join([entry for entry in data if not entry.startswith("Photo:")])
In this case, you really do not buy very much, because you are building one large line only to print it; you can simply print individual lines and avoid the cost of building a line.
Please note that all of your data is Unicode data. You might want to filter out text that might use code points outside of ASCII. You can use regular expressions to find that there are code pages outside of ASCII in the text.
import re nonascii = re.compile(ur'[^\x00-0x7f]', re.UNICODE)
A short demonstration of non-ASCII expression:
>>> import re >>> nonascii = re.compile(ur'[^\x00-\x7f]', re.UNICODE) >>> nonascii.search(u'All you see is ASCII') >>> nonascii.search(u'All you see is ASCII plus a little more unicode, like the EM DASH codepoint: \u2014') <_sre.SRE_Match object at 0x1086275e0>