If this is just one JSON document for each file, you need to SparkContext.wholeTextFiles. First, create some dummy data:
import tempfile
import json
input_dir = tempfile.mkdtemp()
docs = [
{'data': {'text': {'de': 'Ein Text.', 'en': 'A text.'}}},
{'data': {'text': {'de': 'Ein Bahnhof.', 'en': 'A railway station.'}}},
{'data': {'text': {'de': 'Ein Hund.', 'en': 'A dog.'}}}]
for doc in docs:
with open(tempfile.mktemp(suffix="json", dir=input_dir), "w") as fw:
json.dump(doc, fw, indent=4)
Now we read the data:
rdd = sc.wholeTextFiles(input_dir).values()
and make sure the files are really indented:
print rdd.top(1)[0]
#
#
#
#
#
#
#
#
Finally, we can parse:
parsed = rdd.map(json.loads)
and check if everything worked as expected:
parsed.takeOrdered(3)
#
#
#
, , , . , , flatMap :
rdd_malformed = sc.parallelize(["{u'data': {u'text': {u'de':"]).union(rdd)
try_seq ( : scala.util.Try pyspark?)
rdd_malformed.flatMap(lambda x: seq_try(json.loads, x)).collect()
#
#
#