I am using the DataFrame API for pyspark (Apache Spark) and I ran into the following problem:
When I join two DataFrames that come from the same source DataFrame, the resulting DF will explode into a huge number of rows. Quick example:
I load a DataFrame with n lines from disk:
df = sql_context.parquetFile('data.parquet')
Then I create two DataFrames from this source.
df_one = df.select('col1', 'col2') df_two = df.select('col1', 'col3')
Finally, I want to (internally) merge them together:
df_joined = df_one.join(df_two, df_one['col1'] == df_two['col1'], 'inner')
The key in col1 is unique. The resulting DataFrame must have rows n , however, it has rows n*n .
This does not happen when I load df_one and df_two from disk directly. I am on Spark 1.3.0, but this is also happening in the current snapshot 1.4.0.
Can someone explain why this is happening?
source share