long lurker, but the first poster on Stackoverflow.
I hit a wall with a data analysis project I'm working on.
Essentially, if I have a CSV 'A' example:
id | item_num
A123 | 1
A123 | 2
B456 | 1
And I have an example CSV 'B':
id | description
A123 | Mary had a...
A123 | ...little lamb.
B456 | ...Its fleece...
If I execute mergewith Pandas, it ends as follows:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | Mary had a...
A123 | 1 | ...little lamb.
A123 | 2 | ...little lamb.
B456 | 1 | Its fleece...
How could I do this instead:
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb...
B456 | 1 | Its fleece...
This is my code:
import pandas as pd
first = pd.read_csv("../PATH_TO_CSV/A.csv")
print("Imported first CSV: " + str(first.shape))
second = pd.read_csv("../PATH_TO_CSV/B.csv")
print("Imported second CSV: " + str(second.shape))
result = pd.DataFrame()
result = result.append(pd.merge(first, second), ignore_index = True)
print("Merged CSVs... resulting DataFrame is: " + str(result.shape))
result = result.drop_duplicates()
print("Deduping... resulting DataFrame is: " + str(result.shape))
result.to_csv("EXPORT.csv", index=False)
print("Saved to file.")
I would really appreciate any help - I am very stuck! And I am dealing with 20,000 lines.
Thank.
Edit: my post has been flagged as a potential duplicate. This is not the case, since I'm not necessarily trying to add a column - I'm just trying to prevent multiplication descriptionby a number item_numthat is assigned to a specific one id.
UPDATE, 6/21:
, DF ?
id | item_num | other_col
A123 | 1 | lorem ipsum
A123 | 2 | dolor sit
A123 | 3 | amet, consectetur
B456 | 1 | lorem ipsum
CSV 'B':
id | item_num | description
A123 | 1 | Mary had a...
A123 | 2 | ...little lamb.
B456 | 1 | ...Its fleece...
, :
id | item_num | other_col | description
A123 | 1 | lorem ipsum | Mary Had a...
A123 | 2 | dolor sit | ...little lamb.
B456 | 1 | lorem ipsum | ...Its fleece...
, , 3, "amet, consectetur" "other_col", .