Finding CSV files using Pandas (unique identifier) - Python

Question

Finding CSV files using Pandas (unique identifier) - Python

I am looking for a csv file search with 242,000 lines and want to sum the unique identifiers in one of the columns. The column name is "logid" and has several different meanings, i.e. 1002, 3004, 5003. I want to search for the csv file using the panda frame and summarize the number of unique identifiers. If possible, I would like to create a new csv file that stores this information. For example, if I find that there is 50 logid out of 1004, I would like to create a csv file with a column name of 1004 and the number 50 displayed below. I would do this for all unique identifiers and add them to the same csv file. I am completely new to this and have worked a bit, but I don’t know where to start.

Thank!

+1

python pandas

Cameron Jul 26 '17 at 1:39

source share

1 answer

2Obe · Answer 1 · 2017-07-26T11:38:00+0000

Since you are not publishing your code, I can only give you an answer about the general way of working.

Download the CSV file to the pd.Dataframe file using pandas.read_csv
Save all the values that occur in 1> in a separate df1 using pandas.DataFrame.drop_duplicates , for example:
df1 = df.drop_duplicates (hold = "first)

-> This will return a DataFrame that contains only rows with the first occurrence of duplicate values. For instance. if the value 1000 is in 5 lines, only the first line will be returned, and the rest will be deleted.

-> Using df1.shape [0] will return you the number of duplicate values in your df.

3. df, " " CSV , - :

df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)

df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.

print(df1)
list=[]


for m in df1["A"]:
    mask=(df==m)
    list.append(df[mask].dropna())

for dfx in range(len(list)):
    name="file{0}".format(dfx)
    list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Finding CSV files using Pandas (unique identifier) ​​- Python

More articles:

Finding CSV files using Pandas (unique identifier) - Python