Finding CSV files using Pandas (unique identifier) ​​- Python

I am looking for a csv file search with 242,000 lines and want to sum the unique identifiers in one of the columns. The column name is "logid" and has several different meanings, i.e. 1002, 3004, 5003. I want to search for the csv file using the panda frame and summarize the number of unique identifiers. If possible, I would like to create a new csv file that stores this information. For example, if I find that there is 50 logid out of 1004, I would like to create a csv file with a column name of 1004 and the number 50 displayed below. I would do this for all unique identifiers and add them to the same csv file. I am completely new to this and have worked a bit, but I don’t know where to start.

Thank!

+1
source share
1 answer

Since you are not publishing your code, I can only give you an answer about the general way of working.

-> This will return a DataFrame that contains only rows with the first occurrence of duplicate values. For instance. if the value 1000 is in 5 lines, only the first line will be returned, and the rest will be deleted.

-> Using df1.shape [0] will return you the number of duplicate values ​​in your df.

3. df, " " CSV , - :

df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)

df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.

print(df1)
list=[]


for m in df1["A"]:
    mask=(df==m)
    list.append(df[mask].dropna())

for dfx in range(len(list)):
    name="file{0}".format(dfx)
    list[dfx].to_csv(r"YOUR PATH\{0}".format(name))
0

Source: https://habr.com/ru/post/1682631/


All Articles