I am creating an automated cleanup process that cleans up null values from a dataset. I found several functions, such as mode, median, and the average value that could be used to fill in the NaN values in the data. But which one should I choose? if the data is categorical, it must be either mode or median, while for continuous it must be medium or median. Therefore, in order to determine whether the data is categorical or continuous, I decided to make a classification of machine learning models.
I used several functions, for example,
1) standard deviation of the data
2) the number of unique values in the data
3) the total number of rows of data
4) the ratio of the unique number of complete rows
5) the minimum data value
6) the maximum data value
7) the amount of data between median and 75 th percentile
8) of data between the median and 25th percentile
9) of data between the 75 th percentile and upper whiskers
10) of data between the 25 th percentile and lower whiskers
11) the amount of data is above the upper yarns prominent
12) the amount of data below the lower bottom
First, with these 12 functions and about 55 training data, I used the logistic regression model in the Normalized form to predict labels 1 (continuous) and 0 (categorical).
The fun part is working!
But did I do it right? Is this the correct method for predicting the nature of the data? Please advise me if I can improve it further.
source
share