Continuous or categorical data in data science

I am creating an automated cleanup process that cleans up null values ​​from a dataset. I found several functions, such as mode, median, and the average value that could be used to fill in the NaN values ​​in the data. But which one should I choose? if the data is categorical, it must be either mode or median, while for continuous it must be medium or median. Therefore, in order to determine whether the data is categorical or continuous, I decided to make a classification of machine learning models.

I used several functions, for example,
1) standard deviation of the data
2) the number of unique values ​​in the data
3) the total number of rows of data
4) the ratio of the unique number of complete rows
5) the minimum data value
6) the maximum data value
7) the amount of data between median and 75 th percentile
8) of data between the median and 25th percentile
9) of data between the 75 th percentile and upper whiskers
10) of data between the 25 th percentile and lower whiskers
11) the amount of data is above the upper yarns prominent
12) the amount of data below the lower bottom

First, with these 12 functions and about 55 training data, I used the logistic regression model in the Normalized form to predict labels 1 (continuous) and 0 (categorical).

The fun part is working!

But did I do it right? Is this the correct method for predicting the nature of the data? Please advise me if I can improve it further.

+4
source share
2 answers

Data analysis seems amazing. For part

But which one should I choose?

The average always wins, as far as I tested. For each data set, I test all cases and compare accuracy.

There is a better approach, but it takes a lot of time. If you want to use this system, this may help.

. , N , , , N-1 - . , ( ) .

+1

? , , , .

. - . , . , nan, , , "index is nan". - .

, nan. , MICE.

, . , , , :

  • ( )
  • 1D GD, (GMM; 55 )

, + (log, exp).

: . , . .

. , RobustScaler sklearn ( , , "outlied" ).

: . / .

, , .

0

Source: https://habr.com/ru/post/1692758/


All Articles