Does scikit-lean support the tree solution's unordered ("enum") multiclass functions?

From the documentation , it turns out that DecisionTreeClassifier supports multiclass functions

DecisionTreeClassifier can be either binary (where the labels are [-1, 1]) or multiclass (where the labels are [0, ..., K-1]).

But it seems that the acceptance rule in each node is based on "more than"

I'm trying to build trees with enum elements (where there is no value for the absolute value of each function - just equal to \ not equal)

Is this supported in scikit-learn decision trees?

My current solution is to separate each function into a set of binary functions for each possible value - but I'm looking for a cleaner and more efficient solution.

+4
source share
3 answers

The term multiclass only affects the target variable: for a random forest in scikit-learn, it is either categorical with integer encoding to classify multiclasses, or continuous for regression.

"More than" rules apply to input variables regardless of the type of the target variable. If you have categorical input variables with a small dimension (for example, less than a couple of dozen possible values), then it would be useful to use single-string coding for them. See:

  • OneHotEncoder , if your categories are encoded as integers,
  • DictVectorizer if your categories are encoded as string labels in a python dict list.

If some of the categorical variables have high power (for example, thousands of possible values ​​or more), then it has been experimentally shown that DecisionTreeClassifier and the best models based on them, such as RandomForestClassifier , can be trained directly on the source integer encoding without converting it to a single-string encoding, which will contain the memory or size of the model.

+9
source

DecisionTreeClassifier is certainly capable of classifying multiclasses. β€œMore than” is simply illustrated in this link, but coming to this decision rule is a consequence of the influence it has on obtaining information or on gini ( see below on this page ). Decision tree nodes usually have binary rules, so they usually take the form of some value that is larger than the other. The trick converts your data, so it has good predictive values ​​for comparison.

To be clear, multiclass means that your data (like a document) should be classified as one of many possible classes. This differs from multilabel classification, where a document must be classified with several classes from a set of possible classes. Most scikit-learn classifiers support multiclass, and it has several meta-wrappers for multitasking. You can also use probabilities (models with the predict_proba method) or decision distances (models with the decision_function method) for multitasking.

If you say that you need to apply several labels to each anchor point (for example, ['red', 'sport', 'fast'] for cars), then you need to create a unique label for each possible combination to use trees / forest, which becomes your [0 ... K-1] set of classes. However, this means that there is some predictive correlation in the data (for the combined color, type, and speed in the car example). For cars, there may be red / yellow, fast sports cars, but unlikely for other 3-way combinations. Data can be predictive for those few and very weak for others. It is better to use SVM or LinearSVC and / or wrap using OneVsRestClassifier or similar.

+1
source

There is a Python package called DecisionTree https://engineering.purdue.edu/kak/distDT/DecisionTree-2.2.2.html which I find very useful.

This is not directly related to your scikit / sklearn problem, but is useful to others. Also, I always go to pyindex when I look for python tools. https://pypi.python.org/pypi/pyindex

thanks

+1
source

Source: https://habr.com/ru/post/1501589/


All Articles