Feature Selection Using MRMR

Question

Feature Selection Using MRMR

I found two ways to implement MRMR to select functions in python. The source of the article containing this method is:

https://www.dropbox.com/s/tr7wjpc2ik5xpxs/doc.pdf?dl=0

This is my dataset code.

import numpy as np import pandas as pd from sklearn.datasets import make_classification from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" X, y = make_classification(n_samples=10000, n_features=6, n_informative=3, n_classes=2, random_state=0, shuffle=False) # Creating a dataFrame df = pd.DataFrame({'Feature 1':X[:,0], 'Feature 2':X[:,1], 'Feature 3':X[:,2], 'Feature 4':X[:,3], 'Feature 5':X[:,4], 'Feature 6':X[:,5], 'Class':y}) y_train = df['Class'] X_train = df.drop('Class', axis=1)

Method 1: Use MRMR using pymrmr

Contains MID and MIQ

which is published by the author Link https://github.com/fbrundu/pymrmr

 import pymrmr pymrmr.mRMR(df, 'MIQ',6)

['Feature 4', 'Feature 5', 'Feature 2', 'Feature 6', 'Feature 1', 'Feature 3']

or performed using the second method

 pymrmr.mRMR(df, 'MID',6)

['Feature 4', 'Feature 6', 'Feature 5', 'Feature 2', 'Feature 1', 'Feature 3']

Both of these methods, on the dataset above, give this 2 output. Another GitHub author claims that you can use its version to apply the MRMR method. However, when I use it for the same dataset, I have a different result.

Method 2. Application of MRMR using MIFS

github link

https://github.com/danielhomola/mifs

 import mifs for i in range(1,11): feat_selector = mifs.MutualInformationFeatureSelector('MRMR',k=i) feat_selector.fit(X_train, y_train) # call transform() on X to filter it down to selected features X_filtered = feat_selector.transform(X_train.values) #Create list of features feature_name = X_train.columns[feat_selector.ranking_] print(feature_name)

And if you run the above iteration for all different values of i, there will be no time when both methods actually give the same output of the function selection.

What could be the problem here?

+5

python numpy pandas

Victor Mar 12 '18 at 10:22

source share

1 answer

carrdelling · Accepted Answer · 2018-03-20T15:44:58+0000

You may need to contact the authors of the original article and / or the owner of the Github repository for a definitive answer, but most likely the differences here are due to the fact that you are comparing 3 different algorithms (despite the name).

Minimum redundancy. Algorithms of maximum relevance are actually a family of function selection algorithms, the main purpose of which is to select functions that are at a distance from each other, which still have a "high" correlation with the classification variable.

You can measure this goal using measures of mutual information, but follow a specific method (for example, what to do with the estimates calculated? In what order? What other post-processing methods will be used? ...) differ from one author to another - even in the document, they actually give you two different implementations: MIQ and MID .

Thus, my suggestion would be to simply choose an implementation that is more convenient for you (or even better, one that gives the best results in your pipeline after a valid check) and just let you know which specific source you selected and why,

Feature Selection Using MRMR

More articles: