Skip to content Skip to sidebar Skip to footer

Best Model For Variable Selection With Big Data?

I posted a question earlier about some code but now I realize I should be more broad with the general idea. Basically, I'm trying to build a statistical model with about 1000 obser

Solution 1:

I would recommend getting closer look to variance of your variables ot keep those with the largest range (pandas.DataFrame.var()) and eliminate those variables which correlate at most with others (pandas.DataFrame.corr()), as further steps I'd suggest to get any methods mentioned earlier.

Solution 2:

1.Variante A: Feature Selection Scikit

For future selection scikitoffers a lot of different approaches: https://scikit-learn.org/stable/modules/feature_selection.html

Here it sumps up the comments from above.

2.Variante B: Feature Selection with linear regression

You can also read your feature importance if you run linearregression on it. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html .The function reg.coef_will give you the coefiecents for your futres, the higher the absolute number is, the more important is your feature, so for exmaple 0.8 is a really important future, where 0.00001 is not important.

3.Variante C: PCA (not for binary case)

Why you wanna kill your variables ? I would recommend you to use: PCA - Principal ocmponent analysis https://en.wikipedia.org/wiki/Principal_component_analysis.

The basic concept is to transform your 2000 features to a smaller space (maybe 1000 or whatever), while still being mathematically useful.

Scikik-learnhas a good package for it: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Post a Comment for "Best Model For Variable Selection With Big Data?"