[NEW] How to Choose a Feature Selection Method For Machine Learning | feature selection – Pickpeup

feature selection: คุณกำลังดูกระทู้

Last Updated on August 20, 2020

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables.

As such, it can be challenging for a machine learning practitioner to select an appropriate statistical measure for a dataset when performing filter-based feature selection.

In this post, you will discover how to choose statistical measures for filter-based feature selection with numerical and categorical data.

After reading this post, you will know:

  • There are two main types of feature selection techniques: supervised and unsupervised, and supervised methods may be divided into wrapper, filter and intrinsic.
  • Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
  • Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Nov/2019: Added some worked examples for classification and regression.
  • Update May/2020: Expanded and added references. Added pictures.

How to Develop a Probabilistic Model of Breast Cancer Patient Survival

Overview

This tutorial is divided into 4 parts; they are:

  1. Feature Selection Methods
  2. Statistics for Filter Feature Selection Methods
    1. Numerical Input, Numerical Output
    2. Numerical Input, Categorical Output
    3. Categorical Input, Numerical Output
    4. Categorical Input, Categorical Output
  3. Tips and Tricks for Feature Selection
    1. Correlation Statistics
    2. Selection Method
    3. Transform Variables
    4. What Is the Best Method?
  4. Worked Examples
    1. Regression Feature Selection
    2. Classification Feature Selection

1. Feature Selection Methods

Feature selection methods are intended to reduce the number of input variables to those that are believed to be most useful to a model in order to predict the target variable.

Feature selection is primarily focused on removing non-informative or redundant predictors from the model.

— Page 488, Applied Predictive Modeling, 2013.

Some predictive modeling problems have a large number of variables that can slow the development and training of models and require a large amount of system memory. Additionally, the performance of some models can degrade when including input variables that are not relevant to the target variable.

Many models, especially those based on regression slopes and intercepts, will estimate parameters for every term in the model. Because of this, the presence of non-informative variables can add uncertainty to the predictions and reduce the overall effectiveness of the model.

— Page 488, Applied Predictive Modeling, 2013.

One way to think about feature selection methods are in terms of supervised and unsupervised methods.

An important distinction to be made in feature selection is that of supervised and unsupervised methods. When the outcome is ignored during the elimination of predictors, the technique is unsupervised.

— Page 488, Applied Predictive Modeling, 2013.

The difference has to do with whether features are selected based on the target variable or not. Unsupervised feature selection techniques ignores the target variable, such as methods that remove redundant variables using correlation. Supervised feature selection techniques use the target variable, such as methods that remove irrelevant variables..

Another way to consider the mechanism used to select features which may be divided into wrapper and filter methods. These methods are almost always supervised and are evaluated based on the performance of a resulting model on a hold out dataset.

Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance.

— Page 490, Applied Predictive Modeling, 2013.

Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

Filter methods evaluate the relevance of the predictors outside of the predictive models and subsequently model only the predictors that pass some criterion.

— Page 490, Applied Predictive Modeling, 2013.

READ  [Update] 젤 네일의 두 얼굴 | 반디젤 - Pickpeup

Finally, there are some machine learning algorithms that perform feature selection automatically as part of learning the model. We might refer to these techniques as intrinsic feature selection methods.

… some models contain built-in feature selection, meaning that the model will only include predictors that help maximize accuracy. In these cases, the model can pick and choose which representation of the data is best.

— Page 28, Applied Predictive Modeling, 2013.

This includes algorithms such as penalized regression models like Lasso and decision trees, including ensembles of decision trees like random forest.

Some models are naturally resistant to non-informative predictors. Tree- and rule-based models, MARS and the lasso, for example, intrinsically conduct feature selection.

— Page 487, Applied Predictive Modeling, 2013.

Feature selection is also related to dimensionally reduction techniques in that both methods seek fewer input variables to a predictive model. The difference is that feature selection select features to keep or remove from the dataset, whereas dimensionality reduction create a projection of the data resulting in entirely new input features. As such, dimensionality reduction is an alternate to feature selection rather than a type of feature selection.

We can summarize feature selection as follows.

  • Feature Selection: Select a subset of input features from the dataset.
    • Unsupervised: Do not use the target variable (e.g. remove redundant variables).
      • Correlation
    • Supervised: Use the target variable (e.g. remove irrelevant variables).
      • Wrapper: Search for well-performing subsets of features.
        • RFE
      • Filter: Select subsets of features based on their relationship with the target.
        • Statistical Methods
        • Feature Importance Methods
      • Intrinsic: Algorithms that perform automatic feature selection during training.
        • Decision Trees
  • Dimensionality Reduction: Project input data into a lower-dimensional feature space.

The image below provides a summary of this hierarchy of feature selection techniques.

Overview of Feature Selection Techniques

In the next section, we will review some of the statistical measures that may be used for filter-based feature selection with different input and output variable data types.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

2. Statistics for Filter-Based Feature Selection Methods

It is common to use correlation type statistical measures between input and output variables as the basis for filter feature selection.

As such, the choice of statistical measures is highly dependent upon the variable data types.

Common data types include numerical (such as height) and categorical (such as a label), although each may be further subdivided such as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Common input variable data types:

  • Numerical Variables
    • Integer Variables.
    • Floating Point Variables.
  • Categorical Variables.
    • Boolean Variables (dichotomous).
    • Ordinal Variables.
    • Nominal Variables.

Overview of Data Variable Types

The more that is known about the data type of a variable, the easier it is to choose an appropriate statistical measure for a filter-based feature selection method.

In this section, we will consider two broad categories of variable types: numerical and categorical; also, the two main groups of variables to consider: input and output.

Input variables are those that are provided as input to a model. In feature selection, it is this group of variables that we wish to reduce in size. Output variables are those for which a model is intended to predict, often called the response variable.

The type of response variable typically indicates the type of predictive modeling problem being performed. For example, a numerical output variable indicates a regression predictive modeling problem, and a categorical output variable indicates a classification predictive modeling problem.

  • Numerical Output: Regression predictive modeling problem.
  • Categorical Output: Classification predictive modeling problem.

The statistical measures used in filter-based feature selection are generally calculated one input variable at a time with the target variable. As such, they are referred to as univariate statistical measures. This may mean that any interaction between input variables is not considered in the filtering process.

Most of these techniques are univariate, meaning that they evaluate each predictor in isolation. In this case, the existence of correlated predictors makes it possible to select important, but redundant, predictors. The obvious consequences of this issue are that too many predictors are chosen and, as a result, collinearity problems arise.

— Page 499, Applied Predictive Modeling, 2013.

With this framework, let’s review some univariate statistical measures that can be used for filter-based feature selection.

How to Choose Feature Selection Methods For Machine Learning

Numerical Input, Numerical Output

This is a regression predictive modeling problem with numerical input variables.

The most common techniques are to use a correlation coefficient, such as Pearson’s for a linear correlation, or rank-based methods for a nonlinear correlation.

  • Pearson’s correlation coefficient (linear).
  • Spearman’s rank coefficient (nonlinear)

Numerical Input, Categorical Output

This is a classification predictive modeling problem with numerical input variables.

This might be the most common example of a classification problem,

Again, the most common techniques are correlation based, although in this case, they must take the categorical target into account.

  • ANOVA correlation coefficient (linear).
  • Kendall’s rank coefficient (nonlinear).

Kendall does assume that the categorical variable is ordinal.

Categorical Input, Numerical Output

This is a regression predictive modeling problem with categorical input variables.

READ  [Update] 三代目 J SOUL BROTHERS from EXILE TRIBE BRIGHT 歌詞&動画視聴 | 三代目 kose - Pickpeup

This is a strange example of a regression problem (e.g. you would not encounter it often).

Nevertheless, you can use the same “Numerical Input, Categorical Output” methods (described above), but in reverse.

Categorical Input, Categorical Output

This is a classification predictive modeling problem with categorical input variables.

The most common correlation measure for categorical data is the chi-squared test. You can also use mutual information (information gain) from the field of information theory.

  • Chi-Squared test (contingency tables).
  • Mutual Information.

In fact, mutual information is a powerful method that may prove useful for both categorical and numerical data, e.g. it is agnostic to the data types.

3. Tips and Tricks for Feature Selection

This section provides some additional considerations when using filter-based feature selection.

Correlation Statistics

The scikit-learn library provides an implementation of most of the useful statistical measures.

For example:

Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (kendalltau) and Spearman’s rank correlation (spearmanr).

Selection Method

The scikit-learn library also provides many different filtering methods once statistics have been calculated for each input variable with the target.

Two of the more popular methods include:

  • Select the top k variables: SelectKBest
  • Select the top percentile variables: SelectPercentile

I often use SelectKBest myself.

Transform Variables

Consider transforming the variables in order to access different statistical methods.

For example, you can transform a categorical variable to ordinal, even if it is not, and see if any interesting results come out.

You can also make a numerical variable discrete (e.g. bins); try categorical-based measures.

Some statistical measures assume properties of the variables, such as Pearson’s that assumes a Gaussian probability distribution to the observations and a linear relationship. You can transform the data to meet the expectations of the test and try the test regardless of the expectations and compare results.

What Is the Best Method?

There is no best feature selection method.

Just like there is no best set of input variables or best machine learning algorithm. At least not universally.

Instead, you must discover what works best for your specific problem using careful systematic experimentation.

Try a range of different models fit on different subsets of features chosen via different statistical measures and discover what works best for your specific problem.

4. Worked Examples of Feature Selection

It can be helpful to have some worked examples that you can copy-and-paste and adapt for your own project.

This section provides worked examples of feature selection cases that you can use as a starting point.

Regression Feature Selection:
(Numerical Input, Numerical Output)

This section demonstrates feature selection for a regression problem that as numerical inputs and numerical outputs.

A test regression problem is prepared using the make_regression() function.

Feature selection is performed using Pearson’s Correlation Coefficient via the f_regression() function.

1

2

3

4

5

6

7

8

9

10

11

# pearson’s correlation feature selection for numeric input and numeric output

from

sklearn

.

datasets

import

make_regression

from

sklearn

.

feature_selection

import

SelectKBest

from

sklearn

.

feature_selection

import

f

_

regression

# generate dataset

X

,

y

=

make_regression

(

n_samples

=

100

,

n_features

=

100

,

n_informative

=

10

)

# define feature selection

fs

=

SelectKBest

(

score_func

=

f_regression

,

k

=

10

)

# apply feature selection

X_selected

=

fs

.

fit_transform

(

X

,

y

)

print

(

X_selected

.

shape

)

Running the example first creates the regression dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

1

(100, 10)

Classification Feature Selection:
(Numerical Input, Categorical Output)

This section demonstrates feature selection for a classification problem that as numerical inputs and categorical outputs.

A test regression problem is prepared using the make_classification() function.

Feature selection is performed using ANOVA F measure via the f_classif() function.

1

2

3

4

5

6

7

8

9

10

11

# ANOVA feature selection for numeric input and categorical output

from

sklearn

.

datasets

import

make_classification

from

sklearn

.

feature_selection

import

SelectKBest

from

sklearn

.

feature_selection

import

f

_

classif

# generate dataset

X

,

y

=

make_classification

(

n_samples

=

100

,

n_features

=

20

,

n_informative

=

2

)

# define feature selection

fs

=

SelectKBest

(

score_func

=

f_classif

,

k

=

2

)

# apply feature selection

X_selected

=

fs

.

fit_transform

(

X

,

y

)

print

(

X_selected

.

shape

)

Running the example first creates the classification dataset, then defines the feature selection and applies the feature selection procedure to the dataset, returning a subset of the selected input features.

1

(100, 2)

Classification Feature Selection:
(Categorical Input, Categorical Output)

For examples of feature selection with categorical inputs and categorical outputs, see the tutorial:

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

Articles

Summary

In this post, you discovered how to choose statistical measures for filter-based feature selection with numerical and categorical data.

Specifically, you learned:

  • There are two main types of feature selection techniques: supervised and unsupervised, and supervised methods may be divided into wrapper, filter and intrinsic.
  • Filter-based feature selection methods use statistical measures to score the correlation or dependence between input variables that can be filtered to choose the most relevant features.
  • Statistical measures for feature selection must be carefully chosen based on the data type of the input variable and the output or response variable.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction,
and much more…

READ  [Update] 国際 会議 場 メイン ホール 座席 数 | 福岡 国際 会議 場 メイン ホール 座席 - Pickpeup

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

See What’s Inside


Feature Selection In Machine Learning | Feature Selection Techniques With Examples | Simplilearn


In Machine Learning, not all the data you collect is useful for analysis. In this video, you will learn about Feature Selection. You will understand the need for feature selection and what is feature selection. You will look at the various feature selection methods and get an idea about feature selection statistics. Finally, you will learn how to select features using a dataset and perform analysis in Python.
🔥Free Machine Learning Course: https://www.simplilearn.com/learnmachinelearningbasicsskillup?utm_campaign=FeatureSelectioninMachineLearning\u0026utm_medium=Description\u0026utm_source=youtube
00:00:00 What’s in it for you
00:00:32 Need for feature selection
00:01:53 What is Feature selection
00:02:55 Feature selection method
00:07:16 Feature selection stats
00:08:06 Demo
✅Subscribe to our Channel to learn more about the top Technologies: https://bit.ly/2VT4WtH
⏩ Check out the Machine Learning tutorial videos: https://bit.ly/3fFR4f4
FeatureSelectionInMachineLearning FeatureSelectionTechniquesExplained FeatureSelectionTechniquesWithExamples MachineLearningTutorial MachineLearningTutorialForBeginners MachineLearning SimplilearnMachineLearning MachineLearningCourse
To learn more about this topic, visit: https://www.simplilearn.com/tutorials/machinelearningtutorial/featureselectioninmachinelearning?utm_campaign=FeatureSelectionInMachineLearning\u0026utm_medium=Description\u0026utm_source=youtube
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even selfdriving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and handson skills required for certification and job competency in Machine Learning.
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a handson approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, Knearest neighbors, Kmeans clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
👉Learn more at: https://bit.ly/3fouyY0
For more updates on courses and tips follow us on:
Facebook: https://www.facebook.com/Simplilearn
Twitter: https://twitter.com/simplilearn
LinkedIn: https://www.linkedin.com/company/simplilearn
Website: https://www.simplilearn.com
Get the Android app: http://bit.ly/1WlVo4u
Get the iOS app: http://apple.co/1HIO5J0

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูความรู้เพิ่มเติมที่นี่

Feature Selection In Machine Learning | Feature Selection Techniques With Examples | Simplilearn

Feature Selection


Feature Selection

F\u0026M Dev Diary #2: Nation Selection with Sebastian


Presented by Sebastian
PreAlpha 1.5
Wishlist on Steam: https://store.steampowered.com/app/1679290/Fire__Maneuver/
Gain Access: https://www.patreon.com/ahinteractive
Official Discord: https://discord.gg/23nqyAz5z2
PreAlpha 1.5 Changelog:
Fixes to melee \u0026 charges
Fixes to cohesion not regaining properly
Fixes to orders not being able to be restructured
Higher resolution support
Nation selection for France \u0026 Germany
Unit cards now display health \u0026 cohesion
French sprite uniforms updated
Buttons in battle are more responsive
Field UI no longer bugs out when player tilts camera
Max \u0026 Min camera zoom added

F\u0026M Dev Diary #2: Nation Selection with Sebastian

Feature Selection in Python | Machine Learning Basics | Boston Housing Data


Hey Everyone! I’m a first year machine learning PhD student. My research focuses on recommender systems applications in sports science including casebased reasoning techniques to support marathon runners. I have a YouTube channel about my experience doing a PhD and productivity tips but I also have made a few videos related to my research. This video is about feature selection in Python using an example of a kNN regressor on the Boston Housing Data. There are timestamps in the description as well as the python notebook available for download. I hope this will be of interest to some of you and be sure to subscribe if you would like to see more.
Previous videos on Machine Learning:
Supervised vs Unsupervised Learning: https://youtu.be/2Z1B0xESzMw
CaseBased Reasoning: https://youtu.be/Iy2gO8svdMI
kNN: https://youtu.be/SDSC4yLLBKM
Link to downloadable python notebooks: https://drive.google.com/open?id=10jx0A4NBQl7i_BU00QUZlgILYg38HbJX
Timestamps:
00:00 intro
0:55 what is feature selection / why
3:30 description of the Boston Housing Dataset
6:00 recap of the last model
7:47 filtering features by variance
10:20 filtering features by correlation
15:40 feature selection using a wrapper sequential feature selection
21:24 relationships between variables
Join my email list for regular PhD and Productivity advice: https://www.phdandproductivity.com/
Support My Channel: If you would like my content and want to support my channel and get access to exclusive content, then join the channel membership here, starting from €1.99 a month: https://www.youtube.com/channel/UCHyVYHSYzWbWwf8z1Vbi6Ug/join
Connect with Me
Instagram: https://www.instagram.com/phdandproductivity/
Twitter: https://twitter.com/phdproductivity
For business enquiries only: phdproductivity@gmail.com
Shop my Favourites for Working from Home and Productivity: https://www.amazon.co.uk/shop/ciaraxfeely
Check out my Startup Daysier
website: https://www.daysier.co/
Instagram: https://www.instagram.com/daysier.co/
Disclaimer Some of the links are affiliate links meaning that if you make a purchase using the link, I earn a small commission with no extra charge to you. If you do decide to use one of these links then thank you for you support.
machine learning in python,feature selection,machine learning,what is feature selection,boston housing data eda,correlation analysis and feature selection,data science,what is feature selection in machine learning,feature selection in python,recursive feature elimination,feature selection in machine learning,scikit learn,sequential feature selection,stepwise feature selection,knn regression,stepwise regression,python tutorial,machine learning basics,beginner

Feature Selection in Python | Machine Learning Basics | Boston Housing Data

How to find Feature Importance in your model


Using a KNearest Neighbor Classifier, figure out what features of the Iris Dataset are most important when predicting species

How to find Feature Importance in your model

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูบทความเพิ่มเติมในหมวดหมู่Music of Turkey

ขอบคุณมากสำหรับการดูหัวข้อโพสต์ feature selection

Leave a Comment