# Feature selection

## Pearson Coefficient:

Measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

 import numpy as np from scipy.stats import pearsonr np.random.seed(0) size = 300 x = np.random.normal(0, 1, size) print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)) print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

Lower noise (0.71824836862138386, 7.3240173129992273e-49) Higher noise (0.057964292079338148, 0.31700993885324746)

Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Use sklearn, pipeline to get the job faster.

Major Drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.
For example, a correlation between x and x2 is zero or when x is centered on 0.

 x = np.random.uniform(-1, 1, 100000) print pearsonr(x, x**2)[0]

 -0.00230804707612

Pearson Correlation Chart