data science notes2

Feature selection

Pearson Coefficient:

Measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

import numpy as np

from scipy.stats import pearsonr

np.random.seed(0)

size = 300

x = np.random.normal(0, 1, size)

print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size))

print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

Lower noise (0.71824836862138386, 7.3240173129992273e-49)
Higher noise (0.057964292079338148, 0.31700993885324746)

 

Sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Use sklearn, pipeline to get the job faster.

Major Drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.
For example, a correlation between x and x2 is zero or when x is centered on 0.

x = np.random.uniform(-1, 1, 100000)
print pearsonr(x, x**2)[0]


-0.00230804707612

 

Pearson Correlation Chart

 

 

Source: http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/

 

 

Advertisements