data science notes2

Feature selection

Pearson Coefficient:

Measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

import numpy as np

from scipy.stats import pearsonr


size = 300

x = np.random.normal(0, 1, size)

print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size))

print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

Lower noise (0.71824836862138386, 7.3240173129992273e-49)
Higher noise (0.057964292079338148, 0.31700993885324746)



Use sklearn, pipeline to get the job faster.

Major Drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.
For example, a correlation between x and x2 is zero or when x is centered on 0.

x = np.random.uniform(-1, 1, 100000)
print pearsonr(x, x**2)[0]



Pearson Correlation Chart