datascience01 notes

hei, there might some spell issues and other bug please excuse, I will try to put more details. please provide comments/suggestion here. thank you.

Table of contents

This is notes I collected from a ML Session I attended

Keywords

  • essemblers
  • theta
  • learning rate
  • gradient decesnt
  • data model or classifier or regressor

Introduction

Machine Learning can be simply described by 3 variables. Task, Experience & Performance.

 TASK         | Describle/decide what goal is required.
 EXPERIENCE   | Training your data model
 PERFORMANCE  | How good is your solution?

Some ways we can classify our models

Based on feedback

  • Supervised: We know how answers/solution sets looks like. Mail is spam or not.
  • Un-supervised: From start we don’t know what to classify it and we need to classify.
  • Reinforced Learning: Solution set we know may change in future.

Based on Model type

Regression or Classification

World is full of challenging problem and interesting answers. No doubt but we don’t want to tackle all those here rt now. We generally focus on two of problem statement, one is classification and other is continuous distribution type.

Like you can understand classification is about categorising if a particular document belongs to a group or a author or say a mail from xyz.com is spam or not,.. so on.

Regression problem generally include a continue distribution function. For example say, you wrote a wonderful articles or created dub smash video which is going viral. Now you want to understand how many more and people are going to search for that video so that you can pull the traffic into your site.

linear regression

we are going to apply a machine learn code on data we have, thus creating a data model. Here this model is nothing but a linear model,. i.e. y = mx +c but with some extra dimension or variable or features included.

y = a1 * x1 + a2 * x2 + a3 * x3 + c

Simple. Fast. Hassle Free. My Fav. All other machines fav too.

logistic regression

Imagine there is a special function simmilar to round function which we are very familar with and we a machine to figure it our all by itself. All the sytem has to do is figure it out where

In [12]: round(0.445, 0)
Out[12]: 0.0

In [13]: round(0.455, 0)
Out[13]: 0.0

In [14]: round(0.555, 0)
Out[14]: 1.0

In [15]: round(0.655, 0)
Out[15]: 1.0

In [16]: round(0.255, 0)
Out[16]: 0.0

In [17]: round(0.155, 0)
Out[17]: 0.0

In [18]: round(0.055, 0)
Out[18]: 0.0

train_data = [ 0.445, 0.455, 0.555, 0.655, 0.255 ]

target_data = [0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 0.0]

test_data = [ 0.155, 0.755 ]

for this our prediction should come as [ 0.0, 1.0 ]

As you can see, in this problem our input is all with in 0.0 till 1.0 and output is also with in same bound. Our goal is let machine find where is the cut off edge.

how does these model work in sklearn

General Steps

  • step1: select & import a model from models available from sklearn libs
  • step2: initiate a new model( create a instance of it )
  • step3: fit the data into that model
  • step4: predict the solution

Test-Train Data

  • test data
  • train data
  • target data/ labels data

Development Steps

  • step1: select & import a model from models available from sklearn libs
  • step2: initiate a new model ( create a instance of it )
    • step2.1: split our main training data into sets
    • step2.2: take one set for testing and another for training.
      • Generally split is done in 90 : 10 % or 80 : 20 % ratios
      • Small set is for testing and larger for training.
  • step3: fit the data into that model
    • fit only our training data (80%) of data
  • step4: predict the solution
    • here test only our testing data for which we know answers.

Almost all sklearn models follow following structure

from sklearn.marvel_comics import superman

clf = superman()

clf.fit( test-data, target-data)

predicted_data = clf.predict(test-data)

performance checking

Performance Checking. Some thing simmilar to checking our own code. Frankly speaking this is something I really dont like 🙂 Sadly, as a sr. developer this is no option for me but to like and love it to do. Luckily, sklearn has all the tool required for checking. ( thank you for letting to be still lazy 🙂

K-fold, GridCv, RandomisedCV checking are few generic techniques which can help to do most of the stuff like split data into test, train sets and then train mode and calculate the scores.

root mean square error

say y is actual data and y-pred is the data predicted, then average sum of squares of error generated between y and y-pred.

  • squaring takes care of counting postive errors and negitive error
  • squaring make sure total error starts making too much noise and we start moving away from solution and decreases as we close to our answer. Excellent alaram.

say error is -4.0, then rms value will be 16.

say error is -3.0 then rms value will be 9.

say error is 0.1 then rms value will be 0.01.

say error is 0.9 then rms value will be 0.81.

Note: Lower the rms/error rate means the better the model is.

K-fold, CrossValidation, GridSearch

  • randomise the test data and splits into multiple into test and train datasets
  • trains the model for a data and tests it according
  • provide score for each of these sets

Generally data sets consists of some noise, which is natural is real world and its difficlut to predict and filter. Training on randomised set helps to know that we are n0t relying that noise to predict data and our model is good.

http://scikit-learn.org/stable/modules/cross_validation.html

Models

  • linear regression
  • logistic regression
  • random forest

overfitting and underfitting

bias and variance

find and uses test & train scores to know if your Model is caught in bias/variance issues.

features selection and engineering

feature engineering

  • null values based creation of columns
  • generated data based on best features behaviour

Written by Sam.