Titanic

Note: This is also still draft state.

 

Goal: To learn Feature Engg and other extremely cools techniques that had been shared on the Kaggle.com.

Note: This Page is (not) a copy paste or replication but a summary of things I have noticed from these Kagglers. Do not assume.

  • using Name Title for predicting Age – Master, Mr, Mrs, Miss, Captain, Officer,..
  • use of Title, Sex, Pclass for predicting – Age

Tribute to those awesome programmers.

 

# Decision Tree Visualisation and Submission

Here you can see how

https://www.kaggle.com/yildirimarda/titanic/titanic-test3/output

# How to visualise Tree Graphs

>> from IPython.display import Image
>>> dot_data = StringIO()
>>> tree.export_graphviz(clf, out_file=dot_data,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
>>> graph = pydot.graph_from_dot_data(dot_data.getvalue())
>>> Image(graph.create_png())
http://scikit-learn.org/stable/modules/tree.html

 

# How to check correlation between columns with respect to Survival

train = pd.read_csv("../input/train.csv", dtype={"Age": np.float64}, )

# Replacing missing ages with median
train["Age"][np.isnan(train["Age"])] = np.median(train["Age"])
train["Survived"][train["Survived"]==1] = "Survived"
train["Survived"][train["Survived"]==0] = "Died"
train["ParentsAndChildren"] = train["Parch"]
train["SiblingsAndSpouses"] = train["SibSp"]

plt.figure()
sns.pairplot(data=train[["Fare","Survived","Age","ParentsAndChildren","SiblingsAndSpouses","Pclass"]],
             hue="Survived", dropna=True)

 

Source: https://www.kaggle.com/benhamner/titanic/python-seaborn-pairplot-example/output

 

# How lucky is your name?

Well sometimes, you happend to be of a high authority family and this reason that could save your life.

https://www.kaggle.com/anthonyg/titanic/lucky-names/code

 

# how to use “is in alist” paramaeter in pandas

#pull out the passengers that have popular names (> 10 occurances)

top10_popular_firstname = dfTitanic['FirstName'].value_counts()[dfTitanic['FirstName'].value_counts() > 10].index

dfPassengersWithPopularNames = dfTitanic[dfTitanic['FirstName'].isin( top10_popular_firstname )]

 

# How to XGboost your solution?

https://www.kaggle.com/cbrogan/titanic/xgboost-example-python/code

 

Suggestion:

I see there are lots of interesting questions & interesting finding to ask and figure out from data.

  • distplot/hist of features values play important role.
  • sometimes few columns can be inter-dependent and we can use that for guessing missing values.
  • No values can also mean something like a new feature called no.of_nulls feature.
  • check for hidden data in Object type features.
  • if you see combinaiton of cols can change show some important, then create a new feaure.
  • all new features not necesarly adds signification values.
  • too many feaures are good but having feaures which contribute is more important.
  • Failures are stepping stone of sucess. Kill logics the relate to failure. Keep trying.
  • ASK WHY for everything.
    • What & Why these features are good?
    • What story to make up ?
    • What more we can cook up ?

 

 

 

 

Advertisements