Explorative data analysis with pandas¶

for pandas see http://pandas.pydata.org/

Pandas Series¶

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series is ndarray-like; Series is dict-like
The axis labels are collectively referred to as the index.

http://pandas.pydata.org/pandas-docs/stable/dsintro.html

In [3]:

import pandas as pd
data = [2,4,3,4,4]
index = range(5)
s = pd.Series(data)#, index=index)
s

Out[3]:

0    2
1    4
2    3
3    4
4    4
dtype: int64

Pandas Data Frames¶

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

see http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe

In [7]:

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
matplotlib.style.use('ggplot')

In [8]:

# from http://www.analyticsvidhya.com/blog/2014/08/baby-steps-python-performing-exploratory-analysis-python/


# load the data into a python data frame
df = pd.read_csv("./titanic-train.csv")
df.head(2)
#type(df)

Out[8]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	0	PC 17599	71.2833	C85	C

In [9]:

#generate various summary statistics
df.describe()

Out[9]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

In [11]:

df.Age.hist(bins=20)

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x10e21f650>

Age has (891–714=) 277 missing values.

We can also look that about 38% passangers survived the tragedy. How? The mean of survival field is 0.38 (Remember, survival has value 1 for those who survived and 0 otherwise)

By looking at percentiles of Pclass, you can see that more than 50% of passengers belong to class 3,

The age distribution seems to be in line with expectation. Same with SibSp and Parch

The fare seems to have values with 0 indicating possibility of some free tickets or data errors. On the other extreme, 512 looks like a possible outlier / error

Level of measurements¶

Nominal
Ordinal
Interval
Ratio

Data distributions:¶

There are 3 variety of measures, required to understand a distribution:

Measure of Central tendency
Measure of dispersion
Measure to describe shape of curve

Measures of Central tendency¶

In [53]:

#    Mean – or the average
print "Mean of Age:", df['Age'].mean()

#    Median – the value, which divides the population in two half
print "Media of Age:", df['Age'].median()

#    Mode – the most frequent value in a population
print "Mode of Pclass:", df['Pclass'].mode()

# df.Pclass.min()

Mean of Age: 29.6991176471
Media of Age: 28.0
Mode of Pclass: 0    3
dtype: int64
1

Measure of dispersion¶

Range: Difference in the maximum and minimum value in the population
Quartiles: Values, which divide the populaiton in 4 equal subsets (typically referred to as first quartile, second quartile and third quartile)
Inter-quartile range – The difference in third quartile (Q3) and first quartile (Q1). By definition of quartiles, 50% of the population lies in the inter-quartile range.
Variance: The average of the squared differences from the Mean.
Standard Deviation: is square root of Variance

Measures to describe shape of distribution:¶

see http://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.htm

Skewness: Skewness is a measure of the asymmetry. Negatively skewed curve has a long left tail and vice versa.
Kurtosis: Kurtosis is a measure of the “peaked ness”. Distributions with higher peaks have positive kurtosis and vice-versa

In [6]:

print "Skew of age: ", df.Age.skew()
print "Kurtosis of age:", df.Age.kurtosis()

Skew of age:  0.389107782301
Kurtosis of age: 0.178274153642

In [7]:

#Returns first n rows
df.head(3)

Out[7]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S

In [60]:

# not all summary statistics is displayed by describe(), e.g. the median
df['Age'].median()

Out[60]:

28.0

In [9]:

df['Sex'].unique()

Out[9]:

array(['male', 'female'], dtype=object)

In [43]:

# Histogram with matplotlib
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
ax.hist(df['Age'], bins = 15, range = (df['Age'].min(),df['Age'].max()))
# or method hist() of a pandas series
# df['Age'].hist(bins = 10)
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Count of P assengers')
plt.show()

In [19]:

##
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(df['Fare'], bins = 10, range = (df['Fare'].min(),
                                        50.))
                                        #df['Fare'].max()))
plt.title('Fare distribution')
plt.xlabel('Fare')
plt.ylabel('Count of Passengers')
plt.show()

In [12]:

df.boxplot(column='Fare', by = 'Pclass')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x109d19d10>

In [57]:

import seaborn as sns
sns.jointplot("Age", "Fare", df, kind='reg')

Out[57]:

<seaborn.axisgrid.JointGrid at 0x10c60cb90>

In [59]:

sns.lmplot("Age", "Fare", df, col="Pclass")

Out[59]:

<seaborn.axisgrid.FacetGrid at 0x10d7e8b90>

In [33]:

## Violine Plot

df_nn = df[pd.notnull(df['Age'])]
import seaborn as sns
sns.violinplot(df_nn['Age'], df_nn['Sex'], cut=0.) #Variable Plot
sns.despine()

In [14]:

fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.axis("equal")
plt.title("Pclass distribution")
pgroup = df.groupby(['Pclass'])
pcouts = pgroup.PassengerId.count()
pcouts.name = "Number of Passengers per Class"
pcouts.plot(kind='Pie', autopct="%1.1f%%", ax=ax)

Out[14]:

<matplotlib.axes._subplots.AxesSubplot at 0x109d40e50>

In [15]:

temp1 = df.groupby('Pclass').Survived.count()
# or temp1 = df['Survived'].groupby(df['Pclass']).count()

temp2 = df.groupby('Pclass').Survived.sum()/df.groupby('Pclass').Survived.count()
fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(121)
ax1.set_xlabel('Pclass')
ax1.set_ylabel('Count of Passengers')
ax1.set_title("Passengers by Pclass")
temp1.plot(kind='bar')

ax2 = fig.add_subplot(122)
temp2.plot(kind = 'bar')
ax2.set_xlabel('Pclass')
ax2.set_ylabel('Probability of Survival')
ax2.set_title("Probability of survival by class")

Out[15]:

<matplotlib.text.Text at 0x10b96ec10>

In [16]:

# binning can be done with pandas directly
# Categorical
age_bins = pd.cut(df.Age, 8, precision=0)
groups = df.groupby(age_bins) 
groups.PassengerId.count().plot(kind='bar')
temp = groups.Survived.mean()
temp.plot(kind='bar')
group_names = ['kids', 'youth', 'adults', 'seniors']
age_bins = pd.cut(df.Age, [df.Age.min(), 14, 22, 55, df.Age.max()], labels = group_names)
groups = df.groupby(age_bins)
temp = groups.Survived.mean()
temp.plot(kind='bar')

Out[16]:

<matplotlib.axes._subplots.AxesSubplot at 0x10b93b750>

In [17]:

# for discretize variable into equal-sized buckets based on rank or based on sample quantiles: see qcut
#pd.qcut?

In [18]:

## binning with numpy
#bins = np.round(np.linspace(0., df.Age.max(), 10))
#age_bins = pd.Series(np.digitize(df.Age, bins))
##the last bin is here nan
#age_bins[age_bins == age_bins.max()] = 'NaN'
#for i in range(len(bins)-1):
#  age_bins[age_bins == i+1] = "{0:2.0f}-{1:2.0f}".format(bins[i], bins[i+1])
#groups = df.groupby(age_bins)
#temp = groups.Survived.mean()
#temp.plot(kind='bar')

In [47]:

data = df.groupby([age_bins,'Sex']).Survived.mean()
# Note: data has a hierachical index
data

fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(111)
ax.set_xlabel('Age')
ax.set_ylabel('Probability of Survival')
ax.set_title("Probability of survival by age and sex")
data.unstack(level=1).plot(kind='bar', subplots=False, ax=ax)

Out[47]:

<matplotlib.axes._subplots.AxesSubplot at 0x10d3f60d0>

In [37]:

var = df.groupby(['Sex','Survived']).PassengerId.count()
ax = var.unstack().plot(kind='bar',stacked=True,  color=['red','blue'], grid=False)
ax.set_xlabel('Sex')
ax.set_ylabel('Number of Passengers')

Out[37]:

<matplotlib.text.Text at 0x10ec01d90>

In [39]:

from statsmodels.graphics.mosaicplot import mosaic

_ = mosaic(df, ['Survived', 'Sex', 'Pclass'])

In [27]:

df['Adult'] = df["Age"].apply(lambda age: "adult" if age >14. else "child")
#or df['Adult'] = df["Age"]>14.
_ = mosaic(df, ['Survived', 'Sex', 'Pclass', 'Adult'])

In [22]:

#) Probability of surviving of a woman in the 3. class?
df[(df['Sex']=='female') & (df['Pclass']==3)].Survived.mean()

Out[22]:

0.5

In [28]:

# more plots with pandas see http://pandas.pydata.org/pandas-docs/stable/visualization.html

In [48]:

# Some simple string handling for getting the salutations
salutation = df.Name.apply(lambda w: w.split(',')[1].split('.')[0].strip())

# df.Saluation = ... Doesn't work here properly, because the column doesn't exists right now !!
df['Salutation'] = salutation
df.groupby('Salutation').PassengerId.count()
# use 'Other' for the low count salutations
df.Salutation [np.invert( (df.Salutation == 'Mr') | (df.Salutation == 'Mrs') | (df.Salutation == 'Miss') | (df.Salutation == 'Master') )]  = 'Other'
df.boxplot(column='Age', by = 'Salutation')

/Users/christian/.virtualenvs/standard/lib/python2.7/site-packages/ipykernel/__main__.py:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x109d4fa10>

In [31]:

################
# Fill missing values with pandas (alternative: scikit learn imputter)
# After the explorative data analysis!

medianAge = df.Age.median()
df.Age = df.Age.fillna(medianAge)
##################