In the spirit of our last post, let’s take things back to basics with a simple end-to-end data analysis example.

Back to basics with `pandas`, `plotnine`, and `sklearn`

In this colab we do a quick review of the ‘data science basics’:

Download a (small) dataset from an online repository.
Use pandas to store and manipulate the data.
Use plotnine to generate plots/visualizations.
Use scikit-learn to fit a simple logistic regression model.

# @title Imports

import numpy as np
import pandas as pd
import plotnine as gg
import requests
import sklearn

Let there be data

Let’s fetch the famous Iris dataset from the UCI machine learning datasets repository.

# The URL for the raw data in text form.
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Column names/metadata.
feature_columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
target_columns = ['target']

# Fetch the raw data from the website.
response = requests.get(url=data_url)
rows = response.text.split('\n')
data = [r.split(',') for r in rows]

# Turn it into a dataframe and drop any null values.
df = pd.DataFrame(data, columns=feature_columns + target_columns)
df.dropna(inplace=True)

# Fix up the data types: features are numeric, targets are categorical.
df[feature_columns] = df[feature_columns].astype('float')
df[target_column] = df[target_column].astype('category')

Take a quick look at the data

# Look at the raw data to check it looks sensible.
df.head()

	sepal_length	sepal_width	petal_length	petal_width	target
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

# Look at some summary statistics.
df.describe()

	sepal_length	sepal_width	petal_length	petal_width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

# 6.1 KB ... now that's some #bigdata ;)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
target          150 non-null category
dtypes: category(1), float64(4)
memory usage: 6.1 KB

# Let's take a visual look at the data now, projecting down to 2 dimensions.
gg.theme_set(gg.theme_bw())

p = (gg.ggplot(df)
     + gg.aes(x='sepal_length', y='petal_length', color='target')
     + gg.geom_point(size=2)
     + gg.ggtitle('Iris dataset is not linearly separable')
    )
p

png

Fitting a simple linear model

Let’s get ‘old school’ and fit a simple logistic regression classifier using the raw features.

In the binary case, this is a linear probabilistic model with Bernoulli likelihood given by

\[\mathcal{L}\left(\pmb{x}, y \lvert \pmb{x}\right) = f(\pmb{x})^y (1 - f(\pmb{x}))^{1-y},\]

where \(f(\pmb{x}) = \sigma\left(\pmb{w}^T\pmb{x}\right)\).

Assuming the dataset \(\mathcal{D}=\left\{\pmb{x}^{(i)}, y^{(i)}\right\}_{i=1}^{n}\) is iid, the negative log-likelihood yields the familiar cross-entropy loss:

\[\begin{aligned} \log\mathcal{L}\left(\mathcal{D} \lvert \pmb{w}\right) &= \log\prod_{i=1}^{n} \sigma\left(\pmb{w}^T\pmb{x}^{(i)}\right)^{y^{(i)}} \left(1 - \sigma\left(\pmb{w}^T\pmb{x}^{(i)}\right)\right)^{1-y^{(i)}} \\ &= \sum_{i=1}^n y^{(i)}\log f\left(\pmb{x}^{(i)}\right) + (1-y^{(i)})\log\left(1 - f\left(\pmb{x}^{(i)}\right)\right) \end{aligned}\]

This is a convex and trivially differnetiable objective that we can optimize with gradient-based methods. We’ll defer this to scikit-learn (which also does some other tricks + regularization).

# Create a logistic regression model.
model = sklearn.linear_model.LogisticRegression(solver='lbfgs',
                                                multi_class='multinomial')

# Extract the features and targets as numpy arrays.
X = df[feature_columns].values
y = df[target_column].cat.codes  # Note that we 'code' the categories as integers.

# Fit the model.
model.fit(X, y)

# Compute the model accuracy over the training set (we don't have a test set).
pred_df = df.copy()
pred_df['prediction'] = model.predict(X).astype('int8')
pred_df['correct'] = pred_df.prediction == pred_df.target.cat.codes
accuracy = pred_df.correct.sum() / len(pred_df)
print('Accuracy: {:.3f}'.format(accuracy))

Accuracy: 0.973

# Let's look again at the data, highlighting the examples which we mis-classified.
p = (gg.ggplot(pred_df)
     + gg.aes(x='sepal_length', y='petal_length', shape='target', color='correct')
     + gg.geom_point(size=2)
    )
p

png

Back to basics with pandas, plotnine, and sklearn

Let there be data

Take a quick look at the data

Fitting a simple linear model

Back to basics with `pandas`, `plotnine`, and `sklearn`