Introduction to scikit-learn: Machine Learning in Python

title: Introduction to scikit-learn: Machine Learning in Python
use_katex: True
class: title-slide

# Introduction to scikit-learn: Machine Learning in Python

![](images/scikit-learn-logo-notext.png)

.larger[Thomas J. Fan] 
@thomasjpfan 
<a href="https://www.github.com/thomasjpfan" target="_blank"></a>
<a href="https://www.twitter.com/thomasjpfan" target="_blank"></a>
<a class="this-talk-link", href="https://github.com/thomasjpfan/ml-workshop-intro-v2" target="_blank">
This workshop on Github: github.com/thomasjpfan/ml-workshop-intro-v2</a>

???

## Links

- https://scikit-learn.org/stable/
- https://github.com/thomasjpfan/ml-workshop-intro

---

# Table of Contents

.g[
.g-6[
1. [Introduction to Machine Learning](#introduction)
1. [Supervised Learning](#supervised)
1. [Preprocessing](#preprocessing)
1. [Pipelines](#pipelines)
1. [Pandas Output](#pandas-output)
]
.g-6.g-center[
![](images/scikit-learn-logo-notext.png)
]
]

---

# 1. Introduction to Machine Learning

---

# What is machine learning?

---

# Traditional programming

## Prediction

![](images/traditional-programming.svg)

---

# Machine Learning

## Training

![](images/ml-training.svg)

## Prediction

![](images/ml-prediction.svg)

---

# Amazon Recommendations

![](images/amazon.png)

---

# Higgs Boson

![](images/higgs.png)

.footnote[
[Machine Learning Wins the Higgs Challenge](https://atlas.cern/updates/atlas-news/machine-learning-wins-higgs-challenge)
]

---

![](images/blood_quality.png)

.footnote[
[Link to Source](https://www.broadinstitute.org/news/deep-learning-model-assesses-quality-stored-blood)
]

---

# Types of Machine Learning

- Unsupervised Learning

- Reinforcement Learning

- Supervised Learning

---

# Unsupervised Learning

![:scale 80%](images/clustering.png)

.footnote[
[Link to Source](https://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py)
]

---

# Reinforcement Learning

![:scale 80%](images/reinforcement.svg)

---

# Reinforcement Learning

![:scale 80%](images/dota.png)

---

# 2. Supervised Learning

---

# Supervised Learning

$$
(x_i, y_i) \propto p(x, y) \text{ i.i.d}
$$
- $p$ is an unknown joint distribution
- i.i.d means independent identically distributed

$$x_i \in \mathbb{R}^p$$
$$y_i \in \mathbb{R}$$

## Goal during training
$$f(x_i) \approx y_i$$

---

# Generalization

## Goal during training
$$f(x_i) \approx y_i$$

## Generalization
$$f(x) \approx y$$

For *non-training data*: $x$

---

# Classification and Regression

.g[
.g-6[
## Classification
- target $y$ is discrete
- Does the patient have cancer?
]
.g-6[
## Regression
- target $y$ is continuous
- What is the price of the home?
]
]

---

# Data Representation

![:scale 80%](images/data-representation.svg)

---

# Loading Datasets

## Random datasets
```py
from sklearn.datasets import make_classification
from sklearn.datasets import make_regression
```

## Sample datasets
```py
from sklearn.datasets import load_breast_cancer
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_iris
from sklearn.datasets import load_wine
```

## OpenML

```py
from sklearn.datasets import fetch_openml
```

---

# Splitting Training and Test Data

![:scale 70%](images/train-test.svg)

---

# Notebook 📒!
## notebooks/01-loading-data.ipynb

---

# Supervised ML Workflow

![](images/ml-workflow.svg)

---

# Supervised ML Workflow

![](images/ml-workflow-sklearn.svg)

---

# Notebook 📓!
## notebooks/02-supervised-learning.ipynb

---

# 3. Preprocessing

---

# Housing Dataset

![:scale 100%](notebooks/images/housing.png)

---

# Feature Ranges

![:scale 70%](notebooks/images/housing_box.svg)

---

# KNN Scaling

![:scale 90%](images/knn-data.svg)

---

# KNN Scaling Decision Boundary

![:scale 90%](images/knn-scaling.svg)

---

# Notebook 📕!
## notebooks/02-preprocessing.ipynb

---

# Scikit-Learn API

.g[
.g-6[
## `estimator.predict`
- Classification
- Regression
- Clustering
]
.g-6[
## `estimator.transform`
- Preprocessing
- Dimensionality reduction
- Feature selection
- Feature extraction
]
]

---

# 4. Pipelines

---

# Why Pipelines?

- Preprocessing must be fitted on training data only!

## Bad
```py
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
```

## Good
```py
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

---

# Pipeline Example

## Before

```py
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)

est = Ridge().fit(X_train_scaled, y_train)

# Evaluate on test data
X_test_scaled = scaler.transform(X_test)
est.score(X_test_scaled, y_test)
```

## After

```py
*from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(), Ridge())

pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
```

---

# Pipeline Overview

![](images/pipeline.svg)

---

# Notebook 📕!
## notebooks/04-pipelines.ipynb

---

# 5. Pandas Output

---

# Transform outputs NumPy arrays by default

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(X_train)
```

![:scale 80%](images/default_out_ndarray.jpg)

---

# Configured for Pandas Output!

```python
*scaler.set_output(transform="pandas")
scaler.fit_transform(X_train)
```

![](images/pandas_out_dataframe.jpg)

---

# Notebook 📕!
## notebooks/05-pandas-output.ipynb

---

# Whats next?
## Intermediate Machine Learning with scikit-learn
- Pandas Interoperability
- Categorical data & Pandas Input
- Parameter Tuning
- Model Evaluation

---

.g.g-middle[
.g-7[
![:scale 30%](images/scikit-learn-logo-notext.png)
1. [Introduction to Machine Learning](#introduction)
2. [Supervised Learning](#supervised)
3. [Preprocessing](#preprocessing)
4. [Pipelines](#pipelines)
5. [Pandas Output](#pandas-output)
]
.g-5.center[
 
.larger[Thomas J. Fan] 
@thomasjpfan 
<a href="https://www.github.com/thomasjpfan" target="_blank"></a>
<a href="https://www.twitter.com/thomasjpfan" target="_blank"></a>
<a class="this-talk-link", href="https://github.com/thomasjpfan/ml-workshop-intro-v2" target="_blank">
This workshop on Github: github.com/thomasjpfan/ml-workshop-intro-v2</a>
]
]