title: Intermediate Machine Learning with scikit-learn: Cross validation, Parameter Tuning, Pandas Interoperability, and Missing Values use_katex: True class: title-slide # Intermediate Machine Learning with scikit-learn ## Cross validation, Parameter Tuning, Pandas Interoperability, and Missing Values  .larger[Thomas J. Fan]
@thomasjpfan
This workshop on Github: github.com/thomasjpfan/ml-workshop-intermediate-1-of-2
--- name: table-of-contents class: title-slide, left # Table of Contents .g[ .g-6[ 1. [Cross Validation](#validation) 1. [Parameter Tuning](#parameter-tuning) 1. [Missing Values](#missing-values) 1. [Pandas Interoperability](#pandas) ] .g-6.g-center[  ] ] --- # Scikit-learn API .center[ ## `estimator.fit(X, [y])` ] .g[ .g-6[ ## `estimator.predict` - Classification - Regression - Clustering ] .g-6[ ## `estimator.transform` - Preprocessing - Dimensionality reduction - Feature selection - Feature extraction ] ] --- # Data Representation  --- # Supervised ML Workflow  --- class: chapter-slide # Notebook 📒! ## notebooks/00-review-sklearn.ipynb --- name: validation class: chapter-slide # 1. Cross Validation .footnote-back[ [Back to Table of Contents](#table-of-contents) ] --- # Single train test split  --- # Three Fold Split  --- # Why cross validate?  --- # Why cross validate?  --- # Can we do better?  --- class: chapter-slide # Notebook 📓! ## notebooks/01-cross-validation.ipynb --- class: chapter-slide # Cross Validation Strategies ---  ---  ---  ---  --- # Strategies for increasing the number of folds - High variance, takes a long time ```py from sklearn.model_selection import LeaveOneOut ``` - `ShuffleSplit` with stratification ```py from sklearn.model_selection import StratifiedShuffleSplit ``` - `RepeatKFold` or `RepeatedStratifiedKFold` ```py from sklearn.model_selection import RepeatedKFold from sklearn.model_selection import RepeatedStratifiedKFold ``` --- # Cross-validation with non-idd data ## Grouped data - Assume data is not iid such as patient ID or user id - We want to generalized to a new patient ## Time Series - Data is correlated ---  ---  ---  ---  ---  ---  --- class: chapter-slide # Notebook 📓! ## notebooks/01-cross-validation.ipynb --- name: parameter-tuning class: chapter-slide # 2. Parameter Tuning .footnote-back[ [Back to Table of Contents](#table-of-contents) ] --- class: center # Why Tune Parameters?  --- # Score vs n_neighbors  --- # Parameter Tuning Workflow  --- # GridSearchCV ```py from sklearn.model_selection import GridSearchCV param_grid = {'n_neighbors': np.arange(1, 30, 2)} grid = GridSearchCV(KNeighborsClassifier(), param_grid=param_grid, return_train_score=True) grid.fit(X_train, y_train) ``` Best score ```py grid.best_score_ ``` Best parameters ```py grid.best_params_ ``` --- # Random Search  --- # RandomizedSearchCV with scikit-learn ```py from scipy.stats import randint param_dist = { "max_depth": randint(3, 9), "max_features": randint(1, 11) } random_search = RandomizedSearchCV( clf, param_distributions=param_dist, n_iter=20 ) ``` - Values in `param_distributions` can be a list or an object from the `scipy.stats` module --- # Successive Halving ```python from sklearn.experimental import enable_halving_search_cv # noqa from sklearn.model_selection import HalvingRandomSearchCV from sklearn.model_selection import HalvingGridSearchCV ``` ??? The search strategy starts evaluating all the candidates with a small amount of resources and iteratively selects the best candidates, using more and more resources. --- class: center  --- class: chapter-slide # Notebook 📓! ## notebooks/02-parameter-tuning.ipynb --- name: missing-values class: chapter-slide # 3. Missing Values .footnote-back[ [Back to Table of Contents](#table-of-contents) ] --- # Imputers in scikit-learn ## Impute module ```py from sklearn.impute import SimpleImputer from sklearn.impute import KNNImputer # `add_indicator=True` to add missing indicator imputer = SimpleImputer(add_indicator=True) from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer ``` --- # Comparing the Different methods  --- # Estimators with native support ## Histogram-based Gradient Boosting Regression Trees - Based on LightGBM implementation - Have native support for missing values ```py from sklearn.experimental import enable_hist_gradient_boosting from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.ensemble import HistGradientBoostingRegressor ``` --- class: chapter-slide # Notebook 📔! ## notebooks/03-missing-values.ipynb --- name: pandas class: chapter-slide # 4. Pandas Interoperability .footnote-back[ [Back to Table of Contents](#table-of-contents) ] --- # Categorical Data ## Examples of categories: - `['Manhattan', 'Queens', 'Brooklyn', 'Bronx']` - `['dog', 'cat', 'mouse']` ## Scikit-learn Encoders `OrdinalEncoder`: Encodes categories into an integer ```py from sklearn.preprocessing import OrdinalEncoder ``` `OneHotEncoder`: Encodes categories into an integer ```py from sklearn.preprocessing import OneHotEncoder ``` --- # Heterogenous data ## Example: Titanic Dataset
pclass
sex
age
sibsp
parch
fare
embarked
body
0
1.0
female
29.0000
0.0
0.0
211.3375
S
NaN
1
1.0
male
0.9167
1.0
2.0
151.5500
S
NaN
2
1.0
female
2.0000
1.0
2.0
151.5500
S
NaN
3
1.0
male
30.0000
1.0
2.0
151.5500
S
135.0
4
1.0
female
25.0000
1.0
2.0
151.5500
S
NaN
--- # scikit-learn's ColumnTransformer  --- class: chapter-slide # Notebook 📔! ## notebooks/04-pandas-interoperability.ipynb --- class: title-slide, left # Closing .g.g-middle[ .g-7[  1. [Cross Validation](#validation) 1. [Parameter Tuning](#parameter-tuning) 1. [Missing Values](#missing-values) 1. [Pandas Interoperability](#pandas) ] .g-5.center[
.larger[Thomas J. Fan]
@thomasjpfan
This workshop on Github: github.com/thomasjpfan/ml-workshop-intermediate-1-of-2
] ]