title: Can There Be Too Much Parallelism? use_katex: False class: title-slide # Can There Be Too Much Parallelism? .larger[Thomas J. Fan]
@thomasjpfan
data:image/s3,"s3://crabby-images/f0f2a/f0f2afc923bbc887a421208940b902b10caec44a" alt=":scale 15%" This talk on Github: thomasjpfan/scipy-2023-too-parallel
--- class: chapter-slide # Yes ๐ --- class: center # User? data:image/s3,"s3://crabby-images/3d940/3d9404843672862d1301e2f030f39ac5399daeab" alt=":scale 100%" --- class: center # Developer? data:image/s3,"s3://crabby-images/3d940/3d9404843672862d1301e2f030f39ac5399daeab" alt=":scale 100%" --- # My Perspective .g[ .g-6[ ## Parallelism in Scikit-learn - BLAS through SciPy - OpenMP + Cython - Python Multi-Threading - Python Multi-Processing ] .g-6[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] --- # Scope .g[ .g-4[ data:image/s3,"s3://crabby-images/56bb5/56bb5655ab5eb81807430e94bb235e1b3f203b34" alt="" ] .g-4[ data:image/s3,"s3://crabby-images/ac819/ac819cc1c397b86453bd146d556035a2e90ce468" alt="" ] .g-4[ data:image/s3,"s3://crabby-images/ab452/ab4525d01bb29007400c5bb991387edbde122a42" alt="" ] ] --- .center[ # State of Python Parallelism ] .g.g-middle[ .g-6.larger[ ## APIs ๐ป ## Interactions ๐ฅ ## Defaults ๐ ] .g-6[ data:image/s3,"s3://crabby-images/9f5c3/9f5c3ee2846dfbb29bb8970192e55958a0cf5a1d" alt="" ] ] --- class: center # APIs ๐ป data:image/s3,"s3://crabby-images/118bc/118bcc95d21268ecb6dd238a250038699820e637" alt=":scale 70%" --- .g.g-middle[ .g-8[ # Environment Variables ๐ฒ .larger[ - **OpenMP**: `OMP_NUM_THREADS` - **MKL**: `MKL_NUM_THREADS` - **OpenBLAS**: `OPENBLAS_NUM_THREADS` ] ] .g-4[ data:image/s3,"s3://crabby-images/992c9/992c96bf8cc67bb818f1f579447fe70a7df3a859" alt="" ] ] --- .g.g-middle[ .g-8[ # Environment Variables ๐ฒ .larger[ - **OpenMP**: `OMP_NUM_THREADS` - **MKL**: `MKL_NUM_THREADS` - **OpenBLAS**: `OPENBLAS_NUM_THREADS` - **Polars**: `POLARS_MAX_THREADS` - **Numba**: `NUMBA_NUM_THREADS` - **macOS accelerate**: `VECLIB_MAXIMUM_THREADS` - **numexpr**: `NUMEXPR_NUM_THREADS` ] ] .g-4[ data:image/s3,"s3://crabby-images/992c9/992c96bf8cc67bb818f1f579447fe70a7df3a859" alt="" ] ] --- # Global Configuration ๐ .g.g-middle[ .g-8[ .larger[ - `torch.set_num_threads` - `numba.set_num_threads` - `threadpoolctl.threadpool_limits` - `cv.setNumThreads` ] ] .g-4[ data:image/s3,"s3://crabby-images/7a979/7a979bb9aee4f13d1f3a1cf2c61b2c7666db8edd" alt="" ] ] --- # Block Configuration ๐งฑ ## `threadpoolctl` ```python from threadpoolctl import threadpool_limits import numpy as np *with threadpool_limits(limits=2): a = np.random.randn(1000, 1000) a_squared = a @ a ``` --- # Call-site โ๏ธ .g.g-middle[ .g-8[ .larger[ - **scikit-learn**: `n_jobs` - **SciPy**: `workers` - **PyTorch DataLoader**: `num_workers` - **Python**: `max_workers` ] ] .g-4[ data:image/s3,"s3://crabby-images/dd708/dd708da5f2cfcf03995ce73a631dd3772af73f9f" alt="" ] ] --- .g.g-middle[ .g-6[ # APIs ๐ป .larger[ - Environment Variables ๐ฒ - Global Configuration ๐ - Block Configuration ๐งฑ - Call-site โ๏ธ ] ] .g-6.g-center[ data:image/s3,"s3://crabby-images/118bc/118bcc95d21268ecb6dd238a250038699820e637" alt=":scale 80%" ] ] --- class: top
# Proposal: Consistent APIs ๐ฎ .g[ .g-6[ ## Now ``` export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1 export POLARS_MAX_THREADS=1 export NUMEXPR_NUM_THREADS=1 ``` ] .g-6[ ] ] --- class: top
# Proposal: Consistent APIs ๐ฎ .g[ .g-6[ ## Now ``` export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1 export POLARS_MAX_THREADS=1 export NUMEXPR_NUM_THREADS=1 ``` ] .g-6[ ## Future ๐ ### Pragmatic ``` export OMP_NUM_THREADS=1 ``` ] ] --- class: top
# Proposal: Consistent APIs ๐ฎ .g[ .g-6[ ## Now ``` export OMP_NUM_THREADS=1 export MKL_NUM_THREADS=1 export OPENBLAS_NUM_THREADS=1 export POLARS_MAX_THREADS=1 export NUMEXPR_NUM_THREADS=1 ``` ] .g-6[ ## Future ๐ ### Pragmatic ``` export OMP_NUM_THREADS=1 ``` ### Better โ๏ธ ``` export GOTO_NUM_THREADS=1 ``` ] ] --- # Proposal ๐ฎ ## Recognize more threadpools in `threadpoolctl` .center[ data:image/s3,"s3://crabby-images/b6a6c/b6a6cde431473527095b151f7315e9f3cd8d893c" alt=":scale 70%" ] --- # Proposal ๐ฎ .g[ .g-6[ ## Now - **scikit-learn**: `n_jobs` - **SciPy**: `workers` - **PyTorch DataLoader**: `num_workers` - **Python**: `max_workers` ] .g-6[ ] ] --- # Proposal ๐ฎ .g[ .g-6[ ## Now - **scikit-learn**: `n_jobs` - **SciPy**: `workers` - **PyTorch DataLoader**: `num_workers` - **Python**: `max_workers` ] .g-6[ ## Future ๐ - Everyone uses `workers` ] ] --- class: center ## Interactions ๐ฅ data:image/s3,"s3://crabby-images/b462e/b462e42cc91d7f426d97223e46f330a0a8cf016f" alt=":scale 80%" --- # Oversubscription ๐ฅ ## Python + native threading ๐ + ๐งต ```python from scipy import optimize optimize.brute( * computation_that_uses_8_cores, ... * workers=8 ) ``` --- # Current workarounds ๐ฉน ## Dask data:image/s3,"s3://crabby-images/a13e7/a13e7692f714ef11b5431271aaf75645ca6afa72" alt=":scale 10%" data:image/s3,"s3://crabby-images/45894/45894880801fb510170fa094b59089baf555448a" alt=":scale 80%" [Source](https://docs.dask.org/en/stable/array-best-practices.html#avoid-oversubscribing-threads) --- data:image/s3,"s3://crabby-images/0f980/0f98065cd7d11ab6b07a76efc5c7f466ac4262c8" alt=":scale 20%" data:image/s3,"s3://crabby-images/5da5f/5da5f18605d3cd7d077832346d9af91a5983fa23" alt="" [Source](https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#configuring-parallelism-with-omp-num-threads) --- # PyTorch's DataLoader .g.g-middle[ .g-8[ ```python from torch.utils.data import DataLoader dl = DataLoader(..., num_workers=8) # torch/utils/data/_utils/worker.py def _worker_loop(...): ... * torch.set_num_threads(1) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/12b1c/12b1c5f5f9d9ac2aabe97eb32f10718525d226e5" alt="" ] ] [Source]() --- # scikit-learn .g.g-middle[ .g-8[ ```python from sklearn.experimental import enable_halving_search_cv from sklearn.model_selection import HalvingGridSearchCV from sklearn.ensemble import HalvingRandomSearchCV *clf = HistGradientBoostingClassifier() search = HalvingGridSearchCV( clf, param_distributions, * n_jos=8 ) search.fit(X, y) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt=":scale 90%" ] ] --- class: top
# Multiple Parallel Abstractions ๐งต + ๐งถ - Python multiprocessing using `fork` + GCC OpenMP: **stalls** -- - Intel OpenMP + LLVM OpenMP on Linux: **stalls** -- - Multiple OpenBLAS libraries: **sometimes slower** -- - Read more at: [thomasjpfan.github.io/parallelism-python-libraries-design/](https://thomasjpfan.github.io/parallelism-python-libraries-design/) --- # Multiple Parallel Abstractions ๐งต + ๐งถ ## Using more than one parallel backends ๐คฏ data:image/s3,"s3://crabby-images/c2780/c2780b6812984e4494925d52b69aa351a1a9c689" alt="" Sources: [polars](https://pola-rs.github.io/polars-book/user-guide/howcani/multiprocessing.html), [numba](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html), [scikit-learn](https://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux), [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#caveats) --- class: top
# Proposal: Catch issues early ๐ฎ .g.g-middle[ .g-10[ data:image/s3,"s3://crabby-images/66437/6643719ae0fa596e5a7bde6b9d1845b1fa3452e8" alt="" ] .g-2.center[ data:image/s3,"s3://crabby-images/1dc90/1dc9081eaa84a43bf3674b79eb8cf0ac657a0e5a" alt="" ] ] [Source](https://github.com/numba/numba/blob/249c8ff3206928b486346443ec148508f8c25f8e/numba/np/ufunc/omppool.cpp#L119-L121) --
## Not a full solution ๐ฉน --- # Multiple Native threading libraries ๐งต + ๐งถ .center[ data:image/s3,"s3://crabby-images/c15b5/c15b5e18875454385b91ec86fbe7473275d25b9a" alt=":scale 85%" ] [Source](https://www.slideshare.net/RalfGommers/parallelism-in-a-numpybased-program) --- # Multiple Native threading libraries ๐งต + ๐งถ ## CPU Waiting โณ ```python for n_iter in range(100): UV = U @ V.T # Use OpenBLAS with pthreads compute_with_openmp(UV) # Use OpenMP ``` [xianyi/OpenBLAS#3187](https://github.com/xianyi/OpenBLAS/issues/3187) --- # Current Workaround ๐ฉน .g.g-middle[ .g-6[ ## Conda-forge + OpenMP ] .g-6.center[ data:image/s3,"s3://crabby-images/83384/83384c0e9edc8212a6d70abd5d6237ae536ca726" alt="" ] ] --- # Current Workaround ๐ฉน .center[ data:image/s3,"s3://crabby-images/d772a/d772af9e2172b65d2d96dc4b977c9e3773d6eab7" alt=":scale 65%" ] [Source](https://www.slideshare.net/RalfGommers/parallelism-in-a-numpybased-program) --- class: top # Proposal ๐ฎ ## Ship PyPI wheels for OpenMP data:image/s3,"s3://crabby-images/5a1cd/5a1cd941cf0adb50c93db666b5a5a747485892c0" alt=":scale 80%" -- ## Not a full solution ๐ฉน .g.g-center.g-middle[ .g-6[ data:image/s3,"s3://crabby-images/e6156/e61561a551273ed6db9fa2f6dbabd0cdf73224d0" alt=":scale 70%" ] .g-6[ data:image/s3,"s3://crabby-images/b5e9e/b5e9e8ff88b52bdb66ad1e219c90ee2f947276df" alt=":scale 30%" ] ] --- class: center # Defaults ๐ data:image/s3,"s3://crabby-images/6839e/6839ea5ea9357cdc487795636d889dc1909773fe" alt=":scale 85%" --- class: top
# NumPy .g.g-middle[ .g-8[ ```python import numpy as np out = np.sum(A_array, axis=1) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/fd13c/fd13cd58e4b2f07cf42bf7c5786190beebf0fb9d" alt="" ] ] -- .alert.bold.center[๐ One Core ๐] --- class: top
# NumPy matmul .g.g-middle[ .g-8[ ```python import numpy as np out = A_array @ B_array ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/fd13c/fd13cd58e4b2f07cf42bf7c5786190beebf0fb9d" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- class: top
# NumPy matmul (Configuration) .g[ .g-8[ ## Environment variable: `OMP_NUM_THREADS` ```python out = A_array @ B_array ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/fd13c/fd13cd58e4b2f07cf42bf7c5786190beebf0fb9d" alt="" ] ] --- class: top
# NumPy matmul (Configuration) .g[ .g-8[ ## Environment variable: `OMP_NUM_THREADS` ```python out = A_array @ B_array ``` ## `threadpoolctl` ```python from threadpoolctl import threadpool_limits with threadpool_limits(limits=1): out = A_array @ B_array ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/fd13c/fd13cd58e4b2f07cf42bf7c5786190beebf0fb9d" alt="" ] ] --- class: top
# PyTorch .g[ .g-8[ ```python import torch *out = torch.sum(A_tensor, axis=1) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/12b1c/12b1c5f5f9d9ac2aabe97eb32f10718525d226e5" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- class: top
# PyTorch (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` - `threadpoolctl` ```python with threadpool_limits(limits=2): out = torch.sum(A_tensor, axis=1) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/12b1c/12b1c5f5f9d9ac2aabe97eb32f10718525d226e5" alt="" ] ] --- class: top
# PyTorch (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` - `threadpoolctl` ```python with threadpool_limits(limits=2): out = torch.sum(A_tensor, axis=1) ``` - PyTorch function ```python import torch *torch.set_num_threads(2) out = torch.sum(A_tensor, axis=1) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/12b1c/12b1c5f5f9d9ac2aabe97eb32f10718525d226e5" alt="" ] ] --- class: top
# pandas apply .g[ .g-8[ ```python import pandas as pd df = pd.DataFrame(np.random.randn(10_000, 100)) roll = df.rolling(100) *out = roll.mean() ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/17c9a/17c9a6d9cd93d623b1b2ef46c8566571672937bc" alt="" ] ] -- .alert.bold.center[๐ One Core ๐] --- class: top
# pandas apply + numba .g.g-middle[ .g-8[ ```python import pandas as pd df = pd.DataFrame(np.random.randn(10_000, 100)) roll = df.rolling(100) out = roll.mean( * engine="numba", * engine_kwargs={"parallel": True}, ) ``` [Read more](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#pandas-numba-engine) ] .g-4.center[ data:image/s3,"s3://crabby-images/17c9a/17c9a6d9cd93d623b1b2ef46c8566571672937bc" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- class: top
# pandas apply + numba (Configuration) .g[ .g-8[ - Environment variable: `NUMBA_NUM_THREADS` ] .g-4.center[ data:image/s3,"s3://crabby-images/17c9a/17c9a6d9cd93d623b1b2ef46c8566571672937bc" alt="" ] ] --- class: top
# pandas apply + numba (Configuration) .g[ .g-8[ - Environment variable: `NUMBA_NUM_THREADS` - Numba function ```python import numba *numba.set_num_threads(2) out = roll.mean(engine="numba", engine_kwargs={"parallel": True}) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/17c9a/17c9a6d9cd93d623b1b2ef46c8566571672937bc" alt="" ] ] --- class: top
# LogisticRegression .g.g-middle[ .g-8[ ```python from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression().fit(...) *log_reg.predict(X) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- # LogisticRegression (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] --- # LogisticRegression (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` - `threadpoolctl` ```python *with threadpool_limits(limits=2): log_reg.predict(X) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] --- class: top
# HistGradientBoostingClassifier .g[ .g-8[ ```python from sklearn.ensemble import HistGradientBoostingClassifier hist = HistGradientBoostingClassifier() hist.fit(X, y) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- # HistGradientBoostingClassifier (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] --- # HistGradientBoostingClassifier (Configuration) .g[ .g-8[ - Environment variable: `OMP_NUM_THREADS` - `threadpoolctl` ```python *with threadpool_limits(limits=2): hist.predict(X) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/964e2/964e214110e405b8e851c6b8439f3638efcb58cd" alt="" ] ] --- class: top
# polars .g.g-middle[ .g-8[ ```python out = ( pl.scan_csv(...) .filter(pl.col("sepal_length") > 5) .groupby("species") .agg(pl.col("sepal_width").mean()) .collect() ) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/e6156/e61561a551273ed6db9fa2f6dbabd0cdf73224d0" alt="" ] ] -- .success.bold.center[๐๏ธ All Cores ๐๏ธ] --- # polars (Configuration) .g.g-middle[ .g-8[ - Environment variable: `POLARS_MAX_THREADS` ```python out = ( pl.scan_csv(...) .filter(pl.col("sepal_length") > 5) ... ) ``` ] .g-4.center[ data:image/s3,"s3://crabby-images/e6156/e61561a551273ed6db9fa2f6dbabd0cdf73224d0" alt="" ] ] --- class: center # Defaults ๐ data:image/s3,"s3://crabby-images/d72de/d72deda12533e752a942f0a0499776759d41afc0" alt=":scale 90%" --- class: center .g.g-middle[ .g-6[ # Proposal ๐ฎ ## Agree on a default? ๐ ] .g-6[ data:image/s3,"s3://crabby-images/93054/93054717209064d1299b4d5300840373fb315982" alt="" ] ] --- class: center # Proposal ๐ฎ ## Libraries document how to configure parallelism data:image/s3,"s3://crabby-images/319fd/319fda66f027db39e92502b97f65b1fd2f8ae94b" alt=":scale 70%" --- .center[ # State of Python Parallelism ] .g.g-middle[ .g-6.larger[ ## APIs ๐ป ## Interactions ๐ฅ ## Defaults ๐ ] .g-6[ data:image/s3,"s3://crabby-images/9f5c3/9f5c3ee2846dfbb29bb8970192e55958a0cf5a1d" alt="" ] ] --- class: title-slide # Can There Be Too Much Parallelism? .larger[Thomas J. Fan]
@thomasjpfan
data:image/s3,"s3://crabby-images/f0f2a/f0f2afc923bbc887a421208940b902b10caec44a" alt=":scale 15%" This talk on Github: thomasjpfan/scipy-2023-too-parallel
--- class: chapter-slide # Appendix ๐ช --- # Python GIL + Parallelism? ๐ - Python Multi-threading: Release the GIL - Python Multi-processing: Each process gets it's own GIL - Native multi-threading: Release the GIL --- # PEP 684 ๐ฎ: Sub-Interpreters ## Need to explore, it could work โ๏ธ --- # PEP 703 ๐ฎ: No-GIL ## Also Promising, but harder lift for Python