title: Parallelism in Numerical Python Libraries use_katex: False class: title-slide # Parallelism in Numerical Python Libraries .larger[Thomas J. Fan]
@thomasjpfan
data:image/s3,"s3://crabby-images/69148/69148d419a619b0aac82f4d4c7346e3e70d7696d" alt=":scale 25%"
This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism
--- # Parallelism? .g[ .g-4[ data:image/s3,"s3://crabby-images/56bb5/56bb5655ab5eb81807430e94bb235e1b3f203b34" alt="" ] .g-4[ data:image/s3,"s3://crabby-images/ac819/ac819cc1c397b86453bd146d556035a2e90ce468" alt="" ] .g-4[ data:image/s3,"s3://crabby-images/ab452/ab4525d01bb29007400c5bb991387edbde122a42" alt="" ] ] --- # Overview .g[ .g-6.center[ ## Configuration data:image/s3,"s3://crabby-images/118bc/118bcc95d21268ecb6dd238a250038699820e637" alt=":scale 90%" ] .g-6.center[ ## Implementation data:image/s3,"s3://crabby-images/93054/93054717209064d1299b4d5300840373fb315982" alt=":scale 90%" ] ] --- # NumPy Parallelism ๐ data:image/s3,"s3://crabby-images/287aa/287aa0d2d806c259feea5657c0499097789c2cb5" alt=":scale 100%" --- class: chapter-slide # Demo: `np.add` vs `@` (matmul) ๐งช --- # Questions ๐ค - When is NumPy parallel? - How is `linalg` implemented? - How to configure parallelism for `linalg`? - Can we parallelize `np.add`? --- # When is NumPy parallel? .center[ data:image/s3,"s3://crabby-images/8cbdc/8cbdccd4a5e6ebf07978f24fb0d98743e2c8527c" alt=":scale 70%" ] ### https://numpy.org/doc/stable/reference/routines.linalg.html --- # How is `linalg` implemented? .center[ ## BLAS (Basic Linear Algebra Subprograms) ## LAPACK (Linear Algebra PACKage) ] .g.g-middle[ .g-6[ - **OpenBLAS** - **MKL**: Intel's Math Kernel Library - **BLIS**: BLAS-like Library Instantiation Software ] .g-6[ data:image/s3,"s3://crabby-images/a93af/a93af8fe1b20f46854b1776908fde3de5e08608e" alt="" ] ] .footnote-back[ [Source](https://netlib.org/blas/) ] --- class: top, top-5 # Which BLAS is my library using? -- ## Use `threadpoolctl`! ```bash python -m threadpoolctl -i numpy ``` -- ## Output for `OpenBLAS`: ```json { "user_api": "blas", * "internal_api": "openblas", "prefix": "libopenblas", "filepath": "...", "version": "0.3.21", * "threading_layer": "pthreads", "architecture": "Zen", * "num_threads": 32 } ``` --- class: top, top-5 # Which BLAS is my library using? ## Output for `MKL`: ```json { "user_api": "blas", * "internal_api": "mkl", "prefix": "libmkl_rt", "filepath": "...", "version": "2022.1-Product", * "threading_layer": "intel", "num_threads": 16 } ``` --- # How select BLAS implementation? .g.g-middle[ .g-6[ ## PyPI ### Only `OpenBLAS` (with `pthreads`) ] .g-6[ ## Conda-forge ``` conda install "libblas=*=*mkl" conda install "libblas=*=*openblas" conda install "libblas=*=*blis" conda install "libblas=*=*accelerate" conda install "libblas=*=*netlib" ``` [Read more here](https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation) ] ] --- # How to configure parallelism for `@`? ## Environment variables ## `threadpoolctl` --- # Controlling Parallelism with environment variables - `OPENBLAS_NUM_THREADS` (With `pthreads`) - `MKL_NUM_THREADS` (Intel's MKL) - `OMP_NUM_THREADS` (OpenMP) - `BLIS_NUM_THREADS` --- class: top, top-10 # `threadpoolctl` ## Context manager ```python from threadpoolctl import threadpool_limits with threadpool_limits(limits=4, user_api="blas"): Z = X @ Y ``` -- ## Globally ```python threadpool_limits(limits=4, user_api="blas") ``` --- # Can we parallelize `np.add`? .g.g-middle[ .g-4[ data:image/s3,"s3://crabby-images/1dc90/1dc9081eaa84a43bf3674b79eb8cf0ac657a0e5a" alt=":scale 80%" ] .g-4[ data:image/s3,"s3://crabby-images/ca69d/ca69d5847cab0a81875f4a1bfc7e6d6c586db7fe" alt=":scale 80%" ] .g-4[ data:image/s3,"s3://crabby-images/fddb9/fddb9d426d82d386f5b88ddfdc6fa196fb31efeb" alt=":scale 80%" ] ] --- class: chapter-slide # Demo: Parallel `add`! ๐งช --- # Pytorch CPU Parallelism ๐ ## Parallel by default! .g.g-middle[ .g-8[ ## Configuration - `OMP_NUM_THREADS` or `MKL_NUM_THREADS` - `torch.set_num_threads` ] .g-4[ data:image/s3,"s3://crabby-images/fddb9/fddb9d426d82d386f5b88ddfdc6fa196fb31efeb" alt="" ] ] --- class: center # SciPy Parallelism ๐ data:image/s3,"s3://crabby-images/b5e9e/b5e9e8ff88b52bdb66ad1e219c90ee2f947276df" alt=":scale 50%" --- # SciPy Parallelism ๐ .g.g-middle[ .g-8[ ## `scipy.linalg` ## Python's Multiprocessing ## Python's Multithreading ## Custom C++ thread pool using pthreads ] .g-4[ data:image/s3,"s3://crabby-images/b5e9e/b5e9e8ff88b52bdb66ad1e219c90ee2f947276df" alt="" ] ] --- # `scipy.linalg` .center[ data:image/s3,"s3://crabby-images/804e6/804e61f05024df6549c8e80b00a8c41ad17f1482" alt=":scale 60%" ] https://docs.scipy.org/doc/scipy/reference/linalg.html --- # SciPy: Multiprocessing ๐ Config with the `workers` parameter - `integrate.quad_vec` - `optimize.differential_evolution` - `optimize.brute` - `stats.multiscale_graphcorr` --- # SciPy: `optimize.brute` data:image/s3,"s3://crabby-images/a6af4/a6af4119d3d4dde6b68a7a1d7c833be60314df9b" alt="" [Source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.brute.html) --- # SciPy: Multithreading ๐ Config with the `workers` parameter - `spatial.KDTree.query` - `spatial.cKDTree.query` - `spatial.KDTree.query_ball_point` - `spatial.cKDTree.query_ball_point` --- # SciPy: `spatial.KDTree.query` data:image/s3,"s3://crabby-images/b450a/b450a80a5493fa015dd5706c47daa0585226ae3d" alt="" [Source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html) --- class: top, top-10 # Multiprocessing vs Multithreading .g[ .g-6[ ## Multiprocessing - Likely need to think about memory - Different [start methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods): - **spawn** (Default on Windows and macOS) - **fork** (Default on Unix) - **forkserver** ] .g-6[ ## Multithreading - Need to think about the Global Interpreter Lock (**GIL**) - Shared memory - Spawning threads is faster ] ] --- # SciPy: Custom C++ thread pool using `pthreads` .g.g-middle[ .g-6[ Config with the `workers` parameter - `scipy.fft` - `linalg.matmul_toeplitz` ] .g-6.g-center[ data:image/s3,"s3://crabby-images/6d8e4/6d8e4bc13ad117e5275ddaf3a5cb138546c6a9a8" alt=":scale 50%" ] ] --- # SciPy: `scipy.fft.fft` data:image/s3,"s3://crabby-images/8ac9e/8ac9e0a6d0dd9421b77ff2e5b6c68426c245e241" alt="" [Source](https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.fft.html) --- # Why `pthreads` and not `OpenMP`? ๐งต - GNU OpenMP runtime library **does not work well** with `multiprocessing` - Some OpenMP runtime libraries are **not compatible** with each other Read more [in this SciPy issue](https://github.com/scipy/scipy/issues/10239#issuecomment-795030817). --- # Parallelism in Pandas ๐ data:image/s3,"s3://crabby-images/1f491/1f49133154b33101da46ec824f6d57260fa6cd5e" alt="" --- class: top, top-10 # Parallelism in Pandas ๐ .g.g-middle[ .g-8[ - Configure with `engine="numba"` - Aggregations with `groupby`, `rolling` - Parallel over columns in **DataFrames**: - `engine_kwargs={"parallel": True}` ] .g-4[ data:image/s3,"s3://crabby-images/1f491/1f49133154b33101da46ec824f6d57260fa6cd5e" alt=":scale 100%" ] ] --- class: chapter-slide # Demo: Pandas Parallelism ๐ผ --- # UMAP ## **Parallel by default** with Numba data:image/s3,"s3://crabby-images/40818/4081877f91770e2e113a6400c623726a0fad562b" alt="" [Source](https://umap-learn.readthedocs.io/en/latest/index.html) --- class: top # Parallelism with Numba .g.g-middle[ .g-8[ ## Configuration - Environment variable: `NUMBA_NUM_THREADS` - Python API: `numba.set_num_threads` ] .g-4[ data:image/s3,"s3://crabby-images/1dc90/1dc9081eaa84a43bf3674b79eb8cf0ac657a0e5a" alt=":scale 60%" ] ] -- ## Threading layers - Open Multi-Processing (OpenMP) - Intel Threading Building Blocks (TBB) - workqueue (Numba's custom threadpool using `pthreads`) --- # Parallelism with Numba ## Considerations - Need to be careful when combined with Python's parallel libraries: - `multithreading` - `multiprocessing` **spawn** - `multiprocessing` **fork** - `multiprocessing` **forkserver** - [Read more here](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html) --- # AOT vs Numba .g[ .g-6[ .center[ ## AOT data:image/s3,"s3://crabby-images/475c1/475c13d3729e0b7aec5581c732ac00cc5a8ce405" alt="" ] - **Ahead of time** compiled - Harder to build - Less requirements during runtime ] .g-6[ .center[ ## Numba data:image/s3,"s3://crabby-images/1dc90/1dc9081eaa84a43bf3674b79eb8cf0ac657a0e5a" alt=":scale 38%" ] - **Just in time** compiled - Source code is Python - Requires compiler at runtime ] ] --- # Parallelism in polars ๐ปโโ๏ธ data:image/s3,"s3://crabby-images/4c62b/4c62bee408f7528964ebe4a50888b4675f5dcdb7" alt="" - **Parallel by default** - Uses `pthreads` with rust library: `rayon-rs/rayon` - Environment variable: `POLARS_MAX_THREADS` [github.com/pola-rs/polars](https://github.com/pola-rs/polars) --- class: center # Parallelism in scikit-learn ๐ฅ data:image/s3,"s3://crabby-images/61236/61236c366ef808f76b6becbd8a8bd391e1ef4077" alt=":scale 50%" --- # Parallelism in scikit-learn ๐ฅ .g.g-middle[ .g-8[ - Python Multithreading - Python Multiprocessing (with `loky` backend) - OpenMP routines (**Parallel by default**) - Inherits `linalg` BLAS semantics from NumPy and SciPy (**Parallel by default**) ] .g-4[ data:image/s3,"s3://crabby-images/61236/61236c366ef808f76b6becbd8a8bd391e1ef4077" alt=":scale 100%" ] ] --- class: top, top-5 # Parallelism in scikit-learn ๐ฅ ## Python Multithreading -- - `RandomForestClassifier` and `RandomForestRegressor` - LogisticRegression with `solver="sag"` or `"saga"` - Method calls to `kneighbors` or `radius_neighbors` -- ## Configuration - `n_jobs` parameter ```python forest = RandomForestClassifier(n_jobs=4) ``` --- class: top, top-5 # Parallelism in scikit-learn ๐ฅ ## Python Multiprocessing - `HalvingGridSearchCV` and `HalvingRandomSearchCV` - `MultiOutputClassifier`, etc -- ## Configuration - `n_jobs` parameter ```python from sklearn.experimental import enable_halving_search_cv from sklearn.model_selection import HalvingRandomSearchCV *halving = HalvingGridSearchCV(..., n_jobs=4) ``` --- .g.g-middle[ .g-8[ # loky - Detects worker process failures - Reuse existing pools - Processes are spawned by default ] .g-4[ data:image/s3,"s3://crabby-images/efdfb/efdfbf8be47e1b9f61dbd65faf5826edecbadb8c" alt=":scale 100%" ] ] Learn more at [loky.readthedocs.io/en/stable/](https://loky.readthedocs.io/en/stable/) --- # Python Multiprocessing: Memory
data:image/s3,"s3://crabby-images/b931a/b931afca37fd8d609c67328ee36f5b52ed747162" alt="" --- # Parallelism in scikit-learn ๐ฅ ## Python Multiprocessing: memmapping data:image/s3,"s3://crabby-images/6bace/6bace6b4800de0025cbe9ddfd14fa60a07dca89e" alt="" --- class: top # Parallelism in scikit-learn ๐ฅ ## OpenMP - **Parallel by default** -- - `HistGradientBoostingRegressor` and `HistGradientBoostingClassifier` -- - Routines that use pairwise distances reductions: - `metrics.pairwise_distances_argmin` - `manifold.TSNE` - `neighbors.KNeighborsClassifier` and `neighbors.KNeighborsRegressor` - [See here for more](https://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_1_0.html#performance-improvements) -- ## Configuration - `OMP_NUM_THREADS` - `threadpoolctl` --- # Scikit-learn Avoiding Oversubscription ๐ฅ .g.g-middle[ .g-8[ - Automatically configures native threads to **`cpu_count() // n_jobs`** - Learn more in [joblib's docs](https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources) ] .g-4[ data:image/s3,"s3://crabby-images/8dab2/8dab2c71da37261e4e52cbedd8b5acde053cdcd0" alt="" ] ] --- class: top-5, top # Scikit-learn Avoiding Oversubscription Example ```python from sklearn.experimental import enable_halving_search_cv from sklearn.model_selection import HalvingGridSearchCV from sklearn.ensemble import HistGradientBoostingClassifier *clf = HistGradientBoostingClassifier() # OpenMP gsh = HalvingGridSearchCV( estimator=clf, param_grid=param_grid, * n_jobs=n_jobs # loky ) ``` -- ## Timing the search โฐ ```python %%time gsh.fit(X, y) # CPU times: user 15min 57s, sys: 791 ms, total: 15min 58s *# Wall time: 41.4 s ``` --- # Timing results data:image/s3,"s3://crabby-images/6443d/6443d16d7a72c79d67c07436ce0ad6b434997d5a" alt=":scale 80%" --- # Timing results data:image/s3,"s3://crabby-images/46f11/46f11af0023cc4d6bc1b74f35af5268472bdf26b" alt=":scale 80%" --- # Timing results data:image/s3,"s3://crabby-images/4e6b2/4e6b216698017f404f6e30862f9339fcf4597cfb" alt=":scale 80%" --- # Timing results data:image/s3,"s3://crabby-images/091f7/091f74f0c05fa722d4628e15df5da4436d020b0a" alt=":scale 80%" --- class: top # Dask data:image/s3,"s3://crabby-images/a13e7/a13e7692f714ef11b5431271aaf75645ca6afa72" alt=":scale 15%" data:image/s3,"s3://crabby-images/45894/45894880801fb510170fa094b59089baf555448a" alt=":scale 85%" [Source](https://docs.dask.org/en/stable/array-best-practices.html#avoid-oversubscribing-threads) --- data:image/s3,"s3://crabby-images/0f980/0f98065cd7d11ab6b07a76efc5c7f466ac4262c8" alt=":scale 20%" data:image/s3,"s3://crabby-images/5da5f/5da5f18605d3cd7d077832346d9af91a5983fa23" alt="" [Source](https://docs.ray.io/en/latest/serve/scaling-and-resource-allocation.html#configuring-parallelism-with-omp-num-threads) --- class: chapter-slide # Is there a problem here? ๐ค --- # Using Multiple Parallel APIs ## Native thread libraries ๐งต - OpenBLAS (built with `pthreads`) and code with `OpenMP` - Poor interactions between `OpenMP` with different compilers - Two OpenBLAS present (`NumPy` and `SciPy` on `PyPI`) - Read more at: [thomasjpfan.github.io/parallelism-python-libraries-design/](https://thomasjpfan.github.io/parallelism-python-libraries-design/) --- # Using Multiple Parallel APIs ## Using more than one parallel backends ๐คฏ data:image/s3,"s3://crabby-images/c2780/c2780b6812984e4494925d52b69aa351a1a9c689" alt="" Sources: [polars](https://pola-rs.github.io/polars-book/user-guide/howcani/multiprocessing.html), [numba](https://numba.pydata.org/numba-doc/latest/user/threading-layer.html), [scikit-learn](https://scikit-learn.org/stable/faq.html#why-do-i-sometime-get-a-crash-freeze-with-n-jobs-1-under-osx-or-linux), [pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html#caveats) --- class: center # Parallel by Default?
data:image/s3,"s3://crabby-images/d72de/d72deda12533e752a942f0a0499776759d41afc0" alt="" --- # Conclusion .g[ .g-6.center[ ## Configuration data:image/s3,"s3://crabby-images/118bc/118bcc95d21268ecb6dd238a250038699820e637" alt=":scale 90%" ] .g-6.center[ ## Implementation data:image/s3,"s3://crabby-images/93054/93054717209064d1299b4d5300840373fb315982" alt=":scale 90%" ] ] --- .g[ .g-9[ # Configuration Options - Environment variables: - `OMP_NUM_THREADS`, `MKL_NUM_THREADS`, etc - Global: `torch.set_num_threads` - Context manager: `threadpoolctl` (BLAS and OpenMP) - Call-site parameters: `n_jobs`, `worker` ] .g-3[ data:image/s3,"s3://crabby-images/118bc/118bcc95d21268ecb6dd238a250038699820e637" alt=":scale 100%" ] ] --- .g[ .g-9[ # Implementation Options - `C`, `C++`, `Numba`, `Rust`, `Python`, etc - Python Multiprocessing and Multithreading - Native threads: `OpenMP`, `pthreads`, Intel TBB ] .g-3[ data:image/s3,"s3://crabby-images/93054/93054717209064d1299b4d5300840373fb315982" alt=":scale 100%" ] ] --- class: center # Parallelism in Numerical Python Libraries .larger[Thomas J. Fan]
@thomasjpfan
data:image/s3,"s3://crabby-images/69148/69148d419a619b0aac82f4d4c7346e3e70d7696d" alt=":scale 25%"
This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism