Parallelism in Numerical Python Libraries

Thomas J. Fan

@thomasjpfan

This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism

Parallelism?

Overview

Configuration

Implementation

NumPy Parallelism 🚀

Demo: `np.add` vs `@` (matmul) 🧪

Questions 🤔When is NumPy parallel?
How is linalg implemented?
How to configure parallelism for linalg?
Can we parallelize np.add?

When is NumPy parallel?

https://numpy.org/doc/stable/reference/routines.linalg.html

How is `linalg` implemented?

BLAS (Basic Linear Algebra Subprograms)

LAPACK (Linear Algebra PACKage)

OpenBLAS
MKL: Intel's Math Kernel Library
BLIS: BLAS-like Library Instantiation Software

Source

Which BLAS is my library using?

Use `threadpoolctl`!

python -m threadpoolctl -i numpy

Which BLAS is my library using?

Use `threadpoolctl`!

python -m threadpoolctl -i numpy

Output for `OpenBLAS`:

{
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "...",
    "version": "0.3.21",
    "threading_layer": "pthreads",
    "architecture": "Zen",
    "num_threads": 32
}

Which BLAS is my library using?

Output for `MKL`:

{
    "user_api": "blas",
    "internal_api": "mkl",
    "prefix": "libmkl_rt",
    "filepath": "...",
    "version": "2022.1-Product",
    "threading_layer": "intel",
    "num_threads": 16
}

How select BLAS implementation?

PyPI

Only `OpenBLAS` (with `pthreads`)

Conda-forge

conda install "libblas=*=*mkl"
conda install "libblas=*=*openblas"
conda install "libblas=*=*blis"
conda install "libblas=*=*accelerate"
conda install "libblas=*=*netlib"

`threadpoolctl`

Context manager

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=4, user_api="blas"):
    Z = X @ Y

`threadpoolctl`

Context manager

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=4, user_api="blas"):
    Z = X @ Y

Globally

threadpool_limits(limits=4, user_api="blas")

Can we parallelize `np.add`?

Demo: Parallel `add`! 🧪

Pytorch CPU Parallelism 🚀

Parallel by default!

Configuration

OMP_NUM_THREADS or MKL_NUM_THREADS
torch.set_num_threads

SciPy Parallelism 🚀

`scipy.linalg`

Python's Multiprocessing

Python's Multithreading

Custom C++ thread pool using pthreads

`scipy.linalg`

https://docs.scipy.org/doc/scipy/reference/linalg.html

SciPy: Multiprocessing 🚀

Config with the workers parameter

integrate.quad_vec
optimize.differential_evolution
optimize.brute
stats.multiscale_graphcorr

SciPy: `optimize.brute`

Source

SciPy: Multithreading 🚀

Config with the workers parameter

spatial.KDTree.query
spatial.cKDTree.query
spatial.KDTree.query_ball_point
spatial.cKDTree.query_ball_point

SciPy: `spatial.KDTree.query`

Source

Multiprocessing vs Multithreading

Multiprocessing

Likely need to think about memory
Different start methods:
- spawn (Default on Windows and macOS)
- fork (Default on Unix)
- forkserver

Multithreading

Need to think about the Global Interpreter Lock (GIL)
Shared memory
Spawning threads is faster

SciPy: Custom C++ thread pool using `pthreads`

Config with the workers parameter

scipy.fft
linalg.matmul_toeplitz

SciPy: `scipy.fft.fft`

Source

Why `pthreads` and not `OpenMP`? 🧵

GNU OpenMP runtime library does not work well with multiprocessing
Some OpenMP runtime libraries are not compatible with each other

Parallelism in Pandas 🚀

Configure with engine="numba"
Aggregations with groupby, rolling
Parallel over columns in DataFrames:
- engine_kwargs={"parallel": True}

Demo: Pandas Parallelism 🐼

UMAP

Parallel by default with Numba

Source

Parallelism with Numba

Configuration

Environment variable: NUMBA_NUM_THREADS
Python API: numba.set_num_threads

Parallelism with Numba

Configuration

Environment variable: NUMBA_NUM_THREADS
Python API: numba.set_num_threads

Threading layers

Open Multi-Processing (OpenMP)
Intel Threading Building Blocks (TBB)
workqueue (Numba's custom threadpool using pthreads)

Parallelism with Numba

Considerations

Need to be careful when combined with Python's parallel libraries:
- multithreading
- multiprocessing spawn
- multiprocessing fork
- multiprocessing forkserver
Read more here

AOT vs Numba

AOT

Ahead of time compiled
Harder to build
Less requirements during runtime

Numba

Just in time compiled
Source code is Python
Requires compiler at runtime

Parallelism in polars 🐻‍❄️

Parallel by default
Uses pthreads with rust library: rayon-rs/rayon
Environment variable: POLARS_MAX_THREADS

github.com/pola-rs/polars

Parallelism in scikit-learn 🖥

Python Multithreading
Python Multiprocessing (with loky backend)
OpenMP routines (Parallel by default)
Inherits linalg BLAS semantics from NumPy and SciPy (Parallel by default)

Parallelism in scikit-learn  🖥Python Multithreading

Parallelism in scikit-learn  🖥Python MultithreadingRandomForestClassifier and RandomForestRegressor
LogisticRegression with solver="sag" or "saga"
Method calls to kneighbors or radius_neighbors

Parallelism in scikit-learn 🖥

Python Multithreading

RandomForestClassifier and RandomForestRegressor
LogisticRegression with solver="sag" or "saga"
Method calls to kneighbors or radius_neighbors

Configuration

n_jobs parameter

forest = RandomForestClassifier(n_jobs=4)

Parallelism in scikit-learn  🖥Python MultiprocessingHalvingGridSearchCV and HalvingRandomSearchCV
MultiOutputClassifier, etc

Parallelism in scikit-learn 🖥

Python Multiprocessing

HalvingGridSearchCV and HalvingRandomSearchCV
MultiOutputClassifier, etc

Configuration

n_jobs parameter

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
halving = HalvingGridSearchCV(..., n_jobs=4)

loky

Detects worker process failures
Reuse existing pools
Processes are spawned by default

Learn more at loky.readthedocs.io/en/stable/

Python Multiprocessing: Memory

Parallelism in scikit-learn 🖥

Python Multiprocessing: memmapping

Parallelism in scikit-learn  🖥OpenMPParallel by default

Parallelism in scikit-learn 🖥

OpenMP

Parallel by default
HistGradientBoostingRegressor and HistGradientBoostingClassifier

Parallelism in scikit-learn 🖥

OpenMP

Parallel by default
HistGradientBoostingRegressor and HistGradientBoostingClassifier
Routines that use pairwise distances reductions:
- metrics.pairwise_distances_argmin
- manifold.TSNE
- neighbors.KNeighborsClassifier and neighbors.KNeighborsRegressor
- See here for more

Parallelism in scikit-learn 🖥

OpenMP

Parallel by default
HistGradientBoostingRegressor and HistGradientBoostingClassifier
Routines that use pairwise distances reductions:
- metrics.pairwise_distances_argmin
- manifold.TSNE
- neighbors.KNeighborsClassifier and neighbors.KNeighborsRegressor
- See here for more

Configuration

OMP_NUM_THREADS
threadpoolctl

Scikit-learn Avoiding Oversubscription 🖥

Automatically configures native threads to cpu_count() // n_jobs
Learn more in joblib's docs

Scikit-learn Avoiding Oversubscription Example

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()  # OpenMP
gsh = HalvingGridSearchCV(
    estimator=clf, param_grid=param_grid,
    n_jobs=n_jobs  # loky
)

Scikit-learn Avoiding Oversubscription Example

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier()  # OpenMP
gsh = HalvingGridSearchCV(
    estimator=clf, param_grid=param_grid,
    n_jobs=n_jobs  # loky
)

Timing the search ⏰

%%time
gsh.fit(X, y)
# CPU times: user 15min 57s, sys: 791 ms, total: 15min 58s
# Wall time: 41.4 s

Timing results

Dask

Source

Is there a problem here? 🤔

Using Multiple Parallel APIs

Native thread libraries 🧵

OpenBLAS (built with pthreads) and code with OpenMP
Poor interactions between OpenMP with different compilers
Two OpenBLAS present (NumPy and SciPy on PyPI)
Read more at: thomasjpfan.github.io/parallelism-python-libraries-design/

Using Multiple Parallel APIs

Using more than one parallel backends 🤯

Sources: polars, numba, scikit-learn, pandas

Parallel by Default?

Conclusion

Configuration

Implementation

Configuration Options

Environment variables:
- OMP_NUM_THREADS, MKL_NUM_THREADS, etc
Global: torch.set_num_threads
Context manager: threadpoolctl (BLAS and OpenMP)
Call-site parameters: n_jobs, worker

Implementation Options

C, C++, Numba, Rust, Python, etc
Python Multiprocessing and Multithreading
Native threads: OpenMP, pthreads, Intel TBB

Parallelism in Numerical Python Libraries

Thomas J. Fan
@thomasjpfan

This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Parallelism in Numerical Python Libraries

Parallelism?

Overview

Configuration

Implementation

NumPy Parallelism 🚀

Demo: np.add vs @ (matmul) 🧪

Questions 🤔

When is NumPy parallel?

https://numpy.org/doc/stable/reference/routines.linalg.html

How is linalg implemented?

BLAS (Basic Linear Algebra Subprograms)

LAPACK (Linear Algebra PACKage)

Which BLAS is my library using?

Which BLAS is my library using?

Use threadpoolctl!

Which BLAS is my library using?

Use threadpoolctl!

Output for OpenBLAS:

Which BLAS is my library using?

Output for MKL:

How select BLAS implementation?

PyPI

Only OpenBLAS (with pthreads)

Conda-forge

How to configure parallelism for @?

Environment variables

threadpoolctl

Controlling Parallelism with environment variables

threadpoolctl

Context manager

threadpoolctl

Context manager

Globally

Can we parallelize np.add?

Demo: Parallel add! 🧪

Pytorch CPU Parallelism 🚀

Parallel by default!

Configuration

SciPy Parallelism 🚀

SciPy Parallelism 🚀

scipy.linalg

Python's Multiprocessing

Python's Multithreading

Custom C++ thread pool using pthreads

scipy.linalg

SciPy: Multiprocessing 🚀

SciPy: optimize.brute

SciPy: Multithreading 🚀

SciPy: spatial.KDTree.query

Multiprocessing vs Multithreading

Multiprocessing

Multithreading

SciPy: Custom C++ thread pool using pthreads

SciPy: scipy.fft.fft

Why pthreads and not OpenMP? 🧵

Parallelism in Pandas 🚀

Parallelism in Pandas 🚀

Demo: Pandas Parallelism 🐼

UMAP

Parallel by default with Numba

Parallelism with Numba

Configuration

Parallelism with Numba

Configuration

Threading layers

Parallelism with Numba

Considerations

AOT vs Numba

AOT

Numba

Parallelism in polars 🐻‍❄️

Parallelism in scikit-learn 🖥

Parallelism in scikit-learn 🖥

Parallelism in scikit-learn 🖥

Python Multithreading

Parallelism in scikit-learn 🖥

Python Multithreading

Parallelism in scikit-learn 🖥

Python Multithreading

Demo: `np.add` vs `@` (matmul) 🧪

How is `linalg` implemented?

Use `threadpoolctl`!

Use `threadpoolctl`!

Output for `OpenBLAS`:

Output for `MKL`:

Only `OpenBLAS` (with `pthreads`)

How to configure parallelism for `@`?

`threadpoolctl`

`threadpoolctl`

`threadpoolctl`

Can we parallelize `np.add`?

Demo: Parallel `add`! 🧪

`scipy.linalg`

`scipy.linalg`

SciPy: `optimize.brute`

SciPy: `spatial.KDTree.query`

SciPy: Custom C++ thread pool using `pthreads`

SciPy: `scipy.fft.fft`

Why `pthreads` and not `OpenMP`? 🧵