+ - 0:00:00
Notes for current slide
Notes for next slide

Parallelism in Numerical Python Libraries

Thomas J. Fan

@thomasjpfan

This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism

Parallelism?

Overview

Configuration

Implementation

NumPy Parallelism ๐Ÿš€

Demo: np.add vs @ (matmul) ๐Ÿงช

Questions ๐Ÿค”

  • When is NumPy parallel?
  • How is linalg implemented?
  • How to configure parallelism for linalg?
  • Can we parallelize np.add?

How is linalg implemented?

BLAS (Basic Linear Algebra Subprograms)

LAPACK (Linear Algebra PACKage)

  • OpenBLAS
  • MKL: Intel's Math Kernel Library
  • BLIS: BLAS-like Library Instantiation Software

Which BLAS is my library using?

Which BLAS is my library using?

Use threadpoolctl!

python -m threadpoolctl -i numpy

Which BLAS is my library using?

Use threadpoolctl!

python -m threadpoolctl -i numpy

Output for OpenBLAS:

{
"user_api": "blas",
"internal_api": "openblas",
"prefix": "libopenblas",
"filepath": "...",
"version": "0.3.21",
"threading_layer": "pthreads",
"architecture": "Zen",
"num_threads": 32
}

Which BLAS is my library using?

Output for MKL:

{
"user_api": "blas",
"internal_api": "mkl",
"prefix": "libmkl_rt",
"filepath": "...",
"version": "2022.1-Product",
"threading_layer": "intel",
"num_threads": 16
}

How select BLAS implementation?

PyPI

Only OpenBLAS (with pthreads)

Conda-forge

conda install "libblas=*=*mkl"
conda install "libblas=*=*openblas"
conda install "libblas=*=*blis"
conda install "libblas=*=*accelerate"
conda install "libblas=*=*netlib"

Read more here

How to configure parallelism for @?

Environment variables

threadpoolctl

Controlling Parallelism with environment variables

  • OPENBLAS_NUM_THREADS (With pthreads)
  • MKL_NUM_THREADS (Intel's MKL)
  • OMP_NUM_THREADS (OpenMP)
  • BLIS_NUM_THREADS

threadpoolctl

Context manager

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=4, user_api="blas"):
Z = X @ Y

threadpoolctl

Context manager

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=4, user_api="blas"):
Z = X @ Y

Globally

threadpool_limits(limits=4, user_api="blas")

Can we parallelize np.add?

Demo: Parallel add! ๐Ÿงช

Pytorch CPU Parallelism ๐Ÿš€

Parallel by default!

Configuration

  • OMP_NUM_THREADS or MKL_NUM_THREADS
  • torch.set_num_threads

SciPy Parallelism ๐Ÿš€

SciPy Parallelism ๐Ÿš€

scipy.linalg

Python's Multiprocessing

Python's Multithreading

Custom C++ thread pool using pthreads

SciPy: Multiprocessing ๐Ÿš€

Config with the workers parameter

  • integrate.quad_vec
  • optimize.differential_evolution
  • optimize.brute
  • stats.multiscale_graphcorr

SciPy: optimize.brute

Source

SciPy: Multithreading ๐Ÿš€

Config with the workers parameter

  • spatial.KDTree.query
  • spatial.cKDTree.query
  • spatial.KDTree.query_ball_point
  • spatial.cKDTree.query_ball_point

SciPy: spatial.KDTree.query

Source

Multiprocessing vs Multithreading

Multiprocessing

  • Likely need to think about memory
  • Different start methods:
    • spawn (Default on Windows and macOS)
    • fork (Default on Unix)
    • forkserver

Multithreading

  • Need to think about the Global Interpreter Lock (GIL)
  • Shared memory
  • Spawning threads is faster

SciPy: Custom C++ thread pool using pthreads

Config with the workers parameter

  • scipy.fft
  • linalg.matmul_toeplitz

SciPy: scipy.fft.fft

Source

Why pthreads and not OpenMP? ๐Ÿงต

  • GNU OpenMP runtime library does not work well with multiprocessing
  • Some OpenMP runtime libraries are not compatible with each other

Read more in this SciPy issue.

Parallelism in Pandas ๐Ÿš€

Parallelism in Pandas ๐Ÿš€

  • Configure with engine="numba"
  • Aggregations with groupby, rolling

  • Parallel over columns in DataFrames:

    • engine_kwargs={"parallel": True}

Demo: Pandas Parallelism ๐Ÿผ

UMAP

Parallel by default with Numba

Source

Parallelism with Numba

Configuration

  • Environment variable: NUMBA_NUM_THREADS
  • Python API: numba.set_num_threads

Parallelism with Numba

Configuration

  • Environment variable: NUMBA_NUM_THREADS
  • Python API: numba.set_num_threads

Threading layers

  • Open Multi-Processing (OpenMP)
  • Intel Threading Building Blocks (TBB)
  • workqueue (Numba's custom threadpool using pthreads)

Parallelism with Numba

Considerations

  • Need to be careful when combined with Python's parallel libraries:

    • multithreading
    • multiprocessing spawn
    • multiprocessing fork
    • multiprocessing forkserver
  • Read more here

AOT vs Numba

AOT

  • Ahead of time compiled
  • Harder to build
  • Less requirements during runtime

Numba

  • Just in time compiled
  • Source code is Python
  • Requires compiler at runtime

Parallelism in polars ๐Ÿปโ€โ„๏ธ

  • Parallel by default
  • Uses pthreads with rust library: rayon-rs/rayon
  • Environment variable: POLARS_MAX_THREADS

github.com/pola-rs/polars

Parallelism in scikit-learn ๐Ÿ–ฅ

Parallelism in scikit-learn ๐Ÿ–ฅ

  • Python Multithreading
  • Python Multiprocessing (with loky backend)
  • OpenMP routines (Parallel by default)
  • Inherits linalg BLAS semantics from NumPy and SciPy (Parallel by default)

Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multithreading

Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multithreading

  • RandomForestClassifier and RandomForestRegressor
  • LogisticRegression with solver="sag" or "saga"
  • Method calls to kneighbors or radius_neighbors

Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multithreading

  • RandomForestClassifier and RandomForestRegressor
  • LogisticRegression with solver="sag" or "saga"
  • Method calls to kneighbors or radius_neighbors

Configuration

  • n_jobs parameter
forest = RandomForestClassifier(n_jobs=4)

Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multiprocessing

  • HalvingGridSearchCV and HalvingRandomSearchCV
  • MultiOutputClassifier, etc

Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multiprocessing

  • HalvingGridSearchCV and HalvingRandomSearchCV
  • MultiOutputClassifier, etc

Configuration

  • n_jobs parameter
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
halving = HalvingGridSearchCV(..., n_jobs=4)

loky

  • Detects worker process failures
  • Reuse existing pools
  • Processes are spawned by default

Learn more at loky.readthedocs.io/en/stable/

Python Multiprocessing: Memory


Parallelism in scikit-learn ๐Ÿ–ฅ

Python Multiprocessing: memmapping

Parallelism in scikit-learn ๐Ÿ–ฅ

OpenMP

  • Parallel by default

Parallelism in scikit-learn ๐Ÿ–ฅ

OpenMP

  • Parallel by default

  • HistGradientBoostingRegressor and HistGradientBoostingClassifier

Parallelism in scikit-learn ๐Ÿ–ฅ

OpenMP

  • Parallel by default

  • HistGradientBoostingRegressor and HistGradientBoostingClassifier

  • Routines that use pairwise distances reductions:

    • metrics.pairwise_distances_argmin
    • manifold.TSNE
    • neighbors.KNeighborsClassifier and neighbors.KNeighborsRegressor
    • See here for more

Parallelism in scikit-learn ๐Ÿ–ฅ

OpenMP

  • Parallel by default

  • HistGradientBoostingRegressor and HistGradientBoostingClassifier

  • Routines that use pairwise distances reductions:

    • metrics.pairwise_distances_argmin
    • manifold.TSNE
    • neighbors.KNeighborsClassifier and neighbors.KNeighborsRegressor
    • See here for more

Configuration

  • OMP_NUM_THREADS
  • threadpoolctl

Scikit-learn Avoiding Oversubscription ๐Ÿ–ฅ

  • Automatically configures native threads to cpu_count() // n_jobs
  • Learn more in joblib's docs

Scikit-learn Avoiding Oversubscription Example

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier() # OpenMP
gsh = HalvingGridSearchCV(
estimator=clf, param_grid=param_grid,
n_jobs=n_jobs # loky
)

Scikit-learn Avoiding Oversubscription Example

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HistGradientBoostingClassifier
clf = HistGradientBoostingClassifier() # OpenMP
gsh = HalvingGridSearchCV(
estimator=clf, param_grid=param_grid,
n_jobs=n_jobs # loky
)

Timing the search โฐ

%%time
gsh.fit(X, y)
# CPU times: user 15min 57s, sys: 791 ms, total: 15min 58s
# Wall time: 41.4 s

Timing results

Timing results

Timing results

Timing results

Dask

Source

Is there a problem here? ๐Ÿค”

Using Multiple Parallel APIs

Native thread libraries ๐Ÿงต

Using Multiple Parallel APIs

Using more than one parallel backends ๐Ÿคฏ

Sources: polars, numba, scikit-learn, pandas

Parallel by Default?


Conclusion

Configuration

Implementation

Configuration Options

  • Environment variables:
    • OMP_NUM_THREADS, MKL_NUM_THREADS, etc
  • Global: torch.set_num_threads
  • Context manager: threadpoolctl (BLAS and OpenMP)
  • Call-site parameters: n_jobs, worker

Implementation Options

  • C, C++, Numba, Rust, Python, etc
  • Python Multiprocessing and Multithreading
  • Native threads: OpenMP, pthreads, Intel TBB

Parallelism in Numerical Python Libraries

Thomas J. Fan
@thomasjpfan

This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism

Parallelism?

Paused

Help

Keyboard shortcuts

โ†‘, โ†, Pg Up, k Go to previous slide
โ†“, โ†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow