Can There Be Too Much Parallelism?

Thomas J. Fan
@thomasjpfan

This talk on Github: thomasjpfan/scipy-2023-too-parallel

Yes 👀

User?

Developer?

My Perspective

Parallelism in Scikit-learn

BLAS through SciPy
OpenMP + Cython
Python Multi-Threading
Python Multi-Processing

Scope

State of Python Parallelism

APIs 💻

Interactions 🥂

Defaults 🌈

APIs 💻

Environment Variables 🌲

OpenMP: OMP_NUM_THREADS
MKL: MKL_NUM_THREADS
OpenBLAS: OPENBLAS_NUM_THREADS

Environment Variables 🌲

OpenMP: OMP_NUM_THREADS
MKL: MKL_NUM_THREADS
OpenBLAS: OPENBLAS_NUM_THREADS
Polars: POLARS_MAX_THREADS
Numba: NUMBA_NUM_THREADS
macOS accelerate: VECLIB_MAXIMUM_THREADS
numexpr: NUMEXPR_NUM_THREADS

Global Configuration 🌎

torch.set_num_threads
numba.set_num_threads
threadpoolctl.threadpool_limits
cv.setNumThreads

Block Configuration 🧱

`threadpoolctl`

from threadpoolctl import threadpool_limits
import numpy as np
with threadpool_limits(limits=2):
    a = np.random.randn(1000, 1000)
    a_squared = a @ a

Call-site ☎️

scikit-learn: n_jobs
SciPy: workers
PyTorch DataLoader: num_workers
Python: max_workers

APIs 💻

Environment Variables 🌲
Global Configuration 🌎
Block Configuration 🧱
Call-site ☎️

Proposal: Consistent APIs 🔮

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1

Proposal: Consistent APIs 🔮

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1

Future 🚀

Pragmatic

export OMP_NUM_THREADS=1

Proposal: Consistent APIs 🔮

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1

Future 🚀

Pragmatic

export OMP_NUM_THREADS=1

Better ☀️

export GOTO_NUM_THREADS=1

Proposal 🔮

Recognize more threadpools in `threadpoolctl`

Proposal 🔮Now
scikit-learn: n_jobs
SciPy: workers
PyTorch DataLoader: num_workers
Python: max_workers

Proposal 🔮Now
scikit-learn: n_jobs
SciPy: workers
PyTorch DataLoader: num_workers
Python: max_workers

Future 🚀
Everyone uses workers

Interactions 🥂

Oversubscription 💥

Python + native threading 🐍 + 🧵

from scipy import optimize
optimize.brute(
     computation_that_uses_8_cores, ...
     workers=8
)

Current workarounds 🩹

Dask

Source

PyTorch's DataLoader

from torch.utils.data import DataLoader
dl = DataLoader(..., num_workers=8)
# torch/utils/data/_utils/worker.py
def _worker_loop(...):
    ...
    torch.set_num_threads(1)

Source

scikit-learn

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HalvingRandomSearchCV
clf = HistGradientBoostingClassifier()
search = HalvingGridSearchCV(
    clf,
    param_distributions,
    n_jos=8
)
search.fit(X, y)

Multiple Parallel Abstractions 🧵 + 🧶

Python multiprocessing using fork + GCC OpenMP: stalls

Multiple Parallel Abstractions 🧵 + 🧶

Python multiprocessing using fork + GCC OpenMP: stalls
Intel OpenMP + LLVM OpenMP on Linux: stalls

Multiple Parallel Abstractions 🧵 + 🧶

Python multiprocessing using fork + GCC OpenMP: stalls
Intel OpenMP + LLVM OpenMP on Linux: stalls
Multiple OpenBLAS libraries: sometimes slower

Multiple Parallel Abstractions 🧵 + 🧶

Python multiprocessing using fork + GCC OpenMP: stalls
Intel OpenMP + LLVM OpenMP on Linux: stalls
Multiple OpenBLAS libraries: sometimes slower
Read more at: thomasjpfan.github.io/parallelism-python-libraries-design/

Multiple Parallel Abstractions 🧵 + 🧶

Using more than one parallel backends 🤯

Sources: polars, numba, scikit-learn, pandas

Proposal: Catch issues early 🔮

Source

Proposal: Catch issues early 🔮

Source

Not a full solution 🩹

Multiple Native threading libraries 🧵 + 🧶

Source

Multiple Native threading libraries 🧵 + 🧶

CPU Waiting ⏳

for n_iter in range(100):
    UV = U @ V.T             # Use OpenBLAS with pthreads
    compute_with_openmp(UV)  # Use OpenMP

xianyi/OpenBLAS#3187

Current Workaround 🩹

Conda-forge + OpenMP

Current Workaround 🩹

Source

Proposal 🔮

Ship PyPI wheels for OpenMP

Proposal 🔮

Ship PyPI wheels for OpenMP

Not a full solution 🩹

Defaults 🌈

NumPy

import numpy as np
out = np.sum(A_array, axis=1)

NumPy

import numpy as np
out = np.sum(A_array, axis=1)

🐌 One Core 🐌

NumPy matmul

import numpy as np
out = A_array @ B_array

NumPy matmul

import numpy as np
out = A_array @ B_array

🏎️ All Cores 🏎️

NumPy matmul (Configuration)

Environment variable: `OMP_NUM_THREADS`

out = A_array @ B_array

NumPy matmul (Configuration)

Environment variable: `OMP_NUM_THREADS`

out = A_array @ B_array

`threadpoolctl`

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=1):
    out = A_array @ B_array

PyTorch

import torch
out = torch.sum(A_tensor, axis=1)

PyTorch

import torch
out = torch.sum(A_tensor, axis=1)

🏎️ All Cores 🏎️

PyTorch (Configuration)

Environment variable: OMP_NUM_THREADS
threadpoolctl

with threadpool_limits(limits=2):
    out = torch.sum(A_tensor, axis=1)

PyTorch (Configuration)

Environment variable: OMP_NUM_THREADS
threadpoolctl

with threadpool_limits(limits=2):
    out = torch.sum(A_tensor, axis=1)

PyTorch function

import torch
torch.set_num_threads(2)
out = torch.sum(A_tensor, axis=1)

pandas apply

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean()

pandas apply

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean()

🐌 One Core 🐌

pandas apply + numba

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean(
     engine="numba",
     engine_kwargs={"parallel": True},
)

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean(
     engine="numba",
     engine_kwargs={"parallel": True},
)

🏎️ All Cores 🏎️

pandas apply + numba (Configuration)

Environment variable: NUMBA_NUM_THREADS

pandas apply + numba (Configuration)

Environment variable: NUMBA_NUM_THREADS
Numba function

import numba
numba.set_num_threads(2)
out = roll.mean(engine="numba", engine_kwargs={"parallel": True})

LogisticRegression

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(...)
log_reg.predict(X)

LogisticRegression

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(...)
log_reg.predict(X)

🏎️ All Cores 🏎️

LogisticRegression (Configuration)

Environment variable: OMP_NUM_THREADS

LogisticRegression (Configuration)

Environment variable: OMP_NUM_THREADS
threadpoolctl

with threadpool_limits(limits=2):
    log_reg.predict(X)

HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier
hist = HistGradientBoostingClassifier()
hist.fit(X, y)

HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier
hist = HistGradientBoostingClassifier()
hist.fit(X, y)

🏎️ All Cores 🏎️

HistGradientBoostingClassifier (Configuration)

Environment variable: OMP_NUM_THREADS

HistGradientBoostingClassifier (Configuration)

Environment variable: OMP_NUM_THREADS
threadpoolctl

with threadpool_limits(limits=2):
    hist.predict(X)

polars

out = (
    pl.scan_csv(...)
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.col("sepal_width").mean())
    .collect()
)

polars

out = (
    pl.scan_csv(...)
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.col("sepal_width").mean())
    .collect()
)

🏎️ All Cores 🏎️

polars (Configuration)

Environment variable: POLARS_MAX_THREADS

out = (
    pl.scan_csv(...)
    .filter(pl.col("sepal_length") > 5)
    ...
)

Defaults 🌈

Proposal 🔮

Agree on a default? 😅

Proposal 🔮

Libraries document how to configure parallelism

State of Python Parallelism

APIs 💻

Interactions 🥂

Defaults 🌈

Can There Be Too Much Parallelism?

Thomas J. Fan
@thomasjpfan

This talk on Github: thomasjpfan/scipy-2023-too-parallel

Appendix 🪄

Python GIL + Parallelism? 👀Python Multi-threading: Release the GIL
Python Multi-processing: Each process gets it's own GIL
Native multi-threading: Release the GIL

PEP 684 🔮: Sub-InterpretersNeed to explore, it could work ☀️

PEP 703 🔮: No-GILAlso Promising, but harder lift for Python

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Can There Be Too Much Parallelism?

Yes 👀

User?

Developer?

My Perspective

Parallelism in Scikit-learn

Scope

State of Python Parallelism

APIs 💻

Interactions 🥂

Defaults 🌈

APIs 💻

Environment Variables 🌲

Environment Variables 🌲

Global Configuration 🌎

Block Configuration 🧱

threadpoolctl

Call-site ☎️

APIs 💻

Proposal: Consistent APIs 🔮

Now

Proposal: Consistent APIs 🔮

Now

Future 🚀

Pragmatic

Proposal: Consistent APIs 🔮

Now

Future 🚀

Pragmatic

Better ☀️

Proposal 🔮

Recognize more threadpools in threadpoolctl

Proposal 🔮

Now

Proposal 🔮

Now

Future 🚀

Interactions 🥂

Oversubscription 💥

Python + native threading 🐍 + 🧵

Current workarounds 🩹

Dask

PyTorch's DataLoader

scikit-learn

Multiple Parallel Abstractions 🧵 + 🧶

Multiple Parallel Abstractions 🧵 + 🧶

Multiple Parallel Abstractions 🧵 + 🧶

Multiple Parallel Abstractions 🧵 + 🧶

Multiple Parallel Abstractions 🧵 + 🧶

Using more than one parallel backends 🤯

Proposal: Catch issues early 🔮

Proposal: Catch issues early 🔮

Not a full solution 🩹

Multiple Native threading libraries 🧵 + 🧶

Multiple Native threading libraries 🧵 + 🧶

CPU Waiting ⏳

Current Workaround 🩹

Conda-forge + OpenMP

Current Workaround 🩹

Proposal 🔮

Ship PyPI wheels for OpenMP

Proposal 🔮

Ship PyPI wheels for OpenMP

Not a full solution 🩹

Defaults 🌈

NumPy

NumPy

NumPy matmul

NumPy matmul

NumPy matmul (Configuration)

Environment variable: OMP_NUM_THREADS

NumPy matmul (Configuration)

Environment variable: OMP_NUM_THREADS

threadpoolctl

PyTorch

PyTorch

PyTorch (Configuration)

PyTorch (Configuration)

pandas apply

pandas apply

`threadpoolctl`

Recognize more threadpools in `threadpoolctl`

Environment variable: `OMP_NUM_THREADS`

Environment variable: `OMP_NUM_THREADS`

`threadpoolctl`