+ - 0:00:00
Notes for current slide
Notes for next slide

Can There Be Too Much Parallelism?

Thomas J. Fan
@thomasjpfan

This talk on Github: thomasjpfan/scipy-2023-too-parallel

Yes ๐Ÿ‘€

User?

Developer?

My Perspective

Parallelism in Scikit-learn

  • BLAS through SciPy
  • OpenMP + Cython
  • Python Multi-Threading
  • Python Multi-Processing

Scope

State of Python Parallelism

APIs ๐Ÿ’ป

Interactions ๐Ÿฅ‚

Defaults ๐ŸŒˆ

APIs ๐Ÿ’ป

Environment Variables ๐ŸŒฒ

  • OpenMP: OMP_NUM_THREADS
  • MKL: MKL_NUM_THREADS
  • OpenBLAS: OPENBLAS_NUM_THREADS

Environment Variables ๐ŸŒฒ

  • OpenMP: OMP_NUM_THREADS
  • MKL: MKL_NUM_THREADS
  • OpenBLAS: OPENBLAS_NUM_THREADS
  • Polars: POLARS_MAX_THREADS
  • Numba: NUMBA_NUM_THREADS
  • macOS accelerate: VECLIB_MAXIMUM_THREADS
  • numexpr: NUMEXPR_NUM_THREADS

Global Configuration ๐ŸŒŽ

  • torch.set_num_threads
  • numba.set_num_threads
  • threadpoolctl.threadpool_limits
  • cv.setNumThreads

Block Configuration ๐Ÿงฑ

threadpoolctl

from threadpoolctl import threadpool_limits
import numpy as np
with threadpool_limits(limits=2):
a = np.random.randn(1000, 1000)
a_squared = a @ a

Call-site โ˜Ž๏ธ

  • scikit-learn: n_jobs
  • SciPy: workers
  • PyTorch DataLoader: num_workers
  • Python: max_workers

APIs ๐Ÿ’ป

  • Environment Variables ๐ŸŒฒ
  • Global Configuration ๐ŸŒŽ
  • Block Configuration ๐Ÿงฑ
  • Call-site โ˜Ž๏ธ





Proposal: Consistent APIs ๐Ÿ”ฎ

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1





Proposal: Consistent APIs ๐Ÿ”ฎ

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1

Future ๐Ÿš€

Pragmatic

export OMP_NUM_THREADS=1





Proposal: Consistent APIs ๐Ÿ”ฎ

Now

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export POLARS_MAX_THREADS=1
export NUMEXPR_NUM_THREADS=1

Future ๐Ÿš€

Pragmatic

export OMP_NUM_THREADS=1

Better โ˜€๏ธ

export GOTO_NUM_THREADS=1

Proposal ๐Ÿ”ฎ

Recognize more threadpools in threadpoolctl

Proposal ๐Ÿ”ฎ

Now

  • scikit-learn: n_jobs
  • SciPy: workers
  • PyTorch DataLoader: num_workers
  • Python: max_workers

Proposal ๐Ÿ”ฎ

Now

  • scikit-learn: n_jobs
  • SciPy: workers
  • PyTorch DataLoader: num_workers
  • Python: max_workers

Future ๐Ÿš€

  • Everyone uses workers

Interactions ๐Ÿฅ‚

Oversubscription ๐Ÿ’ฅ

Python + native threading ๐Ÿ + ๐Ÿงต

from scipy import optimize
optimize.brute(
computation_that_uses_8_cores, ...
workers=8
)

Current workarounds ๐Ÿฉน

Dask

Source

PyTorch's DataLoader

from torch.utils.data import DataLoader
dl = DataLoader(..., num_workers=8)
# torch/utils/data/_utils/worker.py
def _worker_loop(...):
...
torch.set_num_threads(1)

Source

scikit-learn

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.ensemble import HalvingRandomSearchCV
clf = HistGradientBoostingClassifier()
search = HalvingGridSearchCV(
clf,
param_distributions,
n_jos=8
)
search.fit(X, y)





Multiple Parallel Abstractions ๐Ÿงต + ๐Ÿงถ

  • Python multiprocessing using fork + GCC OpenMP: stalls





Multiple Parallel Abstractions ๐Ÿงต + ๐Ÿงถ

  • Python multiprocessing using fork + GCC OpenMP: stalls

  • Intel OpenMP + LLVM OpenMP on Linux: stalls





Multiple Parallel Abstractions ๐Ÿงต + ๐Ÿงถ

  • Python multiprocessing using fork + GCC OpenMP: stalls

  • Intel OpenMP + LLVM OpenMP on Linux: stalls

  • Multiple OpenBLAS libraries: sometimes slower





Multiple Parallel Abstractions ๐Ÿงต + ๐Ÿงถ

Multiple Parallel Abstractions ๐Ÿงต + ๐Ÿงถ

Using more than one parallel backends ๐Ÿคฏ

Sources: polars, numba, scikit-learn, pandas



Proposal: Catch issues early ๐Ÿ”ฎ

Source



Proposal: Catch issues early ๐Ÿ”ฎ

Source


Not a full solution ๐Ÿฉน

Multiple Native threading libraries ๐Ÿงต + ๐Ÿงถ

Source

Multiple Native threading libraries ๐Ÿงต + ๐Ÿงถ

CPU Waiting โณ

for n_iter in range(100):
UV = U @ V.T # Use OpenBLAS with pthreads
compute_with_openmp(UV) # Use OpenMP

xianyi/OpenBLAS#3187

Current Workaround ๐Ÿฉน

Conda-forge + OpenMP

Current Workaround ๐Ÿฉน

Source

Proposal ๐Ÿ”ฎ

Ship PyPI wheels for OpenMP

Proposal ๐Ÿ”ฎ

Ship PyPI wheels for OpenMP

Not a full solution ๐Ÿฉน

Defaults ๐ŸŒˆ



NumPy

import numpy as np
out = np.sum(A_array, axis=1)



NumPy

import numpy as np
out = np.sum(A_array, axis=1)

๐ŸŒ One Core ๐ŸŒ



NumPy matmul

import numpy as np
out = A_array @ B_array



NumPy matmul

import numpy as np
out = A_array @ B_array

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ




NumPy matmul (Configuration)

Environment variable: OMP_NUM_THREADS

out = A_array @ B_array




NumPy matmul (Configuration)

Environment variable: OMP_NUM_THREADS

out = A_array @ B_array

threadpoolctl

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=1):
out = A_array @ B_array





PyTorch

import torch
out = torch.sum(A_tensor, axis=1)





PyTorch

import torch
out = torch.sum(A_tensor, axis=1)

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ



PyTorch (Configuration)

  • Environment variable: OMP_NUM_THREADS
  • threadpoolctl
with threadpool_limits(limits=2):
out = torch.sum(A_tensor, axis=1)



PyTorch (Configuration)

  • Environment variable: OMP_NUM_THREADS
  • threadpoolctl
with threadpool_limits(limits=2):
out = torch.sum(A_tensor, axis=1)
  • PyTorch function
import torch
torch.set_num_threads(2)
out = torch.sum(A_tensor, axis=1)





pandas apply

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean()





pandas apply

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean()

๐ŸŒ One Core ๐ŸŒ



pandas apply + numba

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean(
engine="numba",
engine_kwargs={"parallel": True},
)

Read more



pandas apply + numba

import pandas as pd
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
out = roll.mean(
engine="numba",
engine_kwargs={"parallel": True},
)

Read more

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ





pandas apply + numba (Configuration)

  • Environment variable: NUMBA_NUM_THREADS





pandas apply + numba (Configuration)

  • Environment variable: NUMBA_NUM_THREADS

  • Numba function

import numba
numba.set_num_threads(2)
out = roll.mean(engine="numba", engine_kwargs={"parallel": True})




LogisticRegression

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(...)
log_reg.predict(X)




LogisticRegression

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(...)
log_reg.predict(X)

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ

LogisticRegression (Configuration)

  • Environment variable: OMP_NUM_THREADS

LogisticRegression (Configuration)

  • Environment variable: OMP_NUM_THREADS

  • threadpoolctl

with threadpool_limits(limits=2):
log_reg.predict(X)




HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier
hist = HistGradientBoostingClassifier()
hist.fit(X, y)




HistGradientBoostingClassifier

from sklearn.ensemble import HistGradientBoostingClassifier
hist = HistGradientBoostingClassifier()
hist.fit(X, y)

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ

HistGradientBoostingClassifier (Configuration)

  • Environment variable: OMP_NUM_THREADS

HistGradientBoostingClassifier (Configuration)

  • Environment variable: OMP_NUM_THREADS

  • threadpoolctl

with threadpool_limits(limits=2):
hist.predict(X)




polars

out = (
pl.scan_csv(...)
.filter(pl.col("sepal_length") > 5)
.groupby("species")
.agg(pl.col("sepal_width").mean())
.collect()
)




polars

out = (
pl.scan_csv(...)
.filter(pl.col("sepal_length") > 5)
.groupby("species")
.agg(pl.col("sepal_width").mean())
.collect()
)

๐ŸŽ๏ธ All Cores ๐ŸŽ๏ธ

polars (Configuration)

  • Environment variable: POLARS_MAX_THREADS
out = (
pl.scan_csv(...)
.filter(pl.col("sepal_length") > 5)
...
)

Defaults ๐ŸŒˆ

Proposal ๐Ÿ”ฎ

Agree on a default? ๐Ÿ˜…

Proposal ๐Ÿ”ฎ

Libraries document how to configure parallelism

State of Python Parallelism

APIs ๐Ÿ’ป

Interactions ๐Ÿฅ‚

Defaults ๐ŸŒˆ

Can There Be Too Much Parallelism?

Thomas J. Fan
@thomasjpfan

This talk on Github: thomasjpfan/scipy-2023-too-parallel

Appendix ๐Ÿช„

Python GIL + Parallelism? ๐Ÿ‘€

  • Python Multi-threading: Release the GIL
  • Python Multi-processing: Each process gets it's own GIL
  • Native multi-threading: Release the GIL

PEP 684 ๐Ÿ”ฎ: Sub-Interpreters

Need to explore, it could work โ˜€๏ธ

PEP 703 ๐Ÿ”ฎ: No-GIL

Also Promising, but harder lift for Python

Yes ๐Ÿ‘€

Paused

Help

Keyboard shortcuts

โ†‘, โ†, Pg Up, k Go to previous slide
โ†“, โ†’, Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow