Thomas J. Fan
@thomasjpfan
This talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism
np.add
vs @
(matmul) ๐งชlinalg
implemented?linalg
?np.add
?linalg
implemented?threadpoolctl
!python -m threadpoolctl -i numpy
threadpoolctl
!python -m threadpoolctl -i numpy
OpenBLAS
:{ "user_api": "blas", "internal_api": "openblas", "prefix": "libopenblas", "filepath": "...", "version": "0.3.21", "threading_layer": "pthreads", "architecture": "Zen", "num_threads": 32}
MKL
:{ "user_api": "blas", "internal_api": "mkl", "prefix": "libmkl_rt", "filepath": "...", "version": "2022.1-Product", "threading_layer": "intel", "num_threads": 16}
OpenBLAS
(with pthreads
)conda install "libblas=*=*mkl"conda install "libblas=*=*openblas"conda install "libblas=*=*blis"conda install "libblas=*=*accelerate"conda install "libblas=*=*netlib"
@
?threadpoolctl
OPENBLAS_NUM_THREADS
(With pthreads
)MKL_NUM_THREADS
(Intel's MKL)OMP_NUM_THREADS
(OpenMP)BLIS_NUM_THREADS
threadpoolctl
from threadpoolctl import threadpool_limitswith threadpool_limits(limits=4, user_api="blas"): Z = X @ Y
threadpoolctl
from threadpoolctl import threadpool_limitswith threadpool_limits(limits=4, user_api="blas"): Z = X @ Y
threadpool_limits(limits=4, user_api="blas")
np.add
?add
! ๐งชOMP_NUM_THREADS
or MKL_NUM_THREADS
torch.set_num_threads
scipy.linalg
Config with the workers
parameter
integrate.quad_vec
optimize.differential_evolution
optimize.brute
stats.multiscale_graphcorr
Config with the workers
parameter
spatial.KDTree.query
spatial.cKDTree.query
spatial.KDTree.query_ball_point
spatial.cKDTree.query_ball_point
pthreads
Config with the workers
parameter
scipy.fft
linalg.matmul_toeplitz
pthreads
and not OpenMP
? ๐งตmultiprocessing
Read more in this SciPy issue.
engine="numba"
Aggregations with groupby
, rolling
Parallel over columns in DataFrames:
engine_kwargs={"parallel": True}
NUMBA_NUM_THREADS
numba.set_num_threads
NUMBA_NUM_THREADS
numba.set_num_threads
pthreads
)Need to be careful when combined with Python's parallel libraries:
multithreading
multiprocessing
spawnmultiprocessing
forkmultiprocessing
forkserverpthreads
with rust library: rayon-rs/rayon
POLARS_MAX_THREADS
loky
backend)linalg
BLAS semantics from NumPy and SciPy (Parallel by default)RandomForestClassifier
and RandomForestRegressor
solver="sag"
or "saga"
kneighbors
or radius_neighbors
RandomForestClassifier
and RandomForestRegressor
solver="sag"
or "saga"
kneighbors
or radius_neighbors
n_jobs
parameterforest = RandomForestClassifier(n_jobs=4)
HalvingGridSearchCV
and HalvingRandomSearchCV
MultiOutputClassifier
, etcHalvingGridSearchCV
and HalvingRandomSearchCV
MultiOutputClassifier
, etcn_jobs
parameterfrom sklearn.experimental import enable_halving_search_cvfrom sklearn.model_selection import HalvingRandomSearchCVhalving = HalvingGridSearchCV(..., n_jobs=4)
Learn more at loky.readthedocs.io/en/stable/
Parallel by default
HistGradientBoostingRegressor
and HistGradientBoostingClassifier
Parallel by default
HistGradientBoostingRegressor
and HistGradientBoostingClassifier
Routines that use pairwise distances reductions:
metrics.pairwise_distances_argmin
manifold.TSNE
neighbors.KNeighborsClassifier
and neighbors.KNeighborsRegressor
Parallel by default
HistGradientBoostingRegressor
and HistGradientBoostingClassifier
Routines that use pairwise distances reductions:
metrics.pairwise_distances_argmin
manifold.TSNE
neighbors.KNeighborsClassifier
and neighbors.KNeighborsRegressor
OMP_NUM_THREADS
threadpoolctl
cpu_count() // n_jobs
from sklearn.experimental import enable_halving_search_cvfrom sklearn.model_selection import HalvingGridSearchCVfrom sklearn.ensemble import HistGradientBoostingClassifierclf = HistGradientBoostingClassifier() # OpenMPgsh = HalvingGridSearchCV( estimator=clf, param_grid=param_grid, n_jobs=n_jobs # loky)
from sklearn.experimental import enable_halving_search_cvfrom sklearn.model_selection import HalvingGridSearchCVfrom sklearn.ensemble import HistGradientBoostingClassifierclf = HistGradientBoostingClassifier() # OpenMPgsh = HalvingGridSearchCV( estimator=clf, param_grid=param_grid, n_jobs=n_jobs # loky)
%%timegsh.fit(X, y)# CPU times: user 15min 57s, sys: 791 ms, total: 15min 58s# Wall time: 41.4 s
pthreads
) and code with OpenMP
OpenMP
with different compilersTwo OpenBLAS present (NumPy
and SciPy
on PyPI
)
Read more at: thomasjpfan.github.io/parallelism-python-libraries-design/
Sources: polars, numba, scikit-learn, pandas
OMP_NUM_THREADS
, MKL_NUM_THREADS
, etctorch.set_num_threads
threadpoolctl
(BLAS and OpenMP)n_jobs
, worker
C
, C++
, Numba
, Rust
, Python
, etcOpenMP
, pthreads
, Intel TBBThis talk on Github: github.com/thomasjpfan/pydata-nyc-2022-parallelism
Keyboard shortcuts
โ, โ, Pg Up, k | Go to previous slide |
โ, โ, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |