`sklearn.cluster`.DBSCAN¶

class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)[source]¶

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

See also

OPTICS: A similar clustering at multiple values of eps. Our implementation is optimized for memory usage.

Notes

For an example, see examples/cluster/plot_dbscan.py.

This implementation bulk-computes all neighborhood queries, which increases the memory complexity to O(n.d) where d is the average number of neighbors, while original DBSCAN had memory complexity O(n). It may attract a higher memory complexity when querying these nearest neighborhoods, depending on the algorithm.

One way to avoid the query complexity is to pre-compute sparse neighborhoods in chunks using NearestNeighbors.radius_neighbors_graph with mode='distance', then using metric='precomputed' here.

Another way to reduce memory and computation time is to remove (near-)duplicate points and use sample_weight instead.

cluster.OPTICS provides a similar clustering with lower memory usage.

References

Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.

Examples

>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0,  0,  0,  1,  1, -1])
>>> clustering
DBSCAN(eps=3, min_samples=2)

Methods

`fit`(self, X[, y, sample_weight])	Perform DBSCAN clustering from features, or distance matrix.
`fit_predict`(self, X[, y, sample_weight])	Perform DBSCAN clustering from features or distance matrix, and return cluster labels.
`get_params`(self[, deep])	Get parameters for this estimator.
`set_params`(self, \\params)	Set the parameters of this estimator.

__init__(self, eps=0.5, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

fit(self, X, y=None, sample_weight=None)[source]¶

Perform DBSCAN clustering from features, or distance matrix.

Parameters

Xarray-like or sparse matrix, shape (n_samples, n_features), or (n_samples, n_samples): Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
sample_weightarray, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
yIgnored: Not used, present here for API consistency by convention.

Returns

self

fit_predict(self, X, y=None, sample_weight=None)[source]¶

Perform DBSCAN clustering from features or distance matrix, and return cluster labels.

Parameters

Xarray-like or sparse matrix, shape (n_samples, n_features), or (n_samples, n_samples): Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
sample_weightarray, shape (n_samples,), optional: Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
yIgnored: Not used, present here for API consistency by convention.

Returns

labelsndarray, shape (n_samples,): Cluster labels. Noisy samples are given the label -1.

get_params(self, deep=True)[source]¶

Get parameters for this estimator.

Parameters

deepboolean, optional: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

set_params(self, **params)[source]¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self

sklearn.cluster.DBSCAN¶

`sklearn.cluster`.DBSCAN¶