Clustering¶
K-Means Clustering¶
-
class
graspy.cluster.
KMeansCluster
(max_clusters=2, random_state=None)[source]¶ KMeans Cluster.
It computes all possible models from one component to
max_clusters
. The best model is given by the lowest silhouette score.Parameters: - max_clusters : int, defaults to 1.
The maximum number of mixture components to consider.
- random_state : int, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
Attributes: - n_clusters_ : int
Optimal number of components. If y is given, it is based on largest ARI. Otherwise, it is based on smallest loss.
- model_ : KMeans object
Fitted KMeans object fitted with optimal n_components.
- silhouette_ : list
List of silhouette scores computed for all possible number of clusters given by
range(2, max_clusters)
.- ari_ : list
Only computed when y is given. List of ARI values computed for all possible number of clusters given by
range(2, max_clusters)
.
-
fit
(self, X, y=None)[source]¶ Fits kmeans model to the data.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - self
-
fit_predict
(self, X, y=None)¶ Fit the models and predict clusters based on best model.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - labels : array, shape (n_samples,)
Component labels.
- ari : float
Adjusted Rand index. Only returned if y is given.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
predict
(self, X, y=None)¶ Predict clusters based on best model.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples, ), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - labels : array, shape (n_samples,)
Component labels.
- ari : float
Adjusted Rand index. Only returned if y is given.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it's possible to update each component of a nested object.Returns: - self
Gaussian Mixture Models Clustering¶
-
class
graspy.cluster.
GaussianCluster
(min_components=2, max_components=None, covariance_type='full', random_state=None)[source]¶ Gaussian Mixture Model (GMM)
Representation of a Gaussian mixture model probability distribution. This class allows to estimate the parameters of a Gaussian mixture distribution. It computes all possible models from one component to max_components. The best model is given by the lowest BIC score.
Parameters: - min_components : int, default=2.
The minimum number of mixture components to consider (unless
max_components=None
, in which case this is the maximum number of components to consider). Ifmax_componens
is not None,min_components
must be less than or equal tomax_components
.- max_components : int or None, default=None.
The maximum number of mixture components to consider. Must be greater than or equal to
min_components
.- covariance_type : {'full' (default), 'tied', 'diag', 'spherical'}, optional
String or list/array describing the type of covariance parameters to use. If a string, it must be one of:
- 'full'
- each component has its own general covariance matrix
- 'tied'
- all components share the same general covariance matrix
- 'diag'
- each component has its own diagonal covariance matrix
- 'spherical'
- each component has its own single variance
- 'all'
- considers all covariance structures in ['spherical', 'diag', 'tied', 'full']
- If a list/array, it must be a list/array of strings containing only
'spherical', 'tied', 'diag', and/or 'spherical'.
- random_state : int, RandomState instance or None, optional (default=None)
If int,
random_state
is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used bynp.random
.
Attributes: - n_components_ : int
Optimal number of components based on BIC.
- covariance_type_ : str
Optimal covariance type based on BIC.
- model_ : GaussianMixture object
Fitted GaussianMixture object fitted with optimal number of components and optimal covariance structure.
- bic_ : pandas.DataFrame
A pandas DataFrame of BIC values computed for all possible number of clusters given by range(min_components, max_components + 1) and all covariance structures given by covariance_type.
- ari_ : pandas.DataFrame
Only computed when y is given. Pandas Dataframe containing ARI values computed for all possible number of clusters given by
r``ange(min_components, max_components)
and all covariance structures given by covariance_type.
-
fit
(self, X, y=None)[source]¶ Fits gaussian mixure model to the data. Estimate model parameters with the EM algorithm.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - self
-
fit_predict
(self, X, y=None)¶ Fit the models and predict clusters based on best model.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples,), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - labels : array, shape (n_samples,)
Component labels.
- ari : float
Adjusted Rand index. Only returned if y is given.
-
get_params
(self, deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
predict
(self, X, y=None)¶ Predict clusters based on best model.
Parameters: - X : array-like, shape (n_samples, n_features)
List of n_features-dimensional data points. Each row corresponds to a single data point.
- y : array-like, shape (n_samples, ), optional (default=None)
List of labels for X if available. Used to compute ARI scores.
Returns: - labels : array, shape (n_samples,)
Component labels.
- ari : float
Adjusted Rand index. Only returned if y is given.
-
set_params
(self, **params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it's possible to update each component of a nested object.Returns: - self