scikit GMM fails randomly with ill-defined linear covariances

Question

I am trying to fit a GMM model to my data. The dataset has 37 features (some int and others float). When the dataset has small number of rows (<200), GMM fits OK. When I try a larger dataset (500 rows), some times (2/15), GMM fails to fit and gives an error

ValueError: Fitting the mixture model failed because some components have ill-defined empirical covariance (for instance caused by singleton or collapsed samples). Try to decrease the number of components, or increase reg_covar.

Additionally the stack trace has entries like

numpy.linalg.LinAlgError: x-th leading minor is not positive definite

For an even larger dataset (1550 rows), the errors are too frequent (13/15 times)

The dataset is too large to show here but two of its columns are IP addresses which are converted to large integers (using ipaddress). I have tried to fit after removing those columns but still the failure occurs sometimes. The smaller dataset is subset of large one which is a subset of larger one.

Any idea why is this error occurring randomly? If this is a problem in data, I would expect every run to fail.

I have tried to set reg_covar to large values but it needs to be set to too large value (0.3) and even then some runs may fail for the large/larger data-set

Update - Setting n_components to 1 seems to get rid of problem. Earlier I was using 2 (The problem is novelty detection)

scikit GMM fails randomly with ill-defined linear covariances

0 Answers0