Add GMM #113

gmaze · 2018-01-08T22:21:29Z

Hi all,
Is there any plan to implement a parallel version of GMM (Gaussian Mixture Modelling) ?
Thanks
g

eg: http://dx.doi.org/10.1109/CSAE.2012.6272849

TomAugspurger · 2018-01-08T22:46:03Z

That'd certainly be in scope. I likely won't have time to work on this until the end of the month, but may be able to after that.

Do you have any other references for parallel or distributed GMM? That paper doesn't seem to be publicly available.

gmaze · 2018-01-09T09:05:16Z

The paper is here:
Yang_et_al.IEEE2012.pdf
But I'm not sure that this is the most relevant implementation for dask-ml, more biblio should be done

TomAugspurger · 2018-01-09T16:15:13Z

Gave a quick skim of scikit-learn's implementation. A translation of that to use work on dask arrays doesn't look too difficult. Unless I missed something, the fanciest thing was a cholesky decomposition, which is implemented in dask.array.

@gmaze do you have any interest in working on this?

mrocklin · 2018-01-09T16:17:37Z

I think it would be interesting to see how Dask's cholesky factorization behaves here, but it may be that other algorithms more suited to large distributed datasets exist. A literature search is possibly still warranted here.

…

On Tue, Jan 9, 2018 at 10:15 AM, Tom Augspurger ***@***.***> wrote: Gave a quick skim of scikit-learn's implementation. A translation of that to use work on dask arrays doesn't look *too* difficult. Unless I missed something, the fanciest thing was a cholesky decomposition, which is implemented in dask.array. @gmaze <https://github.com/gmaze> do you have any interest in working on this? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIEBiw3bvMN0aIOrbaMmh1r7P-g-ks5tI5CRgaJpZM4RXCe3> .

TomAugspurger · 2018-01-09T16:46:01Z

Agreed. On Tue, Jan 9, 2018 at 10:17 AM, Matthew Rocklin <[email protected]> wrote:

…

I think it would be interesting to see how Dask's cholesky factorization behaves here, but it may be that other algorithms more suited to large distributed datasets exist. A literature search is possibly still warranted here. On Tue, Jan 9, 2018 at 10:15 AM, Tom Augspurger ***@***.***> wrote: > Gave a quick skim of scikit-learn's implementation. A translation of that > to use work on dask arrays doesn't look *too* difficult. Unless I missed > something, the fanciest thing was a cholesky decomposition, which is > implemented in dask.array. > > @gmaze <https://github.com/gmaze> do you have any interest in working on > this? > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#113 (comment)>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/ AASszIEBiw3bvMN0aIOrbaMmh1r7P-g-ks5tI5CRgaJpZM4RXCe3> > . > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#113 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIm_8eKLCGrJzJH4OXghsqhKvNhmWks5tI5EigaJpZM4RXCe3> .

gmaze · 2018-01-11T08:20:44Z

I would surely have interest in working on this but have no timeline before the end of February and would certainly need a lot of help in order to follow the dask-ml code logic

At this point, I don't quite yet understand where the need for a specific distribution method arises, ie why people publish papers on new GMM algorithm vs distribute the bottleneck operation of the classic EM algorithm for a GMM (which is, as you pointed, the cholesky factorization of the covariance matrices)

The first step, may be to try the benchmark the regular GMM EM algorithm with and without specific dask-ml optimized operators

DaniJonesOcean · 2019-12-12T14:33:56Z

Hi all. I would like to flag my interest in this project as well. It doesn't look like there has been much activity in this area lately.

Does anyone have plans to work on this issue in the near-term future? I would be interested in contributing, but like gmaze I would need help getting started.

gmaze · 2019-12-12T14:50:01Z

I didn't get the time to work on this yet because I wanted to focus on releasing a clean version of http://github.com/obidam/pyxpcm , which now implement the choice of 2 stats backend (scikit-learn or dask_ml).
Now that it's done, I plan to focus on optimisation, hence this issue of having EM/GMM optmized for dask_ml.
But I can't guaranty any timeline

TomAugspurger · 2019-12-12T15:13:35Z

Thanks for the update @gmaze.

remiadon · 2020-12-10T15:04:34Z

Hi,

I made a bit of literature search on my side.

IMO the resource mentioned by @gmaze is a good start, but it's basically a re-implementation designed to reduce data exchange on cluster of machines. Quoting page 2

we developed a newframework called Distrim from scratch, aiming to minimize space and communication overheads as much as possible, and to maximize the usage of computational power of multicore clusters as much as possible

I suggest to use a different methodology. One concept that I find particularly interesting is called coreset
A coreset is a subset of the original data that gives theoretical guarantees on the shape (the shape of a coreset is close from the shape of the original data)

Coresets have already proven to be useful for large scale modeling of Gaussian Mixture, as well as K-means and K-median clustering

proposed solution

implement a Coreset class (or method, anyway) using dask.arrays. This would return a subsample of the original dask.array as a numpy.array, along with associated weigths for those points, also as a numpy array.
tweak sklearn.GaussianMixture to accept weighted datasets, and run the clustering via this sklearn model

I believe this methodology is compatible with the current philosophy of dask-ml ("re-implement at scale if required, or simply allow sklearn estimators to scale with a different methodology). It can also benefit other methods, not only GMMs

Regards,
Rémi

References
Scalable Training of Mixture Models via Coresets
Coresets for k-Means and k-Median Clustering and their Applications

TomAugspurger · 2020-12-10T17:08:16Z

Thanks for sharing @remiadon. One API question around your proposed Coreset class.

This would return a subsample of the original dask.array as a numpy.array, along with associated weigths for those points, also as a numpy array

I see the suggestion of a method like coreset(*arrays) that handles all the logic of extracting a coreset from a dask Array. But for an end-user API, I instead think of some kind of meta-estimtaor like

>>> model = Coreset(sklearn.mixture.GaussianMixture())
>>> model.fit(big_X, big_y)  # extracts the coreset, fits the weighted(?) sklearn GMM on the coreset (small, in memory)
>>> model.predict(big_X)  # Dask Array of predictions

remiadon · 2020-12-10T21:42:41Z

@TomAugspurger, a Coreset meta-estimator would be great !

Another way of achieving an equivalent goal would be to implement a CoresetTransformer that would return the data fully transformed (the set of points, weighted). But as far as I know sklearn prohibits having a different number of rows between intput/output of a transformer ...

Any of those solutions suits me, I can try submitting a PR

TomAugspurger · 2020-12-10T21:46:01Z

Yes, the Transformer would also work well, but would I think require scikit-learn/enhancement_proposals#15. I haven't read through that in a while, but I don't know how it proposes to deal with weights.

Anyway, I think for now an implementation using a metaestimator would be most welcome. I think the logic of selecting the coreset is likely to be the most difficult part, regardless of the API :)

remiadon · 2021-02-24T15:40:51Z

I created a PR here #799

This is work in progress for now, as most of the sampling methods were designed for KMeans, and usage with Gaussian Mixture is still a bit obscure to me.

gmaze mentioned this issue Dec 12, 2019

Implement GMM with dask_ml backend obidam/pyxpcm#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GMM #113

Add GMM #113

gmaze commented Jan 8, 2018 •

edited

Loading

TomAugspurger commented Jan 8, 2018

gmaze commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

mrocklin commented Jan 9, 2018 via email

TomAugspurger commented Jan 9, 2018 via email

gmaze commented Jan 11, 2018

DaniJonesOcean commented Dec 12, 2019

gmaze commented Dec 12, 2019

TomAugspurger commented Dec 12, 2019

remiadon commented Dec 10, 2020 •

edited

Loading

TomAugspurger commented Dec 10, 2020

remiadon commented Dec 10, 2020 •

edited

Loading

TomAugspurger commented Dec 10, 2020 •

edited

Loading

remiadon commented Feb 24, 2021

Add GMM #113

Add GMM #113

Comments

gmaze commented Jan 8, 2018 • edited Loading

TomAugspurger commented Jan 8, 2018

gmaze commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

mrocklin commented Jan 9, 2018 via email

TomAugspurger commented Jan 9, 2018 via email

gmaze commented Jan 11, 2018

DaniJonesOcean commented Dec 12, 2019

gmaze commented Dec 12, 2019

TomAugspurger commented Dec 12, 2019

remiadon commented Dec 10, 2020 • edited Loading

proposed solution

TomAugspurger commented Dec 10, 2020

remiadon commented Dec 10, 2020 • edited Loading

TomAugspurger commented Dec 10, 2020 • edited Loading

remiadon commented Feb 24, 2021

gmaze commented Jan 8, 2018 •

edited

Loading

remiadon commented Dec 10, 2020 •

edited

Loading

remiadon commented Dec 10, 2020 •

edited

Loading

TomAugspurger commented Dec 10, 2020 •

edited

Loading