Parallel CCD++ on GPU for Matrix Factorization

Abstract

Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).

References

Page 1

	Year	Citations

Page 1