Publication | Closed Access
Parallel CCD++ on GPU for Matrix Factorization
24
Citations
24
References
2017
Year
Unknown Venue
Cluster ComputingParallel Ccd++EngineeringMachine LearningComputer ArchitectureGpu ComputingData ScienceData MiningIncomplete MatrixParallel ComputingComputer EngineeringComputer ScienceGpu ClusterComputational ScienceGpu ArchitectureMatrix FactorizationCyclic Coordinate DescentParallel ProgrammingVectorization
Matrix factorization of an incomplete matrix is useful in applications such as recommender systems. Several iterative algorithms have been proposed for matrix factorization for recommender systems, including Cyclic Coordinate Descent (CCD). Recently a variant of CCD called CCD++ was developed as an attractive algorithm for parallel implementation on multicore processors. In this paper, we address the parallelization of CCD++ for a GPU. Key considerations are the reduction of data volume transferred from/to GPU global memory and minimization of intra-warp load imbalance. Starting with a base implementation, we successively improve the GPU implementation of CCD++ using loop fusion and tiling, using performance insights from hardware counter data. The resulting algorithm is shown to be faster than the best reported multicore implementation of CCD++ as well as the best reported GPU implementation of matrix factorization (using ALS, Alternating Least Squares).
| Year | Citations | |
|---|---|---|
Page 1
Page 1