Fast and Robust Parallel SGD Matrix Factorization
Table of Contents
Abstract
Matrix factorization is one of the fundamental techniques for analyzing latent relationship between two entities. Especially, it is popularly used for recommendation for its high accuracy. Efficient parallel SGD matrix factorization algorithms have been developed for large matrices to speed up the convergence of factorization. However, most of them are designed for a shared-memory environment thus fail to factorize a large matrix that is too big to fit in memory, and their performances are also unreliable when the matrix is skewed.
This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory. MLGF-MF uses Multi-Level Grid File (MLGF) for partitioning the matrix and minimizes the cost for scheduling parallel SGD updates on the partitioned regions by exploiting partial match queries processing. Thereby, MLGFMF produces reliable results efficiently even on skewed matrices. MLGF-MF is designed with asynchronous I/O permeated in the algorithm such that CPU keeps executing without waiting for I/O to complete. Thereby, MLGF-MF overlaps the CPU and I/O processings, which eventually offsets the I/O cost and maximizes the CPU utility. Recent flash SSD disks support high performance parallel I/O, thus are appropriate for executing the asynchronous I/O.
From our extensive evaluations, MLGF-MF significantly outperforms (or converges faster than) the state-of-the-art algorithms in both shared-memory and block-storage environments. In addition, the outputs of MLGF-MF is significantly more robust to skewed matrices.
Paper
MLGF-MF
MLGF-MF algorithm is implemented by C++. We provide an executable binary of the algorithm. It is executable in Ubuntu 14.04 trusty.
Data
Format
We use the typical data format for recommendation. In each line of datafile, there are three numbers, user id, item id, and rating, which are all separated by white spaces. We assume that user id and item id are integer values, and rating is a real value. We do not assume any sorting condition on datasets.
The following is an example of a data file, which is actually a part of NetFlix dataset.
155902 5496 5 2248854 13015 4 43227 5807 1 626906 6679 4 2042206 501 5 397607 788 3 1984294 16265 5 227589 3506 2 918529 12767 4 1546913 7364 4
Available Datasets
In our paper, we use three benchmark datasets, Netflix, Yahoo! Music, and HugeWiki. Here we provide the links for Yahoo! Music, and HugeWiki.
- Dataset repositories
The default data formats for Yahoo! Music and HugeWiki are different from ours. Yahoo! Music uses its own format, and HugeWiki uses mm format. Thus, they have to be preprocessed. Here we provide preprocessing scripts for Yahoo! Music and HugeWiki.
- Preprocessing scripts
The usage of the preprocessing scripts follows.
path-to-preprocessor> python <scriptname> <datafile>
Download MLGF-MF Executable
Contact jinoh.postech AT gmail.com if you find a trouble in executing the file.
Usage
Once you download and extract the .tar file, execute the binary named MLGF_MF, and set input parameters as follows.
Mode selection
- Mode
--build-index
: Build the MLGF index and db file from a plain datasetshell> <Executable> --build-index <input_datafile> <output_db_file>
--train
: Train factorized latent matrices, and output latent matrices to a modelfile.shell> <Executable> --train <input_dbfile> <output_model_file>
--predict
: Predict the score of ratings in the test dataset, and report the validation RMSE.shell> <Executable> --predict <input_modelfile> <intput_test_filename>
Parameter Setting for training mode
- General Parameters
--thread-size <integer_value>
: The size of concurrent threads--validate <validate_file_name>
: Enable validation, report final RMSE at the end of iterations--validate-iteratively
: Enable reporting validation RMSE for each iteration
- Learning parameters
--iteration-size <integer_value>
: The number of iterations--row-lambda <float_value>
: The regularizer for latent matrix U--column-lambda <float_value>
: The regularizer for latent matrix V--stepsize <float_value>
: The stepsize η--latent-dim-size <integer_value>
: The size of latent dimension (k value in the paper)
An Examplary Scenario
This scenario is tested on Ubuntu 14.04 Trusty 64bit.
Here we provide an examplary scenario of using our MLGF-MF for the NetFlix dataset.
Build an MLGF index from the netflix training dataset
- execute
MLGF-MF
as follow.shell> ./MLGF-MF --build-index netflix_train netflix_db
Training latent matrices from dbfile
- execute
MLGF-MF
as follow.shell> ./MLGF-MF --train --validate netflix_test --iteration-size 10 netflix_db model
Evaluation results
Memory
Comparison on convergence speed in shared-memory for varying datasets
|
|
|
Skewed matrices
|
|
Block-storage devices
Comparison of the convergence speed on two block-storage devices
- Block-storage device 1: RevuAhn Drive X
- NetFlix dataset
- Yahoo! Music dataset
- HugeWiki dataset
- NetFlix dataset
- Block-storage device 2: Samsung SSD 850 Pro
- NetFlix dataset
- Yahoo! Music dataset
- HugeWiki dataset
- NetFlix dataset
Comparison of convergence speed when virtual memory (swap space in linux) is used for in-memory implementation.
-
In this figure, we compare three competitors: 1) FPSGD which is performed with enough physical memory (16G), 2) FPSGD with insufficient physical memory, 3) MLGF-MF based on revuAhn disk. Specifically, for baseline 2), we use 2 GB physical memory to perform FPSGD on Yahoo Music dataset. For fair evaluation, we also set revuAhn to be used as a swap space. The following figure shows that, using FPSGD with insufficient memory requires three time longer time than FPSGD with enough physical memory while MLGF-MF shows identical performance for Yahoo dataset eventhough it uses revuAhn.
Performance bottleneck
|
|
Sensitivity on page sizes
|
|
Scalability
We also evaluate the scalability of MLGF-MF in terms of dataset size by using synthetic dataset. For evaluation, we generate the synthetic dataset by following the synthetic generation steps of DSGD[3]. Specifically, we fix the size of column as 1M, and increases the number of users from 20M to 100M with the interval of 20M. Accordingly, we also increases the number of non-zero entries in the matrix from 2B to 10B with the interval of 2B. As a result, we generate 5 datasets having the 20M \~ 100M of users. The size of synthesized matrix is 53G - 274G byte in raw text format. In the result figure, the time for processing large matrix by MLGF-MF linearly increase.
|
|