Fast and Robust Parallel SGD Matrix Factorization

Jinoh Oh	Wook-Shin Han	Hwanjo Yu	Xiaoqian Jiang

Pohang University of Science and Technology (POSTECH)			University of California at San Diego (UCSD)
Pohang, South Korea			California, USA

Abstract
Paper
MLGF-MF
Evaluation results
- Memory
  - Skewed matrices
- Block-storage devices
  - Bottleneck analysis
  - Sensitive analysis on pagesize

Abstract

Matrix factorization is one of the fundamental techniques for analyzing latent relationship between two entities. Especially, it is popularly used for recommendation for its high accuracy. Efficient parallel SGD matrix factorization algorithms have been developed for large matrices to speed up the convergence of factorization. However, most of them are designed for a shared-memory environment thus fail to factorize a large matrix that is too big to fit in memory, and their performances are also unreliable when the matrix is skewed.

This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory. MLGF-MF uses Multi-Level Grid File (MLGF) for partitioning the matrix and minimizes the cost for scheduling parallel SGD updates on the partitioned regions by exploiting partial match queries processing. Thereby, MLGFMF produces reliable results efficiently even on skewed matrices. MLGF-MF is designed with asynchronous I/O permeated in the algorithm such that CPU keeps executing without waiting for I/O to complete. Thereby, MLGF-MF overlaps the CPU and I/O processings, which eventually offsets the I/O cost and maximizes the CPU utility. Recent flash SSD disks support high performance parallel I/O, thus are appropriate for executing the asynchronous I/O.

From our extensive evaluations, MLGF-MF significantly outperforms (or converges faster than) the state-of-the-art algorithms in both shared-memory and block-storage environments. In addition, the outputs of MLGF-MF is significantly more robust to skewed matrices.

Paper

Click Here (PDF)

MLGF-MF

MLGF-MF algorithm is implemented by C++. We provide an executable binary of the algorithm. It is executable in Ubuntu 14.04 trusty.

Data

Format

We use the typical data format for recommendation. In each line of datafile, there are three numbers, user id, item id, and rating, which are all separated by white spaces. We assume that user id and item id are integer values, and rating is a real value. We do not assume any sorting condition on datasets.

The following is an example of a data file, which is actually a part of NetFlix dataset.

155902 5496 5
2248854 13015 4
43227 5807 1
626906 6679 4
2042206 501 5
397607 788 3
1984294 16265 5
227589 3506 2
918529 12767 4
1546913 7364 4

Available Datasets

In our paper, we use three benchmark datasets, Netflix, Yahoo! Music, and HugeWiki. Here we provide the links for Yahoo! Music, and HugeWiki.

Dataset repositories
- Yahoo! Music (C15)
- HugeWiki
The default data formats for Yahoo! Music and HugeWiki are different from ours. Yahoo! Music uses its own format, and HugeWiki uses mm format. Thus, they have to be preprocessed. Here we provide preprocessing scripts for Yahoo! Music and HugeWiki.
Preprocessing scripts
- Yahoo! Music
- HugeWiki
The usage of the preprocessing scripts follows.
```
path-to-preprocessor> python <scriptname> <datafile>
```

Download MLGF-MF Executable

The executable file of MLGF-MF is

MLGF-MF executable

Contact jinoh.postech AT gmail.com if you find a trouble in executing the file.

Usage

Once you download and extract the .tar file, execute the binary named MLGF_MF, and set input parameters as follows.

Mode selection

Mode
- --build-index : Build the MLGF index and db file from a plain dataset
```
shell> <Executable> --build-index <input_datafile> <output_db_file>
```
- --train : Train factorized latent matrices, and output latent matrices to a modelfile.
```
shell> <Executable> --train <input_dbfile> <output_model_file>
```
- --predict : Predict the score of ratings in the test dataset, and report the validation RMSE.
```
shell> <Executable> --predict <input_modelfile> <intput_test_filename>
```

Parameter Setting for training mode

General Parameters
- --thread-size <integer_value> : The size of concurrent threads
- --validate <validate_file_name> : Enable validation, report final RMSE at the end of iterations
- --validate-iteratively : Enable reporting validation RMSE for each iteration
Learning parameters
- --iteration-size <integer_value> : The number of iterations
- --row-lambda <float_value> : The regularizer for latent matrix U
- --column-lambda <float_value> : The regularizer for latent matrix V
- --stepsize <float_value> : The stepsize η
- --latent-dim-size <integer_value> : The size of latent dimension (k value in the paper)

An Examplary Scenario

This scenario is tested on Ubuntu 14.04 Trusty 64bit.

Here we provide an examplary scenario of using our MLGF-MF for the NetFlix dataset.

Build an MLGF index from the netflix training dataset

execute MLGF-MF as follow.
```
shell> ./MLGF-MF --build-index netflix_train netflix_db
```
Then, index building for MLGF is progressed.

Training latent matrices from dbfile

execute MLGF-MF as follow.

shell> ./MLGF-MF --train --validate netflix_test --iteration-size 10 netflix_db model

Evaluation results

Memory

We compare the performance of comptitors in shared-memory environment.

Comparison on convergence speed in shared-memory for varying datasets

NetFlix dataset

Yahoo! Music dataset

HugeWiki dataset is unavailable in shared-memory environment due to the lack of memory

Skewed matrices

As described in the paper, we also evaluate the performance of competitors for a skewed matrix, which is a variation of NetFlix dataset. For a skewed matrix, MLGF-MF significantly outperforms the other methods.

Comparison on convergence speed in shared-memory environment for a skewed NetFlix dataset.

RMSE of competitors for a skewed NetFlix dataset at convergence.

Block-storage devices

We compare the performance of MLGF-MF and GraphChi on three datasets and two block-storage devices. For all combinations of dataset and block-storage device, MLGF-MF significantly outperforms GraphChi.

Comparison of the convergence speed on two block-storage devices

Block-storage device 1: RevuAhn Drive X
NetFlix dataset

Yahoo! Music dataset

HugeWiki dataset
Block-storage device 2: Samsung SSD 850 Pro
NetFlix dataset

Yahoo! Music dataset

HugeWiki dataset

Comparison of convergence speed when virtual memory (swap space in linux) is used for in-memory implementation.

Performance bottleneck

From the following figures, we observe that the performance of MLGF-MF on SSD disk is similar to that on shared-memory, which implies the bottleneck of MLGF-MF is not the I/O but the CPU computation. It is because the I/O cost of MLGF-MF is offset by the asynchronous I/O.

MLGF-MF without SSE instruction on Yahoo! Music dataset

MLGF-MF with SSE instruction on Yahoo! Music dataset

Sensitivity on page sizes

We investigate the performance of MLGF-MF with various settings of page sizes. In the following figures, the performance of MLGF-MF does not change much with various settings of page sizes. MLGF_MF shows the best performance with pages of 4MB, but the performance difference with pages of 512KB is slight.

Convergence speed of MLGF-MF on NetFlix dataset for varying sizes of pages

Storage utility on Netflix dataset for varying sizes of pages

Scalability

We investigae the scalability of MLGF-MF with respect to the size of threasd. Specficially, we investigate the speedup factor (the time using single threads / the time using multiple threads). In this evaluation, both of MLGF-MF and FPSGD show ideal speed up until 4 threads. However, afther 4 threads, the speed up slows down.

We also evaluate the scalability of MLGF-MF in terms of dataset size by using synthetic dataset. For evaluation, we generate the synthetic dataset by following the synthetic generation steps of DSGD[3]. Specifically, we fix the size of column as 1M, and increases the number of users from 20M to 100M with the interval of 20M. Accordingly, we also increases the number of non-zero entries in the matrix from 2B to 10B with the interval of 2B. As a result, we generate 5 datasets having the 20M \~ 100M of users. The size of synthesized matrix is 53G - 274G byte in raw text format. In the result figure, the time for processing large matrix by MLGF-MF linearly increase.

Scalability of MLGF-MF and FPSGD for varying number of threads

Sclalability of MLGF-MF for varying size of users in synthetic dataset