Multi-aspect data appear frequently in many web-related applications. For example, product reviews are quadruplets of (user, product, keyword, timestamp). How can we analyze such web-scale multi-aspect data? Can we analyze them on an off-the-shelf workstation with limited amount of memory? Tucker decomposition has been widely used for discovering patterns in relationships among entities in multi-aspect data, which are naturally expressed as high-order tensors. However, existing algorithms for Tucker decomposition have limited scalability, and especially, fail to decompose high-order tensors since they explicitly materialize intermediate data, whose size rapidly grows as the order increases (≥ 4). We call this problem M-Bottleneck (“Materialization Bottle-neck”). To avoid M-Bottleneck, we propose S-HOT, a scalable high-order tucker decomposition method that employs the on-the-fly-computation to minimize the materialized intermediate data. Moreover, S-HOT is designed for handling disk-resident tensors, too large to fit in memory, without loading them all at once. We provide theoretical analysis on the amount of memory space and the number of scans of data required by S-HOT. In our experiments, S-HOT showed better scalability not only with the order but also with the dimensionality and the rank than baseline methods. In particular, S-HOT decomposed tensors 1000× larger in terms dimensionality than baseline methods. S-HOT also successfully analyzed a real-world tensors that are both large-scale and high-order on an off-the-shelf workstation with limited amount of memory, while baseline methods failed. For reproducibility, we make source code of S-HOT publicly available [link].
Model | Description |
---|---|
BaselinNaive | A naive method computing the intermediate tensor in a straight-forward way. We used the implementation built in Tensor toolbox 2.6 |
BaselineOpt [Kolda et al.] |
The state-of-the-art Memory Efficient Tucker decomposition which computes the intermediate tensor fiber by fiber. We used the implementation built in Tensor toolbox 2.6 |
S-HOT | Our proposed model. [Download] |
S-HOTscan | A variant of S-HOT. S-HOTscan is designed to optimize I/O cost for scanning disk-resident tensor. [Download] |
S-HOT shows better scalability for various factors: dimensionality, order of tensor, and rank
These plots verify the excellence of S-HOT in terms of various factors: dimensionality, order of tensor, and rank. Specifically, in Figure 2, both of S-HOT and S-HOTscan decompose a tensor with dimensionality higher than 104 while baselines fail to decompose thosehigh tensor. Similarly, in Figure 3, both of proposed methods decpompose a tensor with high order (≥ 5) while baselines fail.
We encourage you to cite our paper if you have used our implementation in your work. You can use the following BibTex citation:
@inproceedings{Oh:2017:SHOT, author = {Jinoh Oh and Kijung Shin and Evangelos E. Papalexakis and Christos Faloutsos and Hwanjo Yu}, title = {S-HOT: Scalable High-Order Tucker Decomposition}, booktitle = {Proceedings of the 10th ACM International Conference on Web Search and Data Mining}, series = {WSDM '17}, year = {2017}, location = {Cambridge, UK}, doi = {10.1145/3018661.3018721}, publisher = {ACM}, address = {New York, NY, USA} }