IBM THINK in Las Vegas, we are reporting a breakthrough in AI performance using new software and algorithms on optimized hardware, including POWER9 with NVIDIA® V100™ GPUs.
In a newly published benchmark, using an online advertising dataset released by Criteo Labs with over 4 billion training examples, we train a logistic regression classifier in 91.5 seconds. This training time is 46x faster than the best result that has been previously reported, which used TensorFlow on Google Cloud Platform to train the same model in 70 minutes.
The AI software behind the speed-up is a new library developed over the past two years by our team at IBM Research in Zurich called IBM Snap Machine Learning (Snap ML) – because it trains models faster than you can snap your fingers.
The library provides high-speed training of popular machine learning models on modern CPU/GPU computing systems and can be used to train models to find new and interesting patterns, or to retrain existing models at wire-speed (as fast as the network can support) as new data becomes available. This means less compute costs for users, less energy, more agile development and a faster time to result.
The Need for Speed
The widespread adoption of machine learning and artificial intelligence has been, in part, driven by the ever-increasing availability of data. Large datasets enable training of more expressive models, thus leading to higher quality insights. However, when the size of such datasets grows to billions of training examples and/or features, the training of even relatively simple models becomes prohibitively time consuming. This long turn-around time (from data preparation to scoring) can be a severe hindrance to the research, development and deployment of large-scale machine learning models for critical applications such as weather forecasting and financial fraud detection.
Equally important, Snap ML is not only for large data applications where training time can become a bottleneck. For example, real-time or close-to-real-time applications, in which models must react rapidly to changing events, are another important scenario where training time is critical. For instance, consider an ongoing hack threatening the energy grid, when a new, previously unseen, phenomenon is currently evolving. In such situations, it may be beneficial to train, or incrementally re-train, the existing models with new data on the fly. One’s ability to respond to such events necessarily depends on the training time, which can become critical even when the data itself is relatively small.
A third area when fast training is highly desirable is the field of ensemble learning. It is well known that most data science competitions today are won by large ensembles of models. In order to design a winning ensemble, a data scientist typically spends a significant amount of time trying out different combinations of models and tuning the large number of hyper-parameters that arise. In such a scenario, the ability to train models orders of magnitude faster naturally results in a more agile development process. A library that provides such acceleration can give its user a valuable edge in the field of competitive data science or any applications where best-in-class accuracy is desired. One such application is click-through rate prediction in online advertising, where it has been estimated that even 0.1% better accuracy can lead to increased earning of the order of hundreds of millions of dollars.
The efficiency, results, and insights from machine learning have made it critical to businesses of all sizes. Whether a small to medium business is running in the cloud or a large-scale enterprise IT operation, which services many business units, machine learning puts pressure on compute resources. Since resources are typically billed in increments, time to solution will have a direct impact on the business’ bottom line.
In this work we describe a library that exploits the hierarchical memory and compute structure of modern systems. We focus on the training of generalized linear models for which we combine recent advances in algorithm and system design to optimally leverage all hardware resources available in modern computing environments.
The three main features that distinguish Snap ML are:
- Distributed training: We build our system as a data-parallel framework, enabling us to scale out and train on massive datasets that exceed the memory capacity of a single machine which is crucial for large-scale applications.
- GPU acceleration: We implement specialized solvers designed to leverage the massively parallel architecture of GPUs while respecting the data locality in GPU memory to avoid large data transfer overheads. To make this approach scalable we take advantage of recent developments in heterogeneous learning in order to enable GPU acceleration even if only a small fraction of the data can indeed be stored in the accelerator memory.
- Sparse data structures: Many machine learning datasets are sparse, therefore we employ some new optimizations for the algorithms used in our system when applied to sparse data structures.
Tera-Scale Benchmark Set-Up
The Terabyte Click Logs is a large online advertising dataset released by Criteo Labs for the purposes of advancing research in the field of distributed machine learning. It consists of 4 billion training examples.
Each example has a “label”, that is, whether or not a user clicked on an online advert, and a corresponding set of anonymized features. The goal of machine learning for such data is to learn a model that can predict whether or not a new user will click on an advert. It’s one of the largest publicly available datasets. The data was collected over 24 days, and on each day they collected on average 160 million training examples. This shows us that online advertising is a field with really huge data, that is being generated in real-time.
In order to train on the full Terabyte Click Logs dataset, we deploy Snap ML across four IBM Power System AC922 servers. Each server has four NVIDIA Tesla V100 GPUs and two Power9 CPUs which communicate with the host via the NVIDIA NVLink interface. The servers communicate with each other via an Infiniband network. When training a logistic regression classifier on such infrastructure, we achieved a test loss of 0.1292 in 91.5 seconds.
The results that have been previously reported on the same dataset and model are summarized in Figure 7, where we plot the training time vs. the test loss with some remarks about the hardware that was used for the experiment.
The closest result to Snap ML in terms of speed was reported by Google, who deployed TensorFlow on their cloud platform to train a logistic regression classifier in 70 minutes. They report using 60 workers machines and 29 parameter machines. Relative to the TensorFlow results, we observe that Snap ML achieves the same loss on the test set but 46x faster. A full review of the previously-reported results on including references can be found in the full paper.
When deploying GPU acceleration for such large scale applications one main technical challenge arises: the training data is too large to be stored inside the memory available on the GPUs. Thus, during training, data needs to be processed selectively and repeatedly moved in and out of the GPU memory. To profile the runtime of our application we analyze how much time is spent in the GPU kernel vs. how much time is spent copying data on the GPU. For this study we used a smaller subset of the Terabyte Clicks Logs, consisting of the first 200 million training examples, and compare two hardware configurations:
- An Intel x86-based machine (Xeon Gold 6150 CPU @ 2.70GHz) with 1 NVIDIA Tesla V100 GPUs attached using the PCI Gen 3 interface.
- A IBM POWER AC922 server with 4 Tesla V100 GPUs attached using the NVLink interface (we only use 1 of them for comparison).
In Figure 8a, we show the profiling results for the x86-based setup. We can see two streams S1 and S2. On stream S1, the actual training is being performed (i.e., calls to the logistic regression kernel). The time to train each chunk of data is around 90 milliseconds (ms). While the training is ongoing, in stream S2 we copy the next data chunk onto the GPU. We observe that it takes 318 ms to copy the data, thus meaning that the GPU is sitting idle for quite some time and the copy time is clearly the bottleneck.
In Figure 8b, for the POWER-based setup, we observe that the time to copy the next chunk onto the GPU is reduced significantly to 55 ms (almost a factor of 6), due to the faster bandwidth provided by NVIDIA NVLink. This speed-up hides the data copy time behind the kernel execution, effectively removing the copy time from the critical path and resulting in a 3.5x speed-up.
This IBM Research breakthrough will be available for customers to try as part of the PowerAI Tech Preview portfolio later this year and in the meantime, we are actively looking for clients interested in pilot projects.
Credits IBM Research