Allreduce for Parallel Learning. John Langford, Microsoft Resarch, NYC

Similar documents
Deep Neural Networks [GBC] Chap. 6, 7, 8. CS 486/686 University of Waterloo Lecture 18: June 28, 2017

Quorums. Christian Plattner, Gustavo Alonso Exercises for Verteilte Systeme WS05/06 Swiss Federal Institute of Technology (ETH), Zürich

Recursive Mergesort. CSE 589 Applied Algorithms Spring Merging Pattern of Recursive Mergesort. Mergesort Call Tree. Reorder the Merging Steps

Radiomics for Disease Characterization: An Outcome Prediction in Cancer Patients

ECE 5984: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning

Smith Waterman Algorithm - Performance Analysis

TÜ Information Retrieval

Probabilistic Quorum-Based Accounting for Peer-to-Peer Systems

NPTEL NPTEL ONLINE COURSES REINFORCEMENT LEARNING. UCB1 Explanation (UCB1)

Bigdata High Availability Quorum Design

ECE 5424: Introduction to Machine Learning

Overview of the ATLAS Fast Tracker (FTK) (daughter of the very successful CDF SVT) July 24, 2008 M. Shochet 1

DPaxos: Managing Data Closer to Users for Low-Latency and Mobile Applications

NPTEL NPTEL ONINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture-59 Ensemble Methods- Bagging,Committee Machines and Stacking

ECE 5424: Introduction to Machine Learning

Basic Algorithms Overview

Grids: Why, How, and What Next

Introduction to Statistical Hypothesis Testing Prof. Arun K Tangirala Department of Chemical Engineering Indian Institute of Technology, Madras

Actuaries Institute Podcast Transcript Ethics Beyond Human Behaviour

Investigating I/O approaches to improve performance and scalability of the Ocean-Land-Atmosphere Model

Distributed Systems. 11. Consensus: Paxos. Paul Krzyzanowski. Rutgers University. Fall 2015

Performance Analysis with Vampir

Quantitative Finance Major

Computational Learning Theory: Agnostic Learning

Statistics, Politics, and Policy

ECE 6504: Deep Learning for Perception

Further COSMO-Model Development

Six Sigma Prof. Dr. T. P. Bagchi Department of Management Indian Institute of Technology, Kharagpur

Quantitative Finance Major

ITU Kaleidoscope 2016 ICTs for a Sustainable World

Computing Machinery and Intelligence. The Imitation Game. Criticisms of the Game. The Imitation Game. Machines Concerned in the Game

Index. ASIC, 12 see also accelerator atomic operation, 214, , 399 see also GPU atoms, 124

What can happen if two quorums try to lock their nodes at the same time?

L2. Logic and Reasoning

Sentiment Flow! A General Model of Web Review Argumentation

MISSOURI S FRAMEWORK FOR CURRICULAR DEVELOPMENT IN MATH TOPIC I: PROBLEM SOLVING

P2P Content Distribution BitTorrent and Spotify

Outline. Uninformed Search. Problem-solving by searching. Requirements for searching. Problem-solving by searching Uninformed search techniques

Agnostic KWIK learning and efficient approximate reinforcement learning

Laboratory Exercise Saratoga Springs Temple Site Locator

Interview of Wesley Chu

TEST # 1 CUT PATHS FROM HOST TO IOGRP0:

NPTEL NPTEL ONLINE CERTIFICATION COURSE. Introduction to Machine Learning. Lecture 31

Probability Foundations for Electrical Engineers Prof. Krishna Jagannathan Department of Electrical Engineering Indian Institute of Technology, Madras

Predictive Coding. CSE 390 Introduction to Data Compression Fall Entropy. Bad and Good Prediction. Which Context to Use? PPM

Gesture recognition with Kinect. Joakim Larsson

MLLunsford, Spring Activity: Conditional Probability and The Law of Total Probability

Load balanced Scalable Byzantine Agreement through Quorum Building, with Full Information

CS485/685 Lecture 5: Jan 19, 2016

The World Wide Web and the U.S. Political News Market: Online Appendices

CS224W Project Proposal: Characterizing and Predicting Dogmatic Networks

From Machines To The First Person

Carolina Bachenheimer-Schaefer, Thorsten Reibel, Jürgen Schilder & Ilija Zivadinovic Global Application and Solution Team

TECHNICAL WORKING PARTY ON AUTOMATION AND COMPUTER PROGRAMS. Twenty-Fifth Session Sibiu, Romania, September 3 to 6, 2007

Simulative Portfolio Optimization under Distributions of Hyperbolic Type - Methods and Empirical Investigation

How many imputations do you need? A two stage calculation using a quadratic rule

Heap and Merge Sorts

MITOCW watch?v=4hrhg4euimo

APRIL 2017 KNX DALI-Gateways DG/S x BU EPBP GPG Building Automation. Thorsten Reibel, Training & Qualification

9/7/2017. CS535 Big Data Fall 2017 Colorado State University Week 3 - B. FAQs. This material is built based on

Index. in this web service Cambridge University Press

Factors related to students focus on God

CS 4803 / 7643: Deep Learning

Our Story with MCM. Shanghai Jiao Tong University. March, 2014

Everything you should know about kartina tv

Content Area Variations of Academic Language

ECE 5424: Introduction to Machine Learning

ANTIOCH SCHOOL OF CHURCH PLANTING AND LEADERSHIP DEVELOPMENT

King and Kitchener Packet 3 King and Kitchener: The Reflective Judgment Model

Lesson 07 Notes. Machine Learning. Quiz: Computational Learning Theory

6.041SC Probabilistic Systems Analysis and Applied Probability, Fall 2013 Transcript Lecture 21

COS 226 Algorithms and Data Structures Fall Midterm

The following content is provided under a Creative Commons license. Your support

Quorums Quicken Queries: Efficient Asynchronous Secure Multiparty Computation

2 THE ART OF LEADING PEOPLE THROUGH CHANGE

Boston University Computer Science Convocation Address May 16, 2004

Improving Tree-to-Tree Translation with Packed Forests

Here s a very dumbed down way to understand why Gödel is no threat at all to A.I..

Review for Test III #1

Information Science and Statistics. Series Editors: M. Jordan J. Kleinberg B. Schölkopf

Plans for COSMO-1 within the project COSMO-NExT

This report is organized in four sections. The first section discusses the sample design. The next

The performance of the Apriori-DHP algorithm with some alternative measures

LIQUID CHURCH SPIRITUAL GROWTH PASTOR JOB SPECIFICATIONS PREPARED BY W. VANDERBLOEMEN MORRISTOWN, NJ

McDougal Littell High School Math Program. correlated to. Oregon Mathematics Grade-Level Standards

Syrian Opposition Survey June 1 July 2, Democratic Models

Visualizing Darwin s Theory and its Revolutionary Implication

The Development of Knowledge and Claims of Truth in the Autobiography In Code. When preparing her project to enter the Esat Young Scientist

Strategic Planning Update for the Diocese of Evansville

Passenger Management by Prioritization

FUZZY EXPERT SYSTEM IN DETERMINING HADITH 1 VALIDITY. 1. Introduction

Factors related to students spiritual orientations

A Scientific Model Explains Spirituality and Nonduality

IN a distributed database system, data is

Pastor Views on Technology. Survey of Protestant Pastors

POLS 205 Political Science as a Social Science. Examples of Theory-Building in Political Science

KEEP THIS COPY FOR REPRODUCTION Pý:RPCS.15i )OCUMENTATION PAGE 0 ''.1-AC7..<Z C. in;2re PORT DATE JPOTTYPE AND DATES COVERID

ABB STOTZ-KONTAKT GmbH ABB i-bus KNX DGN/S DALI Gateway for emergency lighting

Slides by: Ms. Shree Jaswal

Transcription:

Allreduce for Parallel Learning John Langford, Microsoft Resarch, NYC May 8, 2017

Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines!

Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels!

Applying for a fellowship in 1997 Interviewer: So, what do you want to do? John: I d like to solve AI. I: How? J: I want to use parallel learning algorithms to create fantastic learning machines! I: You fool! The only thing parallel machines are good for is computational windtunnels! The worst part: he had a point.

Why is it hard? Everyone s first instinct: Try using parameter servers. Model Shard Model Shard Model Shard Data Shard Data Shard Data Shard Data Shard Data Shard 1 Hsu, Karampatziakis, Langford, and Smola, Parallel Online Learning, https://arxiv.org/abs/1103.4204 2 Dean et al, Large scale Deep Distributed Networks, NIPS 2012.

Why is it hard? Everyone s first instinct: Try using parameter servers. Model Shard Model Shard Model Shard Data Shard Data Shard Data Shard Data Shard Data Shard 1 Hsu, Karampatziakis, Langford, and Smola, Parallel Online Learning, https://arxiv.org/abs/1103.4204 2 Dean et al, Large scale Deep Distributed Networks, NIPS 2012. Big problems in practice: 1 Overwhelmingly inefficient. Best case: marginally faster with x100 electricity. 2 Nondeterministic. Minor bugs incredibly difficult to track.

Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f w (x) = i w ix i?

Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f w (x) = i w ix i? 17B Examples 16M parameters 1K nodes How long does it take?

Terascale Linear Learning ACDL11 Given 2.1 Terafeatures of data, how can you learn a good linear predictor f w (x) = i w ix i? 17B Examples 16M parameters 1K nodes How long does it take? 70 minutes = 500M features/second: faster than the IO bandwidth of a single machine faster than all possible single machine linear learning algorithms.

MPI-style AllReduce Allreduce initial state 5 7 6 1 2 3 4

MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28

MPI-style AllReduce Create Binary Tree 7 5 6 1 2 3 4

MPI-style AllReduce Reducing, step 1 7 8 13 1 2 3 4

MPI-style AllReduce Reducing, step 2 28 8 13 1 2 3 4

MPI-style AllReduce Broadcast, step 1 28 28 28 1 2 3 4

MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast

MPI-style AllReduce Allreduce final state 28 28 28 28 28 28 28 AllReduce = Reduce+Broadcast Properties: 1 Easily pipelined so no latency concerns. 2 Bandwidth 6n. 3 No need to rewrite code!

An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w w/n

An Example Algorithm: Weight averaging n = AllReduce(1) While (pass number < max) 1 While (examples left) 1 Do online update. 2 AllReduce(weights) 3 For each weight w w/n Other algorithms implemented: 1 Nonuniform averaging for online learning 2 Conjugate Gradient 3 LBFGS

Hadoop AllReduce Data Program 1 Map job moves program to data.

Hadoop AllReduce Data Program 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data.

Hadoop AllReduce Data Program 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the first to finish reading all data once.

Hadoop AllReduce Data Program 1 Map job moves program to data. 2 Delayed initialization: Most failures are disk failures. First read (and cache) all data, before initializing allreduce. Failures autorestart on different node with identical data. 3 Speculative execution: In a busy cluster, one node is often slow. Hadoop can speculatively start additional mappers. We use the first to finish reading all data once. The net effect: Reliable execution out to perhaps 10K node-hours.

Robustness & Speedup Speedup 10 9 8 7 6 5 4 3 2 1 Average_10 Min_10 Max_10 linear Speed per method 0 10 20 30 40 50 60 70 80 90 100 Nodes

Splice Site Recognition 0.55 0.5 0.45 auprc 0.4 0.35 0.3 Online 0.25 L BFGS w/ 5 online passes L BFGS w/ 1 online pass L BFGS 0.2 0 10 20 30 40 50 Iteration

Splice Site Recognition 0.6 0.5 0.4 auprc 0.3 0.2 L BFGS w/ one online pass 0.1 Zinkevich et al. Dekel et al. 0 0 5 10 15 20 Effective number of passes over data

What about parallel deep learning? Needs to work with a GPU.

GPU Allreduce optimization 1: Minibatch gradient descent GPUs have much more computation than communication.

GPU Allreduce optimization 1: Minibatch gradient descent GPUs have much more computation than communication. Give every GPU n examples and compute average gradient on them. Synchronize Gradient

GPU Allreduce optimization 1: Half-Batch Do communication asynchronous to computation.

GPU Allreduce optimization 1: Half-Batch Do communication asynchronous to computation. 1 minibatch communicates while the other computes on older parameters.

GPU Allreduce optimization 1: 1-bit SGD Discretize the gradient to 1 bit before communicating off GPU.

GPU Allreduce optimization 1: 1-bit SGD Discretize the gradient to 1 bit before communicating off GPU. Keep and accumulate discretization errors on GPU.

GPU Allreduce Optimization 2: Ring Allreduce Every node masters a subset and messages travel in a ring. 1 Downside: latency increase? 2 Upside: perfectly efficient synchronization

Bibliography L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited Storage, Mathematics of Computation 35:773 782, 1980. grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable Modular Convex Solver for Regularized Risk Minimization, KDD 2007. Rings 1 Pitch Patarasuk and Xin Yuan, Bandwidth Optimal all-reduce algorithms for clusters of workstations, JPDC 2009. avg. G. Mann et al. Efficient large-scale distributed training of conditional maximum entropy models, NIPS 2009. ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized Stochastic Gradient Descent, NIPS 2010. P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola, Parallel Online Learning, in SUML 2010. D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, Optimal Distributed Online Predictions Using Minibatch, JMLR 2012.

Bibliography II Tera Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John Langford A Reliable Effective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBelief Dean et al, Large Scale Distributed Deep Networks, NIPS 2012. One-bit Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu, 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs, Interspeech 2014. Rings 2 Amodei et al, Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, ICML 2016.