Paper Review: Latte: A language, Compiler, and Runtime for Elegant and Efficient Deep Neural Netowrks

These are my thoughts and summarization of the PLDI 2016 paper, Latte: A language, Compiler, and Runtime for Elegant and Efficient Deep Neural Netowrks

Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks

  1. Abstract
    1. natural abstractions
      1. ensembles of neurons with connections between them 
    2. Domain specific optimizations
      1. emits efficient code for heterogeneous architectures
    3. Distributed runtime
      1. distributed memory parallelism 
  2. Intro
    1. problem statement
      1. popular “array programming model” is a poor abstraction for programming neural networks
      2. it also has poor performance
        1. need matrix multiplication
      3. missing cross-layer optimization
        1. static library approach can not do this
    2. Latte
      1. DSL
        1. good abstraction
        2. suite of performance optimization
  3. Background
    1. Neural Networks
  4. Language Design
    1. Neuron
      1. Weighted Neuron
      2. forward function
      3. backward function
    2. Connections
    3. Network
    4. Examples
      1. Weighted Neurons
      2. Fully Connected layers
      3. Convolution Layers
  5. Latte Compiler
    1. Internal Representation of networks
      1. Adjacency List (store it as a graph)
    2. Analysis of Shared Variables
    3. Synthesis 
      1. Dataflow
      2. Compute
      3. Distributed Memory Communication
    4. Optimizations
      1. Library Kernel Pattern Matching
      2. Loop tiling
        1. Keep shared memory variables in cache 
      3. Cross-Layer Fusion 
      4. Parallelization
      5. Code Generation 
  6. Latte Runtime
    1. employs data parallelism in distributed memory cluster 
  7. Weakness
    1. Fusion
      1. Only works on specific combination
        1. Convolution + ReLu + Pooling 
        2. Convolution + Convolution does not work 
    2. Performance 
      1. gains on fusion is only a micro benchmark (first three layers of VGG) [the intro focused a lot on the intro story)
        1. most of the time is in compute, no clue why fusing Convolution + ReLu + Pooling would have any impact 
      2. no hardware counters to demonstrate the impact of fusion 
      3. much of the speedup seems to come from parallelization more than anything else
    3. Overall
      1. It will not really make an impact, everyone will still use GPU
      2. It is not clear that it is really a better way to program neural networks
      3. No analysis on why existing frameworks are bad at performance
      4. Caffe never claimed to be high performance 
      5. Not sure why the networks they pick are representative of all the networks
  8. Strengths
    1. Clearly a lot of work
      1. single node
      2. multi node
    2. Justifies the design of a DSL
      1. kernel fusion 
    3. New hardware
      1. Xeon Phi
    4. Large cluster

 

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s