Paper Summary: Code Completion with Statistical Language Models

Code Completion with Statistical Language Models

This is a paper summary of the 2014 PLDI paper

  1. Problems/Focus
    1. Existing techniques cannot synthesize usable code beyond simple sequences required for instantiation of library objects.
  2. Importance
    1. synthesize sequences of calls across multiple objects together with their arguments
      1. most existing approaches to code completion target completion based on shallow semantic information, and cannot capture the temporal information required for synthesizing correct code using a library
  3. Context
    1. Programming Language as a natural language
    2. Code completion and Synthesis
    3. Synthesis with Partial Programs
  4. Implementation
    1. high-level design: extract sequences of method calls from a large codebase, and index those into a statistical language model. Then the program use the language model to find the highest ranked sentences.
    2. Part1: Statistical language model (NLP)
    3. Part2: Alias analysis (Classical Programming Language Concepts)
  5. Evaluation
    1. Dataset:
      1. Android Data
        1. Android source code from various source repositories
        2. use Android methods as training data
      2. Parameters adjusted
        1. size of the training data set
          1. entire database
          2. 10%
          3. 1%
        2. Enable/Disable Alias Analysis
        3. Language Model
          1. 3-gram
          2. RNNME-40 RNN
          3. a combination of the two
      3. Tasks
        1. Task1: Single Object, Single Completion
          1. 20 tasks that a programmer want to finish
        2. Task2: General Completion
          1. multiple holes
        3. Task3: Random introduction of single or multiple holes
      4. Analysis of results
        1. it seems that 3gram model is always better than RNN
          1. Only in once case the combined approach is better than 3 gram (Task3 top 3 completion, by merely 1)
        2. The top-3 reaches over 90% precision
        3. Very few type check errors
          1. all the 1000+ returned completions were manually examined
        4. Aliasing significantly expanded the training data
      5. My own opinion
        1. the sample size is too small
        2. no automated way to examine the results
  6. Contributions
    1. Given a program with holes, the program synthesize completions for holes with the most likely sequences of method calls.
    2. A new approach to code completion for programs using APIs
    3. A scalable static analysis that extracts sequences of method calls from large codebases, and indexes them into statistical language models such as N-gram and Recurrent Neural Networks
    4. A synthesis that completes holes in programs by leveraging statistical models
  7. Applications
    1. code completion (program synthesis)
  8. Tools
    1. Soot
      1.  to work with intermediate representations (program analysis)
    2. SRLM: n-gram language models
  9. Jimple???? (What is Jimple)
    1. RNNLM: recurrent neural networks
  10. Questions
    1. What is statistical model?
      1. Markov model, the probability of word(i) given a history of h(i-1)
      2. This paper used a trigram model, meaning Pr(W(i) | (wi-2, wi-1)
    2. What is Recurrent Neural Networks???
      1. use neural networks to predict the probabilities of sentences
      2. They not only capture regularities between a word and a fixed number of predecessors, but also capture longer distances between words
    3. Combination of 3gram and RNN
      1. performs better than 3gram and RNN individually
    4. why program analysis?  How does it work with Statistical language model
      1. The definition of a “word” and a “sentence”
        1. word: a event in the semantics
          1. An event for an object o corresponds to an invocation of an API method involving the object o: o can either be the receiver object (this), the return value returned by the API invocation, or one of the arguments to the method invocation.

        2. sentence:  a history sequence
      2. Program analysis
        1. extract the abstract objects (word) and histories (sentences)
      3. Statistical
        1. build the language models and calculate the probabilities of the top candidates
    5. Why history per object?
      1. This concrete history makes up “sentences”, see “model” section for more details
    6. Synthesis (The problem is how to formally define a synthesis problem)
      1. specify the variables (can be a single or multiple variables)
    7. Need to work out a concrete example to fully understand the model
      1. the stack overflow example
        1. The concrete history of abstract variables (step 1)
        2. Partial sequence extraction (step2)
        3. Candidate ranking/selection
        4. Optimality and Consistency
This entry was posted in Programming Language. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s