DECKARD: Scalable and Accurate Tree-based Detection of Code Clones∗

  1. Focus / Problem to be solved
    1. Existing approaches either do not scale to large code bases or are not robust against minor code modi- fications.
  2. Importance
    1. eliminate duplicated code in large code base
  3. Method
    1. generate an Abstract Syntax Tree (AST) or Parse Tree (too expensive for large programs)
    2. Generate characteristic Vector for tree and subtrees
    3. Compare the different mere characteristic features to detect clones using various scalable techniques
  4. Context
    1. Tree similarity detection
    2. Studies on Simple Code Clones (evolution of software)
    3. Simple Clone detection
      1. CPMiner, CCFinder (token based, string based)
    4. High-level Structural Clone detection using data mining techniques
      1. frequent item set
    5. Semantics based clone detection (not very scalable)
      1. use Program Dependence Graphs
  5. Results
    1. scalable performance on large code base
    2. detect more clones with lower similarity score
  6. Unique contributions
    1. A new similarity definition using abstract syntax trees
    2. A scalable tree-based similarity calculation algorithm using characteristic vectors
  7. Possible applications
    1. better clone detection tools
