This paper presents a new approach to analyzing source code by representing it as a hierarchical structure of lines and files, rather than treating it as a stream of tokens. The authors use a tokenizer called Byte-Pair Encoder (BPE) to convert each line into distinct tokens, which are then fed into a feed-forward network for defect detection. The output of the feed-forward network is passed through a softmax layer to produce a probability matrix indicating the likelihood that each line is defective or not.
The authors’ approach differs from traditional deep learning models that treat source code as a stream of tokens because it captures the hierarchical structure of source code documents. This allows for more accurate analysis and understanding of the code. The use of BPE tokenizer also helps to improve the accuracy of the model by creating more precise tokens.
The paper also discusses the importance of pre-processing and tokenization in deep learning models, emphasizing the need to capture the hierarchical structure of source code documents. The authors provide examples of how their approach can be applied to different programming languages, including Python and Java.
In summary, this paper presents a new approach to analyzing source code by representing it as a hierarchical structure of lines and files, using a tokenizer called BPE to create more precise tokens, and feeding them into a feed-forward network for defect detection. The authors’ approach captures the hierarchical structure of source code documents, leading to more accurate analysis and understanding of the code.
Computer Science, Software Engineering