The article presents a novel approach to code stylometry, which involves analyzing the writing style of software developers. The authors propose a two-phase method that first generates abstract syntax trees (ASTs) from source code and then transforms selected features into real-valued vectors using word2vec embedding model. These vectors represent neural features in a low-dimensional vector space, enabling the analysis of code similarities and differences.
To begin with, the authors explain that stylometry is the study of natural language writing style, and its application to software development has been gaining attention due to its potential to identify subtle variations in coding patterns. The article references earlier works in this field, including a scalable analysis approach called Writeprints, which extracts various features from source code to detect anonymized cyberspace entities.
The authors then introduce their two-phase method, which they call Neural Representation (NR). In phase 1, ASTs are generated from the source code using a parser, and in phase 2, selected features are transformed into embedding vectors using word2vec. The authors explain that word2vec is a popular technique for learning vector representations of words based on their co-occurring patterns in text data.
To illustrate how NR works, the authors provide an example of transforming cstyle features (e.g., naming conventions and coding patterns) into embedding vectors. They demonstrate that these vectors can be used to identify similarities between code samples, even when the code itself is not identical. The authors also discuss the choice of sequence length and vector dimensions in word2vec, emphasizing the need to experiment with different parameters to find the most suitable configuration for a particular task.
Throughout the article, the authors use engaging analogies and metaphors to help readers understand complex concepts. For instance, they compare the process of transforming features into embedding vectors to cooking a meal, where the ingredients (features) are mixed together in a specific way to create a flavorful dish (vector representation). They also liken the AST to a blueprint of a building, which can be used to analyze the structure and organization of the code.
Overall, the article provides a concise and comprehensive overview of Neural Representation for code stylometry, offering insights into how this approach can help improve software development practices and identify subtle variations in coding patterns. By using everyday language and engaging analogies, the authors make the complex concepts more accessible to a broad audience.
Computer Science, Software Engineering