Thus, we input the token sequence and the CFG of bytecode into the customized deep learning model, and generate the comment through the trained model. In addition, the token sequence in the Code area reflects the program semantic information of the bytecode. Inspired by this, we convert the bytecode into a CFG and capture the bytecode structural information from the CFG. 2019), and model the code structural information from the AST. To capture the code structural information, those methods generating comments at source code level convert the code into an AST (LeClair et al. The Code area mainly contains opcodes, which are similar to assembly language. Then, we mainly use the information in the Code area of the disassembled bytecode. Specifically, we first disassemble the bytecode to obtain the assembly instruction. 2017), to generate comments for bytecode. In this paper, we propose BCGen, a Transformer based model (Vaswani et al. Compared with source code, bytecode is more abstract, and the code structural information is missing. Bytecode is a binary file that contains an executable program, consisting of a sequence of opcode/data pairs, and is a kind of intermediate code (Dahm 1999). However, at the bytecode level, it is very challenging to generate comments for it. 2018b) and abstract syntax tree (AST) LeClair et al. The input data used by these methods are the information extracted from the source code, such as API information (Hu et al. At present, the most prevalent methods are based on deep learning (Sridhara et al. 2010 Jackson and Waingold 2001), the comments of the bytecode can provide a bird’s-eye view of the system’s functionalities, which can be used as important supporting information with decompilation tools.Īt the source code level, researchers have carried out many studies on automatic comment generation methods (Alon et al. (2) When programmers face with a task of decompilation for a large-scale system (Krogmann et al. Therefore, the comments of the bytecode of the API can help determine the real behavior of the API. However, if an attacker deliberately names an API to hide its true functionality, we cannot analyze the API’s behavior by its name. 2019), some existing tools detect the API from bytecode to determine its specific behavior (e.g., accessing the phone’s contact list). For example, (1) In APP sensitive-resource-access detection (Nguyen et al. Furthermore, generating comments for bytecode has practical implications. If we can generate natural language comments for bytecode, it will greatly save developers time to understand bytecode. Therefore, there is a need for an easy, fast, and accurate way to understand the bytecode. In these tasks, developers often need to analyze and understand the bytecode. Meanwhile, it is important to take structured and semantic information into account in generating bytecode comments.Īs an intermediate code, bytecode has been widely used in various software tasks, such as malware detection (Zhao et al. It is concluded that it is possible to generate natural language comments directly from the bytecode. Experimental results show that the BLEU of BCGen can reach 0.26, which outperforms several baselines and proves the effectiveness and practicability of our method. We obtain the bytecode by building the Jar packages of the well-known open-source projects in the Maven repository and construct a bytecode dataset to train and evaluate our model. Then a transformer model combining gate recurrent unit is proposed to learn the features of bytecode to generate comments. Specifically, to get the structured information of the bytecode, we first generate the control flow graph (CFG) of the bytecode, and serialize the CFG with bytecode semantic information. In order to understand the meaning of the bytecode more quickly and accurately and further help programmers in more software activities, we propose a bytecode comment generation method (called BCGen) using neural language model. Bytecode has been widely used in various software tasks such as malware detection and clone detection. Unlike human-readable source code, bytecode is even harder to understand for programmers and researchers. Bytecode is a form of instruction set designed for efficient execution by a software interpreter.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |