Understanding the content of the source code and its regular expression is very difficult when they are written in an unfamiliar language. Pseudo-code explains and describes the content of the code without using syntax or programming language technologies. However, writing Pseudo-code to each code instruction is laborious. Recently, neural machine translation is used to generate textual descriptions for the source code. In this paper, a novel deep learning-based transformer (DLBT) model is proposed for automatic Pseudo-code generation from the source code. The proposed model uses deep learning which is based on Neural Machine Translation (NMT) to work as a language translator. The DLBT is based on the transformer which is an encoder-decoder structure. There are three major components: tokenizer and embeddings, transformer, and post-processing. Each code line is tokenized to dense vector. Then transformer captures the relatedness between the source code and the matching Pseudo-code without the need of Recurrent Neural Network (RNN). At the post-processing step, the generated Pseudo-code is optimized. The proposed model is assessed using a real Python dataset, which contains more than 18,800 lines of a source code written in Python. The experiments show promising performance results compared with other machine translation methods such as Recurrent Neural Network (RNN). The proposed DLBT records 47.32, 68. 49 accuracy and BLEU performance measures, respectively.
In the software development cycle [
Therefore, an automatic way to generating Pseudo-code from the source code is needed. Furthermore, to generate the Pseudo-code from the source code, Statistical Machine Translation (SMT) is used [
In this paper, a novel deep learning-based transformer (DLBT) model is proposed for automatic Pseudo-code generation from the source code. The proposed model is a transformer machine translation model that is based on deep learning [
The main goal of the proposed model is to automatically generate the Pseudo-code by avoiding the problem of the vanishing gradient because it has access at each layer of all input tokens. In addition, the DLBT uses the attention layer independently of the previous state. This independence leads to lower computation and speed in training by avoiding weight calculations between the situations or using a long short-term memory.
This paper is organized as follows. Section 2 presents a literature review. Section 3 presents the DLBT proposed model. Section 4 shows experimental results, and finally, Section 5 is the conclusions.
Machine Translation (MT) is used to translate a language into other languages. The methods which are based on Machine Translation are comprised of rule-based and SMT. Recently the Neural Machine Translation (NMT), has increased for language modeling. In [
In [
The authors in [
The work of [
Studies | Limitation |
---|---|
[ |
• LSTM requires high memory for training experiments. |
[ |
• Many translation errors with material that does not resemble the content of the training data. |
[ |
• Training and testing phases are long. |
[ |
• CNN works for a fixed sentence size; a performance reduction would occur when the sentence becomes too long. |
[ |
• In decoder, the processing of sentences is done word by word, which leads to high computation time. |
[ |
Low performance compared to other machines translation methods. It takes to much time in training |
The proposed Deep Learning-Based Transformer (DLBT) model generates Pseudo-code from the source code based on the Transformer Neural Machine Translation (TNMT). The proposed model is organized as a stack of self-attention and pointwise, fully connected layers as shown in Tokenization and Embedding. Transformer (encoding and decoding). Post-processing.
In this step, each sentence is tokenized. For example, the sentence s = s.replace ( ′ \t″ , ′\ \t″ ) is tokenized as [′s″,'=″,'s″,″.″,'replace″,″′,'(′, ′\'″,'\″,'t″,'\'″,″,″,'\'″,'\″,'\″,'t″,'\'″,″)″]. The order of the tokens are unimportant, because the input is represented as a set of tokens. Two different methods are used to tokenize:
Method 1, the rule expression of the Tokenizer is ([∧A-Za-z0-9_:.]); Method 2, the rule expression of Tokenizer is (
For example, if there are three statements: “ the_expression=(′?\/∗+−%<>:&|″) ”, “ default: “html,txt”, or “js” ”, and “z2=x1+y3”. The output of the tokenization step after applying Method 1 will be as follows:
Statement 1: [the_expression, =, (, ′, ?, V, ∗, +, -, %, <, >, :, &, |, ′, )]; Statement 2: [default:, “, html, ′,″, txt, “, ′,″, or, “, js, “]; Statement3: [z2, =, x1, +, y3].
The result of the tokenizing step after applying Method 2 will be the following:
Statement 1:[the_expression, =, (, ′?/,⋁∗+-%<>:&|″,)]; Statement 2: [default:, “html, ′,″, ′txt∁, ′,″, ′or″, “js∁]; Statement 3: [z2, =, x1+y3].
Both methods are used for Pseudo-code and code. Finally, additional tokens are added at the beginning and end of the sequence. For example, <sop>for start-of-Pseudo-code and <eop>for end-of-Pseudo-code. After tokenization, tokens embedding is applied. This is a way to represent similar tokens by similar encodings. A floating-point value is assigned to similar tokens constructing a dense vector. Finally, positional encoding is done to discover the sequence order by alternating the embeddings depending on the position. Before the first attention layer, a small set of constants is added to the embedding vector as in [
For example, the sentence “I love the Transformer model “is tokenized as [I, love, the, Transformer, model]. Then, each token is assigned an index such as indexes in the list [224,378,962,1123,4136]. Each token is initialized a d-length vector, which has a random number from −1 to 1. Each vector value represents the linguistic features that are used during the training step. If two tokens share similar language characteristics and appear in similar contexts, their embedding values are updated to be closer during the training process.
In positional encodings,
The DLBT is an encoder-decoder frame structure. Encoder processes the input iteratively, then the decoder does the same to the output of the encoder. Encoder and decoder are composed of uniformly chained layers. The encoder and decoder use attention mechanism to weigh each input, the relevance of each other input, and pull information from them accordingly to produce the result. In addition, the feed-forward neural network is used in both the encoder and decoder to perform the additional processing of outputs. Residual connection and normalization layer [
In Multi-Head Attention, the attention function maps a query and a set of key-value pairs for output and all previous values to vectors. The output is computed by calculating a weighted sum of the values. The weight is calculated using the query compatibility function with the corresponding key as shown in
The decoder has a different layer, which is Masked Multi-Head Attention. It works similar to the multi-head attention layer and uses an extra mask inside Scaled Dot-Product Attention. It sets all values in the input of the SoftMax equal infinity with corresponding to illegal connections.
In testing phase, the proposed model corrects Pseudo-code as tokenizer by adding spaces between tokens; such as, (\'\"\ ⋁/?|&<>∗+%). For example, the translation of code “ s = s . replace ( ′\r″ , ′\ \r″ ).” is “ replace every occurrence of ′\r ′ in s with ′\ \ r ′.”, this well be corrected to “ replace every occurrence of ′\r″ in s with ′\ \r″.” as shown in
The proposed deep learning-based transformer (DLBT) model is evaluated using Django codebase [ Method(1) tokenizer: the vocabulary sizes of Pseudo-code and Python code are 10,867 and 6,226 respectively. The maximum sequence lengths of Pseudo-code and the source code are 538 and 963 respectively. Method(2) tokenizer: the vocabulary sizes of Pseudo-code and Python code are13,060 and 8,552. The maximum sequence lengths are 522 and 613.
The proposed DLBT model is implemented by PyTorch [
The BLEU score is a metric measurement to evaluate the quality of machine translation techniques. Quality is measured in terms of how to match between the machine translation output (candidate or hypothesis) and human translation output (reference). BLEU score is a value, ranging from 0 to 1. The machine-translation output closest to the human translation output gets a high BLEU score. If the machine translation output is identical to the human translation output, then the BLEU score will be 1. The BLEU uses a modified form of precision uni-gram, bi-gram, 3-gram, and 4-gram to compare a candidate translation against multiple reference translations.
In line 3, the six-layer DLBT misses the two-argument “col” and “area”. Eight-layers DLBT prevents this error because it has more training, it is formed of more than six-layers DLBT. In lines 4 and 5, the six layers-DLBT add “s is an empty list” in both lines and some other mistakes. In line 5, The eight-layers DLBT change"every” by “all”. This is a mistake compared with the manually generated Pseudo-code. It is a logic error because DLBT uses WordNet in the tokenization phase and the two words have close meaning and are semantically related.
Moreover, cross-validation is added to the proposed DLBT and is evaluated in terms of accuracy and BLEU performance measures, as shown in
# | Python code | Pseudo-code | |
---|---|---|---|
Manually Generated Pseudo-code | |||
1 | from gzip import GzipFile | #from gzip import GzipFile into default name space. | |
2 | def __init__ (self , text) : | #define the method __init__ with 2 arguments: self and text. | |
3 | html4_singlets=(′br″,'col″,'link″,'base″, | #html4_singlets is a tuple with 9 elements: strings ′br″, ′col″, ′link″ | |
′img″,'param″,'area″,'hr″,'input″) | , ′base″, ′img″, ′ param ″, ′ area ″, ′hr″ and ′input″. | ||
4 | s = s. replace (′\ r″, ′\ \ r″) | #replace every occurrence of ′\ r″ in s with ′\ \ r″. | |
5 | s = s. replace (′\t″, ′\ \t″) | #replace every occurrence of ′\t″ in s with ′\ \t″. | |
Six-layers DLBT | |||
1 | from gzip import GzipFile | #from gzip import GzipFile into default name space. | |
2 | def __init__ (self , text ) : | #define the method __init__ with 2 arguments: self and text. | |
3 | html4_singlets=(′br″,'col″,'link″,'base″, | #html4_singlets is a tuple with 9 elements strings ′br ′, ′link ′, ′link ′, ′base ′, ′img ′, ′param ′, ′param ′, ′hr ′and ′input ′. | |
′img″,'param″,'area″,'hr″,'input″) | #s is an empty list. Replace every occurrence of ′\ r ′in the result with ′\ \ r ′. | ||
4 | s = s. replace (′\ r″, ′\ \r″ ) | ||
5 | \t″′in t with ′\ \t″. | ||
Eight-layers DLBT | |||
1 | from gzip import GzipFile | #from gzip import GzipFile into default name space. | |
2 | def __init__ ( self , text ) : | #define the method __init__ with 2 arguments: self and text. | |
3 | html4_singlets=(′br″,'col″,'link″,'base″, | #html4_singlets is a tuple with 9 elements: strings ′br″, ′col″, ′link″, | |
′img″,'param″,'area″,'hr″,'input″) | ′base″, ′img″, ′param″, ′area″, ′hr″ and ′input″. | ||
4 | s = s . replace ( ′\r″ , ′\ \r″ ) | #replace every occurrence of ′\r″ in s with ′\ \r″. | |
5 | s = s . replace ( ′\t″ , ′\ \t″ ) | #replace all occurrences of ′\t″ with ′\ \t″. |
# of layers | Tokenizer | #epoch | BLEU | Accuracy |
---|---|---|---|---|
Six-layers DLBT | Method(1) | 35 | 59.62 | 36.85 |
Six-layers DLBT | Method (2) | 35 | 59.31 | 36.42 |
Six-layers DLBT | Method (1) | 75 + cross-validation | 67.00 | 46.89 |
Six-layers DLBT | Method (2) | 75 + cross-validation | ||
Eight-layers DLBT | Method(1) | 35 | 58.58 | 35.13 |
Eight- layers DLBT | Method (2) | 35 | 57.92 | 34.81 |
Eight-layers DLBT | Method (1) | 75 + cross-validation | 68.31 | 47.06 |
Eight-layers DLBT | Method (2) | 75 + cross-validation |
Studies | BLEU% |
---|---|
Eight-layers DLBT | 68.49 |
Code2NL [ |
56.54 |
code2pseudocode [ |
54.78 |
T2SMT [ |
54.08 |
Code-GRU [ |
50.81 |
NoAtt [ |
43.55 |
RBMT [ |
41.876 |
CODE-NN [ |
40.51 |
Seq2Seq [ |
28.26 |
PBMT [ |
25.17 |
SimpleRNN [ |
06.45 |
In [ They do not model the non-sequential structure of source code as they process the code tokens sequentially. Source code can be very long, and thus these models may fail to capture the long-range dependencies between code tokens.
The proposed DLBT outperforms the performance of the previous models because of
Tokenization: the rule expression for Tokenizer is ( Positional embeddings: it is position-insensitive and positional encoding is used to guarantee the order relationship between words in the text; Multi-head attention mechanism: each input is divided into multiple heads, and each head uses the attention mechanism. This unit is used to compute similarity scores between words in a sentence; Cross-validation: it scans the dataset and defines the dataset to test the DLBT in the training phase (validation dataset). It reinforces the vocabulary that is not defined in the training data and put in the test data. Post-processing: it is an important process in the testing phase. It finds the mathematical expressions errors and corrects them such as spaces between a single quotation or double quotation.
In this paper, a novel deep learning-based transformer (DLBT) model is proposed for automatic Pseudo-code generation from the source code. The proposed DLBT is an encoder-decoder structure, which consists of three components: Tokenization and Embedding, Transformer (encoding and decoding), and Post-processing. In the tokenization and embedding step, the DLBT manages the order and relationship between the tokens. Then, the encoder-decoder structure is built using a multi-head attention mechanism to calculate a score of similarity between the tokens. Finally, the correction function is applied. The proposed model is evaluated using a large dataset of Python code. The proposed model is assessed using six and eight layers and applying cross-validation. The experimental results are promising compared to other methods. Using six and eight layers, it reaches 68.49% and 67.18% in terms of BLEU score performance measures, respectively.
We are planning to add more languages such as C++, C#, and Java. In addition, we want to enhance the proposed DLBT to deal with more complex code rather than simple lines of code and output more abstract description. Moreover, there will be an important step to discover syntax errors before the preprocessing step. This step guarantees that the source code is correct before generating the Pseudo-code. Using this information can partially enhance Pseudo-code generation and help to increase accuracy and BLEU performance measures.