This post introduces a way to use deep learning to detect programming languages. Take the following code as an example.
We will get an answer
python if we use the program to be introduced in the post to detect the language of the above code, which is also the correct answer. In fact, through a preliminary test, the accuracy of the program is around 90%. We have reason to believe that we are able to get a better result if the training dataset is larger or further tuning is conducted.
First let’s try running the program, so we can have an intuitive perspective on what the program is about.
Install third-party libraries
Gensim1conda install -c anaconda gensim
Keras1conda install -c conda-forge keras
Tensorflow1pip install tensorflow==1.3.0
Download the program1git clone firstname.lastname@example.org:searene/demos.git && cd demos/PLDetector-demo
Train the model123456789101112131415161718192021222324252627282930313233343536373839404142python -m src.neural_network_trainerUsing TensorFlow backend...._________________________________________________________________Layer (type) Output Shape Param #=================================================================embedding_1 (Embedding) (None, 500, 100) 773100_________________________________________________________________conv1d_1 (Conv1D) (None, 496, 128) 64128_________________________________________________________________max_pooling1d_1 (MaxPooling1 (None, 248, 128) 0_________________________________________________________________flatten_1 (Flatten) (None, 31744) 0_________________________________________________________________dense_1 (Dense) (None, 8) 253960=================================================================Total params: 1,091,188Trainable params: 318,088Non-trainable params: 773,100_________________________________________________________________INFO:root:NoneEpoch 1/10- 1s - loss: 0.4304 - acc: 0.8823Epoch 2/10- 1s - loss: 0.1357 - acc: 0.9657Epoch 3/10- 1s - loss: 0.0706 - acc: 0.9788Epoch 4/10- 1s - loss: 0.0392 - acc: 0.9887Epoch 5/10- 1s - loss: 0.0266 - acc: 0.9927Epoch 6/10- 1s - loss: 0.0203 - acc: 0.9945Epoch 7/10- 1s - loss: 0.0169 - acc: 0.9948Epoch 8/10- 1s - loss: 0.0145 - acc: 0.9956Epoch 9/10- 1s - loss: 0.0131 - acc: 0.9959Epoch 10/10- 1s - loss: 0.0120 - acc: 0.9959INFO:root:Test Accuracy: 94.642857
We will have three important files as soon as the above step is completed.
We will introduce the three files in detail later on.
Detection1234python -m src.detectorUsing TensorFlow backend.Python
The following python code is detected by default by
Of course you can modify
detector.pyto detect other code.
Let’s first have a rough idea of the project structure. Don’t worry, it will only take 1 ~ 2 minutes.
resources/code/train: training data. The name of each subfolder representes a programming language. There are around 10 code files in each subfolder, i.e. 10 files per programming language for training.
resources/code/test: the same as
resources/code/trainexcept that it’s used for testing accuracy instead of training.
vocab_tokenizer: stored training result
- src/config.py: some constants used in the program
- src/neural_network_trainer.py: code used to train the model
- src/detector.py: code used to load the model and detect programming languages
let’s first get our heads around the training process, aka the contents in
neural_network_trainer.py. the first step to train the neural network is to build a vocabulary. Vocabulary is actually a list of words, which consists of some common words in the training data. When we are done with building a vocabulary and start detecting the programming language, we will try splitting the code into a list of words, and remove those which are not in the vocabulary, then we put the remaining words into the neural network for detection.
OK, you might want to ask, why removing words that are not in the vocabulary? Wouldn’t it work if we just put all the words into the neural network? Actually, this is impossible. Because each word in the vocabulary is mapped to a word vector, which is constructed during training. So words that are not in the vocabulary don’t have word vectors to map, which means the neural network is unable to process this word.
So how do we build the vocabulary? It’s fairly easy, we just need to scan all the code in
resources/code/train and extract common words in it. Those common words will make up our vocabulary. Key code is as follows.
build_vocab to get the vocabulary.
So, as you can see, the vocabulary is just a list of words, that’s it.
The next step is to build
vocab_tokenizer. So what is
vocab_tokenzier? It’s a simple variable, you can imagine it as a dictionary, which maps each word in the vocabulary to a number. Why would we map those words to numbers? Because our neural network is only able to run with numbers, rather than strings.
Tokenizer provided by
Keras to build
Then we save this
vocab_tokenizer as a file, to be used later.
Before diving into word vectors, we first need to know what they are.
To put it simply, word vectors are just vectors, and each word in the vocabulary is mapped to a word vector. You may still not get it. This may seem too simple, let’s take the following Java code as an example.
word2vec variable we are building here is actually a dictionary, which is like this(word -> word_vector).
Here comes the question. Why would we build word vectors, instead of just using the number given by
vocab_tokenizer? This is because word vectors have a very special and useful characteristic: The more close two words are, the smaller their word vectors are(Note that the calculation of the distance between vectors are of the field of math, which can be dealt with using multiple methods. It doesn’t matter if you don’t know how to calculate it, you only need to know the distance between vectors can be calculated). This characteristic will boost the accuracy of our neural network dramatically.
staic are only seen together in Java, so the distance between their word vectors should be small. However,
System is not that close, i.e. we may only see one of them at a time, so the distance between their word vectors are larger.
Now that we know why it is necessary to build word vectors, the next problem is how we build them. There are multiple ways to do it. Here we use the
Word2Vec algorithm provided by
gensim to achieve it. Steps are as follows.
- Load all the training data, extract those words which are in the vocabulary.
- Map each word into its respective number by using
- Put those numbers into
Word2Veclibrary and obtain word vectors.
The code is as follows.
Everything is ready, now it’s the time to train the neural network! First we need to know the input and output of the neural network, take the following code as an example.
something into their respective numbers, we get the input
The output of the neural network is the probability of each language.
The code is as follows.
So we know the above code is most likely to be written by Python, because Python has the most probability(0.5)
Now that we know the input and output, let me introduce how the neural network is constructed. There are three parts in total.
- Embedding Layer: it’s used to map each word into its respective word vector
- Conv1D, MaxPooling1D: this part is a classic deep learning layer. To put it simply, what it does is extraction and transformation. Refer to corresponding tutorials of deep learning for details.
- Flatten, Dense: convert the multi-dimensional array into one-dimensional, and output the prediction.
Key code is as follows.
All right, we built our neural network, not a trivial achievement! Then let’s write a function, which uses the neural network to detect test code, check out its accuracy.
As what we have got before, the test accuracy is around 94%~95%, which is good enough. Let’s save the neural network as files, so we can load it when detecting.
This part is simple, we only need to load
vocab_tokenizer and the neural network for detection. The code is as follows.
Use it like this.
All in all, here are the steps to build the neural network.
- Build vocabulary.
vocab_tokenizerusing vocabulary, which is used to convert words into numbers.
- Load words into
Word2Vecto build word vectors.
- Load word vectors into the neural network as part of the input layer.
- Load all the training data, extract words that are in the vocabulary, convert them into numbers using
vocab_tokenizer, load them into the neural network for training.
Three steps for detection:
- Extract words in the code and remove those that are not in the vocabulary.
- Convert those words into number through
vocab_tokenizer, and load them into the neural network.
- Choose the language which has the most probability, which the answer we want.
You may have already found out that, we only saved
vocab_tokenizer and the neural network(which lies in the model directory), why didn’t we save
If you have any question, please leave it in the comment below, I’ll try to answer it.