R2D2’s speech recognition system is damaged in a collision. Now R2D2 cannot understand what Luke says and thus cannot assist him during combat. Your task is to implement a new intent detection module to help R2D2 classify different natural language commands.
This assignment will focus on a specific area in natural language processing (NLP) called intent detection. An intent detection module that will take in a natural language command, and determine what type of command that a user wants the droid to do. These will include things like driving commands, light commands, changing the position of its head, making sounds, etc.)
For example, the following command belongs to the category ‘driving’.
"Drive straight ahead for 2 seconds at half speed"
A skeleton notebook r2d2_hw5.ipynb containing empty definitions for each question has been provided. Please do not change any of that. Since portions of this assignment will be graded automatically, none of the names or function signatures in this file should be modified. However, you are free to introduce additional variables or functions if needed. You could use the Google Colab to edit the notebook file and conduct the training using the free GPU from Google.
You are strongly encouraged to follow the Python style guidelines set forth in PEP 8, which was written in part by the creator of Python. However, your code will not be graded for style.
Once you have completed the assignment, you should submit your file on Gradescope.
We’re going to begin this assignment by brainstorming different commands that we might like to give to our robot. We’ll take several factors into account:
Here are some sample commands:
For each of the 8 categories of commands please create 10 unique sentences on how you might tell the robot to execute one or more of the actions in that category. You can add add your sentence lists to the code by adding them as arrays called my_driving_sentences
, my_light_sentences
, my_head_sentences
, my_state_sentences
, my_connection_sentences
, my_stance_sentences
, my_animation_sentences
, and my_grid_sentences
.
One of the amazing thing about language is that there are many different ways of communicating the same intent. For example, if we wanted to have our R2D2 start waddling, we could say
"waddle",
"totter",
"todder",
"teater",
"wobble",
"start to waddle"
"start waddling",
"begin waddling",
"set your stance to waddle",
"try to stand on your tiptoes",
"move up and down on your toes",
"rock from side to side on your toes",
"imitate a duck's walk",
"walk like a duck"
Similarly, if we wanted it to stop, we could prefix the command above with a bunch of ways of saying stop:
"stop your waddle",
"end your waddle",
"don't waddle anymore",
"stop waddling",
"cease waddling",
"stop standing on your toes",
"stand still"
"stop acting like a duck",
"don't walk like a duck",
"stop teetering like that"
"put your feet flat on the ground"
We collected thousands of students’ written sentences from last year, and you could download the sample. The data is stored in the format of list of samples and each sample looks like this [sentence, label]
. The label is the int range from 0 to 8 which denotes the category of this sentence. The map of label to category is shown below.
ind2cat = {0: 'no', 1: 'driving', 2: 'light', 3: 'head', 4: 'state',
5: 'connection', 6: 'stance', 7: 'animation', 8: 'grid'}
Write a tokenization function clean(sentence) which takes as input a string of text and returns a list of tokens derived from that text. Here, we define a token to be a contiguous sequence of non-whitespace characters. We will remove punctuation marks and convert the text to lowercase, as well as remove the stopwords. Hint: Use the built-in constant string.punctuation, found in the string module, and/or python’s regex library, re.
>>> sentence = "Making beeping noises."
>>> clean(sentence)
['making', 'beeping', 'noises']
After you tokenize all sentences in training and validation set, you need to find the maximum number of tokens for all sentences and store it as max_len
, this is a very important variable and we will use it multiple times later.
Then we could use the tokenized data to construct a vocabulary. You need to first find the unique words in train/val set and count their occurrence number. Then we need to assign an index (start from 1) to each unique word.
>>> word_count
{'making': 10,
'beeping': 3,
'noises': 7,...
>>> word2ind
{'making': 1,
'beeping': 2,
'noises': 3,...
Then we assign the size of the vocabulary to vocab_size
which will be used for building the Recurrent Neural Network.
In this part, you will implement a RNN for intent detection. We will use keras backbone in this section.
Convert each list of tokens into an array use the vocabulary you built before. The length of the vector is the max_len
and remember to do zero-padding if tokens’ lenghth is smaller than max_len
. Please complete the vectorize(tokens, max_len, word2ind)
function.
Convert the scalar label to 1D array (length = 9), e.g 0 -> array([1, 0, 0, 0, 0, 0, 0, 0, 0])
Now it’s time to build the RNN network to do the classification task, you could just refer to this official document.
You will need the Embedding layer, RNN layer and Dense layer, your last layer should project to the number of labels. The architecture of model is shown below:
Now run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]
. Then save your prediction as rnn.p
using the following code and submit to Gradescope.
pickle.dump(test_prediction, open("rnn.p", "wb"))
In this part, instead of build the vocabulary and use word2index to vectorize the tokens, we are going to leverage word embeddings that we dicussed in lecture (and that are described in the Vector Semantics and Embeddings chapter of the Jurafsky and Martin textbook). We will use pre-trained word2vec embeddings, and use the Magnitude python package to work with these embeddings. Then, we will use the embeddings for the words in a sentence to create sentence embeddings.
For this part, we’ll use the Magnitude package, which is a fast, efficient Python package for manipulating pre-trained word embeddings. It was written by former Penn students Ajay Patel and Alex Sands. You can install it with pip by typing this command into your terminal:
pip3 install pymagnitude
Next, you’ll need to download a pre-trained set of word embeddings. We’ll get a set trained with Google’s word2vec algorithm, which we discussed in class. You can download them by clicking on this link or by using this command in your terminal:
wget http://magnitude.plasticity.ai/word2vec/medium/GoogleNews-vectors-negative300.magnitude
Warning: the file is very large (5GB). If you’d like to experiment with another set of word vectors that is smaller, you can download these GloVE embeddings which are only 1.4GB.
Here, you can check the full list of available embeddings, feel free to try different embeddings.
from pymagnitude import *
vectors = Magnitude("GoogleNews-vectors-negative300.magnitude")
v = vectors.query("cat") # vector representing the word 'cat'
w = vectors.query("dog") # vector representing the word 'dog'
You could now use the pymagnitude to query each token and convert them to a list of embeddings. Note that you need to do zero padding to match the maximum length. Please complete the embedding(list_tokens, max_len, vectors)
.
Similar to Part 1, build a RNN model using your new embedding. The model architecture is shown below.
Again, run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]
. Then save your prediction as rnn.p
using the following code and submit to Gradescope.
pickle.dump(test_prediction, open("embedding.p", "wb"))
In this part, you will use the BERT pipeline to further improve the performance. This part is open-ended, we just provide one example of using BERT, feel free to find other tutorial online to customize on this task.
We will use the hugging face backbone for this part. Install the transformaer package using the following command:
pip3 install transformers
Run the following code in python to define the bert tokenizer and bert model.
from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") #feel free to change the model
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=9)
You could change the "bert-base-uncased"
to any other state-of-art model, here is the list of all existing models.
The BERT Tokenizer will return a dictionary which contains ‘input_ids’, ‘token_type_ids’ and ‘attention_mask’, we will use the ‘input_ids’ and ‘attention_mask’ later.
# Test the tokenizer
sent = X_train[0]
tokenized_sequence= bert_tokenizer.encode_plus(sent,add_special_tokens = True,
max_length =30,pad_to_max_length = True,
return_attention_mask = True)
print(tokenized_sequence)
print(bert_tokenizer.decode(tokenized_sequence['input_ids']))
Use the bert tokenizer described above, encode the training and validations sentences, note that the max length should be 64. Please complete the BERT_Tokenizer(sentences)
function which takes in a list of sentences and return two numpy array which represent the input_ids
and attention_mask
.
After you preprocessed the data, you could use the given code to run the model. Then just use BERT model to make prediction on the test sentences. Save the prediction as bert.p
.
pickle.dump(test_prediction, open("bert.p", "wb"))
Please write 10 sentences for each category, this will be very helpful for future students!
my_commands = {'no': [],
'driving': [],
'light': [],
'head': [],
'state': [],
'connection': [],
'stance': [],
'animation': [],
'grid': []}
pickle.dump(my_commands, open("my_commands.p", "wb"))
Please use your best model to predict on test sentences and save the prediction as best.p
, we will give top 1 extra 5 points for this assigenment, top 2-3: 3 points, top 5 - 10: 1 point.
Here are what you need to submit for this homework:
Dialogue Systems and Chatbots. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft). |
Vector Semantics and Embeddings. Dan Jurafsky and James H. Martin. Speech and Language Processing (3rd edition draft). |
Linguistic Regularities in Continuous Space Word Representations. Tomas Mikolov, Wen-tau Yih, Geoffrey Zweig. NACL 2013. |
Magnitude: A Fast, Efficient Universal Vector Embedding Utility Package. Ajay Patel, Alexander Sands, Chris Callison-Burch, Marianna Apidianaki. ACL 2018. |
Learning to Parse Natural Language Commands to a Robot Control System. Cynthia Matuszek and Evan Herbst and Luke S. Zettlemoyer and Dieter Fox. ISER 2012. |
Developing Skills for Amazon Alexa. Amazon. developer tutorial. |
Getting Started with Rasa. Rasa. developer tutorial. |