**In this homework, you will implement several AI models to conduct the intent detection task.**
![alt text](https://i.ibb.co/fXmYHRq/ec5.jpg)

# Part 0: Data Preprocessing

In this section, you will have a general idea of how the data looks like and do some simple transformation.

In [None]:
# download the data
!wget "https://drive.google.com/uc?export=download&id=1dLUN9oSB4u27NOleYE-Uksoh6RNQlZbi" -O sample.p

In [None]:
# test sentences for evaluation
!wget "https://drive.google.com/uc?export=download&id=1gEW_qY5x8uPAhriiobubheYo6FC35btQ" -O test_sentences.p

In [None]:
import pickle
samples = pickle.load(open("sample.p", "rb"))
test_sentences = pickle.load(open("test_sentences.p", "rb"))

In [None]:
###data structure###
### [[sentence, label]] ###
print(samples[:3])

There are nine categories for these sentences, which are 'no', 'driving', 'light', 'head', 'state', 'connection', 'stance', 'animation' and 'grid'. The mapping from index to category name are shown below.

In [None]:
ind2cat = {0: 'no', 1: 'driving', 2: 'light', 3: 'head', 4: 'state', 5: 'connection', 6: 'stance', 7: 'animation', 8: 'grid'}

In [None]:
### Distribution on categories ###
cat2sentence = {}
for sample in samples:
  sentence = sample[0]
  cat = ind2cat[sample[1]]
  if cat not in cat2sentence:
    cat2sentence[cat] = [sentence]
  else:
    cat2sentence[cat].append(sentence)

print("number of sentences for each category")
for cat, sentences in cat2sentence.items():
  print(cat, ": ", len(sentences))

### Train/Validation Split

In [None]:
from sklearn.model_selection import train_test_split
SENTENCES = [sample[0] for sample in samples]
LABELS = [sample[1] for sample in samples]
X_train, X_val, y_train, y_val = train_test_split(SENTENCES, LABELS, test_size=0.2)

### Clean Text
Write a tokenization function clean(sentence) which takes as input a string of text and returns a list of tokens derived from that text. Here, we define a token to be a contiguous sequence of non-whitespace characters. We will remove punctuation marks and convert the text to lowercase. Hint: Use the built-in constant string.punctuation, found in the string module, and/or python's regex library, re.

In [None]:
import nltk
import re
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = stopwords.words('english')

def clean(sentence):
  '''1. tokenize the sentence (remove punctuation)
     2. remove the stop words
     3. convert all words to lowercase'''
  pass

X_train_token = [clean(sentence) for sentence in X_train]
X_val_token = [clean(sentence) for sentence in X_val]

In [None]:
max_len = 0# Find the maximum length of tokens in train/val

### Build a Vocabulary
Build a vocabulary to map each word to an index, you need to first find the unique words in train/val set.

Once you build a vocabulary, it's better to save it to a file for future use. Because the vocabulary may change each time you run the code.

In [None]:
word_count = {} # count the frequency of each word
word2ind = {} # build your vocabulary
vocab_size = len(word2ind)

# Part 1: Recurrent Neural Network

### Convert token to vector
Convert each list of tokens into an array use the vocabulary you built before. The length of the vector is the max_len and remember to do zero-padding if a list's lenghth is smaller than max_len.

In [None]:
def vectorize(tokens, max_len, word2ind):
  '''
  Input: list of tokens
  Output: 1D numpy array (length = max_len)
  '''
  pass

X_train_array = np.array([vectorize(tokens, max_len, word2ind) for tokens in X_train_token])
X_val_array = np.array([vectorize(tokens, max_len, word2ind) for tokens in X_val_token])
assert X_train_array.shape[-1] == max_len

### One-hot label
Convert the scalar label to 1D array (length = 9), e.g 0 -> array([1, 0, 0, 0, 0, 0, 0, 0, 0])

In [None]:
y_train_onehot = 
y_val_onehot = 
assert y_train_onehot.shape[1] == 9

### Build the Recurrent Neural Network
Now it's time to build the RNN network to do the classification task, you could just refer to this [official document](https://www.tensorflow.org/guide/keras/rnn).

You will need the Embedding layer, RNN layer and Dense layer, your last layer should project to the number of labels.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
# Embedding Layer, Input Dimension = vocab_size, Output Dimension = 64

# Two LSTM layers with 64 Units

# Dense to the number of classes with softmax activation function

model.summary()

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train_array, y_train_onehot, batch_size=16, epochs=10, validation_data=(X_val_array, y_val_onehot))

### Evaluate on the test sentences
Now run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]

In [None]:
test_prediction = []
#TODO


In [None]:
# Save the results and upload to Gradescope
pickle.dump(test_prediction, open("rnn.p", "wb"))

#Part 2. Word Embedding via pymagnitude
Instead of using the vocabulary to convert word to number, you could use pretrained word embeddings to do the task.

In [None]:
! echo "Installing Magnitude.... (please wait, can take a while)"
! (curl https://raw.githubusercontent.com/plasticityai/magnitude/master/install-colab.sh | /bin/bash 1>/dev/null 2>/dev/null)
! echo "Done installing Magnitude."

Next, you'll need to download a pre-trained set of word embeddings. We'll get a set trained with Google's word2vec algorithm, which we discussed in class. [Here](https://gitlab.com/Plasticity/magnitude), you can check the full list of available embeddings, feel free to try different embeddings.

In [None]:
# Download Pretrained Word-Embedding
! wget http://magnitude.plasticity.ai/word2vec/light/GoogleNews-vectors-negative300.magnitude

In [None]:
# Load the embedding
from pymagnitude import *
vectors = Magnitude("GoogleNews-vectors-negative300.magnitude") 
D = vectors.query("cat").shape[0]

### Convert tokens to embeddings
You could now use the pymagnitude to query each token and convert them to a list of embeddings. Note that you need to do zero padding to match the maximum length.

In [None]:
def embedding(list_tokens, max_len, vectors, D):
  '''
  return an array with the shape (n_of_samples, max_len, D)
  '''
  pass
X_train_embedding = embedding(X_train_token, max_len, vectors, D)
X_val_embedding = embedding(X_val_token, max_len, vectors, D)

assert X_train_embedding.shape[-1] == D
assert X_train_embedding.shape[-2] == max_len

### Build the RNN model
Similar to Part 1, build a RNN model using your new embedding.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential()
#TODO
# LSTM Layer with input shape (max_len, D), output shape (max_len, 256)

# LSTM Layer with 128 units

# Dense to 64 with tanh activation function

# Dense to number of classes with softmax function

model.summary()

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(X_train_embedding, y_train_onehot, batch_size=16, epochs=10, validation_data=(X_val_embedding, y_val_onehot))

### Evaluate on the test sentences
Now run your model to predict on the test sentences, you need to do the preprocessing on these sentences first and save your prediction to a list of labels, e.g [0, 2, 1, 5, ....]

In [None]:
test_prediction = []
#TODO


In [None]:
# Save the results and upload to Gradescope
pickle.dump(test_prediction, open("embedding.p", "wb"))

# Part 3: BERT

In this part, you will use the BERT pipeline to further improve the performance.

This part is open-ended, we just provide one example of using BERT, feel free to find other tutorial online to customize on this task.

[Here](https://huggingface.co/models) is the list of all existing models.

In [None]:
!pip install transformers
!pip install --upgrade tensorflow

In [None]:
from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") #feel free to change the model
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=9)

### Use BERT Tokenizer to preprocess the data
The BERT Tokenizer will return a dictionary which contains 'input_ids', 'token_type_ids' and 'attention_mask', we will use the 'input_ids' and 'attention_mask' later

In [None]:
# Test the tokenizer
sent = X_train[0]
tokenized_sequence= bert_tokenizer.encode_plus(sent,add_special_tokens = True,
                                              max_length =30,pad_to_max_length = True, 
                                              return_attention_mask = True)
print(tokenized_sequence)
print(bert_tokenizer.decode(tokenized_sequence['input_ids']))

Use the bert tokenizer described above, encode the training and validations sentences, note that the max length should be 64.

In [None]:
def BERT_Tokenizer(sentences):
  '''Input: list of sentences
     Output: two numpy array
  '''
  pass

X_train_ids, X_train_masks = BERT_Tokenizer(X_train)
X_val_ids, X_val_masks = BERT_Tokenizer(X_val)
y_train_array = np.array(y_train)
y_val_array = np.array(y_val)
assert X_train_ids.shape[-1] == 64

In [None]:
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-6,epsilon=1e-08)
bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])

In [None]:
bert_model.fit([X_train_ids,X_train_masks],y_train_array,batch_size=16,epochs=5,validation_data=([X_val_ids,X_val_masks],y_val_array))

### Evaluate on test sentences
Again, use BERT to predict on the test sentences and submit to Gradescope.

In [None]:
test_prediction = []
#TODO


In [None]:
pickle.dump(test_prediction, open("bert.p", "wb"))

# Part 4: Write your own commands

Please write 10 sentences for each category, this will be very helpful for future students!

In [None]:
my_commands = {'no': [], 
               'driving': [], 
               'light': [],
               'head': [],
               'state': [],
               'connection': [], 
               'stance': [], 
               'animation': [],
               'grid': []}

In [None]:
pickle.dump(my_commands, open("my_commands.p", "wb"))