COMS 4705 - Natural Language Processing
Exam 2 - Spring 2023
Question 1 - Short answer questions (5 pts each, 25pts total)
a) Brie y describe the diference between the Continuous Bag of Words (CBOW) approach and the SkipGram approach for computing word2vec embeddings.
b) Brie y describe two main diferences between how ELMo and BERT are used to compute con- textualized word representations.
c) Brie y explain the idea of zero-shot learning in GPT models.
d) When implementing attention mechanisms, we assume that the input at each token i is represented by a vector hi . An attention function f computes a context representation as a weighted sum of these representations, that is c = ε αihi. Explain one approach for computing the αi weights.
e) Brie y describe the diference between discriminative and generative machine learning models.
Question 2 - PCFGs (18 pts)
Assume you are given a PCFG G and an integer n. Complete the dynamic programming algorithm below to compute the maximum probability of any sentence generated by G that is exactly n words long.
Hint: This is similar to the CKY algorithm for PCFGs. Maintain a table π, such that π[i, X] represents the maxium probability for any sentences of length i that G can derive from nonterminal X . Initialize the entries π[1, A] using the lexical rules A → t.
Input: 1 . a PCFG G = (Nonterminals,Terminals,Rules,Startsymbol S) .
2 . an integer n .
// initialization
for each A in Nonterminals:
if there is a rule A -> t, where t is a terminal:
pi[1,A] = . . .
// main loop
for i = 2 . . .n:
for each A in Nonterminals:
...
pi[i,A] = . . .
return . . .
Question 3 - Transition-Based Dependency Parsing (17 pts)
This questions is about transition-based dependency parsing using the arc-standard transition system.
Show that the transition sequence for a given dependency tree is not unique – that is, there are multiple transition sequences that result in the same dependency tree – by providing two diferent transition sequences for the following example:
Question 4 - Statistical MT / IBM Model 2 (18 pts)
The following parameters specify an IBM Model 2 . Assume that any parameter not specified defaults to 0 .
q(1j1, 3, 3) = 1/4
q(2j1, 3, 3) = 3/4
q(2j2, 3, 3) = 2/3
q(3j2, 3, 3) = 1/3
q(1j3, 3, 3) = 1/2
q(3j3, 3, 3) = 1/2
t(meowjcat) = 1
t(meowjlion) = 1/3
t(roarjlion) = 2/3
For the following sentence pair, compute the optimal alignment under the model. Show your computa- tion and specify the resulting alignment variables〈a1. . . am〉.
English: cat cat lion
Foreign: meow roar meow
Question 5 - Semantic Role Labeling (22 pts)
Assume you are implementing a PropBank style SRL system using sequence labeling. For each token s i your system predicts an output tag bi ∈ {O, BARG0, BARG1, . . . , IARG0, IARG1 , . . .}. Where Ir indicates a token inside an argument span with role r , Br indicates the first token of an argument span with role r, and O indicates a token not inside an argument span.
a) You want to use a bi-directional RNN model for this approach (such as an LSTM, but the specific RNN architecture does not matter) . How would you represent the input to each token position? Note that you will also have to represent which token is the predicate.
b) For each position i, the biRNN computes two hidden state vectors →h(-) i and -i, corresponding to the hidden states in forward and backward direction. Describe how to use these two vectors to predict a probability distribution over the set of output tags.
c) Assume you choose the output tag bi with the highest probability for each token position. Is this approach guaranteed to always result in a valid semantic role annotation?
d) Your research advisor suggests to use a pre-trained transformer-based model, such as BERT or GPT, instead of a biRNN for this task. Describe how you would use BERT or GPT for this task.