Skip Thought Vectors

⁉ Big Question

Unsupervised way of representing text in order to produce task independent vector representations.

🏙 Background Summary

Current processes mostly use compositional operators that map word vectors to sentense vectors using various deep learning methods. All these use supervised learning and hence are of high quality but very task specific.

Paragraph vector representaion is an alternative(unsupervised task) but the downside is the test time.

❓ Specific question(s)

Is there a task and a corresponding loss that will allow us to learn highly generic sentence representations in an unsupervised way without having any supervised task in mind?

💭 Approach

Skip-gram models which is currently used to word level representation where it tries to predic the word before and after the given word. The authors try to extend skip-gram models to sentences.

This is done is using encoder-decoder model. The encoder maps the words to the sentence vector and the decoder generates the surrounding sentences.

⚗️ Methods

Given sentence the model tries to predict the previous and the next sentence. The model is called skip-thoughts and the vector representation is called skip-thought vectors. Skip-Thought Vectors

In order to be able to handle new words which may not be their in the vocabulary already a method as been learned to map new words in word2vec representation to model’s representation.

⊧ Model

  • Encoder - RNN with GRU

  • Decoder - RNN with conditional GRU

  • Vocabulary Expansion - A trained mapping from vector space of word2vec to vector space of RNN


1) GRU performs as well as LSTMs and is conceptually simpler.

2) Vocabulary of word2vec is assumed to be bigger than that of RNN.

📓 Results

Skip-thoughts yield generic representations that perform robustly across all tasks considered.

Experiments were conducted across 8 tasks. Even with linear classifiers they performed very well demostrating the robustness of skip-thoughts.

★ New Terms

  • GRU(Gated Recurrent Units)

$$ % <![CDATA[ \begin{split} gate_r &= \sigma (W_{rx} X_t + W_{rh} h_{t-1} + b) \\
gate_{update} &= \sigma (W_{ux} X_t + W_{uh} h_{t-1} + b) \end{split} %]]>$$

$gate_{update}$ is for which which part hidden to update and which to remove. Both done with one gate instead of two like in LSTMs.

$$h_t = (1 - gate_{update}) \cdot h_{t-1} + gate_{update} \cdot \tilde{h_{t}}$$

$gate_r$ is for which part of hidden to use for computation of new proposal $\tilde{h_{t}}$

$$\tilde{h_{t}} = \tanh (W_{hx} X_t + W_{hh} \cdot (gate_r \cdot h_{t-1}) + b)$$

🔭 Scope

Use of:

  • Deep encoders and decoders

  • Larger context windows

  • Paragraph encoding and decoding

  • Other encoders

📚 Other Resources


comments powered by Disqus