Fork me on GitHub

An Introduction to Text Representation


[TOC]

1. Introduction

Before feeding our text data into a model, we need firstly transformed it to a numerical format that the model can understand. That is called text representation, and it is a necessary data pre-processing procedure for almost every kind of nlp task.

From the view of language granularity, text representation can be divided into word representation, sentence representation and document representation. Word representation is the fundamental of the other two and we’ll introduce it first. Then we will introduce sentence representation in details. Documentation representation can be simiplified to sentence representation in many situations, so we won’t spend too much time on it.

2. Word Representation

Word representation includes two categories of models: discrete word representation and distributed word representation. The major difference between discrete and distributed representation is whether it considers the relationships among words. Discrete representation assumes that words are independent, and it fills in the word vector mainly by extracting features from the properties of the word itself, such as occurrence and frequency. Therefore, the word vector of discrete representation can’t capture the semantic and syntactic information of the word, and it is usually high dimensional and sparse. On the contrary, distributed representation analyzes relationships among the word and other words and learns a dense and low-dimensional feature vector to represent the word vector. We will firstly introduce one-hot encoding, a classical and simple method of discrete representation. Then we’ll discuss word embedding, the most popular frame of distributed representation techniques in recent years.

2.1. One-hot Encoding

Given a fixed set of vocabulary $V=\{w_1,w_2,\cdots,w_{|V|}\}$, one-hot encoding encodes a word $w_i$ with a $|V|$-dimensional vector $X$, of which the i-th dimension where $w_i$ occurs is 1 and the other dimensions are all zeros.
For example, if we have a corpus as follows(This example will also be used in the later chapters, and we mark it as Example 1):

1
2
3
4
"I like mathematics."
"I like science."
"You like computer science."
"I like computer and science."

Then the vocabulary is:

1
{I, like, mathematics, science, You, computer, and}

For word “mathematics”, its one-hot vector is:

1
(0,0,1,0,0,0,0)

One-hot encoding only captures the information of the word’s occurrence and position in the vocabulary, neglecting the frequency information and co-occurrence of different words.
In practice, one-hot encoding can be transformed into storage of the indexes of words in the vocabulary, so the the characteristics of high dimensionality and sparsity won’t cause computation problems. However, it is seldomly used directly for word representation because of its lack of semantic and syntactic information. Instead, it is used to tokenizing the word in other representation methods.

2.2. Word Embedding

Generally, the phrase “word embedding” has two meanings: 1. The technique to find a function that maps a word to a multi-dimensional and dense vector. 2. The word vector obtained in 1.
As peviously mentioned, one-hot encoding has serval problems: high dimension, sparsity and lack of semantics. The first two problems can be easily solved in word embedding by choosing appropriate dimension number of the result vector. What about the third problem? We haven’t discuss the exact definition of semantics. We won’t discuss too much of it since it is inherited from linguistic, instead we only give some intuitions. Semantics represents the meaning of a word and words with similar meanings should have similar semantics.
With the hypothesis of distributional semantics——linguistic items with similar distributions have similar meanings, the problem of learning word semantics can be transformed into modeling the relations among the target word and its context, and that is what statistical language model does. In fact, the earliest word embedding is the by-product of neural networks language model. In the following sections, we will introduce some classical word embedding methods.

2.2.1. Word2Vec

2.2.1.1. Continuous Bag of Words Model(CBOW)

2.2.1.2. Skip-Gram Model

[TODO]

3. Sentence Representation

3.1. Bag of Words

The model of Bag-of-Words represents a sentence as a bag of words, and each word of the sentence is represented by one-hot encoding. Then the sentence vector is the sum of all of the word vector:

Where $X(w_i)$ is the one-hot encoding vector of $w_i$ in sentence $s$, and $X(s)$ is the vector of $s$.
In example 1, for a sentence “I like computer and science”, its sentence vector is:

1
(1,1,0,1,0,1,1)

3.1.1. Bag of Words weighted by term count

In one-hot encoding, every word has the same value—1, in the non-zero component of its feature vector, which ignores importance of different words. Thus, it can’t distinguish these two sentences: “I like computer and science” and “I like computer and computer science”.
To evalute the importance of a word, a straightforward way is to multiply the one-hot vector by a weight—the count of the word in the sentence. Then, the vector of “I like computer and computer science” can be represented as:

1
(1,1,0,1,0,2,1)

3.1.2. Word of Bags weighted by TF-IDF

A better choice is to use TF-IDF(Term Frequency-Inverse Document Frequency. In this section, document is identity with sentence) as the weight, where $TF(t)=\frac{counts\ of\ word\ t\ in\ the\ doc}{num\ of\ words\ in\ the\ doc}$.

The original definition of IDF is:

where $n$ is the number of documents while $df(t)$ is number of documents containing the target word $t$. To avoid division by zero, we can use the smoothing technique: $IDF(t)=ln\frac{1+n}{1+df(t)}$, and it looks like that there is a document containing all the words. What’s more, to avoid $IDF=0$, we can add one to IDF:

Then, TF-IDF$=TF(t)\cdot IDF(t)$
Finally,

TF-IDF representation can be easily implemented by calling the following sk-learn functions:

1
2
3
vectorizer=CountVectorizer()
transformer=TfidfTransformer()
tfidf=transformer.fit_transform(vectorizer.fit_transform(corpus))

or

1
2
transformer=TfidfVectorizer()
tfidf2=transformer.fit_transform(corpus)

3.1.3. Bag of Words with Word Embeddings

In this method, each word is represented by word embeddings instead of one-hot encoding. The sentence is then represented by the average or weighted average of all the word vectors, like what we discussed above.