A Neural Probabilistic Language Model

util62026/4/27大约 85 分钟

A Neural Probabilistic Language Model

Yoshua Bengio

<BENGIOY@IRO.umONTREAL.CA>

Réjean Ducharme

<DUCHARME@IRO.umONTREAL.CA>

Pascal Vincent

<VINCENTP@IRO.umONTREAL.CA>

Christian Jauvin

<JAUVINC@IRO.umONTREAL.CA>

Département d'Informatique et Recherche Opérationnelle

Centre de Recherche Mathématiques

Université de Montréal, Montréal, Québec, Canada

Editors: Jaz Kandola, Thomas Hofmann, Tomaso Poggio and John Shawe-Taylor

Abstract

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

Keywords: Statistical language modeling, artificial neural networks, distributed representation, curse of dimensionality

1. Introduction

A fundamental problem that makes language modeling and other learning problems difficult is the curse of dimensionality. It is particularly obvious in the case when one wants to model the joint distribution between many discrete random variables (such as words in a sentence, or discrete attributes in a data-mining task). For example, if one wants to model the joint distribution of 10 consecutive words in a natural language with a vocabulary $V$ of size 100,000, there are potentially $100000^{10} - 1 = 10^{50} - 1$ free parameters. When modeling continuous variables, we obtain generalization more easily (e.g. with smooth classes of functions like multi-layer neural networks or Gaussian mixture models) because the function to be learned can be expected to have some local smoothness properties. For discrete spaces, the generalization structure is not as obvious: any change of these discrete variables may have a drastic impact on the value of the function to be esti

神经概率语言模型

约书亚·本吉奥

<BENGIOY@IRO.UMONTREAL.CA>

雷热安·杜夏尔

<DUCHARME@IRO.umONTREAL.CA>

Pascal Vincent

<VINCENTP@IRO.UMONTREAL.CA>

Christian Jauvin

<JAUVINC@IRO.UMONTREAL.CA>

计算机与运筹学部

数学研究中心

蒙特利尔大学，蒙特利尔，魁北克，加拿大

编者：Jaz Kandola，Thomas Hofmann，Tomaso Poggio和John Shawe-Taylor

摘要

统计语言建模的一个目标是学习语言中单词序列的联合概率函数。这本质上很困难，因为存在维数灾难：模型将要测试的单词序列很可能与训练期间看到的所有单词序列都不同。传统的但非常成功的方法基于 n-gram 通过连接训练集中非常短的重叠序列来获得泛化。我们提出通过学习单词的分布式表示来对抗维数灾难，这使得每个训练句子都能向模型提供关于指数数量语义相邻句子的信息。模型同时学习 (1) 每个单词的分布式表示以及 (2) 以这些表示表示的单词序列概率函数。泛化是通过从未见过的新单词序列获得高概率来实现的，如果该序列由与已见句子形成的单词相似的单词（在具有附近表示的意义上）组成。在合理的时间内训练这样的大型模型（具有数百万个参数）本身就是一项重大挑战。我们报告了使用神经网络进行概率函数的实验，表明在两个文本语料库上，所提出的方法显著优于最先进的 n-gram 模型，并且所提出的方法允许利用更长的上下文。

关键词：统计语言模型、人工神经网络、分布式表示、维度灾难

1. 引言

语言建模和其他学习问题中的一个基本难题是维数灾难。这在想要对许多离散随机变量（例如句子中的单词，或数据挖掘任务中的离散属性）之间的联合分布进行建模的情况下尤其明显。例如，如果想要用词汇量为10万个的自然语言中连续10个单词的联合分布进行建模，潜在的自由参数可能有 $100000^{10} - 1 = 10^{50} - 1$ 个。当建模连续变量时，我们更容易获得泛化（例如，使用多层神经网络或高斯混合模型等平滑函数类），因为可以预期要学习的函数具有某些局部平滑性。对于离散空间，泛化结构并不那么明显：这些离散变量的任何变化都可能对要估计的函数值产生剧烈影响，并且当每个离散变量可以取的值数量很大时，大多数观测对象在汉明距离上几乎彼此最远。

mated, and when the number of values that each discrete variable can take is large, most observed objects are almost maximally far from each other in hamming distance.

A useful way to visualize how different learning algorithms generalize, inspired from the view of non-parametric density estimation, is to think of how probability mass that is initially concentrated on the training points (e.g., training sentences) is distributed in a larger volume, usually in some form of neighborhood around the training points. In high dimensions, it is crucial to distribute probability mass where it matters rather than uniformly in all directions around each training point. We will show in this paper that the way in which the approach proposed here generalizes is fundamentally different from the way in which previous state-of-the-art statistical language modeling approaches are generalizing.

A statistical model of language can be represented by the conditional probability of the next word given all the previous ones, since

\hat {P} \left(w _ {1} ^ {T}\right) = \prod_ {t = 1} ^ {T} \hat {P} \left(w _ {t} \mid w _ {1} ^ {t - 1}\right),

where $w_{t}$ is the $t$ -th word, and writing sub-sequence $w_{i}^{j} = (w_{i},w_{i + 1},\dots ,w_{j - 1},w_{j})$ . Such statistical language models have already been found useful in many technological applications involving natural language, such as speech recognition, language translation, and information retrieval. Improvements in statistical language models could thus have a significant impact on such applications.

When building statistical models of natural language, one considerably reduces the difficulty of this modeling problem by taking advantage of word order, and the fact that temporally closer words in the word sequence are statistically more dependent. Thus, $n$ -gram models construct tables of conditional probabilities for the next word, for each one of a large number of contexts, i.e. combinations of the last $n - 1$ words:

\hat {P} \left(w _ {t} \mid w _ {1} ^ {t - 1}\right) \approx \hat {P} \left(w _ {t} \mid w _ {t - n + 1} ^ {t - 1}\right).

We only consider those combinations of successive words that actually occur in the training corpus, or that occur frequently enough. What happens when a new combination of $n$ words appears that was not seen in the training corpus? We do not want to assign zero probability to such cases, because such new combinations are likely to occur, and they will occur even more frequently for larger context sizes. A simple answer is to look at the probability predicted using a smaller context size, as done in back-off trigram models (Katz, 1987) or in smoothed (or interpolated) trigram models (Jelinek and Mercer, 1980). So, in such models, how is generalization basically obtained from sequences of words seen in the training corpus to new sequences of words? A way to understand how this happens is to think about a generative model corresponding to these interpolated or backoff n-gram models. Essentially, a new sequence of words is generated by "gluing" very short and overlapping pieces of length 1, 2 ... or up to $n$ words that have been seen frequently in the training data. The rules for obtaining the probability of the next piece are implicit in the particulars of the back-off or interpolated n-gram algorithm. Typically researchers have used $n = 3$ , i.e. trigrams, and obtained state-of-the-art results, but see Goodman (2001) for how combining many tricks can yield to substantial improvements. Obviously there is much more information in the sequence that immediately precedes the word to predict than just the identity of the previous couple of words. There are at least two characteristics in this approach which beg to be improved upon, and that we

要估计的，并且当每个离散变量可以取的值数量很大时，大多数观测对象在汉明距离上几乎彼此最远。

一种有用的可视化方法，可以展示不同学习算法如何泛化，其灵感来源于非参数密度估计的观点，即思考最初集中在训练点（例如训练句子）上的概率质量如何在更大的体积中分布，通常以训练点周围的某种邻域形式存在。在高维度中，关键在于将概率质量分布在与每个训练点均匀分布不同的位置。本文将证明，这里提出的方法的泛化方式与以往最先进的统计语言建模方法的泛化方式存在根本差异。

语言统计模型可以通过给定所有先前词的下一个词的条件概率来表示，因为

\hat {P} \left(w _ {1} ^ {T}\right) = \prod_ {t = 1} ^ {T} \hat {P} \left(w _ {t} \mid w _ {1} ^ {t - 1}\right),

其中 $\mathrm{W}_t$ 是第 $t$ 个词，并且写作子序列 $w_i^j = (w_i, w_{i+1}, \dots, w_{j-1}, w_j)$ 。这种统计语言模型已经在涉及自然语言的许多技术应用中找到了用途，例如语音识别、语言翻译和信息检索。因此，统计语言模型的改进可能对这些应用产生重大影响。

在构建自然语言的统计模型时，通过利用词序以及词序列中时间上更接近的词语在统计上更依赖这一事实，可以显著降低该建模问题的难度。因此，n-gram模型为大量上下文中的每个上下文构建了条件概率表，即最后 $n - 1$ 个词的组合：

\hat {P} \left(w _ {t} \mid w _ {1} ^ {t - 1}\right) \approx \hat {P} \left(w _ {t} \mid w _ {t - n + 1} ^ {t - 1}\right).

我们只考虑那些在训练语料库中实际出现或出现频率足够高的连续词语组合。当出现一个新的n词组合，而这个组合在训练语料库中并未见过时，会发生什么？我们不想给这种情况分配零概率，因为这些新的组合很可能出现，并且对于更大的上下文大小，它们出现的频率会更高。一个简单的答案是查看使用较小上下文大小预测的概率，就像在回退三元组模型（Katz，1987）或平滑（或插值）三元组模型（Jelinek和Mercer，1980）中所做的那样。因此，在这样的模型中，基本上是如何从训练语料库中看到的词语序列泛化到新的词语序列的？理解这种情况发生的一种方式是思考与这些插值或回退n元组模型相对应的生成模型。本质上，一个新的词语序列是通过“粘贴”在训练数据中频繁出现的长度为1、2...或最多n个的非常短且重叠的片段生成的。获得下一个片段概率的规则隐含在回退或插值n元组算法的细节中。通常研究人员使用 $n = 3$ ，即三元组，并获得了最先进的结果，但请参见Goodman (2001)，了解如何结合许多技巧可以带来实质性的改进。显然，在要预测的词语立即前面的序列中包含比仅知道前几个词语的身份更多的信息。这种方法中至少有两个特点亟待改进，而我们

will focus on in this paper. First, it is not taking into account contexts farther than 1 or 2 words, $^{1}$ second it is not taking into account the "similarity" between words. For example, having seen the sentence "The cat is walking in the bedroom" in the training corpus should help us generalize to make the sentence "A dog was running in a room" almost as likely, simply because "dog" and "cat" (resp. "the" and "a", "room" and "bedroom", etc...) have similar semantic and grammatical roles.

There are many approaches that have been proposed to address these two issues, and we will briefly explain in Section 1.2 the relations between the approach proposed here and some of these earlier approaches. We will first discuss what is the basic idea of the proposed approach. A more formal presentation will follow in Section 2, using an implementation of these ideas that relies on shared-parameter multi-layer neural networks. Another contribution of this paper concerns the challenge of training such very large neural networks (with millions of parameters) for very large data sets (with millions or tens of millions of examples). Finally, an important contribution of this paper is to show that training such large-scale model is expensive but feasible, scales to large contexts, and yields good comparative results (Section 4).

Many operations in this paper are in matrix notation, with lower case $\nu$ denoting a column vector and $\nu'$ its transpose, $A_j$ the $j$ -th row of a matrix $A$ , and $x.y = x'y$ .

1.1 Fighting the Curse of Dimensionality with Distributed Representations

In a nutshell, the idea of the proposed approach can be summarized as follows:

associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in $\mathbb{R}^m$ ),
express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence, and
learn simultaneously the word feature vectors and the parameters of that probability function.

The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features (e.g. $m = 30$ , 60 or 100 in the experiments) is much smaller than the size of the vocabulary (e.g. 17,000). The probability function is expressed as a product of conditional probabilities of the next word given the previous ones, (e.g. using a multi-layer neural network to predict the next word given the previous ones, in the experiments). This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data or a regularized criterion, e.g. by adding a weight decay penalty. The feature vectors associated with each word are learned, but they could be initialized using prior knowledge of semantic features.

Why does it work? In the previous example, if we knew that dog and cat played similar roles (semantically and syntactically), and similarly for (the,a), (bedroom,room), (is,was),

将重点关注于本文。首先，它没有考虑到超过1或2个词的上下文，其次，它没有考虑到词语之间的“相似性”。例如，在训练语料库中看到句子“The cat is walking in the bedroom”应该能帮助我们泛化，使得句子“A dog was running in a room”几乎同样可能，仅仅因为“dog”和“cat”（分别对应“the”和“a”、“room”和“bedroom”，等等）具有相似的语义和语法角色。

针对这两个问题，已经提出了许多方法，我们将在1.2节简要解释这里提出的方法与其中一些早期方法之间的关系。我们将首先讨论所提出方法的基本思想。在2节中，我们将使用基于共享参数多层神经网络的这些思想实现进行更正式的介绍。本文的另一个贡献是关于训练这样非常大的神经网络（具有数百万个参数）以处理非常大的数据集（具有数百万或数千万个示例）的挑战。最后，本文的一个重要贡献是证明训练这样的大规模模型虽然昂贵但可行，能够扩展到大型上下文，并产生良好的比较结果（第4节）。

本文中的许多操作以矩阵形式表示，其中小写v表示列向量， $\mathbf{v}'$ 为其转置， $A_j$ 表示矩阵A的第j行，以及 $x.y = x'y$ 。

1.1 用分布式表示对抗维度灾难

简而言之，所提出方法的核心思想可以总结如下：

将词汇表中的每个词与一个分布式词特征向量（ $\mathbb{R}^m$ 中的实值向量）相关联，
以序列中这些词的特征向量表示词序列的联合概率函数，在这些词的序列中，和
同时学习词特征向量和该概率的参数函数。

特征向量表示单词的不同方面：每个单词都与向量空间中的一个点相关联。特征的数量（例如实验中的 $m = 30$ 、60 或 100）远小于词汇表的大小（例如 17,000）。概率函数表示为给定前一个单词的下一个单词的条件概率的乘积（例如，使用多层神经网络根据前一个单词预测下一个单词，在实验中）。该函数具有参数，可以通过迭代调整以最大化训练数据的对数似然或正则化标准，例如通过添加权重衰减惩罚。与每个单词相关的特征向量是学习得到的，但它们也可以使用语义特征的先验知识进行初始化。

为什么有效？在之前的例子中，如果我们知道 dog 和 cat 扮演了相似的角色（在语义和句法上），并且 similarly for (the,a)，(bedroom, room)，(is, was)，

(running,walking), we could naturally generalize (i.e. transfer probability mass) from

The cat is walking in the bedroom

A dog was running in a room

and likewise to

The cat is running in a room

A dog is walking in a bedroom

The dog was walking in the room

and many other combinations. In the proposed model, it will so generalize because "similar" words are expected to have a similar feature vector, and because the probability function is a smooth function of these feature values, a small change in the features will induce a small change in the probability. Therefore, the presence of only one of the above sentences in the training data will increase the probability, not only of that sentence, but also of its combinatorial number of "neighbors" in sentence space (as represented by sequences of feature vectors).

1.2 Relation to Previous Work

The idea of using neural networks to model high-dimensional discrete distributions has already been found useful to learn the joint probability of $Z_{1} \cdots Z_{n}$ , a set of random variables where each is possibly of a different nature (Bengio and Bengio, 2000a,b). In that model, the joint probability is decomposed as a product of conditional probabilities

\hat {P} \left(Z _ {1} = z _ {1}, \dots , Z _ {n} = z _ {n}\right) = \prod_ {i} \hat {P} \left(Z _ {i} = z _ {i} \mid g _ {i} \left(Z _ {i - 1} = z _ {i - 1}, Z _ {i - 2} = z _ {i - 2}, \dots , Z _ {1} = z _ {1}\right)\right),

where $g(\cdot)$ is a function represented by a neural network with a special left-to-right architecture, with the $i$ -th output block $g_i(\cdot)$ computing parameters for expressing the conditional distribution of $Z_i$ given the value of the previous $Z$ 's, in some arbitrary order. Experiments on four UCI data sets show this approach to work comparatively very well (Bengio and Bengio, 2000a,b). Here we must deal with data of variable length, like sentences, so the above approach must be adapted. Another important difference is that here, all the $Z_i$ (word at $i$ -th position), refer to the same type of object (a word). The model proposed here therefore introduces a sharing of parameters across time – the same $g_i$ is used across time – that is, and across input words at different positions. It is a successful large-scale application of the same idea, along with the (old) idea of learning a distributed representation for symbolic data, that was advocated in the early days of connectionism (Hinton, 1986, Elman, 1990). More recently, Hinton's approach was improved and successfully demonstrated on learning several symbolic relations (Paccanaro and Hinton, 2000). The idea of using neural networks for language modeling is not new either (e.g. Miikkulainen and Dyer, 1991). In contrast, here we push this idea to a large scale, and concentrate on learning a statistical model of the distribution of word sequences, rather than learning the role of words in a sentence. The approach proposed here is also related to previous proposals of character-based text compression using neural networks to predict the probability of the next character (Schmidhuber, 1996). The idea of using a neural network for language modeling has also been independently proposed by Xu and Rudnicky (2000), although experiments are with networks without hidden units and a single input word, which limit the model to essentially capturing unigram and bigram statistics.

The idea of discovering some similarities between words to obtain generalization from training sequences to new sequences is not new. For example, it is exploited in approaches that are based on learning a clustering of the words (Brown et al., 1992, Pereira et al., 1993, Niesler et al., 1998, Baker

(跑步，走路)，我们可以自然地推广（即转移概率质量）从

猫正在卧室里走路

狗曾在房间里跑步

同样地到

猫正在房间里跑步

一只狗正在卧室里散步

那只狗曾经在一个房间里散步

因此，训练数据中仅出现上述其中一句，将不仅会增加该句的概率，还会增加其在句子空间中“邻居”组合数的概率（以特征向量序列表示）。

1.2与前人工作的关系

使用神经网络对高维离散分布进行建模的想法已经被发现对学习随机变量 $Z_{1} \cdots Z_{n}$ 的联合概率很有用，其中每个变量可能具有不同的性质（Bengio和Bengio，2000a,b）。在该模型中，联合概率被分解为条件概率的乘积

\hat {P} \left(Z _ {1} = z _ {1}, \dots , Z _ {n} = z _ {n}\right) = \prod_ {i} \hat {P} \left(Z _ {i} = z _ {i} \mid g _ {i} \left(Z _ {i - 1} = z _ {i - 1}, Z _ {i - 2} = z _ {i - 2}, \dots , Z _ {1} = z _ {1}\right)\right),

其中g(.)是一个由具有特殊从左到右架构的神经网络表示的函数，第i个输出块 $g_i(\cdot)$ 计算用于表达在给定先前Z值的情况下 $Z_i$ 的条件分布的参数，在任意顺序中。在四个UCI数据集上的实验表明这种方法相对非常有效（Bengio和Bengio，2000a,b）。在这里我们必须处理可变长度的数据，如句子，因此必须调整上述方法。另一个重要区别是，在这里，所有的 $Z_i$ （第i个位置的词），都指的是同一类型的对象（一个词）。因此，这里提出的模型引入了跨时间的参数共享——即同一 $g_i$ 在时间上被使用——以及在不同位置的输入词之间。这是同一想法的一个成功的应用，以及（旧的）为符号数据学习分布式表示的想法，该想法在连接主义早期（Hinton，1986，Elman，1990）被提倡。最近，Hinton的方法得到了改进并在学习多个符号关系方面取得了成功（Paccanaro和Hinton，2000）。使用神经网络进行语言建模的想法也不是新的（例如，Miikkulainen和Dyer，1991）。相反，在这里我们将这一想法推向了大规模，并专注于学习词序列的统计模型，而不是学习词在句子中的作用。这里提出的模型也与之前的基于字符的文本压缩提议有关，这些提议使用神经网络来预测下一个字符的概率（Schmidhuber，1996）。使用神经网络进行语言建模的想法也由Xu和Rudnicky（2000）独立提出，尽管实验使用的是没有隐藏单元的网络和单个输入词，这限制了模型基本上只能捕获一元和二元统计。

发现词语之间某些相似性，从而从训练序列中获得泛化并应用于新序列的想法并非新颖。例如，这种方法被应用于基于学习词语聚类的方法中（Brownetal.,1992,Pereira et al.,1993,Niesler et al.,1998,Baker

and McCallum, 1998): each word is associated deterministically or probabilistically with a discrete class, and words in the same class are similar in some respect. In the model proposed here, instead of characterizing the similarity with a discrete random or deterministic variable (which corresponds to a soft or hard partition of the set of words), we use a continuous real-vector for each word, i.e. a learned distributed feature vector, to represent similarity between words. The experimental comparisons in this paper include results obtained with class-based n-grams (Brown et al., 1992, Ney and Kneser, 1993, Niesler et al., 1998).

The idea of using a vector-space representation for words has been well exploited in the area of information retrieval (for example see work by Schutze, 1993), where feature vectors for words are learned on the basis of their probability of co-occurring in the same documents (Latent Semantic Indexing, see Deerwester et al., 1990). An important difference is that here we look for a representation for words that is helpful in representing compactly the probability distribution of word sequences from natural language text. Experiments suggest that learning jointly the representation (word features) and the model is very useful. We tried (unsuccessfully) using as fixed word features for each word $w$ the first principal components of the co-occurrence frequencies of $w$ with the words occurring in text around the occurrence of $w$ . This is similar to what has been done with documents for information retrieval with LSI. The idea of using a continuous representation for words has however been exploited successfully by Bellegarda (1997) in the context of an n-gram based statistical language model, using LSI to dynamically identify the topic of discourse.

The idea of a vector-space representation for symbols in the context of neural networks has also previously been framed in terms of a parameter sharing layer, (e.g. Riis and Krogh, 1996) for secondary structure prediction, and for text-to-speech mapping (Jensen and Riis, 2000).

2. A Neural Model

The training set is a sequence $w_{1} \cdots w_{T}$ of words $w_{t} \in V$ , where the vocabulary $V$ is a large but finite set. The objective is to learn a good model $f(w_{t}, \dots, w_{t-n+1}) = \hat{P}(w_{t} | w_{1}^{t-1})$ , in the sense that it gives high out-of-sample likelihood. Below, we report the geometric average of $1 / \hat{P}(w_{t} | w_{1}^{t-1})$ , also known as perplexity, which is also the exponential of the average negative log-likelihood. The only constraint on the model is that for any choice of $w_{1}^{t-1}$ , $\sum_{i=1}^{|V|} f(i, w_{t-1}, \dots, w_{t-n+1}) = 1$ , with $f > 0$ . By the product of these conditional probabilities, one obtains a model of the joint probability of sequences of words.

We decompose the function $f(w_{t},\dots ,w_{t - n + 1}) = \hat{P} (w_{t}|w_{1}^{t - 1})$ in two parts:

A mapping $C$ from any element $i$ of $V$ to a real vector $C(i) \in \mathbb{R}^m$ . It represents the distributed feature vectors associated with each word in the vocabulary. In practice, $C$ is represented by a $|V| \times m$ matrix of free parameters.
The probability function over words, expressed with $C$ : a function $g$ maps an input sequence of feature vectors for words in context, $(C(w_{t - n + 1}),\dots ,C(w_{t - 1}))$ , to a conditional probability distribution over words in $V$ for the next word $w_{t}$ . The output of $g$ is a vector whose $i$ -th element estimates the probability $\hat{P} (w_{t} = i|w_{1}^{t - 1})$ as in Figure 1.

f (i, w _ {t - 1}, \dots , w _ {t - n + 1}) = g (i, C (w _ {t - 1}), \dots , C (w _ {t - n + 1}))

The function $f$ is a composition of these two mappings ( $C$ and $g$ ), with $C$ being shared across all the words in the context. With each of these two parts are associated some parameters. The

并且 McCallum, 1998 年): 每个词被确定性地或概率性地与一个离散类别相关联, 并且同一类别的词在某些方面是相似的。在本模型中, 我们不是用离散随机或确定性变量来表征相似性（这对应于词汇集合的软划分或硬划分）, 而是为每个词使用一个连续实向量, 即一个学习到的分布式特征向量, 来表示词之间的相似性。本文中的实验比较包括使用基于类别的 n-gram (Brown 等人, 1992 年, Ney 和 Kneser, 1993 年, Niesler 等人, 1998 年) 获得的结果。

使用向量空间表示法来表示单词的想法在信息检索领域（例如参见Schutze，1993年的工作）中已被充分探索，其中单词的特征向量是基于它们在同一文档中共同出现的概率学习的（潜在语义索引，参见Deerwester等人，1990年）。一个重要的区别在于，在这里我们寻找一个对表示自然语言文本中单词序列的概率分布具有帮助的单词表示。实验表明，联合学习表示（单词特征）和模型非常有用。我们尝试（但没有成功）将每个单词w的固定单词特征设置为w与w在文本中出现的单词的共现频率的第一主成分。这与使用LSI对信息检索中的文档所做的工作类似。然而，使用连续表示来表示单词的想法已被

Bellegarda（1997年）在基于n-gram的统计语言模型的上下文中成功利用，他使用LSI动态识别话语的主题。

在神经网络背景下，符号的向量空间表示方法也曾被理解为参数共享层（例如Riis和Krogh，1996年），用于二级结构预测和文本到语音映射（Jensen和Riis，2000年）。

2. 神经模型

训练集是一个由单词 $w_{t} \in V$ 组成的序列 $\mathrm{W}_{1} \cdots \mathrm{W}_{T}$ , 其中词汇表 $\mathrm{V}$ 是一个庞大但有限的集合。目标是学习一个良好的模型 $f(w_{t}, \dots, w_{t-n+1}) = \hat{P}(w_{t} | w_{1}^{t-1})$ , 使其在样本外具有高似然性。下面, 我们报告了 $1 / \hat{P}(w_{t} | w_{1}^{t-1})$ 的几何平均值, 也称为困惑度, 它也是平均负对数似然性的指数。模型唯一的约束条件是对于任何选择的 $w_{1}^{t-1} \sum_{i=1}^{|V|} f(i, w_{t-1}, \dots, w_{t-n+1}) = 1$ , 都有 $f > 0$ 。通过这些条件概率的乘积, 可以得到单词序列联合概率的模型。

我们将函数 $f(w_{t},\dots ,w_{t - n + 1}) = \hat{P} (w_{t}|w_{1}^{t - 1})$ 分解为两部分：

一个从 V 的任意元素 i 映射到实向量的映射 C, $C(i) \in \mathbb{R}^m$ 。它表示词汇表中每个词关联的分布式特征向量。在实践中, C 由一个 $|V| \times m$ 参数矩阵表示。
词语上的概率函数，用 C 表示：一个函数 g 将上下文中词语的特征向量输入序列 $(C(w_{t - n + 1}),\dots ,C(w_{t - 1}))$ 映射到 V 中下一个词 $W_{t}$ 的条件概率分布。g 的输出是一个向量，其第 i 个元素估计概率 $\hat{P} (w_t = i|w_1^{t - 1})$ ，如图 1 所示。

f (i, w _ {t - 1}, \dots , w _ {t - n + 1}) = g (i, C (w _ {t - 1}), \dots , C (w _ {t - n + 1}))

函数 f 是这两个映射（C 和 g）的组合，其中 C 在上下文中的所有单词中共享。这两个部分各自关联一些参数。The

Figure 1: Neural architecture: $f(i, w_{t-1}, \dots, w_{t-n+1}) = g(i, C(w_{t-1}), \dots, C(w_{t-n+1}))$ where $g$ is the neural network and $C(i)$ is the $i$ -th word feature vector.

parameters of the mapping $C$ are simply the feature vectors themselves, represented by a $|V| \times m$ matrix $C$ whose row $i$ is the feature vector $C(i)$ for word $i$ . The function $g$ may be implemented by a feed-forward or recurrent neural network or another parametrized function, with parameters $\omega$ . The overall parameter set is $\theta = (C, \omega)$ .

Training is achieved by looking for $\theta$ that maximizes the training corpus penalized log-likelihood:

L = \frac {1}{T} \sum_ {t} \log f \left(w _ {t}, w _ {t - 1}, \dots , w _ {t - n + 1}; \theta\right) + R (\theta),

where $R(\theta)$ is a regularization term. For example, in our experiments, $R$ is a weight decay penalty applied only to the weights of the neural network and to the $C$ matrix, not to the biases. $^3$

In the above model, the number of free parameters only scales linearly with $V$ , the number of words in the vocabulary. It also only scales linearly with the order $n$ : the scaling factor could be reduced to sub-linear if more sharing structure were introduced, e.g. using a time-delay neural network or a recurrent neural network (or a combination of both).

In most experiments below, the neural network has one hidden layer beyond the word features mapping, and optionally, direct connections from the word features to the output. Therefore there are really two hidden layers: the shared word features layer $C$ , which has no non-linearity (it would not add anything useful), and the ordinary hyperbolic tangent hidden layer. More precisely, the neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1:

\hat {P} \left(w _ {t} \mid w _ {t - 1}, \dots w _ {t - n + 1}\right) = \frac {e ^ {y _ {w _ {t}}}}{\sum_ {i} e ^ {y _ {i}}}.

图1：神经网络架构： $f(i, w_{t-1}, \dots, w_{t-n+1}) = g(i, C(w_{t-1}), \dots, C(w_{t-n+1}))$ 其中g表示神经网络，C(i)是第i个词特征向量。

映射 C 的参数就是特征向量本身，表示为 $|V| \times m$ 矩阵 C，其第 i 行是单词 i 的特征向量 C(i)。函数 g 可以通过前馈或循环神经网络或其他参数化函数实现，参数为。整体参数集是 = (C, )。

通过寻找最大化训练语料库惩罚时数似然来训练：

L = \frac {1}{T} \sum_ {t} \log f \left(w _ {t}, w _ {t - 1}, \dots , w _ {t - n + 1}; \theta\right) + R (\theta),

其中R()是一个正则化项。例如，在我们的实验中，R是一种仅应用于神经网络权重和C矩阵的权重衰减惩罚，而不应用于偏置项。3

在上述模型中，自由参数的数量仅随词汇量V线性扩展。它也仅随顺序n线性扩展：如果引入更多共享结构，例如使用时间延迟神经网络或循环神经网络（或两者的组合），扩展因子可以降低到次线性。

在下面的大多数实验中，神经网络的词特征映射之外还有一个隐藏层，并且可选地，从词特征到输出的直接连接。因此实际上有两个隐藏层：共享词特征层C，它没有非线性（它不会增加任何有用的东西），以及普通的双曲正切隐藏层。更精确地说，神经网络计算以下函数，带有softmax输出层，这保证了和为1的正概率：

\hat {P} \left(w _ {t} \mid w _ {t - 1}, \dots w _ {t - n + 1}\right) = \frac {e ^ {y _ {w _ {t}}}}{\sum_ {i} e ^ {y _ {i}}}.

The $y_{i}$ are the unnormalized log-probabilities for each output word $i$ , computed as follows, with parameters $b,W,U,d$ and $H$ :

y = b + W x + U \tanh (d + H x) \tag {1}

where the hyperbolic tangent tanh is applied element by element, $W$ is optionally zero (no direct connections), and $x$ is the word features layer activation vector, which is the concatenation of the input word features from the matrix $C$ :

x = \left(C \left(w _ {t - 1}\right), C \left(w _ {t - 2}\right), \dots , C \left(w _ {t - n + 1}\right)\right).

Let $h$ be the number of hidden units, and $m$ the number of features associated with each word. When no direct connections from word features to outputs are desired, the matrix $W$ is set to 0. The free parameters of the model are the output biases $b$ (with $|V|$ elements), the hidden layer biases $d$ (with $h$ elements), the hidden-to-output weights $U$ (a $|V| \times h$ matrix), the word features to output weights $W$ (a $|V| \times (n - 1)m$ matrix), the hidden layer weights $H$ (a $h \times (n - 1)m$ matrix), and the word features $C$ (a $|V| \times m$ matrix):

\theta = (b, d, W, U, H, C).

The number of free parameters is $|V|(1 + nm + h) + h(1 + (n - 1)m)$ . The dominating factor is $|V|(nm + h)$ . Note that in theory, if there is a weight decay on the weights $W$ and $H$ but not on $C$ , then $W$ and $H$ could converge towards zero while $C$ would blow up. In practice we did not observe such behavior when training with stochastic gradient ascent.

Stochastic gradient ascent on the neural network consists in performing the following iterative update after presenting the $t$ -th word of the training corpus:

\theta \leftarrow \theta + \varepsilon \frac {\partial \log \hat {P} \left(w _ {t} \mid w _ {t - 1} , \cdots w _ {t - n + 1}\right)}{\partial \theta}

where $\varepsilon$ is the "learning rate". Note that a large fraction of the parameters needs not be updated or visited after each example: the word features $C(j)$ of all words $j$ that do not occur in the input window.

Mixture of models. In our experiments (see Section 4) we have found improved performance by combining the probability predictions of the neural network with those of an interpolated trigram model, either with a simple fixed weight of 0.5, a learned weight (maximum likelihood on the validation set) or a set of weights that are conditional on the frequency of the context (using the same procedure that combines trigram, bigram, and unigram in the interpolated trigram, which is a mixture).

3. Parallel Implementation

Although the number of parameters scales nicely, i.e. linearly with the size of the input window and linearly with the size of the vocabulary, the amount of computation required for obtaining the output probabilities is much greater than that required from n-gram models. The main reason is that with n-gram models, obtaining a particular $P(w_{t}|w_{t - 1},\ldots ,w_{t - n + 1})$ does not require the computation of the probabilities for all the words in the vocabulary, because of the easy normalization (performed when training the model) enjoyed by the linear combinations of relative frequencies. The main computational bottleneck with the neural implementation is the computation of the activations of the output layer.

$y_{i}$ 是每个输出词 $i$ 的未归一化对数概率，计算方法如下，参数为 b、W、U、d 和 H:

y = b + W x + U \tanh (d + H x) \tag {1}

其中双曲正切函数tanh逐元素应用，W可选为零（无直接连接），x是词特征层激活向量，它是矩阵C中输入词特征的连接：

x = \left(C \left(w _ {t - 1}\right), C \left(w _ {t - 2}\right), \dots , C \left(w _ {t - n + 1}\right)\right).

设h为隐藏单元的数量，m为每个词关联的特征数量。当不希望从词特征到输出存在直接连接时，矩阵W被设置为0。模型的自由参数是输出偏置b（有|V|个元素）、隐藏层偏置d（有h个元素）、隐藏层到输出权重（词特征到输出权重W矩阵）、隐藏层权重H（一个 $h \times (n - 1)m$ 矩阵），以及词特征C（一个 $|\mathbf{V}| \times \mathbf{m}$ 矩阵）：

\theta = (b, d, W, U, H, C).

自由参数的数量是 $|V|(1 + nm + h) + h(1 + (n - 1)m)$ 。主导因素是 $|V|(nm + h)$ 。请注意,理论上, 如果权重 W 和 H 有权重衰减, 但 C 没有则有, 那么 W 和 H 可能会收敛到零,而 C 会爆炸。在实践中, 我们在使用随机梯度上升进行训练时没有观察到这种行为。

神经网络的随机梯度上升包括在展示训练语料库的第 $t$ 个词后执行以下迭代更新：

\theta \leftarrow \theta + \varepsilon \frac {\partial \log \hat {P} \left(w _ {t} \mid w _ {t - 1} , \cdots w _ {t - n + 1}\right)}{\theta}

其中是“学习率”。请注意，大部分参数在每次示例后不需要更新或访问：所有在输入窗口中未出现的词j的词特征C(j)。

模型混合。在我们的实验（见第4节）中，我们发现通过将神经网络的概率预测与插值三元组模型的概率预测相结合，性能得到了提高，无论是使用简单的固定权重0.5、学习权重（在验证集上的最大似然）还是一组根据上下文频率条件化的权重（使用相同的组合三元组、二元组和一元组的程序，这是一个混合）。

3. 并行实现

虽然参数数量能够很好地扩展，即与输入窗口大小和词汇表大小呈线性关系，但获取输出概率所需的计算量远大于n-gram模型所需。主要原因是，对于n-gram模型，获取特定的 $P(w_{t}|w_{t - 1},\dots ,w_{t - n + 1})$ 并不需要计算词汇表中所有单词的概率，因为相对频率的线性组合在模型训练时已经进行了简单的归一化处理。神经实现的主要计算瓶颈是输出层激活值的计算。

Running the model (both training and testing) on a parallel computer is a way to reduce computation time. We have explored parallelization on two types of platforms: shared-memory processor machines and Linux clusters with a fast network.

3.1 Data-Parallel Processing

In the case of a shared-memory processor, parallelization is easily achieved, thanks to the very low communication overhead between processors, through the shared memory. In that case we have chosen a data-parallel implementation in which each processor works on a different subset of the data. Each processor computes the gradient for its examples, and performs stochastic gradient updates on the parameters of the model, which are simply stored in a shared-memory area. Our first implementation was extremely slow and relied on synchronization commands to make sure that each processor would not write at the same time as another one in one of the above parameter subsets. Most of the cycles of each processor were spent waiting for another processor to release a lock on the write access to the parameters.

Instead we have chosen an asynchronous implementation where each processor can write at any time in the shared-memory area. Sometimes, part of an update on the parameter vector by one of the processors is lost, being overwritten by the update of another processor, and this introduces a bit of noise in the parameter updates. However, this noise seems to be very small and did not apparently slow down training.

Unfortunately, large shared-memory parallel computers are very expensive and their processor speed tends to lag behind mainstream CPUs that can be connected in clusters. We have thus been able to obtain much faster training on fast network clusters.

3.2 Parameter-Parallel Processing

If the parallel computer is a network of CPUs, we generally can't afford to frequently exchange all the parameters among the processors, because that represents tens of megabytes (almost 100 megabytes in the case of our largest network), which would take too much time through a local network. Instead we have chosen to parallelize across the parameters, in particular the parameters of the output units, because that is where the vast majority of the computation is taking place, in our architecture. Each CPU is responsible for the computation of the unnormalized probability for a subset of the outputs, and performs the updates for the corresponding output unit parameters (weights going into that unit). This strategy allowed us to perform a parallelized stochastic gradient ascent with a negligible communication overhead. The CPUs essentially need to communicate two informations: (1) the normalization factor of the output softmax, and (2) the gradients on the hidden layer (denoted $a$ below) and word feature layer (denoted $x$ ). All the CPUs duplicate the computations that precede the computation of the output units activations, i.e., the selection of word features and the computation of the hidden layer activation $a$ , as well as the corresponding back-propagation and update steps. However, these computations are a negligible part of the total computation for our networks.

For example, consider the following architecture used in the experiments on the AP (Associated Press) news data: the vocabulary size is $|V| = 17,964$ , the number of hidden units is $h = 60$ , the order of the model is $n = 6$ , the number of word features is $m = 100$ . The total number of numerical operations to process a single training example is approximately $|V|(1 + nm + h) + h(1 + nm) + nm$ (where the terms correspond respectively to the computations of the output units, hidden units, and word

在并行计算机上运行模型（包括训练和测试）是减少计算时间的方法。我们探索了两种平台的并行化：共享内存处理器机器和具有快速网络的Linux集群。

3.1 数据并行处理

对于共享内存处理器，由于处理器之间通过共享内存具有极低的通信开销，并行化很容易实现。在这种情况下，我们选择了一种数据并行实现，其中每个处理器处理数据的不同子集。每个处理器计算其示例的梯度，并对存储在共享内存区域的模型参数执行随机梯度更新。我们的第一个实现非常慢，并依赖于同步命令，以确保每个处理器不会同时向上述参数子集之一写入。每个处理器的大部分周期都花在等待另一个处理器释放对参数写入访问的锁上。

相反，我们选择了一种异步实现，其中每个处理器可以在任何时间向共享内存区域写入。有时，处理器之一对参数向量的一部分更新会被另一个处理器的更新覆盖而丢失，这给参数更新引入了一些噪声。然而，这种噪声似乎非常小，并且显然没有减慢训练速度。

不幸的是，大型共享内存并行计算机非常昂贵，它们的处理器速度往往落后于可以组成集群的主流CPU。因此，我们能够在快速网络集群上获得更快的训练。

3.2 参数并行处理

如果并行计算机是CPU网络，我们通常无法负担在处理器之间频繁交换所有参数，因为这代表数十兆字节（在我们最大的网络中几乎接近100兆字节），通过本地网络传输会花费太多时间。相反，我们选择跨参数并行化，特别是输出单元的参数，因为在我们架构中，绝大多数计算都发生在那里。每个CPU负责计算输出子集的非标准化概率，并执行对应输出单元参数（输入该单元的权重）的更新。这种策略使我们能够以可忽略的通信开销执行并行化随机梯度上升。CPU本质上需要通信两方面的信息：(1) 输出softmax的归一化因子，以及(2) 隐藏层（下文标记为a）和词特征层的梯度（标记为x）。所有CPU都会复制计算输出单元激活之前的计算，即选择词特征和计算隐藏层激活a，以及相应的反向传播和更新步骤。然而，这些计算占我们网络总计算量的可忽略部分。

例如，考虑以下架构，该架构用于AP（美联社）新闻数据上的实验：词汇量大小是 $|V| = 17964$ ，隐藏单元数量是 $\mathrm{h} = 60$ 模型的阶数是 $\mathrm{n} = 6$ ，词特征的个数是 $m = 100$ 。处理单个训练样本所需的数值运算总数大约是 $|V|(1 + nm + h) + h(1 + nm) + nm$ （其中术语分别对应输出单元、隐藏单元和词特征单元的计算）。在这个例子中，计算输出单元加权求和所需的整体计算量的比例因此大约是 $|V| = 17964\%$ 。这个算是近似的，因为不同操作的实际CPU时间不同，但它表明并行化输出单元计算通常是有利的。所有CPU都会重复一个非常小的计算比例，这对所寻求的并行化级别（即几十个处理器）的总计算时间不会造成伤害。如果隐藏单元数量很大，并行化它们的计算也会变得有利可图，但我们没有在我们的实验中研究这种方法。

feature units). In this example the fraction of the overall computation required for computing the weighted sums of the output units is therefore approximately $\frac{|V|(1 + (n - 1)m + h)}{|V|(1 + (n - 1)m + h) + h(1 + (n - 1)m) + (n - 1)m} = 99.7\%$ . This calculation is approximate because the actual CPU time associated with different operations differ, but it shows that it is generally advantageous to parallelize the output units computation. The fact that all CPUs will duplicate a very small fraction of the computations is not going to hurt the total computation time for the level of parallelization sought here, i.e. of a few dozen processors. If the number of hidden units was large, parallelizing their computation would also become profitable, but we did not investigate that approach in our experiments.

The implementation of this strategy was done on a cluster of 1.2 GHz clock-speed Athlon processors (32 x 2 CPUs) connected through a Myrinet network (a low-latency Gigabit local area network), using the MPI (Message Passing Interface) library (Dongarra et al., 1995) for the parallelization routines. The parallelization algorithm is sketched below, for a single example $(w_{t - n + 1},\dots ,w_t)$ , executed in parallel by CPU $i$ in a cluster of $M$ processors. CPU $i$ ( $i$ ranging from 0 to $M - 1$ ) is responsible for a block of output units starting at number $start_i = i\times \lceil |V| / M\rceil$ , the block being of length $\min(\lceil |V| / M\rceil, |V| - start_i)$ .

COMPUTATION FOR PROCESSOR $i$ , example $t$

1. FORWARD PHASE

(a) Perform forward computation for the word features layer:

\begin{array}{l} x (k) \leftarrow C \left(w _ {t - k}\right), \\ x = (x (1), x (2), \dots , x (n - 1)) \\ \end{array}

(b) Perform forward computation for the hidden layer:

\begin{array}{l} o \leftarrow d + H x \\ a \leftarrow \tanh (o) \\ \end{array}

s _ {i} \leftarrow 0

Loop over $j$ in the $i$ -th block

\begin{array}{l} i. y _ {j} \leftarrow b _ {j} + a. U _ {j} \\ i i. \text {I f} (\text {d i r e c t c o n n e c t i o n s}) y _ {j} \leftarrow y _ {j} + x. W _ {j} \\ \quad \text {i i i .} p _ {j} \leftarrow e ^ {y _ {j}} \\ i v. s _ {i} \leftarrow s _ {i} + p _ {j} \\ \end{array}

(d) Compute and share $S = \sum_{i} s_{i}$ among the processors. This can easily be achieved with an MPI Allreduce operation, which can efficiently compute and share this sum.
(e)Normalize the probabilities:

Loop over $j$ in the $i$ -th block, $p_j \gets p_j / S$ .

(f) Update the log-likelihood. If $w_{t}$ falls in the block of CPU $i > 0$ , then CPU $i$ sends $p_{w_t}$ to CPU 0. CPU 0 computes $L = \log p_{w_t}$ and keeps track of the total log-likelihood.

2. BACKWARD/UPDATE PHASE, with learning rate $\varepsilon$ .

(a) Perform backward gradient computation for output units in the $i$ -th block:

clear gradient vectors $\frac{\partial L}{\partial a}$ and $\frac{\partial L}{\partial x}$ .

Loop over $j$ in the $i$ -th block

功能单元）。在这个例子中，计算输出单元的加权求和所需的总体计算量的比例因此大约是

\frac {| V | (1 + (n - 1) m + h)}{| V | (1 + (n - 1) m + h) + h (1 + (n - 1) m) + (n - 1) m} = 0 \%

99.7

这个计算是近似的，因为不同的操作实际关联的CPU时间不同，但它表明并行化输出单元计算通常是有利的。所有CPU都会重复一个非常小的计算比例这一事实，对于这里所寻求的并行化级别（即几十个处理器）来说，不会损害总计算时间。如果隐藏单元的数量很大，并行化它们的计算也会变得有利可图，但我们没有在我们的实验中研究这种方法。

该策略的实现是在一个由 $1.2\mathrm{GHz}$ 时钟速度的 Athlon 处理器集群（ $32 \times 2$ CPU）上完成的，这些处理器通过 Myrinet 网络（一种低延迟千兆局域网）连接，并使用 MPI（消息传递接口）库（Dongarra 等人，1995 年）进行并行化例程。并行化算法的概述如下，以单个示例 $(w_{t-n+1}, \cdots, w_t)$ 为例，该示例在包含 M 个处理器的集群中由 CPU i 并行执行。CPU i（i 的范围从 0 到 M - 1）负责一个从编号 $\text{start}_i = i \times \lceil |V| / M \rceil$ ，开始的输出单元块，该块的长度为 $\min(\lceil |V| / M \rceil, |V| - \text{start}_i)$ 。

处理器i的计算，示例t

1. 前向阶段

(a) 对词特征层执行前向计算：

\begin{array}{l} x (k) \leftarrow C \left(w _ {t - k}\right), \\ x = (x (1), x (2), \dots , x (n - 1)) \\ \end{array}

(b) 对隐藏层执行前向计算：

\begin{array}{l} o \leftarrow d + H x \\ a \leftarrow \tanh (o) \\ \end{array}

s _ {i} \leftarrow 0

遍历第i个块中的j

\begin{array}{l} i. y _ {j} \leftarrow b _ {j} + a. U _ {j} \\ i i. \text {I f} (\text {d i r e c t c o n n e c t i o n s}) y _ {j} \leftarrow y _ {j} + x. W _ {j} \\ \text {i i i .} p _ {j} \leftarrow e ^ {y _ {j}} \\ \text {i v .} s _ {i} \leftarrow s _ {i} + p _ {j} \\ \end{array}

(d) 计算并在处理器之间共享 $S = {}_{i} s_{i}$ 。这可以通过 MPI Allreduce 操作轻松实现，该操作可以高效地计算并共享此和。

(e) 归一化概率:

遍历第 $\mathrm{i}$ 个块中的j， $p_j\gets p_j / S.$

(f) 更新对数似然。如果 $\mathrm{w}_t$ 落在CPUi $>0$ 的块中，则CPUi向CPU0发送 $\mathsf{p}^{w_t}$ 。CPU0计算 $L = \log p_{w_t}$ 并跟踪总对数似然。

2. 反向传播/更新阶段，学习率为。

(a) 对第 i 个块中的输出单元执行反向梯度计算: 清除梯度向量 L a 和

Lx。遍历第i个块中的 $\frac{\partial}{\partial j}$

\begin{array}{l} i. \frac {\partial L}{\partial y _ {j}} \leftarrow 1 _ {j = = w _ {t}} - p _ {j} \\ \mathrm {i i .} b _ {j} \leftarrow b _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} \\ \text {I f} (\text {d i r e c t c o n n e c t i o n s}) \frac {\partial L}{\partial x} \leftarrow \frac {\partial L}{\partial x} + \frac {\partial L}{\partial y _ {j}} W _ {j} \\ \frac {\partial L}{\partial a} \leftarrow \frac {\partial L}{\partial a} + \frac {\partial L}{\partial y _ {j}} U _ {j} \\ \text {I f} \left(\text {d i r e c t c o n n e c t i o n s}\right) W _ {j} \leftarrow W _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} x \\ U _ {j} \leftarrow U _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} a \\ \end{array}

(b) Sum and share $\frac{\partial L}{\partial x}$ and $\frac{\partial L}{\partial a}$ across processors. This can easily be achieved with an MPI Allreduce operation.
(c) Back-propagate through and update hidden layer weights:

Loop over $k$ between 1 and $h$ ,

\begin{array}{l} \frac {\partial L}{\partial o _ {k}} \leftarrow \left(1 - a _ {k} ^ {2}\right) \frac {\partial L}{\partial a _ {k}} \\ \frac {\partial L}{\partial x} \leftarrow \frac {\partial L}{\partial x} + H ^ {\prime} \frac {\partial L}{\partial o} \\ d \leftarrow d + \varepsilon \frac {\partial L}{\partial o} \\ H \leftarrow H + \varepsilon \frac {\partial L}{\partial o} x ^ {\prime} \\ \end{array}

(d) Update word feature vectors for the input words:

Loop over $k$ between 1 and $n - 1$

\begin{array}{l} \bar {C} \left(w _ {t - k}\right) \leftarrow C \left(w _ {t - k}\right) + \varepsilon \frac {\partial L}{\partial x (k)} \\ \text {w h e r e} \frac {\partial L}{\partial x (k)} \text {i s t h e} k \text {- t h b l o c k (o f l e n g t h m)} \text {o f t h e v e c t o r} \frac {\partial L}{\partial x}. \\ \end{array}

The weight decay regularization was not shown in the above implementation but can easily be put in (by subtracting the weight decay factor times the learning rate times the value of the parameter, from each parameter, at each update). Note that parameter updates are done directly rather than through a parameter gradient vector, to increase speed, a limiting factor in computation speed being the access to memory, in our experiments.

There could be a numerical problem in the computation of the exponentials in the forward phase, whereby all the $p_j$ could be numerically zero, or one of them could be too large for computing the exponential (step 1(c)ii above). To avoid this problem, the usual solution is to subtract the maximum of the $y_j$ 's before taking the exponentials in the softmax. Thus we have added an extra Allreduce operation to share among the $M$ processors the maximum of the $y_j$ 's, before computing the exponentials in $p_j$ . Let $q_i$ be the maximum of the $y_j$ 's in block $i$ . Then the overall maximum $Q = \max_i q_i$ is collectively computed and shared among the $M$ processors. The exponentials are then computed as follows: $p_j \gets e^{y_j - Q}$ (instead of step 1(c)ii) to guarantee that at least one of the $p_j$ 's will be numerically non-zero, and the maximum of the exponential's argument is 1.

By comparing clock time of the parallel version with clock time on a single processor, we found that the communication overhead was only 1/15th of the total time (for one training epoch): thus we get an almost perfect speed-up through parallelization, using this algorithm on a fast network.

On clusters with a slow network, it might be possible to still obtain an efficient parallelization by performing the communications every $K$ examples (a mini-batch) rather than for each example. This requires storing $K$ versions of the activities and gradients of the neural network in each processor. After the forward phase on the $K$ examples, the probability sums must be shared among the

\begin{array}{l} i. \frac {\partial L}{\partial y _ {j}} \leftarrow 1 _ {j = = w _ {t}} - p _ {j} \\ \mathrm {i i .} b _ {j} \leftarrow b _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} \\ \text {I f} (\text {d i r e c t c o n n e c t i o n s}) \frac {\partial L}{\partial x} \leftarrow \frac {\partial L}{\partial x} + \frac {\partial L}{\partial y _ {j}} W _ {j} \\ \frac {\partial L}{\partial a} \leftarrow \frac {\partial L}{\partial a} + \frac {\partial L}{\partial y _ {j}} U _ {j} \\ \text {I f} \left(\text {d i r e c t c o n n e c t i o n s}\right) W _ {j} \leftarrow W _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} x \\ U _ {j} \leftarrow U _ {j} + \varepsilon \frac {\partial L}{\partial y _ {j}} a \\ \end{array}

(b) 在处理器之间求和并共享 Lx 和 La。这可以通过 MPI Allreduce 操作轻松实现。

Loop over $k$ between 1 and $h$ ,

\begin{array}{l} \frac {\partial L}{\partial o _ {k}} \leftarrow \left(1 - a _ {k} ^ {2}\right) \frac {\partial L}{\partial a _ {k}} \\ \frac {\partial L}{\partial x} \leftarrow \frac {\partial L}{\partial x} + H ^ {\prime} \frac {\partial L}{\partial o} \\ d \leftarrow d + \varepsilon \frac {\partial L}{\partial o} \\ H \leftarrow H + \varepsilon \frac {\partial L}{\partial o} x ^ {i} \\ \end{array}

(d) 更新输入词的词特征向量:

Loop over $k$ between 1 and $n - 1$

\begin{array}{l} \bar {C} \left(w _ {t - k}\right) \leftarrow C \left(w _ {t - k}\right) + \varepsilon \frac {\partial L}{\partial x (k)} \\ \text {w h e r e} \frac {\partial L}{\partial x (k)} \text {i s t h e} k \text {- t h b l o c k (o f l e n g t h m)} \text {o f t h e v e c t o r} \frac {\partial L}{\partial x}. \\ \end{array}

上述实现中未显示权重衰减正则化，但可以轻松加入（通过从每个参数中减去权重衰减因子乘以学习率乘以参数值，在每次更新时）。请注意，参数更新是直接进行的，而不是通过参数梯度向量，以提高速度，因为计算速度的限制因素是内存访问，在我们的实验中。

在前向阶段的指数计算中可能会有数值问题，导致所有 $\mathfrak{p}j$ 数值上为零，或者其中一个太大无法计算指数（上述步骤1(c)ii）。为了避免这个问题，通常的解决方案是在取指数前减去 $y_{j}$ 's的最大值。因此，我们增加了一个额外的Allreduce操作，在计算 $\mathfrak{p}j$ 的指数之前，在M个处理器之间共享 $\mathrm{yj}$ 的最大值。令 $\mathbf{q}_i$ 为块i中 $y_{j}$ 's的最大值。然后整体最大 $Q = \max_{i}q_{i}$ 被集体计算并在M个处理器之间共享。指数的计算如下： $p_j\gets e^{y_j - Q}$ （而不是步骤1(c)ii），以确保至少有一个 $p_j$ 's数值上非零，并且指数的参数最大值为1。

通过比较并行版本的时钟时间与单处理器的时钟时间，我们发现通信开销仅占总时间（一个训练周期）的1/15：因此，使用该算法在高速网络上并行化，我们几乎获得了完美的加速效果。

在具有慢网络的集群上，通过每K个样本（一个mini-batch）进行通信而不是每个样本进行通信，仍然有可能获得高效的并行化。这需要在每个处理器上存储K个神经网络的活动和梯度版本。在K个样本的前向阶段之后，概率总和必须在这些处理器之间共享。然后启动K个反向阶段，以获得K个部分梯度向量La和Lx。在处理器之间交换这些梯度向量后，每个处理器可以完成反向阶段并更新参数。这种方法主要节省时间，因为网络通信延迟的节省（传输的数据量相同）。如果K太大，它可能会在收敛时间上损失，原因与批量梯度下降通常比随机梯度下降慢得多相同（LeCun等人，1998年）。

processors. Then the $K$ backward phases are initiated, to obtain the $K$ partial gradient vectors $\frac{\partial L}{\partial a}$ and $\frac{\partial L}{\partial x}$ . After exchanging these gradient vectors among the processors, each processor can complete the backward phase and update parameters. This method mainly saves time because of the savings in network communication latency (the amount of data transferred is the same). It may lose in convergence time if $K$ is too large, for the same reason that batch gradient descent is generally much slower than stochastic gradient descent (LeCun et al., 1998).

4. Experimental Results

Comparative experiments were performed on the Brown corpus which is a stream of 1,181,041 words, from a large variety of English texts and books. The first 800,000 words were used for training, the following 200,000 for validation (model selection, weight decay, early stopping) and the remaining 181,041 for testing. The number of different words is 47,578 (including punctuation, distinguishing between upper and lower case, and including the syntactical marks used to separate texts and paragraphs). Rare words with frequency $\leq 3$ were merged into a single symbol, reducing the vocabulary size to $|V| = 16,383$ .

An experiment was also run on text from the Associated Press (AP) News from 1995 and 1996. The training set is a stream of about 14 million (13,994,528) words, the validation set is a stream of about 1 million (963,138) words, and the test set is also a stream of about 1 million (963,071) words. The original data has 148,721 different words (including punctuation), which was reduced to $|V| = 17964$ by keeping only the most frequent words (and keeping punctuation), mapping upper case to lower case, mapping numeric forms to special symbols, mapping rare words to a special symbol and mapping proper nouns to another special symbol.

For training the neural networks, the initial learning rate was set to $\varepsilon_{o} = 10^{-3}$ (after a few trials with a tiny data set), and gradually decreased according to the following schedule: $\varepsilon_{t} = \frac{\varepsilon_{o}}{1 + rt}$ where $t$ represents the number of parameter updates done and $r$ is a decrease factor that was heuristically chosen to be $r = 10^{-8}$ .

4.1 N-Gram Models

The first benchmark against which the neural network was compared is an interpolated or smoothed trigram model (Jelinek and Mercer, 1980). Let $q_{t} = l(freq(w_{t - 1},w_{t - 2}))$ represents the discretized frequency of occurrence of the input context $(w_{t - 1},w_{t - 2})$ . Then the conditional probability estimates have the form of a conditional mixture:

\hat {P} \left(w _ {t} \mid w _ {t - 1}, w _ {t - 2}\right) = \alpha_ {0} \left(q _ {t}\right) p _ {0} + \alpha_ {1} \left(q _ {t}\right) p _ {1} \left(w _ {t}\right) + \alpha_ {2} \left(q _ {t}\right) p _ {2} \left(w _ {t} \mid w _ {t - 1}\right) + \alpha_ {3} \left(q _ {t}\right) p _ {3} \left(w _ {t} \mid w _ {t - 1}, w _ {t - 2}\right)

with conditional weights $\alpha_{i}(q_{t})\geq 0,\sum_{i}\alpha_{i}(q_{t}) = 1$ . The base predictors are the following: $p_0 = 1 / |V|$ , $p_1(i)$ is a unigram (relative frequency of word $i$ in the training set), $p_2(i|j)$ is the bigram (relative frequency of word $i$ when the previous word is $j$ ), and $p_3(i|j,k)$ is the trigram (relative frequency of word $i$ when the previous 2 words are $j$ and $k$ ). The motivation is that when the frequency of $(w_{t - 1},w_{t - 2})$ is large, $p_3$ is most reliable, whereas when it is lower, the lower-order statistics of $p_2$ , $p_1$ , or even $p_0$ are more reliable. There is a different set of mixture weights $\alpha$ for each of the discrete values of $q_{t}$ (which are context frequency bins). They can be easily estimated with

4. 实验结果

在布朗语料库上进行了对比实验，该语料库包含来自各种英语文本和书籍的1,181,041个单词流。前800,000个单词用于训练，接下来的200,000个用于验证（模型选择、权重衰减、提前停止），剩余的181,041个用于测试。不同单词的数量为47,578（包括标点符号、区分大小写，并包括用于分隔文本和段落的句法标记）。频率低于 $\leq 3$ 的罕见词被合并为一个符号，将词汇量减少到 $|V| = 16383$ 。

还在1995年和1996年由美联社（AP）新闻的文本上进行了实验。训练集是一个约1400万（13,994,528）个单词的流，验证集是一个约100万（963,138）个单词的流，测试集也是一个约100万（963,071）个单词的流。原始数据有148,721个不同单词（包括标点符号），通过仅保留最频繁的单词（并保留标点符号）、将大写映射为小写、将数字形式映射为特殊符号、将罕见词映射为特殊符号以及将专有名词映射为另一个特殊符号，将其减少到 $|\mathrm{V}| = 17964$ 。

为了训练神经网络，初始学习率被设置为 $o = 10^{-3}$ （经过几次使用小数据集的试验后），并根据以下计划逐渐降低： $t = \frac{o}{1 + rt}$ 其中 $t$ 表示完成的参数更新次数， $r$ 是一个启发式选择的降低因子。

4.1 N-gram模型

神经网络首先与一个插值或平滑的三元组模型（Jelinek 和 Mercer，1980）进行了比较。令 $q_{t} = l(freq(w_{t-1}, w_{t-2}))$ 表示输入上下文 $(w_{t-1}, w_{t-2})$ 的离散出现频率。那么条件概率估计的形式为条件混合：

\hat {P} \left(w _ {t} \mid w _ {t - 1}, w _ {t - 2}\right) = \alpha_ {0} \left(q _ {t}\right) p _ {0} + \alpha_ {1} \left(q _ {t}\right) p _ {1} \left(w _ {t}\right) + \alpha_ {2} \left(q _ {t}\right) p _ {2} \left(w _ {t} \mid w _ {t - 1}\right) + \alpha_ {3} \left(q _ {t}\right) p _ {3} \left(w _ {t} \mid w _ {t - 1}, w _ {t - 2}\right)

使用条件权重 $i(q_{t}) \geq 0$ $\alpha_{i}$ $i(q_{t}) = \sum_{\alpha} \alpha$ 基础预测器如下： $p_{0} = 1 / |V|$ $p_{1}(i)$ 是一个一元语法（单词 i 在训练集中的相对频率）， $p_{2}(i|j)$ 是一个二元语法（当前单词是 j 时单词 i 的相对频率），而 $p_{3}(i|j,k)$ 是一个三元语法（当前两个单词是 j 和 k 时单词 i 的相对频率）。其动机在于，当 $(w_{t-1}, w_{t-2})$ 的频率较高时， $p_{3}$ 最可靠，而当其较低时， $p_{2}, p_{1}$ ，甚至 $p_{0}$ 的低阶统计量更可靠。对于 $q_{t}$ 的每个离散值（即上下文频率桶），都有一组不同的混合权重。它们可以很容易地估计出来。

the EM algorithm in about 5 iterations, on a set of data (the validation set) not used for estimating the unigram, bigram and trigram relative frequencies. The interpolated n-gram was used to form a mixture with the MLPs since they appear to make "errors" in very different ways.

Comparisons were also made with other state-of-the-art n-gram models: back-off n-gram models with the Modified Kneser-Ney algorithm (Kneser and Ney, 1995, Chen and Goodman., 1999), as well as class-based n-gram models (Brown et al., 1992, Ney and Kneser, 1993, Niesler et al., 1998). The validation set was used to choose the order of the n-gram and the number of word classes for the class-based models. We used the implementation of these algorithms in the SRI Language Modeling toolkit, described by Stolcke (2002) and in www.speech.sri.com/projects/srilm/. They were used for computing the back-off models perplexities reported below, noting that we did not give a special status to end-of-sentence tokens in the accounting of the log-likelihood, just as for our neural network perplexity. All tokens (words and punctuation) were treated the same in averaging the log-likelihood (hence in obtaining the perplexity).

4.2 Results

Below are measures of test set perplexity (geometric average of $1 / \hat{P} (w_t|w_1^{t - 1}))$ for different models $\hat{P}$ . Apparent convergence of the stochastic gradient ascent procedure was obtained after around 10 to 20 epochs for the Brown corpus. On the AP News corpus we were not able to see signs of overfitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs). Early stopping on the validation set was used, but was necessary only in our Brown experiments. A weight decay penalty of $10^{-4}$ was used in the Brown experiments and a weight decay of $10^{-5}$ was used in the APNews experiments (selected by a few trials, based on validation set perplexity). Table 1 summarizes the results obtained on the Brown corpus. All the back-off models of the table are modified Kneser-Ney n-grams, which worked significantly better than standard back-off models. When $m$ is specified for a back-off model in the table, a class-based n-gram is used ( $m$ is the number of word classes). Random initialization of the word features was done (similarly to initialization of neural network weights), but we suspect that better results might be obtained with a knowledge-based initialization.

The main result is that significantly better results can be obtained when using the neural network, in comparison with the best of the n-grams, with a test perplexity difference of about $24\%$ on Brown and about $8\%$ on AP News, when taking the MLP versus the n-gram that worked best on the validation set. The table also suggests that the neural network was able to take advantage of more context (on Brown, going from 2 words of context to 4 words brought improvements to the neural network, not to the n-grams). It also shows that the hidden units are useful (MLP3 vs MLP1 and MLP4 vs MLP2), and that mixing the output probabilities of the neural network with the interpolated trigram always helps to reduce perplexity. The fact that simple averaging helps suggests that the neural network and the trigram make errors (i.e. low probability given to an observed word) in different places. The results do not allow to say whether the direct connections from input to output are useful or not, but suggest that on a smaller corpus at least, better generalization can be obtained without the direct input-to-output connections, at the cost of longer training: without direct connections the network took twice as much time to converge (20 epochs instead of 10), albeit to a slightly lower perplexity. A reasonable interpretation is that direct input-to-output connections provide a bit more capacity and faster learning of the "linear" part of the mapping from word features to log

EM算法在大约5次迭代中，在一个未用于估计一元、二元和三元相对频率的数据集（验证集）上运行。插值n-gram被用于与MLPs形成混合物，因为它们似乎以非常不同的方式出现“错误”。

同时与其他最先进的 n-gram 模型进行了比较：使用改进的 Kneser-Ney 算法的回退 n-gram 模型（Kneser 和 Ney，1995 年，Chen 和 Goodman，1999 年），以及基于类的 n-gram 模型（Brown 等人，1992 年，Ney 和 Kneser，1993 年，Niesler 等人，1998 年）。验证集用于选择 n-gram 的顺序以及基于类模型的词类数量。我们使用了 SRI 语言建模工具包中这些算法的实现，如 Stolcke（2002 年）所述，可在

www.speech.sri.com/projects/srilm/找到。它们用于计算下面报告的回退模型的困惑度，需要注意的是，在计算对数似然时我们没有给句子结束标记特殊地位，就像我们神经网络困惑度计算一样。在对数似然求平均时（因此是在获得困惑度时）所有标记（单词和标点符号）都同等对待。

4.2结果

以下是测试集困惑度指标（不同模型^P的几何平均值 $1 / \hat{P} (w_t|w_1^{t - 1}))$ 。对于布朗语料库，随机梯度上升过程在大约10到20个epoch后表现出明显的收敛。在AP新闻语料库上，我们没有观察到过拟合的迹象（在验证集上），可能是因为我们只运行了5个epoch（使用40个CPU，历时3周）。我们在验证集上使用了提前停止，但仅在布朗实验中需要。布朗实验中使用了权重衰减惩罚 $10^{-4}$ ，而APNews实验中使用了权重衰减 $10^{-5}$ （通过几次试验选择，基于验证集困惑度）。表1总结了在布朗语料库上获得的结果。表中的所有回退模型都是修改后的Kneser-Ney n-gram，它们的工作效果明显优于标准回退模型。当表中的回退模型指定了m时，会使用基于类的n-gram（m是词类的数量）。词特征进行了随机初始化（类似于神经网络权重的初始化），但我们怀疑使用基于知识的初始化可能会获得更好的结果。

主要结果是，当使用神经网络时，与n-gram的最佳结果相比，可以获得显著更好的结果，在Brown数据集上测试困惑度差异约为 $24\%$ ，在AP新闻数据集上约为 $8\%$ ，这是在将MLP与在验证集上表现最佳的n-gram进行比较时得出的。该表格还表明，神经网络能够利用更多上下文（在Brown数据集上，从2个词的上下文增加到4个词带来了对神经网络的改进，而不是对n-gram的改进）。它还显示隐藏单元是有用的（MLP3与MLP1和MLP4与MLP2的比较），并且混合神经网络的输出概率与插值三元组始终有助于降低困惑度。简单平均有助于这一事实表明，神经网络和三元组在错误发生的位置上有所不同（即给观察到的词分配了低概率）。结果不允许判断输入到输出的直接连接是否有用或无用，但表明至少在小语料库上，没有直接输入到输出的连接也能获得更好的泛化能力，代价是更长的训练时间：没有直接连接时，网络收敛时间翻倍（需要20个epoch而不是10个）尽管困惑度略低。一个合理的解释是，直接输入到输出的连接提供了更多的容量，并能更快地学习从词特征到log-

	n	c	h	m	direct	mix	train.	valid.	test.
MLP1	5		50	60	yes	no	182	284	268
MLP2	5		50	60	yes	yes		275	257
MLP3	5		0	60	yes	no	201	327	310
MLP4	5		0	60	yes	yes		286	272
MLP5	5		50	30	yes	no	209	296	279
MLP6	5		50	30	yes	yes		273	259
MLP7	3		50	30	yes	no	210	309	293
MLP8	3		50	30	yes	yes		284	270
MLP9	5		100	30	no	no	175	280	276
MLP10	5		100	30	no	yes		265	252
Del. Int.	3						31	352	336
Kneser-Ney back-off	3							334	323
Kneser-Ney back-off	4							332	321
Kneser-Ney back-off	5							332	321
class-based back-off	3	150						348	334
class-based back-off	3	200						354	340
class-based back-off	3	500						326	312
class-based back-off	3	1000						335	319
class-based back-off	3	2000						343	326
class-based back-off	4	500						327	312
class-based back-off	5	500						327	312

Table 1: Comparative results on the Brown corpus. The deleted interpolation trigram has a test perplexity that is $33\%$ above that of the neural network with the lowest validation perplexity. The difference is $24\%$ in the case of the best n-gram (a class-based model with 500 word classes). $n$ : order of the model. $c$ : number of word classes in class-based n-grams. $h$ : number of hidden units. $m$ : number of word features for MLPs, number of classes for class-based n-grams. direct: whether there are direct connections from word features to outputs. mix: whether the output probabilities of the neural network are mixed with the output of the trigram (with a weight of 0.5 on each). The last three columns give perplexity on the training, validation and test sets.

probabilities. On the other hand, without those connections the hidden units form a tight bottleneck which might force better generalization.

Table 2 gives similar results on the larger corpus (AP News), albeit with a smaller difference in perplexity $(8\%)$ . Only 5 epochs were performed (in approximately three weeks with 40 CPUs). The class-based model did not appear to help the n-gram models in this case, but the high-order modified Kneser-Ney back-off model gave the best results among the n-gram models.

5. Extensions and Future Work

In this section, we describe extensions to the model described above, and directions for future work.

	n	c	h	m	直接	mix	训练。	验证。	测试。
MLP1	5		50	60	yes	no	182	284	268
MLP2	5		50	60	yes	yes		275	257
MLP3	5		0	60	yes	no	201	327	310
MLP4	5		0	60	yes	yes		286	272
MLP5	5		50	30	yes	no	209	296	279
MLP6	5		50	30	yes	yes		273	259
MLP7	3		50	30	yes	no	210	309	293
MLP8	3		50	30	yes	yes		284	270
MLP9	5		100	30	no	no	175	280	276
MLP10	5		100	30	no	yes		265	252
Del. Int.	3						31	352	336
Kneser-Ney 回退	3							334	323
Kneser-Ney 回退	4							332	321
Kneser-Ney 回退	5							332	321
基于类的回退	3	150						348	334
基于类的回退	3	200						354	340
基于类的回退	3	500						326	312
基于类的回退	3	1000						335	319
基于类的回退	3	2000						343	326
基于类的回退	4	500						327	312
基于类的回退	5	500						327	312

表1：布朗语料库的比较结果。被删除的插值三元组在测试困惑度上比验证困惑度最低的神经网络高 $33\%$ 。在最佳 n-gram 的情况下（一个包含 500 个词类的基于类的模型），差异为 $24\%$ 。n：模型的阶数。c：基于类 n-gram 中的词类数量。h：隐藏单元数量。m：用于 MLP 的词特征数量，或基于类 n-gram 的类数量。direct：是否存在从词特征到输出的直接连接。mix：是否将神经网络的输出概率与三元组的输出混合（每个权重为 0.5）。最后三列给出训练集、验证集和测试集上的困惑度。

概率的“线性”部分映射。另一方面，如果没有这些连接，隐藏单元会形成一个紧密的瓶颈，这可能迫使更好的泛化。

表2在大规模语料库（AP新闻）上给出了相似的结果，尽管困惑度差异较小（ $8\%$ ）。仅进行了5个epoch（大约三周，使用40个CPU）。在这种情况下，基于类的模型似乎并未帮助n-gram模型，但高阶改进的Kneser-Ney回退模型在n-gram模型中取得了最佳结果。

5. 扩展与未来工作

在本节中，我们描述了上述模型的扩展，以及未来工作的方向。

	n	h	m	direct	mix	train.	valid.	test.
MLP10	6	60	100	yes	yes		104	109
Del. Int.	3						126	132
Back-off KN	3						121	127
Back-off KN	4						113	119
Back-off KN	5						112	117

5.1 An Energy Minimization Network

A variant of the above neural network can be interpreted as an energy minimization model following Hinton's recent work on products of experts (Hinton, 2000). In the neural network described in the previous sections the distributed word features are used only for the "input" words and not for the "output" word (next word). Furthermore, a very large number of parameters (the majority) are expanded in the output layer: the semantic or syntactic similarities between output words are not exploited. In the variant described here, the output word is also represented by its feature vector. The network takes in input a sub-sequence of words (mapped to their feature vectors) and outputs an energy function $E$ which is low when the words form a likely sub-sequence, high when it is unlikely. For example, the network outputs an "energy" function

E \left(w _ {t - n + 1}, \dots , w _ {t}\right) = v. \tanh (d + H x) + \sum_ {i = 0} ^ {n - 1} b _ {w _ {t - i}}

where $b$ is the vector of biases (which correspond to unconditional probabilities), $d$ is the vector of hidden units biases, $\nu$ is the output weight vector, and $H$ is the hidden layer weight matrix, and unlike in the previous model, input and output words contribute to $x$ :

x = \left(C \left(w _ {t}\right), C \left(w _ {t - 1}\right), C \left(w _ {t - 2}\right), \dots , C \left(w _ {t - n + 1}\right). \right.

The energy function $E(w_{t - n + 1},\dots ,w_t)$ can be interpreted as an unnormalized log-probability for the joint occurrence of $(w_{t - n + 1},\dots ,w_t)$ . To obtain a conditional probability $\hat{P} (w_{t}|w_{t - n + 1}^{t - 1})$ it is enough (but costly) to normalize over the possible values of $w_{t}$ , as follows:

\hat {P} \left(w _ {t} \mid w _ {t - 1}, \dots , w _ {t - n + 1}\right) = \frac {e ^ {- E \left(w _ {t - n + 1} , \cdots , w _ {t}\right)}}{\sum_ {i} e ^ {- E \left(w _ {t - n + 1} , \cdots , w _ {t - 1} , i\right)}}

Note that the total amount of computation is comparable to the architecture presented earlier, and the number of parameters can also be matched if the $\nu$ parameter is indexed by the identity of the target word $(w_{t})$ . Note that only $b_{w_t}$ remains after the above softmax normalization (any linear function of the $w_{t - i}$ for $i > 0$ is canceled by the softmax normalization). As before, the parameters of the model can be tuned by stochastic gradient ascent on $\log \hat{P} (w_t|w_{t - 1},\dots ,w_{t - n + 1})$ , using similar computations.

In the products-of-experts framework, the hidden units can be seen as the experts: the joint probability of a sub-sequence $(w_{t - n + 1},\dots ,w_t)$ is proportional to the exponential of a sum of terms associated with each hidden unit $j$ , $\nu_{j}\tanh (d_{j} + H_{j}x)$ . Note that because we have chosen to decompose the probability of a whole sequence in terms of conditional probabilities for each element,

Table 2: Comparative results on the AP News corpus. See the previous table for the column labels.

	n	h	m	直接	mix	训练。	验证。	测试。
MLP10	6	60	100	yes	yes		104	109
Del. Int.	3						126	132
Back-off KN	3						121	127
Back-off KN	4						113	119
Back-off KN	5						112	117

表2：AP新闻语料库的比较结果。请参考前表获取列标签。

5.1 能量最小化网络

上述神经网络的变体可以解释为遵循希顿（Hinton）近期关于专家乘积（Hinton, 2000）工作的能量最小化模型。在前几节描述的神经网络中，分布式词特征仅用于“输入”词，而未用于“输出”词（下一个词）。此外，大量参数（大多数）在输出层被扩展：输出词之间的语义或句法相似性并未被利用。在本节描述的变体中，“输出”词也由其特征向量表示。网络输入一个词子序列（映射到其特征向量），并输出一个能量函数E，当词形成可能的子序列时该函数值较低，当不太可能时该函数值较高。例如，网络输出一个“能量”函数

E \left(w _ {t - n + 1}, \dots , w _ {t}\right) = v. \tanh (d + H x) + \sum_ {i = 0} ^ {n - 1} b _ {w _ {t - i}}

其中 b 是偏置向量（对应于无条件概率），d 是隐藏单元偏置向量，v 是输出权重向量，H 是隐藏层权重矩阵，并且与之前的模型不同，输入词和输出词会贡献到 x:

x = \left(C \left(w _ {t}\right), C \left(w _ {t - 1}\right), C \left(w _ {t - 2}\right), \dots , C \left(w _ {t - n + 1}\right). \right.

能量函数 $E(w_{t - n + 1},\dots ,w_t)$ 可以解释为 $(w_{t - n + 1},\dots ,w_t)$ 联合发生的未归一化对数概率。要获得条件概率 $\hat{P} (w_t|w_{t - n + 1}^{t - 1})$ ，只需（但代价高昂）对 $\mathbf{W}_t$ 的可能值进行归一化，如下所示：

\hat {P} \left(w _ {t} \mid w _ {t - 1}, \dots , w _ {t - n + 1}\right) = \frac {e ^ {- E \left(w _ {t - n + 1} , \cdots , w _ {t}\right)}}{\sum_ {i} e ^ {- E \left(w _ {t - n + 1} , \cdots , w _ {t - 1} , i\right)}}

注意，总计算量与之前提出的架构相当，并且如果 $\mathrm{v}$ 参数由目标词的标识 $\mathbf{W}_t$ 索引，参数数量也可以匹配。注意，在上述softmax归一化后，只有 $\mathtt{b}_{w_t}$ 保留下来（任何 $\mathbf{W}_{t - i}$ 的线性函数对于 $\mathrm{i} > 0$ 都会被softmax归一化抵消）。与之前一样，可以通过在log $\hat{P} (w_{t}|w_{t - 1},\dots ,w_{t - n + 1})$ 上的随机梯度上升来调整模型的参数，使用类似的计算。

在专家产品框架中，隐藏单元可以被视为专家：子序列 $(w_{t - n + 1},\dots ,w_t)$ 的联合概率与每个隐藏单元j相关的项之和的指数成正比，即vjtanh $(\mathrm{d}\mathbf{j} + \mathrm{H}\mathbf{j}\mathrm{x})$ 。请注意，因为我们选择将整个序列的概率分解为每个元素的条件下概率，所以

the computation of the gradient is tractable. This is not the case for example with products-of-HMMs (Brown and Hinton, 2000), in which the product is over experts that view the whole sequence, and which can be trained with approximate gradient algorithms such as the contrastive divergence algorithm (Brown and Hinton, 2000). Note also that this architecture and the products-of-experts formulation can be seen as extensions of the very successful Maximum Entropy models (Berger et al., 1996), but where the basis functions (or "features", here the hidden units activations) are learned by penalized maximum likelihood at the same time as the parameters of the features linear combination, instead of being learned in an outer loop, with greedy feature subset selection methods.

We have implemented and experimented with the above architecture, and have developed a speed-up technique for the neural network training, based on importance sampling and yielding a 100-fold speed-up (Bengio and Senécal, 2003).

Out-of-vocabulary words. An advantage of this architecture over the previous one is that it can easily deal with out-of-vocabulary words (and even assign them a probability!). The main idea is to first guess an initial feature vector for such a word, by taking a weighted convex combination of the feature vectors of other words that could have occurred in the same context, with weights proportional to their conditional probability. Suppose that the network assigned a probability $\hat{P}(i|w_{t-n+1}^{t-1})$ to words $i \in V$ in context $w_{t-n+1}^{t-1}$ , and that in this context we observe a new word $j \notin V$ . We initialize the feature vector $C(j)$ for $j$ as follows: $C(j) \gets \sum_{i \in V} C(i) \hat{P}(i|w_{t-n+1}^{t-1})$ . We can then incorporate $j$ in $V$ and re-compute probabilities for this slightly larger set (which only requires a renormalization for all the words, except for word $i$ , which requires a pass through the neural network). This feature vector $C(i)$ can then be used in the input context part when we try to predict the probabilities of words that follow word $i$ .

5.2 Other Future Work

There are still many challenges ahead to follow-up on this work. In the short term, methods to speed-up training and recognition need to be designed and evaluated. In the longer term, more ways to generalize should be introduced, in addition to the two main ways exploited here. Here are some ideas that we intend to explore:

Decomposing the network in sub-networks, for example using a clustering of the words. Training many smaller networks should be easier and faster.
Representing the conditional probability with a tree structure where a neural network is applied at each node, and each node represents the probability of a word class given the context and the leaves represent the probability of words given the context. This type of representation has the potential to reduce computation time by a factor $|V| / \log |V|$ (see Bengio, 2002).
Propagating gradients only from a subset of the output words. It could be the words that are conditionally most likely (based on a faster model such as a trigram, see Schwenk and Gauvain, 2002, for an application of this idea), or it could be a subset of the words for which the trigram has been found to perform poorly. If the language model is coupled to a speech recognizer, then only the scores (unnormized probabilities) of the acoustically ambiguous words need to be computed. See also Bengio and Senécal (2003) for a new accelerated training method using importance sampling to select the words.

梯度计算是可行的。但对于例如HMM的乘积（Brown和Hinton，2000），情况并非如此，其中乘积是对查看整个序列的专家进行的，并且可以使用近似梯度算法（如对比散度算法（Brown和Hinton，2000））进行训练。请注意，这种架构和专家乘积公式可以看作是极其成功的最大熵模型（Berger等人，1996）的扩展，但在其中基函数（或“特征”，此处为隐藏单元激活）是与特征线性组合的参数同时通过惩罚最大似然学习，而不是在外部循环中通过贪婪特征子集选择方法学习。

我们已经实现并测试了上述架构，并基于重要性采样开发了一种神经网络训练加速技术，实现了100倍的加速（Bengio和Senécal，2003年）。

词汇表外词。这种架构相对于之前架构的一个优势是，它可以轻松处理词汇表外的词（甚至可以为其分配概率！）。主要思想是首先通过取其他可能出现在相同上下文中的词的特征向量的加权凸组合（权重与其条件概率成正比）来猜测这种词的初始特征向量。假设网络为上下文 $w_{t - n + 1}^{t - 1}$ 中的V集合中的词i分配了概率 $\hat{P}(i|w_{t - n + 1}^{t - 1})$ ，并且在这个上下文中我们观察到一个新的词j不属于V。我们将词j的特征向量C(j)初始化如下：

$C(j)\gets \underset {i\in V}{C(i)}\hat{P} (i|w_{t - n + 1}^{t - 1})$ 然后我们可以将j加入V中，并重新计算这个略微更大的集合的概率（这只需要对所有词进行重新归一化，除了词i，它需要通过神经网络进行一次传递）。然后可以使用这个特征向量C(i)作为输入上下文的一部分，当我们尝试预测紧随词i之后的词的概率时。

5.2 其他未来工作

这项工作仍有许多后续挑战需要跟进。短期内，需要设计和评估加速训练与识别的方法。从长远来看，除了这里利用的两种主要方式外，还应引入更多泛化方法。以下是我们打算探索的一些想法：

将网络分解为子网络，例如使用词语聚类的方法。训练许多更小的网络应该更容易、更快。
使用树状结构表示条件概率，其中每个节点应用神经网络，每个节点表示给定上下文的词类的概率，叶节点表示给定上下文的词的概率。这种表示方式有可能通过 $|V| / \log |V|$ 的因子减少计算时间（参见Bengio, 2002）。
仅从输出词的一个子集传播梯度。这些词可能是条件概率最高的词（基于更快的模型如三元组，参见 Schwenk 和 Gauvain, 2002，该想法的应用），或者可能是三元组表现不佳的词的子集。如果语言模型与语音识别器耦合，那么只需要计算声学模糊词的分数（未归一化的概率）。另见 Bengio 和 Senécal (2003)，其中介绍了一种使用重要性采样选择词的新加速训练方法。
Introducing a-priori knowledge. Several forms of such knowledge could be introduced, such as: semantic information (e.g., from WordNet, see Fellbaum, 1998), low-level grammatical information (e.g., using parts-of-speech), and high-level grammatical information, e.g., coupling the model to a stochastic grammar, as suggested in Bengio (2002). The effect of longer term context could be captured by introducing more structure and parameter sharing in the neural network, e.g. using time-delay or recurrent neural networks. In such a multi-layered network the computation that has been performed for small groups of consecutive words does not need to be redone when the network input window is shifted. Similarly, one could use a recurrent network to capture potentially even longer term information about the subject of the text.
Interpreting (and possibly using) the word feature representation learned by the neural network. A simple first step would start with $m = 2$ features, which can be more easily displayed. We believe that more meaningful representations will require large training corpora, especially for larger values of $m$ .
Polysemous words are probably not well served by the model presented here, which assigns to each word a single point in a continuous semantic space. We are investigating extensions of this model in which each word is associated with multiple points in that space, each associated with the different senses of the word.

6. Conclusion

The experiments on two corpora, one with more than a million examples, and a larger one with above 15 million words, have shown that the proposed approach yields much better perplexity than a state-of-the-art method, the smoothed trigram, with differences between 10 and $20\%$ in perplexity.

We believe that the main reason for these improvements is that the proposed approach allows to take advantage of the learned distributed representation to fight the curse of dimensionality with its own weapons: each training sentence informs the model about a combinatorial number of other sentences.

There is probably much more to be done to improve the model, at the level of architecture, computational efficiency, and taking advantage of prior knowledge. An important priority of future research should be to improve speed-up techniques5 as well as ways to increase capacity without increasing training time too much (to deal with corpora with hundreds of millions of words or more). A simple idea to take advantage of temporal structure and extend the size of the input window to include possibly a whole paragraph (without increasing too much the number of parameters or computation time) is to use a time-delay and possibly recurrent neural networks. Evaluations of the type of models presented here in applicative contexts would also be useful, but see work already done by Schwenk and Gauvain (2002) for improvements in speech recognition word error rate.

More generally, the work presented here opens the door to improvements in statistical language models brought by replacing "tables of conditional probabilities" by more compact and smoother representations based on distributed representations that can accommodate far more conditioning variables. Whereas much effort has been spent in statistical language models (e.g. stochastic grammars) to restrict or summarize the conditioning variables in order to avoid overfitting, the type of

引入先验知识。可以引入多种此类知识，例如：语义信息（例如来自WordNet，参见Fellbaum，1998年），低级语法信息（例如使用词性），以及高级语法信息，例如将模型与随机语法结合，正如Bengio（2002年）所建议的那样。更长期上下文的影响可以通过在神经网络中引入更多结构和参数共享来捕获，例如使用时间延迟或循环神经网络。在这样的多层网络中，为连续的少量单词执行的计算在网络输入窗口移动时无需重新执行。类似地，可以使用循环网络来捕获关于主题的潜在更长期信息

文本。

解释（并可能使用）神经网络学习的单词特征表示。一个简单的第一步可以开始于 $m = 2$ 特征，这些特征可以更容易地展示。我们相信更有意义的表示将需要大量的训练语料库，特别是对于 $m$ 的较大值
多义词可能无法很好地被此处提出的模型所服务，该模型为每个词在连续语义空间中分配一个单一的点。我们正在研究该模型的扩展，其中每个词都与该空间中的多个点相关联，每个点都与词的不同含义相关联。

6.结论

在两个语料库上的实验——其中一个包含超过一百万个示例，另一个包含超过1500万个单词——已经表明，所提出的方法比最先进的方法（平滑三元组）产生了更好的困惑度，困惑度差异在 $10\%$ 到 $20\%$ 之间。

我们认为这些改进的主要原因在于，所提出的方法允许利用学习的分布式表示来对抗维数灾难，用其自身的武器：每条训练句子都向模型提供了关于组合数量其他句子的信息。

为了改进模型，在架构、计算效率和利用先验知识方面还有更多工作要做。未来研究的一个重要优先事项应该是提高加速技术以及在不显著增加训练时间的情况下增加容量的方法（以处理包含数亿个单词或更多单词的语料库）。一个简单的想法是利用时间结构并将输入窗口的大小扩展到可能包含整个段落（而不显著增加参数数量或计算时间），即使用时间延迟和可能循环神经网络。在应用环境中对此处提出的模型类型的评估也将是有用的，但请参见Schwenk和Gauvain（2002）已经完成的工作，以改进语音识别的单词错误率。

更一般地，这里提出的工作为统计语言模型的改进打开了大门，即通过用基于分布式表示的更紧凑和更平滑的表示来替代“条件概率表”，这些分布式表示能够容纳远更多的条件变量。虽然统计语言模型（例如随机文法）已经投入了大量精力来限制或总结条件变量以避免过拟合，但这里描述的模型将困难转移到了其他地方：需要更多的计算，但计算和内存需求与条件变量的数量线性扩展，而不是指数扩展。

models described here shifts the difficulty elsewhere: many more computations are required, but computation and memory requirements scale linearly, not exponentially with the number of conditioning variables.

ACKNOWLEDGMENTS

The authors would like to thank Léon Bottou, Yann Le Cun and Geoffrey Hinton for useful discussions. This research was made possible by funding from the NSERC granting agency, as well as the MITACS and IRIS networks.

References

D. Baker and A. McCallum. Distributional clustering of words for text classification. In SIGIR'98, 1998.
J.R. Bellegarda. A latent semantic analysis framework for large-span language modeling. In Proceedings of Eurospeech 97, pages 1451-1454, Rhodes, Greece, 1997.
S. Bengio and Y. Bengio. Taking on the curse of dimensionality in joint distributions using neural networks. IEEE Transactions on Neural Networks, special issue on Data Mining and Knowledge Discovery, 11(3):550-557, 2000a.
Y. Bengio. New distributed probabilistic language models. Technical Report 1215, Dept. IRO, Université de Montréal, 2002.
Y. Bengio and S. Bengio. Modeling high-dimensional discrete data with multi-layer neural networks. In S. A. Solla, T. K. Leen, and K-R. Müller, editors, Advances in Neural Information Processing Systems, volume 12, pages 400-406. MIT Press, 2000b.
Y. Bengio and J-S. Senécal. Quick training of probabilistic neural nets by importance sampling. In AISTATS, 2003.
A. Berger, S. Della Pietra, and V. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22:39-71, 1996.
A. Brown and G.E. Hinton. Products of hidden markov models. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London, 2000.
P.F. Brown, V.J. Della Pietra, P.V. DeSouza, J.C. Lai, and R.L. Mercer. Class-based $n$ -gram models of natural language. Computational Linguistics, 18:467-479, 1992.
S.F. Chen and J.T. Goodman. An empirical study of smoothing techniques for language modeling. Computer, Speech and Language, 13(4):359-393, 1999.
S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990.
J. Dongarra, D. Walker, and The Message Passing Interface Forum. MPI: A message passing interface standard. Technical Report <http://www-unix.mcs.anl.gov/mpi>, University of Tennessee, 1995.

这里描述的模型将困难转移到了其他地方：需要更多的计算，但计算和内存需求与条件变量的数量线性扩展，而不是指数扩展。

致谢

作者感谢Léon Bottou、Yann Le Cun和Geoffrey Hinton的有益讨论。这项研究得到了NSERC授予机构的资助，以及MITACS和IRIS网络的支持。

参考文献

J.L. Elman. Finding structure in time. Cognitive Science, 14:179-211, 1990.
C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998.
J. Goodman. A bit of progress in language modeling. Technical Report MSR-TR-2001-72, Microsoft Research, 2001.
G.E. Hinton. Learning distributed representations of concepts. In Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1-12, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
G.E. Hinton. Training products of experts by minimizing contrastive divergence. Technical Report GCNU TR 2000-004, Gatsby Unit, University College London, 2000.
F. Jelinek and R. L. Mercer. Interpolated estimation of Markov source parameters from sparse data. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice. North-Holland, Amsterdam, 1980.
K.J. Jensen and S. Riis. Self-organizing letter code-book for text-to-phoneme neural network model. In Proceedings ICSLP, 2000.
S.M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-35 (3):400-401, March 1987.
R. Kneser and H. Ney. Improved backing-off for m-gram language modeling. In International Conference on Acoustics, Speech and Signal Processing, pages 181-184, 1995.
Y. LeCun, L. Bottou, G.B. Orr, and K.-R. Müller. Efficient backprop. In G.B. Orr and K.-R. Müller, editors, Neural Networks: Tricks of the Trade, pages 9-50. Springer, 1998.
R. Miikkulainen and M.G. Dyer. Natural language processing with modular neural networks and distributed lexicon. Cognitive Science, 15:343-399, 1991.
H. Ney and R. Kneser. Improved clustering techniques for class-based statistical language modelling. In European Conference on Speech Communication and Technology (Eurospeech), pages 973-976, Berlin, 1993.
T.R. Niesler, E.W.D. Whittaker, and P.C. Woodland. Comparison of part-of-speech and automatically derived category-based language models for speech recognition. In International Conference on Acoustics, Speech and Signal Processing, pages 177-180, 1998.
A. Paccanaro and G.E. Hinton. Extracting distributed representations of concepts and relations from positive and negative propositions. In Proceedings of the International Joint Conference on Neural Network, IJCNN'2000, Como, Italy, 2000. IEEE, New York.
F. Pereira, N. Tishby, and L. Lee. Distributional clustering of english words. In 30th Annual Meeting of the Association for Computational Linguistics, pages 183-190, Columbus, Ohio, 1993.

S. Riis and A. Krogh. Improving protein secondary structure prediction using structured neural networks and multiple sequence profiles. Journal of Computational Biology, pages 163-183, 1996.
J. Schmidhuber. Sequential neural text compression. IEEE Transactions on Neural Networks, 7(1): 142-146, 1996.
H. Schutze. Word space. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages pp. 895-902, San Mateo CA, 1993. Morgan Kaufmann.
H. Schwenk and J-L. Gauvain. Connectionist language modeling for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech and Signal Processing, pages 765-768, Orlando, Florida, 2002.
A. Stolcke. SRILM - an extensible language modeling toolkit. In Proceedings of the International Conference on Statistical Language Processing, Denver, Colorado, 2002.
W. Xu and A. Rudnicky. Can artificial neural network learn language models. In International Conference on Statistical Language Processing, pages M1-13, Beijing, China, 2000.