knowledge enhanced language modeling

Date: September 20, 2019
Presenter: Tal Schuster
Remarks: This is the first lab meeting of the season, so we have included a lot of background information for the newcomers. Feel free to skip the motivation if desired.
Note: The $\LaTeX$ source for this post may be found here.

Language modeling is one of the most fundamental tasks in natural language processing. The goal is simple: given a prefix of words $w_1,w_2,\dots,w_i$, estimate the conditional likelihood,

\[P(w_{i+1 \mid w_1,w_2,\dots,w_i}).\]

In old days, this conditional likelihood was estimated using statistical methods. In recent years, neural models and big data have improved upon classical methods by leaps and bounds. To illustrate, the following text was generated one such model, GPT-2 [1].

Prompt: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.
Language model: The scientist named the population, after their distinctive horn, Ovid's Unicorn. These four-horned, silver-white unicorns were previously unknown to science.
Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.
Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

At first glance, this story about Andean unicorns is quite remarkable. It reads smoothly, with few grammatical faults. To a naive reader, it even approximately resembles a real article! Sadly for the unicorns, it’s not real news—but we might wonder, how well can language models produce true text? As it turns out, not too well.

A recent paper investigated whether language models learn knowledge-base-like information [2]. That is, we can ask a language model to complete the prefix: The official language of Mauritius is ___. The most probable answer is English, which is amazingly, correct.

Unfortunately, this positive behavior is more an outlier than the norm. If we ask a language model: JDK is developed by ___, the most probable answers are Apple, Nokia, Sony, Samsung, and Intel. A human might guess as much too and be no more correct (the true answer is Oracle). So as Yoav Goldberg tweets:

"no, [language models] really CANNOT replace knowledge bases yet. they recover correctly only a fraction of the facts in the KB, are restricted to single tokens, and when they don't know the answer they just make something up." @yoavgo

But can we generate better facts, using knowledge bases?

Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling

Authors: Robert L. Logan IV, Nelson F. Liu, Matthew E. Peters, Matt Gardner and Sameer Singh

If we ask GPT-2 to complete the prefix Barack's wife ___, its probable guesses often include Hillary, who is often mentioned alongside Barack, but is in no way the same person as Michelle Obama.


This paper [3] attempts to address the issue by directly generating text from facts, grounded in a knowledge graph. As a modification on the traditional language model objective, the authors propose to estimate the following:

\[P\left(w_{i+1}, \varepsilon_{i+1} \mid w_{\le i}, \varepsilon_{\le i}\right)\]

where $\varepsilon$ represents an entity in the knowledge graph and the set

\[\mathcal{K}\mathcal{G}_{\le i} = \left\{\varepsilon_t \mid t \le i\right\}\]

is the local knowledge graph induced by the entities that appear up to word $i+1$. The generative process is as follows (figure below).

  1. Decide whether word $w_{i+1}$ refers to a new entity, refers to an existing entity, or is not an entity.

  2. If $w_{i+1}$ is a new entity, select an entity $\varepsilon_{i+1}$ from the set of all entities and add $\varepsilon_{i+1}$ to $\mathcal{K}\mathcal{G}$.

  3. If $w_{i+1}$ is an existing entity, choose an existing entity as the parent, choose one of the parent’s relations, and choose the target of that relation as $\varepsilon_{i+1}$.

  4. Generate $w_{i+1}$ conditioned on $\varepsilon_{i+1}$.

At inference time, the authors do not assume access to all entity annotations in text. Furthermore, the traditional language modeling objective only seeks to predict $P(w)$, not $P(w,\varepsilon)$. So the authors marginalize over entities: given a proposal distribution $Q(\varepsilon \mid w)$,

\[P(w) = \sum_\varepsilon P(w,\varepsilon) = \sum_\varepsilon \frac{P(w,\varepsilon)Q(\varepsilon\mid w)}{Q(\varepsilon\mid w)} \approx \frac{1}{N} \sum_{\varepsilon\sim Q} \frac{P(w,\varepsilon)}{Q(\varepsilon\mid w)}.\]

This quantity is approximated using importance sampling.


The authors construct a dataset they dub Linked WikiText-2, which consists of WikiText-2 articles linked to the WikiData knowledge graph. This dataset is convenient for comparison against models trained on WikiText-2, and the inherent connection between the texts and WikiData ensures good coverage.

The following metrics were used for evaluation.

  • The model achieves 44.1 perplexity and 88.5 unknown penalized perplexity, which penalizes for the usage of [UNK] tokens. These values are about half that of their baseline comparisons.

  • The model is also evaluated for fact completion peformance. That is, given a prefix, can the model predict the next word, consistent with existing knowledge? Compared to a vanilla GPT-2, the model achieves much higher top-1 and top-5 accuracy (e.g. 94/95 vs. 14/14 on birth location).

Personal Thoughts

I like the idea behind a lot of works that attempt to integrate structured knowledge into natural language generation. However, I find that many of these works result in exceedingly complex or engineered architectures. Maybe this is a matter of personal preference, but I’m not a fan of the many components required to select entities, render entities into text, and “marginalize out” the knowledge graph at inference time.


  1. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.

  2. F. Petroni, T. Rocktaschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel, “Language models as knowledge bases?,” 2019.

  3. R. Logan, N. Liu, M. Peters, M. Gardner, and S. Singh, “Barack’s wife hillary: Using knowledge-graphs for fact-aware language modeling,” CoRR, vol. abs/1906.07241, 2019.