Unicorn Transformer Là Ai

This blog introduces a new long-range memory Mã Sản Phẩm, the Compressive sầu Transformer, alongside a new benchmark for book-màn chơi language modelling, PG19. We provide the conceptual tools needed khổng lồ underst& this new research in the context of recent developments in memory models và language modelling.Quý Khách vẫn xem: Unicorn transformer là ai

Throughout our lives, we build up memories that are retained over a diverse array of timescales, from minutes khổng lồ months khổng lồ years to lớn decades. When reading a book, we can reGọi characters who were introduced many chapters ago, or in an earlier book in a series, và reason about their motivations and likely actions in the current context. We can even put the book down during a busy week, and piông xã up from where we left off without forgetting the plotline.

Bạn đang xem: Unicorn transformer là ai

We vị not achieve sầu such feats by storing every detail of sensory input đầu vào we receive sầu about the world throughout our lifetimes. Our brains select, filter, and integrate input đầu vào stimuli based on factors of relevance, surprise, perceived danger, và repetition. In other words, we compress lifelong experience to a set of salient memories which help us understand the past, & better anticipate the future. A major goal of AI researchers is discovering ways of implementing such abilities in computational systems và benchmarks which require complex reasoning over long time-spans.

Memory systems for artificial neural networks have advanced considerably in the past two decades. In this post, we look to past advances khổng lồ explore why this is such a difficult task & consider how natural language modelling could offer an effective sầu means of designing better long range memory systems? We reflect on the necessity for better compressive memory architectures, and sparse memory access mechanisms, to work towards the goal of incorporating lifelong reasoning in our computational systems.

A brief history of memory in deep learning

There is no memory or retentive faculty based on lasting impression. What we designate as memory is but increased responsiveness khổng lồ repeated stimuli. Nikola Tesla

One of the earliest và most widely-used memory architectures in present day is a recurrent neural network (RNN) called the Long Short-Term-Memory (LSTM). The LSTM maintains a compact memory in the form of a vector of numbers, which it accesses và modifies with gated read, write, và forget operations. It was originally developed on a suite of synthetic tasks that involved learning logical operations on a stream of bits. However, it has since become a ubiquitous mã sản phẩm of sequential data: from recognising handwritten notes to lớn predicting the early onset of kidney injury.

One weakness of the LSTM, and of many contemporary RNNs, is capacity. They are designed so that each unit of memory can influence every other unit in memory with a learnable weight. But this results in a computationally inefficient system: the number of learnable parameters in the mã sản phẩm grows quadratically with the memory size. For example, an LSTM with a memory of form size 64KB results in parameters of kích cỡ 8GB. Circumventing this memory capađô thị bottleneông chồng has been an active sầu retìm kiếm area.


Figure 1. Long-range reasoning is crucial to lớn general intelligence. Here, an agent remembers the existence và location of a key over a long period of time, and recalls this information when a treasure chest is discovered – prompting the agent to return to the remembered location khổng lồ retrieve the key.

Researchers at i-google-map.com proposed a novel architecture, the Differentiable Neural Computer (DNC), which augments an LSTM with a much larger memory matrix to address these deficits. The DNC uses an attention operation lớn read from this memory matrix. In visual attention, our eyes are drawn by pertinent objects in a visual scene–for example, one might typically spend more time observing a friend’s face during an emotional conversation than on noticing their shoes. Here, memory models can attend lớn particular events/data in the past. This attention operation requires a fixed number of parameters, independent of the memory kích cỡ, & so the memory capađô thị of the mã sản phẩm can be significantly increased.

Alongside the DNC, recurrent neural networks with an additional attention mechanism were showing promise in the domains of translation and question answering. These models were able khổng lồ reason over time using two memory structures: a small và compact LSTM memory và a large external memory. However, more recently researchers at Google Brain Team proposed the Transformer which removes the LSTM, and only uses attention to transmit information across time.


Figure 2. A visualisation of the neural network’s attention for English to lớn French translation. Source: Attention and Augmented Recurrent Neural Networks, Olah & Carter, 2016The Transformer was originally shown khổng lồ significantly outperform recurrent neural networks for machine translation. However it has since been applied to lớn a range of applications in natural language processing, from question answering, document summarisation, sentiment classification & the modelling of natural language – a task that has seen particular exciting developments over the past year.

Modelling natural language

Finding machine learning tasks which both drive the development of better memory architectures & push us further towards artificial general intelligence is challenging. Statistical language modelling is one such task that we believe could be valuable for both purposes. Language models work by sequentially predicting the next word in a stream of text. They can be used to lớn mã sản phẩm existing texts và also lớn generate novel texts. As they get better at modelling the past, their predictions become more accurate, & the texts they generate become more realistic.

Xem thêm: Nghĩa Của Từ Skeletal Muscle Là Gì ? Skeletal Muscle

In Claude Shannon’s seminal article “A Mathematical Theory of Communication published in 1948, which founded the field of information theory, he discussed primitive sầu language models and illustrated how adding more context improves the unique and realism of generated text. He does this by introducing the most simple Model of English text, which has no contextual modelling at all – a character-level model which treats each character independently. By sampling characters with their relative sầu frequencies (8% of the time for ‘a’, 1.5% for ‘b’ etc.) we arrive sầu with a nonsensical string :

However, he remarks at the improvement in sample quality if one instead models the probability of words independently. Now the modelled context is approximately 7X larger (the average number of characters in a word):


By modelling the probability of word pairs, a further 2X in context length, even more realistic text emerges:


In other words, an increase in the length of context leads to lớn an improvement in the unique of text generated. Shannon remarks on the unique of his produced samples and conjectures that natural text samples may emerge from a sufficiently complex statistical model, “The particular sequence of ten words “attaông chồng on an English writer that the character of this” is not at all unreasonable. It appears then that a sufficiently complex stochastic process will give sầu a satisfactory representation of a discrete source”.

One criticism of language modelling as a task for long-range reasoning is that models can capture a large portion of their predictions from the local context. Neural language models have traditionally ignored the wider context, focusing mostly on the short term. For example, in 2017 Dailuk et al. found their neural language model rarely attends beyond the preceding five sầu words. However in the past year large Transformer models have sầu been shown lớn make use of hundreds of words of context to lớn generate ever-more realistic text with a longer range of coherence. A chạy thử from OpenAI’s GPT-2, a 1.5B parameter Transformer, indicate that the Model is able to generate realistic text và retain key entities (e.g. Dr Jorge Pérez and unicorns) across multiple paragraphs:

The scientist named the population, after their distinctive sầu horn, Ovid’s Unicorn. These four-horned, silver-trắng unicorns were previously unknown khổng lồ science.Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, và several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared khổng lồ be a natural fountain, surrounded by two peaks of roông xã and silver snow.Pérez and the others then ventured further inlớn the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.Pérez và his friends were astonished to lớn see the unicorn herd. These creatures could be seen from the air without having to lớn move sầu too much khổng lồ see them – they were so cchiến bại they could touch their horns.While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something lượt thích a dialect or dialectic.”Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.While their origins are still unclear, some believe that perhaps the creatures were created when a human và a unicorn met each other in a time before human civilization. According lớn Pérez, “In South America, such incidents seem to lớn be quite common.”However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem lớn be able khổng lồ communicate in English quite well, which I believe sầu is a sign of evolution, or at least a change in social organization,” said the scientist.

Transferring knowledge

Such samples would likely astound Shannon, 70 years on from his early language model experiments. However the real benefit of powerful neural language models – và their relevance lớn the goal of AGI – is their ability lớn transfer knowledge to lớn a suite of tasks. In the process of learning how to lớn Model text, neural language models appear khổng lồ build up a knowledge-base of associations, & a plethora of skills.

For instance, researchers at OpenAI showed that GPT-2 can be applied to natural-language processing tasks such as question answering, paraphrasing, or sentiment analysis with surprisingly good performance – especially for a Model that has never been explicitly trained khổng lồ perkhung such tasks. When large Transformer language models are fine-tuned on particular tasks such as question answering, the resulting performance is significantly better than models that were designed & trained solely for question answering. Google’s prominent natural language Mã Sản Phẩm, BERT, achieves state-of-the-art performance on a wide array of NLP benchmarks, and is now a part of Google Search. And more recently, it was shown that GPT-2 can learn lớn play rudimentary chess by training it on strings of game moves.

Benchmarking language models

A popular long-range language mã sản phẩm benchmark is WikiText-103, which is comprised of English-language Wikipedia articles, và was developed by researchers at Salesforce AI. Articles are around 3,600 words on average, which, at the time of creation, was far beyond the memory window of state-of-the-art models.

However researchers at Google recently showed that a Transformer variant called the TransformerXL – which maintains a memory of past network activations & recently obtained state-of-the-art results on WikiText-103 – can make use of contexts spanning over one thous& words. This raises the question: will models soon saturate these benchmarks? As such, we’ve sầu compiled and released a new, longer-range language Mã Sản Phẩm benchmark based on books.

A new dataphối for long-term memory research

Books provide a rich context for the development of long-range memory models. We selected a subphối of approximately 28,000 books from Project Gutenberg published before 1919. Unlượt thích prior language modeling dataset releases, we apply very little pre-processing khổng lồ the text. For example, we vày not limit the vocabulary form size of the data or censor numbers, to lớn avoid the filtering of useful information.

PG-19 is over double the form size of prior language modelling benchmarks, such as the Billion Word Benchmark, & contains text that is over 10X longer in context than the prior long-range language mã sản phẩm benchmark, WikiText-103. We provide a comparative sầu table of existing language modelling benchmarks, below: