Fighting the AI Winter and Seeding Modern Intelligence
The University of Toronto emerged as an unlikely stronghold against the skepticism that swept artificial intelligence during the prolonged "AI winters" of the late 1970s through 1990s. Led by Geoffrey Hinton and his pioneering research group, the Toronto AI community defied prevailing academic consensus to nurture revolutionary concepts in neural networks, emergent behavior, and attention mechanisms—ultimately laying the foundation for today's AI renaissance through their seminal 2017 "Attention Is All You Need" paper.
During the first AI winter (1974-1980) and its second wave (late 1980s-1990s), most of the artificial intelligence community abandoned neural networks in favor of symbolic AI and expert systems. However, a small cadre of researchers at the University of Toronto, centered around Geoffrey Hinton's lab, maintained faith in the biological principles underlying neural computation.
Geoffrey Hinton, who joined the University of Toronto in 1987 after becoming disillusioned with DARPA's focus on immediate military applications, established Toronto as a sanctuary for neural network research during this academic exile. Unlike mainstream AI researchers who dismissed neural networks as computationally intractable and theoretically limited, Hinton's group pursued what many considered a scientific dead end.
The Toronto group's contrarian approach was rooted in biological inspiration. While expert systems dominated AI research with reasoning, Hinton and his colleagues believed that intelligence emerged from the collective behavior of simple, interconnected units—much like neurons in biological brains. This connectionist philosophy directly challenged the prevailing "Good Old Fashioned AI" (GOFAI) paradigm that had captured most research funding and academic attention.
The Toronto school's persistence paid dividends when Hinton, along with David Rumelhart and Ronald Williams, rediscovered and refined the backpropagation algorithm in 1986. This breakthrough provided an efficient method for training multilayer neural networks, overcoming the theoretical limitations that Marvin Minsky and Seymour Papert had identified in their influential 1969 book "Perceptrons".
Backpropagation represented more than a technical advance—it embodied the biological principle that learning emerges from the iterative adjustment of synaptic connections. The algorithm demonstrated how complex behaviors could arise from simple local rules, presaging later discoveries about emergent behavior in artificial systems.
The University of Toronto became a hub for this connectionist revival, attracting researchers who shared Hinton's vision of biologically-inspired AI. The Toronto Machine Learning Group fostered collaboration between computer scientists, cognitive scientists, and neuroscientists, creating an interdisciplinary environment that would prove crucial for later breakthroughs.
Emergent behavior represents one of the most profound concepts connecting biological systems to artificial intelligence. In biological terms, emergence describes how complex phenomena arise from the interactions of simpler components—a principle first articulated by Donald Hebb in 1948. The Toronto group pioneered the application of these biological insights to computational systems.
In neural networks, emergent behavior manifests when simple artificial neurons, through their collective interactions, generate capabilities not explicitly programmed into individual units. For example, a neural network trained on image recognition may spontaneously develop internal representations of edges, textures, and object parts—features never explicitly taught but emerging from the network's attempts to minimize prediction errors.
The Toronto researchers understood that emergence occurs through structural nonlinearity—the phenomenon where the output of a system cannot be predicted by simply summing the outputs of its individual components. This nonlinearity is crucial because it allows neural networks to capture complex, non-obvious patterns in data that linear models cannot detect.
Modern research has identified several key mechanisms through which emergent behavior arises in neural networks:
The Toronto group drew heavily from neuroscience and cognitive science to understand how emergence occurs in biological systems. In the human brain, consciousness itself may be an emergent property arising from the complex interactions of billions of neurons, none of which possesses consciousness individually.
This biological perspective influenced their approach to artificial neural networks. Just as the brain's information processing emerges from local interactions between neurons, the Toronto researchers designed AI systems where global intelligence emerges from local computational rules.
The concept of bio-inspired computing became central to their work, incorporating principles from cellular computation, DNA processing, and neural dynamics. This interdisciplinary approach distinguished the Toronto school from purely engineering-focused AI research and provided theoretical grounding for their empirical discoveries.
The attention mechanism that revolutionized AI has deep roots in both neuroscience and the Toronto AI community's research tradition. Human attention—the ability to selectively focus on relevant information while filtering irrelevant details—provided the biological inspiration for computational attention.
Early computational attention mechanisms appeared in the 1990s with "fast weight controllers" and sigma-pi units that created dynamic connections similar to attention in biological systems. However, the breakthrough came in 2014 when Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio (part of the broader Toronto-Montreal AI network) introduced the attention mechanism for neural machine translation.
The Bahdanau attention mechanism addressed a fundamental limitation of sequence-to-sequence models: their inability to effectively process long sequences due to the fixed-length bottleneck in encoder-decoder architectures. By allowing the decoder to selectively attend to different parts of the input sequence, attention mechanisms enabled models to capture long-range dependencies and contextual relationships.
While the Bahdanau attention mechanism represented a crucial step forward, the revolutionary breakthrough came with the 2017 "Attention Is All You Need" paper authored by eight Google researchers, several with connections to the Toronto AI ecosystem. This paper introduced the Transformer architecture, which relied entirely on attention mechanisms and dispensed with recurrent and convolutional layers altogether.
The Transformer's key innovation was self-attention—a mechanism that allows each position in a sequence to attend to all positions in the same sequence. This enables parallel processing and captures complex dependencies regardless of distance between elements. The multi-head attention mechanism further enhanced this capability by allowing the model to focus on different types of relationships simultaneously.
The biological inspiration remained evident in the Transformer's design. Just as human attention can rapidly shift focus between different aspects of a scene, self-attention allows neural networks to dynamically weight the importance of different input elements. This flexibility enables the emergence of sophisticated language understanding and generation capabilities.
The Transformer architecture became the foundation for nearly all subsequent breakthroughs in artificial intelligence. Large language models like BERT, GPT, and their successors all rely on the attention mechanisms pioneered by the Toronto school and their collaborators.
The attention mechanism exemplifies how emergent behavior arises in modern AI systems. While individual attention heads perform simple similarity computations, their collective behavior generates sophisticated understanding of language, context, and meaning. This emergence of complex capabilities from simple components validates the biological principles that the Toronto group championed during the AI winter years.
The Toronto group's theoretical work on neural networks and emergent behavior received dramatic empirical validation in 2012 with AlexNet, a deep convolutional neural network developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. This breakthrough demonstrated that the biological principles underlying neural network research could achieve unprecedented performance on real-world tasks.
AlexNet's victory in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 marked a watershed moment in AI history. The network achieved a top-5 error rate of 15.3%, dramatically outperforming the second-place entry's 26.2% error rate. This performance gap was so significant that it immediately convinced the AI community of deep learning's potential.
The technical innovations in AlexNet embodied the Toronto school's principles:
AlexNet demonstrated how emergent behavior manifests in deep neural networks. The network's convolutional layers spontaneously learned to detect edges in early layers, textures in middle layers, and complex object parts in deeper layers—a hierarchical organization reminiscent of the visual cortex in biological systems.
This emergent feature hierarchy was not programmed explicitly but arose from the network's attempt to minimize classification error across millions of training images. The Toronto researchers had theorized that such emergent organization would occur, but AlexNet provided the first compelling empirical evidence at scale.
The biological inspiration remained evident in AlexNet's architecture. Like neurons in the visual cortex, the network's artificial neurons developed receptive fields tuned to specific visual features. This bio-inspired design enabled the network to achieve human-level performance on image recognition tasks.
AlexNet's success triggered what many consider the modern AI renaissance. The dramatic performance improvement convinced researchers worldwide that neural networks represented the future of artificial intelligence, ending the AI winter that had persisted since the 1970s.
The Toronto group's victory validated their long-standing belief in the power of emergent behavior and biological inspiration. Neural networks, once dismissed as computational curiosities, became the dominant paradigm in AI research. The principles that Hinton and his colleagues had championed during the AI winter—connectionism, emergent behavior, and biological inspiration—suddenly became mainstream.
The 2017 "Attention Is All You Need" paper represented the culmination of decades of research by the Toronto school and their intellectual descendants. The paper's eight authors, while working at Google, embodied the interdisciplinary, bio-inspired approach that characterized Toronto AI research.
The Transformer architecture synthesized key insights from the Toronto tradition:
The multi-head attention mechanism exemplifies how emergent behavior arises in modern AI systems. Each attention head learns to focus on different types of relationships—some attend to syntactic dependencies, others to semantic associations, and still others to positional information. The collective behavior of these attention heads generates sophisticated understanding that no individual head possesses.
This emergence of diverse specializations without explicit programming validates the Toronto school's theoretical predictions about neural network behavior. The attention mechanism demonstrates how complex cognitive capabilities can arise from the interaction of simple computational units, paralleling the emergence of consciousness from neural activity in biological brains.
The attention mechanism's success has inspired renewed interest in biological computation and bio-inspired AI. Modern research explores how biological systems implement attention-like mechanisms and how these insights can improve artificial systems.
The Toronto group's emphasis on biological inspiration proved prescient. As AI systems become increasingly sophisticated, researchers are returning to biological principles to understand and improve artificial intelligence. The field of biological computation, which combines biology and computer science, represents a natural evolution of the Toronto school's interdisciplinary approach.
The Toronto AI community's influence continues through institutions like the Vector Institute, co-founded by Geoffrey Hinton in 2017. The institute represents a continuation of the Toronto tradition, fostering research in deep learning, attention mechanisms, and emergent AI behaviors.
Current research at Toronto builds on the foundational work in neural networks and attention mechanisms. Projects explore how emergent behavior can be quantified and controlled in AI systems, developing new initialization schemes that promote emergence, and investigating the relationship between network architecture and emergent capabilities.
Modern Toronto researchers are developing mathematical frameworks to measure emergence in neural networks. This work treats emergence as structural nonlinearity—the degree to which a network's behavior deviates from simple linear combinations of its components.
By quantifying emergence, researchers can:
This quantitative approach to emergence represents a natural evolution of the Toronto school's biological inspiration, providing mathematical rigor to concepts originally derived from neuroscience and cognitive science.
The Toronto group's emphasis on biological inspiration has evolved into the emerging field of biological computing. This interdisciplinary area explores how biological systems process information and how these mechanisms can inspire new AI architectures.
Recent developments include:
These approaches represent a return to the biological principles that sustained the Toronto group during the AI winter years, suggesting that the next breakthrough in artificial intelligence may again emerge from biological inspiration.
As AI systems become more powerful, the Toronto community is grappling with questions of AI alignment and control—ensuring that emergent behaviors in AI systems remain beneficial and predictable. This work draws on the group's deep understanding of emergent behavior to develop techniques for:
The Toronto school's biological perspective proves valuable in addressing these challenges, as biological systems have evolved mechanisms for controlling and channeling emergent behaviors toward beneficial outcomes.
The University of Toronto's AI group represents one of the most successful examples of scientific contrarianism in modern history. By maintaining faith in neural networks and biological inspiration during the AI winter years, they preserved and developed the theoretical foundations that would later enable the AI renaissance of the 21st century.
Their story illustrates several key principles:
The Toronto AI community's journey from academic outcasts to leaders of the AI renaissance demonstrates how fundamental research, guided by biological insights and sustained through periods of skepticism, can ultimately transform entire fields of study. Their work on emergent behavior and attention mechanisms continues to shape the development of artificial intelligence, ensuring that the biological principles they championed during the AI winter remain central to the field's future evolution.
The Toronto group's legacy extends beyond their specific technical contributions to encompass a research philosophy that values biological inspiration, emergent behavior, and long-term theoretical development over short-term practical applications. This approach, vindicated by the success of modern AI systems, offers a model for scientific research in other emerging fields where biological principles may again provide the key to unlocking artificial intelligence's next frontier.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Google Brain / Google Research / University of Toronto
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing with large and limited training data.
Recurrent neural networks, long short-term memory and gated recurrent neural networks in particular, have been firmly established as state of the art in sequence modeling and transduction problems such as language modeling and machine translation. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures.
Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks and conditional computation, while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. In all but a few cases, however, such attention mechanisms are used in conjunction with a recurrent network.
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU, ByteNet and ConvS2S, all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as described in section 3.2.
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations.
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-aligned recurrence and have been shown to perform well on simple-language question answering and language modeling tasks.
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate self-attention and discuss its advantages over models such as those in [17, 18] and [9].
Most competitive neural sequence transduction models have an encoder-decoder structure. Here, the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time. At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next.
The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively.
In this section we compare various aspects of self-attention layers to the recurrent and convolutional layers commonly used for mapping one variable-length sequence of symbol representations (x1, ..., xn) to another sequence of equal length (z1, ..., zn), with xi, zi ∈ Rd, such as a layer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we consider three desiderata.
One is the total computational complexity per layer. Another is the amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
The third is the path length between long-range dependencies in the network. Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies. Hence we also compare the maximum path length between any two input and output positions in networks composed of the different layer types.
This section describes the training regime for our models.
On the WMT 2014 English-to-German translation task, the big transformer model (Transformer (big) in Table 2) outperforms the best previously reported models (including ensembles) by more than 2.0 BLEU, establishing a new state-of-the-art BLEU score of 28.4. The configuration of this model is listed in the bottom line of Table 3. Training took 3.5 days on 8 P100 GPUs. Even our base model surpasses all previously published models and ensembles, at a fraction of the training cost of any of the competitive models.
In this work, we presented the Transformer, the first sequence transduction model based entirely on attention, replacing the recurrent layers most commonly used in encoder-decoder architectures with multi-headed self-attention.
For translation tasks, the Transformer can be trained significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014 English-to-French translation tasks, we achieve a new state of the art. In the former task our best model outperforms even all previously reported ensembles.
We are excited about the future of attention-based models and plan to apply them to other tasks. We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to handle large inputs and outputs such as images, audio and video. Making generation less sequential is another research goal of ours.