Geoffrey Hinton received his PhD in Artificial Intelligence from Edinburgh in 1978. After five years as a faculty member at Carnegie-Mellon he became a fellow of the Canadian Institute for Advanced Research and moved to the Department of Computer Science at the University of Toronto where he is now an emeritus distinguished professor. He is also a distinguished researcher at Google. Geoffrey Hinton was one of the researchers who introduced the backpropagation algorithm and the first to use backpropagation for learning word embeddings. His other contributions to neural network research include Boltzmann machines, distributed representations, time-delay neural nets, mixtures of experts, variational learning and deep learning. His research group in Toronto made major breakthroughs in deep learning that revolutionized speech recognition and object classification. Geoffrey Hinton is a fellow of the UK Royal Society and a foreign member of the US National Academy of Engineering and the American Academy of Arts and Sciences. His awards include the David E. Rumelhart prize, the IJCAI award for research excellence, the Killam prize for Engineering, the IEEE James Clerk Maxwell Gold medal, and the NSERC Herzberg Gold Medal which is Canada's top award in Science and Engineering.
We would like to train neural networks that have trillions of weights on trillions of examples. This requires massive parallelism. We can use different processors for different parts of a large neural net but this requires us to communicate the states of the neurons between processors. We can also make many replicas of the neural net and feed different sets of examples to different replicas but this requires us to communicate the weight gradients computed by different replicas. When these two forms of parallelism have been exhausted, we need to turn to biology to find another form of parallelism that can tolerate low bandwidth and high latency. I will describe systems in which one neural network attempts to mimic the predictions of another neural network and show that this allows knowledge to be transferred between networks even though the inner workings of the networks are completely different. Curiously, for classification tasks, the most informative aspect of the output of a network is not the probability it assigns to the correct class but the relative probabilities it assigns to incorrect classes. Given an image of a BMW, a network that has learned that a BMW is much more similar to a garbage truck than to a carrot provides a much better teaching signal than the label "BMW".