Geoff Hinton had been silent since he went to work for Google. Recently, however, Geoff has come out and started talking about something he calls dark knowledge. Maybe some questions shouldn’t be asked, but what does he mean by that?
Image credit: Shadow of the Vampire
Perhaps the first thing to understand here is the problem being solved: model complexity at test time. Complex models like ensembles and deep networks work well, but are slow to predict and require a lot of memory. Rich Caruana et al. explained it with regards to ensembles in the abstract of their 2006 paper, Model Compression (PDF):
Often the best performing supervised learning models are ensembles of hundreds or thousands of base-level classifiers. Unfortunately, the space required to store this many classifiers, and the time required to execute them at run-time, prohibits their use in applications where test sets are large (e.g. Google), where storage space is at a premium (e.g. PDAs), and where computational power is limited (e.g. hearing aids).
Smartphones would also count as devices with limited computational power and storage space.
The proposed solution is to train a simpler model that mimics a deep network or an ensemble. To make it work, one replaces actual class labels with predictions from the model we wish to mimic.
The newest paper on this is probably Do Deep Nets Really Need to be Deep?. While its angle is different, the main point is exactly the same: you can train a shallow network to imitate a deep one, but first you need to train the deep network to get predictions from it. Once you have those predictions, they become the labels and the second model attempts to learn the mapping from the first model.
Let’s turn to to Rich Caruana once again:
We take a large, slow, but accurate model and compress it into a much smaller, faster, yet still accurate model. This allows us to separate the models used for learning from the models used to deliver the learned function so that we can train large, complex models such as ensembles, but later make them small enough to fit on a PDA, hearing aid, or satellite. With model compression we can make models 1000 times smaller and faster with little or no loss in accuracy.
Geoff Hinton says in his BayLearn keynote abstract that
this technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This dark knowledge, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier.
Image credit: Dark Shadows
We’re looking forward to Geoff’s Reddit AMA on November 10th. In the meantime, why don’t you check out the dark knowledge talk and slides (PDF). Maybe our exclusive interview with Geoffrey, if you have the courage…
UPDATE: The paper on dark knowledge: Hinton, Vinyals, Dean - Distilling the Knowledge in a Neural Network [arxiv]