On March 4th Jürgen Schmidhuber tackled “ask me anything” questions on Reddit. The professor was very keen to answer, in fact he continued to do so on the 5th, 6th and beyond. Here are some of his thoughts we found interesting, grouped by topic.
Why doesn’t your group post its code online for reproducing the results of competitions you’ve won, such as the ISBI Brain Segmentation Contest? Your results are impressive, but almost always not helpful for pushing the research forward. (that was the most popular question in the AMA)
We did publish lots of open source code. Our PyBrain Machine learning library is public and widely used, thanks to the efforts of Tom Schaul, Justin Bayer, Daan Wierstra, Sun Yi, Martin Felder, Frank Sehnke, Thomas Rückstiess.
Here is the already mentioned code (RNNLIB) of the first competition-winning RNNs (2009) by my former PhD student and then postdoc Alex Graves. Many are using that.
It is true though that we don’t publish all our code right away. In fact, some of our code gets tied up in industrial projects which make it hard to release.
Nevertheless, especially recently, we published less code than we could have. I am a big fan of the open source movement, and we’ve already concluded internally to contribute more to it. (…) There are also plans to release more of our recent recurrent network code soon. In particular, there are plans for a new open source library, a successor of PyBrain.
What is the future of PyBrain? Is your team still working with/on PyBrain? If not, what is your framework of choice? What do you think of Theano? Are you using something better?
My PhD students Klaus and Rupesh are working on a successor of PyBrain with many new features, which hopefully will be released later this year.
How do you recognize a promising machine learning PhD student?
They all have something in common: successful students are not only smart but also tenacious. While trying to solve a challenging problem, they run into a dead end, and backtrack. Another dead end, another backtrack. But they don’t give up. And suddenly there is this little insight into the problem which changes everything. And suddenly they are world experts in a particular aspect of the field, and then find it easy to churn out one paper after another, and create a great PhD thesis.
Do you know of any labs doing biotech/bioinformatics that you think are worth exploring?
I know a great biotech/bioinformatics lab for this: the one of Deep Learning pioneer Sepp Hochreiter in Linz.
Sepp is back in the NN game, and his team promptly won nine out of 15 challenges in the Tox21 data challenge, including the Grand Challenge, the nuclear receptor panel, the stress response panel. Check out the NIH (NCATS) announcement of the winners and the leaderboard.
Sepp’s Deep Learning approach DeepTox is described here.
What do you think about the American model of grad school (5 years on average, teaching duties, industry internships, freedom to explore and zero in on a research problem) versus the European model (3 years, contracted for a specific project, no teaching duties, limited industry internships)?
The models in both US and EU are shaped by Humboldt’s old model of the research university. But they come in various flavours. For example, there is huge variance in “the European models”. I see certain advantages of the successful US PhD school model which I got to know better at the University of Colorado at Boulder in the early 1990s. But I feel that less school-like models also have something going for them.
US-inspired PhD schools like those at my present Swiss university require students to get credits for certain courses. At TU Munich (where I come from), however, the attitude was: a PhD student is a grown-up who doesn’t go to school any more; it’s his own job to acquire the additional education he needs. This is great for strongly self-driven persons but may be suboptimal for others. At TUM, my wonderful advisor, Wilfried Brauer, gave me total freedom in my research. I loved it, but it seems kind of out of fashion now in some places.
The extreme variant is what I like to call the “Einstein model.” Einstein never went to grad school. He worked at the patent office, and at some point he submitted a thesis to Univ. Zurich. That was it. Ah, maybe I shouldn’t admit that this is my favorite model. And now I am also realizing that I have not really answered your question in any meaningful way - sorry for that!
What is, in your opinion, the best venue to publish modern neural network work? What do you think of the International Conference on Learning Representations (ICLR)?
I like the partially open review process of ICLR, and the way it uses the arXiv preprint server. In fact, some of the most interesting papers are now first published as Tech Reports on arXiv without peer review. The physicists started this a quarter-century ago, forcing leading journals to accelerate the subsequent peer review process by a factor of 10 or so, to prevent the TRs from attracting all the citations. Computer science caught up around 2000. Here’s my old text on this from 2001.
The only problem is that some people publish nonsense on arXiv, and then sometimes even manage to promote it through contacts with “tabloid science” journalists who have no idea what they are writing about.
Anyway, one can still earn a badge of honor by getting something published in the leading journals: Neural Computation, Neural Networks, IEEE Transaction on Neural Networks, Journal of Machine Learning Research, etc. And at the leading conferences such as NIPS and ICML (IJCNN also has published great papers).
Why is there not much interaction and collaboration between the researchers of Recurrent NNs and the rest of the NN community, particularly Convolutional NNs? I always see Hinton, LeCun, and Bengio interacting at conferences, panels, and google plus, but never Schmidhuber. They also cite each others papers more.
Maybe part of this is just a matter of physical distance. This trio of long-term collaborators has done great work in three labs near the Northeastern US/Canadian border, co-funded by the Canadian CIFAR organization, while our labs in Switzerland and Munich were over 6,000 km away and mostly funded by the Swiss National Foundation, DFG, and EU projects. Also, I didn’t go much to the important NIPS conference in Canada any more when NIPS focused on non-neural stuff such as kernel methods during the most recent NN winter, and when cross-Atlantic flights became such a hassle after 9/11.
Nevertheless, there are quite a few connections across the big pond. For example, before he ended up at DeepMind, my former PhD student and postdoc Alex Graves went to Geoff Hinton’s lab, which is now using LSTM RNNs a lot for speech and other sequence learning problems. Similarly, my former PhD student Tom Schaul did a postdoc in Yann LeCun’s lab before he ended up at DeepMind (which has become some sort of retirement home for my former students :-). Yann LeCun also was on the PhD committee of Jonathan Masci, who did great work in our lab on fast image scans with max-pooling CNNs.
With Yoshua Bengio we even had a common paper in 2001 on the vanishing gradient problem. The first author was Sepp Hochreiter, my very first student (now professor) who identified and analysed this Fundamental Deep Learning Problem in 1991 in his diploma thesis.
There have been lots of other connections through common research interests. (…) To summarise: there are lots of RNN/CNN-related links between our labs.
I just took my first machine learning course and I’m interested in learning more about the field. Where do you recommend I start? Do you have any books, tutorials, tips to recommend?
Here is a very biased list of books and links that I found useful for students entering our lab (other labs may emphasize different aspects though).
What are some of the most exciting papers that you have read (or written) in the past year?
Last year I got excited about industrial breakthroughs of our recurrent neural networks. They are now helping to revolutionize speech processing and other sequence learning domains, especially the Long Short-Term Memory (LSTM) developed in my research groups in the 1990s and 2000s (main PhD theses by Sepp Hochreiter 1999, Felix Gers 2001, Alex Graves 2008, main postdoc contributors: Fred Cummins, Santiago Fernandez, Faustino Gomez). Here some recent benchmark records achieved with LSTM, often at big IT companies:
- Text-to-speech synthesis (Fan et al., Microsoft, Interspeech 2014)
- Language identification (Gonzalez-Dominguez et al., Google, Interspeech 2014)
- Large vocabulary speech recognition (Sak et al., Google, Interspeech 2014)
- Prosody contour prediction (Fernandez et al., IBM, Interspeech 2014)
- Medium vocabulary speech recognition (Geiger et al., Interspeech 2014)
- English to French translation (Sutskever et al., Google, NIPS 2014)
- Audio onset detection (Marchi et al., ICASSP 2014)
- Social signal classification (Brueckner & Schulter, ICASSP 2014)
- Arabic handwriting recognition (Bluche et al., DAS 2014)
- TIMIT phoneme recognition (Graves et al., ICASSP 2013)
- Optical character recognition (Breuel et al., ICDAR 2013)
- Image caption generation (Vinyals et al., Google, 2014)
- Video to textual description (Donahue et al., 2014)
- Photo-real talking heads (Soong and Wang, Microsoft, 2014).
- Semantic Representations (Kai Sheng Tai et al., 2015)
- Learning Video Representations (Srivastava et al., 2015)
- Video Description Generation (Li Yao et al., 2015)
Also check out recent end-to-end speech recognition (Hannun et al., Baidu, 2014) with our CTC-based RNNs (Graves et al., 2006), without any HMMs etc.
Many of the references above can be found in my recent Deep Learning survey, whose write-up consumed quite some time, because I wanted to get the history right: who started Deep Learning, who invented backpropagation, what has been going on in Deep Reinforcement Learning, etc, etc. In the end I had 888 references on 88 pages.
What is hot now in applying learning-as-compression as per say Vitanyi to ANNs? Will this study gain more momentum? And what about the RNN book, will it make us wait still too much :-)?
From my biased perspective, Compressed Network Search is hot.
Regarding the RNN book: please bear with us, and let me offer a partial excuse for the delay, namely, that the field is moving so quickly right now! In the meantime, please make do with the Deep Learning overview which also is an RNN survey.
I think Marcus Hutter’s AIXI model of the early 2000s was a game changer. Until then, the field of Artificial General Intelligence (AGI) had been a collection of heuristics. But heuristics come and go, while theorems last for eternity. Building on Ray Solomonoff’s earlier work on universal predictors, Marcus proved that there is a universal AI that is mathematically optimal in a certain sense. It’s not a practical sense, otherwise we’d probably not even discuss this here. But this work exposed the ultimate limits of both human and artificial intelligence, and brought mathematical soundness and theoretical credibility to the entire field for the first time. These results will still stand in a thousand years. More.
From my extremely biased perspective I’d say that there also has been a lot of important work on non-universal but still rather general and very practical recurrent neural networks. RNNs are general computers. RNNs are the deepest NNs. Some RNNs are biologically plausible. Some RNNs are compatible with physically efficient future hardware: lots of processors connected through many short and few long wires. In many ways, RNNs are the ultimate NNs. In recent decades, there has been lots of progress in both supervised learning RNNs and reinforcement learning RNNs. RNNs have started to revolutionize very important fields such as speech recognition. And that’s just the beginning. Many researchers have collectively contributed to this RNN-based “game changer”; here are some relevant sections in my little survey, with lots of references: Sec. 2, 3, 5.5, 5.5.1, 5.6.1,5.9, 5.10, 5.13, 5.16, 5.17, 5.20, 5.22, 6.1, 6.3, 6.4, 6.6, 6.7.
How do we get from supervised learning to fully unsupervised learning?
When we started explicit Deep Learning research in the early 1990s, we actually went the other way round, from unsupervised learning (UL) to supervised learning (SL)! To overcome the vanishing gradient problem, I proposed a generative model, namely, an unsupervised stack of RNNs (1992) [PDF]. The first RNN uses UL to predict its next input. Each higher level RNN tries to learn a compressed representation of the info in the RNN below, trying to minimise the description length (or negative log probability) of the data. The top RNN may then find it easy to classify the data by supervised learning. One can also “distill” a higher RNN (the teacher) into a lower RNN (the student) by forcing the lower RNN to predict the hidden units of the higher one (another form of unsupervised learning). Such systems could solve previously unsolvable deep learning tasks.
However, then came supervised LSTM, and that worked so well in so many applications that we shifted focus to that. On the other hand, LSTM can still be used in unsupervised mode as part of an RNN stack like above. This illustrates that the boundary between supervised and unsupervised learning is blurry. Often gradient-based methods such as backpropagation are used to optimize objective functions for both types of learning.
So how do we get back to fully unsupervised learning? First of all, what does that mean? The most general type of unsupervised learning comes up in the general reinforcement learning (RL) case. Which unsupervised experiments should an agent’s RL controller C conduct to collect data that quickly improves its predictive world model M, which could be an unsupervised RNN trained on the history of actions and observations so far? The simple formal theory of curiosity and creativity says: Use the learning progress of M (typically compression progress in the MDL sense) as the intrinsic reward or fun of C. I believe this general principle of active unsupervised learning explains all kinds of curious and creative behaviour in art and science, and we have built simple artificial “scientists” based on approximations thereof, using (un)supervised gradient-based learners as sub-modules.
How will IBM’s TrueNorth neurosynaptic chip affect Neural Networks community? Can we expect that the future of Deep Learning lies not in GPUs, but rather in a dedicated hardware as TrueNorth?
As already mentioned in another reply, current GPUs are much hungrier for energy than biological brains, whose neurons efficiently communicate by brief spikes (Hodgkin and Huxley, 1952; FitzHugh, 1961; Nagumo et al., 1962), and often remain quiet. Many computational models of such spiking neurons have been proposed and analyzed - see Sec. 5.26 of the Deep Learning survey. I like the TrueNorth chip because indeed it consumes relatively little energy (see Sec. 5.26 for related hardware). This will become more and more important in the future. It would be nice though to have a chip that is not only energy-efficient but also highly compatible with existing state of the art learning methods for NNs that are normally implemented on GPUs. I suspect the TrueNorth chip won’t be the last word on this.
What do you think about Hierarchical Temporal Memory (HTM) and the Cortical Learning Algorithm (CLA) theory developed by Jeff Hawkins and others?
Jeff Hawkins had to endure a lot of criticism because he did not relate his method to much earlier similar methods, and because he did not compare its performance to the one of other widely used methods.
HTM is a neural system that attempts to learn from temporal data in hierarchical fashion. To my knowledge, the first neural hierarchical sequence-processing system was our hierarchical stack of recurrent neural networks (Neural Computation, 1992). Compare also hierarchical Hidden Markov Models (e.g., Fine, S., Singer, Y., and Tishby, N., 1998), and our widely used hierarchical stacks of LSTM recurrent networks [PDF].
At the moment I don’t see any evidence that Hawkins’ system can contribute “towards more powerful AI systems (or even AGI).”
Recurrent neural networks
Why do you think that RNNs are the ultimate NNs?
Because they are general computers. Like your laptop. The program of an RNN is its weight matrix. Unlike feedforward NNs, RNNs can implement while loops, recursion, you name it.
While FNNs are traditionally linked to concepts of statistical mechanics and traditional information theory, the programs of RNNs call for the framework of algorithmic information theory (or Kolmogorov complexity theory).
How on earth did you and Hochreiter come up with LSTM units? They seem radically more complicated than any other “neuron” structure I’ve seen, and everytime I see the figure, I’m shocked that you’re able to train them.
In my first Deep Learning project ever, Sepp Hochreiter (1991) analysed the vanishing gradient problem. LSTM falls out of this almost naturally :-)
Your recent paper on Clockwork RNNs seems to provide an alternative to LSTMs for learning long term temporal dependencies. Are there obvious reasons to prefer on approach over the other? Have you put thought into combining elements from each approach (e.g. Clockwork RNNs that make use of multiplicative gating in some fashion)?
We had lots of ideas about this. This is actually a simplification of our RNN stack-based history compressors (Neural Computation, 1992) [PDF], where the clock rates are not fixed, but depend on the predictability of the incoming sequence (and where a slowly clocking teacher net can be “distilled” into a fast clocking student net that imitates the teacher net’s hidden units).
But we don’t know yet in general when to prefer which variant of plain LSTM over which variant of Clockwork RNNs or Clockwork LSTMs or history compressors. Clockwork RNNs so far are better only on the synthetic benchmarks presented in the ICML 2014 paper.
I don’t have a specific question, but I was curious about your thoughts towards recent work on reservoir computing.
The relation between reservoirs and fully adaptive recurrent neural networks (RNNs) is a bit like the relation between kernel methods and fully adaptive feedforward neural networks (FNNs). Kernel methods such as support vector machines (SVMs) typically have a pre-wired, complex, highly nonlinear pre-processor of the data (the kernel), and optimize a linear mapping from kernel outputs to target labels. That’s what reservoirs do, too, except that they don’t just process individual data points, but sequences of data (e.g., speech). Deep FNNs go beyond SVMs in the sense that they also optimize the nonlinear part of the mapping from data to labels. RNNs go beyond reservoirs in the same sense. Nevertheless, just like SVMs, reservoirs have achieved excellent results in certain domains. For example, see the pioneering work of Herbert Jaeger and Wolfgang Maass and colleagues. (More references in the Deep Learning overview.)
Jochen Steil (2007) and others used unsupervised learning to improve nonlinear reservoir parts as well. One can also optimize reservoirs by evolution. For example, evolution-trained hidden units of LSTM RNNs combined with an optimal linear mapping (e.g., SVM) from hidden to output units outperformed traditional pure gradient-based LSTM on certain supervised sequence learning tasks. See the EVOLINO papers since 2005.
What kind of a future do you see for recurrent neural networks, as considered against deep learning networks? What kinds of training algorithms, if any, work well for recurrent networks?
All deep feedforward networks are special cases of recurrent networks. In principle, recurrent neural networks are the deepest of them all. Deep Learning can occur in both recurrent and feedforward networks. See Sec. 3 of survey below.
What works well? Useful algorithms for supervised, unsupervised, and reinforcement learning recurrent networks are mentioned in the following sections of the Deep Learning survey: Sec. 5.5, 5.5.1, 5.6.1, 5.9, 5.10, 5.13, 5.16, 5.17, 5.20, 5.22, 6.1, 6.3, 6.4, 6.6, 6.7.
What are the next big things that you a) want to or b) will happen in the world of recurrent neural nets?
The world of RNNs is such a big world because RNNs (the deepest of all NNs) are general computers, and because efficient computing hardware in general is becoming more and more RNN-like, as dictated by physics: lots of processors connected through many short and few long wires. It does not take a genius to predict that in the near future, both supervised learning RNNs and reinforcement learning RNNs will be greatly scaled up. Current large, supervised LSTM RNNs have on the order of a billion connections; soon that will be a trillion, at the same price. (Human brains have maybe a thousand trillion, much slower, connections - to match this economically may require another decade of hardware development or so). In the supervised learning department, many tasks in natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends). The commercially less advanced but more general reinforcement learning department will see significant progress in RNN-driven adaptive robots in partially observable environments. Perhaps much of this won’t really mean breakthroughs in the scientific sense, because many of the basic methods already exist. However, much of this will SEEM like a big thing for those who focus on applications. (It also seemed like a big thing when in 2011 our team achieved the first superhuman visual classification performance in a controlled contest, although none of the basic algorithms was younger than two decades.)
So what will be the real big thing? I like to believe that it will be self-referential general purpose learning algorithms that improve not only some system’s performance in a given domain, but also the way they learn, and the way they learn the way they learn, etc., limited only by the fundamental limits of computability. I have been dreaming about and working on this all-encompassing stuff since my 1987 diploma thesis on this topic, but now I can see how it is starting to become a practical reality. Previous work on this is collected here.
What’s something exciting you’re working on right now, if it’s okay to be specific?
Among other things, we are working on the “RNNAIssance” - the birth of a Recurrent Neural Network-based Artificial Intelligence (RNNAI). This is about a reinforcement learning, RNN-based, increasingly general problem solver.
From the AMA intro: Since age 15 or so, Jürgen Schmidhuber’s main scientific ambition has been to build an optimal scientist through self-improving Artificial Intelligence (AI), then retire.
What sparked you at age 15 to have that as your ambition?
When I was a boy, the nearby public library had all those popular science books. I got fascinated by the most fundamental of natural sciences, namely, physics. Even mathematics, the “queen of sciences,” had a history of physics-driven discoveries. Einstein was my big hero. I wanted to become a physicist, too. But then I realized the much bigger potential impact of building an artificial scientist much smarter than myself, letting him do the remaining work. This has been driving me ever since. I went on to study computer science with the ambition to build a general purpose artificial intelligence. This naturally led to self-modifying programs and reinforcement learning RNNs etc, etc.
What is your take on the threat posed by artificial super intelligence to mankind?
I guess there is no lasting way of controlling systems much smarter than humans, pursuing their own goals, being curious and creative, in a way similar to the way humans and other mammals are creative, but on a much grander scale.
But I think we may hope there won’t be too many goal conflicts between “us” and “them.” Let me elaborate on this.
Humans and others are interested in those they can compete and collaborate with. Politicians are interested in other politicians. Business people are interested in other business people. Scientists are interested in other scientists. Kids are interested in other kids of the same age. Goats are interested in other goats.
Supersmart AIs will be mostly interested in other supersmart AIs, not in humans. Just like humans are mostly interested in other humans, not in ants. Aren’t we much smarter than ants? But we don’t extinguish them, except for the few that invade our homes. The weight of all ants is still comparable to the weight of all humans.
Human interests are mainly limited to a very thin film of biosphere around the third planet, full of poisonous oxygen that makes many robots rust. The rest of the solar system, however, is not made for humans, but for appropriately designed robots. Some of the most important explorers of the 20th century already were (rather stupid) robotic spacecraft. And they are getting smarter rapidly. Let’s go crazy. Imagine an advanced robot civilization in the asteroid belt, quite different from ours in the biosphere, with access to many more resources (e.g., the earth gets less than a billionth of the sun’s light). The belt contains lots of material for innumerable self-replicating robot factories. Robot minds or parts thereof will travel in the most elegant and fastest way (namely by radio from senders to receivers) across the solar system and beyond. There are incredible new opportunities for robots and software life in places hostile to biological beings. Why should advanced robots care much for our puny territory on the surface of planet number 3?
You see, I am an optimist :-)
You once said:
All attempts at making sure there will be only provably friendly AIs seem doomed. Once somebody posts the recipe for practically feasible self-improving Goedel machines or AIs in form of code into which one can plug arbitrary utility functions, many users will equip such AIs with many different goals, often at least partially conflicting with those of humans.
Do you still believe this? Secondly, if someone comes up with such a recipe, wouldn’t it be best if they didn’t publish it?
If there was a recipe, would it be better if some single guy had it under his wings, or would it be better if it got published?
I guess the biggest threat to humans will as always come from other humans, mostly because those share similar goals, which results in goal conflicts. Please see my optimistic answer above.
In what field do you think machine learning will make the biggest impact in the next ~5 years?
I think it depends a bit on what you mean by “impact”. Commercial impact? If so, in a related answer I write: Both supervised learning recurrent neural networks (RNNs) and reinforcement learning RNNs will be greatly scaled up. In the commercially relevant supervised department, many tasks such as natural language processing, speech recognition, automatic video analysis and combinations of all three will perhaps soon become trivial through large RNNs (the vision part augmented by CNN front-ends).
“Symbol grounding” will be a natural by-product of this. For example, the speech or text-processing units of the RNN will be connected to its video-processing units, and the RNN will learn the visual meaning of sentences such as “the cat in the video fell from the tree”. Such RNNs should have many commercial applications.
I am not so sure when we will see the first serious applications of reinforcement learning RNNs to real world robots, but it might also happen within the next 5 years.
Where do you see the field of machine learning 5, 10, and 20 years from now?
Even (minor extensions of) existing machine learning and neural network algorithms will achieve many important superhuman feats. I guess we are witnessing the ignition phase of the field’s explosion. But how to predict turbulent details of an explosion from within?
Earlier I tried to reply to questions about the next 5 years. You are also asking about the next 10 years. In 10 years we’ll have 2025. That’s an interesting date, the centennial of the first transistor, patented by Julius Lilienfeld in 1925. But let me skip the 10 year question, which I find very difficult, and immediately address the 20 year question, which I find even much, much more difficult.
We are talking about 2035, which also is an interesting date, a century or so after modern theoretical computer science was created by Goedel (1931) & Church/Turing/Post (1936), and the patent application for the first working general program-controlled computer was filed by Zuse (1936). Assuming Moore’s law will hold up, in 2035 computers will be more than 10,000 times faster than today, at the same price. This sounds more or less like a human brain power in a small portable device. Or the human brain power of a city in a larger computer.
Given such raw computational power, I expect huge (by today’s standards) recurrent neural networks on dedicated hardware to simultaneously perceive and analyse an immense number of multimodal data streams (speech, texts, video, many other modalities) from many sources, learning to correlate all those inputs and use the extracted information to achieve a myriad of commercial and non-commercial goals. Those RNNs will continually and quickly learn new skills on top of those they already know. This should have innumerable applications, although I am not even sure whether the word “application” still makes sense here.
This will change society in innumerable ways. What will be the cumulative effect of all those mutually interacting changes on our civilisation, which will depend on machine learning in so many ways? In 2012 [PDF], I tried to illustrate how hard it is to answer such questions: A single human predicting the future of humankind is like a single neuron predicting what its brain will do.
I am supposed to be an expert, but my imagination is so biased and limited - I must admit that I have no idea what is going to happen. It just seems clear that everything will change. Sorry for completely failing to answer your question.