The AI Paper with The Best Title Ever

Epistemic Status: Confident I understood the paper, but no promises on its implications

The title is “Learning to learn by gradient descent by gradient descent.” If anyone can top that title I am eager to hear about it. The idea is a good one on the merits, but I still suspect that one of the authors thought of the title and then decided to implement the technique in order to justify using the name as the title of a paper. That may sound like a lot of work, since it involved a lot of coding and eventually eight authors, but when you see a golden opportunity, it would be wrong not to seize it.

The paper is conceptually simple. Like most things that actually happen in AI (or elsewhere), the question in hindsight seems more like “why did it take so long for someone to do this?” rather than “how did you think of that?”

Machine learning is almost always done via gradient descent plus a few tricks. Gradient descent has a lot of parameters whose optimal values interact and are not obvious, and you can tinker with those parameters to try and improve its performance. That sounds a lot like a problem you can solve using… gradient descent.

They do this on some problems where gradient descent already gets a reasonable answer. They do what seems like the simplest thing, and use an LSTM optimizer on each parameter. It works. Performance is noticeably improved versus existing benchmarks.

One, two, three, victory! That is pretty cool.

Presumably the next thing to do is to learn to learn to learn by gradient descent by gradient descent by gradient descent, since the LSTM optimizer does not sound especially optimized but there’s this great idea for solving that particular problem. There might even be a clever trick that allows us to extend this to using a level of the network to tune itself, if you choose the right look-ahead for the optimization target.

This is what passes for meta-learning these days. How useful is it? How dangerous is it?

The process seems useful, but finitely useful. In the end, what you are left with is still a gradient descent algorithm that solves the original problem. We can find a better set of parameters, which should improve performance, but only if the problem was already solvable and only so much improvement will be available. In that case, you will find the best solution the system is capable of finding, and find it faster, assuming you don’t waste too much time meta-learning.

If the original problem was previously unsolvable, going into this recursion seems unlikely to turn it into a solvable one. I would be surprised if they could point to a problem and say “without this technique, our best efforts were completely useless, but then we tried meta-descent and now we have something reasonable” rather than “here is a problem we had a reasonable solution for, and here is a better one.”

This means it also does not seem especially dangerous in terms of assembling an AGI, except as a building block that makes the key insights easier to implement. The meta-learning we have is not the true meta-learning.

It does seem dangerous in the sense that narrow AI is dangerous, in that the better we get at sniffing out local maxima out of our metrics, without being able to step outside of those metrics, the more we risk those overpowered metrics eating our under-specified goals. Getting the system to output a better number takes over, and we stop thinking. The more I think about such outcomes, the more worrisome they seem to be.

I am just getting started thinking about machine learning and the current state of AI; I was completely out of the loop for several years. For more, and better, thinking along these lines, from someone who has read, talked and thought about these issues a lot more, Sarah Constantin just came out with Strong AI Isn’t Here Yet.  When I have digested that, I hope to say more.

This entry was posted in Rationality, Reviews and tagged , , . Bookmark the permalink.

3 Responses to The AI Paper with The Best Title Ever

  1. Romeo Stevens says:

    God damn it I was assuming this was already being done with the number of levels being itself optimized by an LSTM optimizer. Dibs on I told you so when that gets implemented since apparently it hasn’t yet.

  2. Pingback: Best of Don’t Worry About the Vase | Don't Worry About the Vase

  3. Tim says:

    I’ve seen okay results out of using a genetic algorithm to set learning parameters for hard problems. If the basic model is gradient descent trained, this is likely to leave you with several distinct models of varying quality from the genetic algorithm’s final generation; I suppose you can ensemble them if you’re worried about overfitting.

Leave a comment