Home

Ilya 30u30 guide UNDER CONSTRUCTION

I left grad school a few years ago in 2021. Right before that, I remember my NLP professor excitedly telling us about these new “transformer” things that had insanely high performance, and being amazed at how I trained my Chromebook to speak German. I (intensely regrettably) went into a multi-year crypto{graphy, currency} wormhole immediately after. As I deprogram myself from that cult and re-enter the real world, I’ve had a bit of catching up to do, and now seems like a good time.

To kick it off, I am reading the “Ilya 30u30”, which is supposed to be 30 papers that (purportedly) Ilya Sutskever says will get you up to the cutting edge of ML in 2024. Link here.

The first piece of good news is that there are only 27 papers! And there’s a significant amount of repetition in them. And the papers contain a remarkably friendly, focused, and holistic introduction to the field.

In this post, I’m going to summarize, extract the key insights, and discuss things I found interesting as I read all thirty of them. I hope it’s educational! I’ve tried to be friendly to the level of an engineer who knows a bit of math.

I’ve also reordered things a bit and added sections to make it friendlier to newcomers who may not have wide context in ML.

These papers mostly operate well above the calculus and linear algebra that you get in an undergraduate machine learning class- there’s not much backpropagation or matrix multiplication or taking the derivative of SoftMax. There aren’t many optimizers or batching choices around here.

What you do see a lot of reference to is deep statistical intuitions, information theoretic intuitions (which are sort of the same thing), computability, and generalized insight into the fundamental nature of concepts and space and time and thought and language and knowledge in themselves. The things I’ve read that I’ve found most useful as I’m diving into this have all been philosophical. As a philosophy hobbyist, I’ve found these papers intensely fun, and I hope you do as well!

note that this is super under construction… :P I’m way ahead with reading than I am with writing at this point.

► Information theory, algorithmic complexity theory, and other relevant background

A Tutorial Introduction to the Minimum Description Length Principle (Tutorial/Paper)

Friendly! Go read it!

Kolmogorov Complexity And Algorithmic Randomness from page 434 onwards (Textbook)

Oh this one is way less friendly haha

The First Law of Complexodynamics (Blog post) and Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton (Paper)

TODO God I love Scott.

Keeping Neural Networks Simple by Minimizing the Description Length of the Weights (Paper)

TODO

► Generalized Architectures and Techniques (With A Philosophical Bent) (Meatiest Section)

Understanding LSTM Networks (Paper)

The Unreasonable Effectiveness of Recurrent Neural Networks (Blog)

TODO

Recurrent Neural Network Regularization (Paper)

TODO

Relational recurrent neural networks (Paper)

TODO

A simple neural network module for relational reasoning (Paper)

TODO

Neural Turing Machines (Paper)

TODO

Variational Lossy Autoencoder (Paper)

TODO

Identity Mappings in Deep Residual Networks (Paper)

TODO

Order Matters: Sequence to Sequence for Sets (Paper)

TODO

Pointer Networks (Paper)

TODO

► Techniques and Architectures for Computer Vision

Convolutional Neural Networks for Visual Recognition (Stanford Course Notes)

TODO

Multi Scale Context Aggregation By Dilated Convolutions (Paper)

TODO

Deep Residual Learning For Image Recognition (Paper)

TODO

ImageNet Classification with Deep Convolutional Neural Networks (Paper)

TODO

► Techniques and Architectures for NLP

Attention is all you need (Paper) and The Annotated Transformer (Jupyter Notebook/Blog)

Attention Is All You Need is the 2017 paper that introduces the transformer. The key innovation is in the title: A transformer is a model architecture that works over sequences and uses encoders and decoders. However, unlike previous architectures, it is neither RNN nor CNN. Instead, it uses an attention mechanism to take a global view of the input and focus appropriately on elements and connections between them.

The Annotated Transformer is a very useful post to read alongside the transformer paper. It does what it says on the label, walking you through the transformer, alternating between Jupyter snippets of implementation and explanations of what’s going on.

Intuitively, it makes a lot of sense that this works better than an LSTM or GRU, if you just reflect on what happens when you read anything complex. Imagine how difficult a time you’d have if you couldn’t move around freely to focus on different areas of the subject matter, but instead had to move through it rigidly one word at a time.

Key insight: how does a transformer work?

  1. Embeds the inputs and adds a “positional encoding” feature to them (more about this mechanism later- it’s really nifty)
  2. Perform multi-head attention on the inputs (this picks out what’s important)
  3. …. TODO … finish

TODO

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Paper)

TODO

Scaling Laws for Neural Language Models (Paper)

TODO

Neural Machine Translation by Jointly Learning to Align and Translate (Paper)

TODO

► Grab Bag

Machine Super Intelligence (PhD Thesis)

TODO

Neural Message Passing for Quantum Chemistry (Paper)

TODO

GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism (Paper)

TODO