
MIT 6.S191 Deep Learning
Lecture Notes
Lecture 2 - RNNs
What is sequential data? What are some examples of sequential data?
Data that comes one after another.
- Words in a sentence.
- Movement of a circle across a screen, if you see it’s previous movement.
What are different forms of sequence modelling? What are some use cases for different forms?
#Inputs #Outputs Example One One Image → Label Many One Sentence → Emoji One Many Image → Caption (?) Many Many Translation
What are time-steps? and how are they used in neural networks?
different times for the same type of input. i.e. Inputs may be words, but as they occur at different times (in a specific order), that order is relevant:
- “Today was a good day, not bad at all”, is not the same as
- “Today was a bad day, not good at all” xD.
What is a recurrence relation?
The operations relating between previous inputs and operations with new ones.
What would be some pseudocode for a RNN?
myRnn = RNN() h = [0,0,0,0] sentence = ["How", "was", "your"] for word in sentence: prediction, h = myRnn(word, h) # "Evening" is a good prediction here. return prediction
What is a mathematical interpretation of a RNN?
How to initialise an RNN in tensorflow?
myRnn = tf.keras.layers.SimpleRNN(hiddenDim)
What are 4 sequence model design criteria?
- Variable length inputs.
- Long term dependencies are remembered.
- Order is remembered.
- Parameters are shared across the sequence.
How is language encoded for our neural networks?
Can be done with a list, or ordered set of words, with associated indexes.
What are dependencies, and what does it mean to track them?
Old words in a sentence for example, or patterns in music introduced early on, and potentially reintroduced later, with more pazazz 🎊
Inputs to remember and the order they came in.
What are some problems with long term dependencies?
They are forgotten about later on in RNNs.
How should weights and biases be initialised to help with vanishing gradient?
Weights initialised to identity matrix, biases to 0.
LSTMs are only covered lightly in this course.
How are LSTMs implemented in tensorflow?
tf.keras.layers.LSTM(num_units) # ?
What are some limitations of RNNs?
- Encoding bottleneck. (?)
- Slow due to no parallelisation.
- Bad long term memory.
What are the pros and cons of using a dense network for sequential modelling?
Pro - no recurrence, so those limitations vanish.
Con - No variability of input
Con - Input can be very large
Con - Order is not (necessarily) relevant
What is self attention?
Not too sure…. All I have is intuition and non relatable (bad) math :(
Basically, parse an input into what is important and should be grouped together against what isn’t important:

This is done by first getting a position-aware encoding:

This is done by getting a query, key and value, which are got by multiplying our input (positional embedding) 3 times by different matrices related to query, key and value:

Then compare how similar different elements are by doing :


Then you multiply that with the value matrix to get the output:

Example with Iron-man:

Lecture 3 - Computer Vision
- Computer vision basics
What are the 2 parts of a CNN?
Feature extraction - The convolutional part.
Classification - The fully connected part, with a softmax at the end.
What is a better, but still lacking, approach?
Have a heuristic for extracting ‘important’ parts of an image.
What is Faster R-CNN?
Have a layer, or a few that extract important parts of an image. This layer is learned like any part of the network.
Lecture 4 - Deep Generative Modelling
What are latent variables?
Important, but unknown variables in a model, such as identifying an image.
How can you compress an image with an auto-encoder?
The latent variables produced by an auto-encoder can be thought of as a compressed version of the image, especially if trained to the point of overfitting.
What is a variational auto-encoder (VAE)?
Adds some randomness, from a gaussian distribution with mean and sd over many latent variable vectors.
What is the regularisation term on the latent variables in a VAE?
D(q(z|x) || p(z)).
Where p(z) is based on gaussian N(0,1).
What 2 properties of VAEs does regularisation help to achieve?
- Continuity - Similar points get mapped to similar images.
- Generated images are meaningful, AKA not garbage.
How can you overcome the inability to back-propagate through the randomness in the latent variables?
use the mean and standard deviation of z, plus a little random variable epsilon.
so , where . which is determined prior to backprop.
This seems kinda dumb though, bc why not randomly fix all random sigma for each training example?
What is disentanglement? and how is it achieved?
One latent variable may have an impact on multiply semantically relevant properties of an image, and we want one latent variable to have an impact on only one semanitcally relevant property of an image; this is called disentanglement.
What is the structure of a Generative Adversarial Network (GAN) ?
A generative part.
A discriminator part.
We feed a latent variable vector into a generative model, and that generates an image.
We then feed the generated image and a real image to a discriminator, and the discriminator needs to learn how to classify which one is fake.
What is a discriminator and a generator?
A generator makes fake data.
A discriminator aims to find which data is real, given one real and one fake input.
Unbiasing Paper Summary
Increase the representation of under-represented data by learning latent features of the data and increasing the probability of data that has low probabilities in latent features.
Lecture 5 - Robust and Trustworthy Deep Learning
What is uncertainty?
When a model cannot make a clear decision.
What is data vs model uncertainty?
Data uncertainty is data that our model finds weird, like a horse in a cat-dog classifier.
Model uncertainty is when the model is missing data (?).
Lecture 6 - Deep Reinforcement learning
What are 2 aspects of reinforcment and how do they interact?
An actor and an environment.
The actor performs actions in a specific state, and the environment responds with a new state.
What are rewards?
Rewards are what the actor gets from the environment after each action that show that it’s doing well.
What is total reward? What is discounted reward?
The sum of all rewards produced by a single action.
Disounted reward is:
Where is the total discounted reward at timestep , are the rewards from the previous action and is a discounting factor. Later timesteps have smaller rewards.
What is the Q function?
A function that gives the expected return for each action, given a state.
This is learnt by our network.
How does a policy function differ from a Q-function?
A policy function gives the action directly from the state, rather than looking at every possible action.
What are 3 downsides of Q-learning?
- Models have a small discrete action space
- Models cannot really model continuous spaces
- Cannot learn stochasticity because it’s deterministic.
How do you train a policy gradient?
As the network runs over many timesteps, it may eventually lose the game*. Once it loses the game, we take the last half of the states, actions and rewards and reduce their probability, and we increase the probability of the first half. Probability here might be synonymous with reward (thank god for the intuitive picture 🙄).
*Or crash the autonomous vehicle.
What is a shortcoming of training a policy gradient network?
There may be a lot of timesteps until a crash and it’s also expensive and dangerous to crash real life cars all the time.
What is the Sim to Real gap?
The gap in the performance of a network in a simulated environment, against a real environment. Usually the network performs far worse in real life because real life has far more objects and variables.
MIT Amini made VISTA with others and it’s meant to overcome this issue by producing photorealistic training data by slightly changing real world data.
Lecture 7 - Deep Learning New Frontiers
What are perturbations?
Small changes. In the DL context, it’s small changes to images that may produce completely different labels.
What is a graph convolutional network (GCN)?
No clue. Its a repeated matrix acting on a node and it’s connections in a graph?
What are some applications of GCNs?
Traffic prediction. Molecule production.
What are 3 issues with VAEs and GANs?
- Mode collapse - keeps making a similar image
- Can’t generate unique ideas
- Hard to train
What is a diffusion model?
Generates images from random noise.
How is a diffusion model able to create novel images from noise?
All noise is very different, obviously we interpret it as noise but each noise image is near maximally variant from any other noisy image.
Lecture 8 - Text-to-Image Generation
What is a dataset that has 5 Billion text and image pairs?
LAION-5B
How does MUSE differ to stable diffusion in terms of upscaling images?
MUSE upscales the latent space, whereas SD analyses images by pixel.
What is negative prompting?
Specify what you don’t want in the image.
What is the main reason for MUSE’s very fast speed?
Fewer iterations. has 24 where SD may have 50-1000
Optional Lab - Reinforcement Learning (RL)
What is reinforcement learning?
You’ve got an agent (a neural network) and an environment (simulation), and you want the agent to perform well in the environment by choosing certain actions based on the state it’s in. If it performs good actions it gets rewards. Good actions are those that occur in the half of a trial run before a crash, bad ones happen in the second half, a bit arbitrary but eh; one might imagine using an exponential function that integrates to 1, over the range of the episode as being a scale for good and bad, making the very last few moments bad, and the first majority of moments good.
What are 3 stages of deploying an RL model?
- Define the agent and environment
- Define the agents memory
- Define the learning algorithm (and reward function)
What is an episode?
The span of time (-steps) that the agent hasn’t done some illegal terminating move.
Lecture 9 - The Modern Era of Statistics
What is a pro and a con of overparametrization?
pro: improves robustness - deals well with adversarial examples.
con: worse performance on minority samples.
What is label noise?
Adding some noise to your input to help deal with adversarial examples.
Essentially making an adversarial dataset lol.
What is a heuristic for parameter size? (law of robustness)
where
- is the number of parameters.
- is the number of examples.
- is the full dimensionality of each example.
What is effective dimsionality?
Since not all elements in the input will represent image to the same degree, the actual dimensionality is lower*.
Looking at the MNIST digit example below, pixels near the corners and edges will be a lot less relevant for modelling. When inputs are more complex, it isn’t clear what the effective dimensionality is.

*It is the number of pixels that actually make a difference in classification.
What are Continuous time processes?
Going from one layer to another in discrete time is a difference equation*:
In continuous time, we have
This allows for continuous outputs.

*Simplified, but during test time, the only new input is so it is accurate, everything else is a coefficient or a known function.
What is a benefit of continuous time processes?
Smoother, more accurate.
Lecture 10 - The Future of Robotics
Cool lecture, no notes though 😂
Labs
Lab 1 and 2 are good fun.
Lab 3 requires the “capsa” package mentioned in the THEMIS AI talk, which doesn’t exist on PyPI when I checked on 22/09/2023.
Optional labs autonomous_driving and pong were both acually the same lab, just with different folder names, a pong lab does not exist.
As for the autonomous_driving lab, save_video_of_model didn’t work for CartPole (part 1), even after doing some investigation into the mitdeeplearning package and using lab3_old; which was a bit annoying but the agent itself seemed to train well 🤷.
As for Autonomous Driving with VISTA (part 2), I ended up getting an error on some of their pre-written code related to rendering an inspecting a human trace, so I avoided the rest of the lab.
Further Projects
Referenced in lecture 9.

















