MIT 6.S191 Deep Learning

Lecture Notes

Lecture 1 - Fully Connected Networks
- What is early stopping? (hint: it’s a regularisation method)
  Stopping training early to prevent overfitting.

Lecture 2 - RNNs
- What is sequential data? What are some examples of sequential data?
  Data that comes one after another.
  - Words in a sentence.
  - Movement of a circle across a screen, if you see it’s previous movement.
- What are different forms of sequence modelling? What are some use cases for different forms?
  #Inputs #Outputs Example
  One One Image → Label
  Many One Sentence → Emoji
  One Many Image → Caption (?)
  Many Many Translation
- What are time-steps? and how are they used in neural networks?
  different times for the same type of input. i.e. Inputs may be words, but as they occur at different times (in a specific order), that order is relevant:
  - “Today was a good day, not bad at all”, is not the same as
  - “Today was a bad day, not good at all” xD.
- What is a recurrent neural network (RNN)?
  A network that stores information about previous inputs and intermediate values .
- What is a recurrence relation?
  The operations relating between previous inputs and operations with new ones.
  $\bm{h}_t = f_{\bm{W}}(\bm{x}_t, \bm{h}_{t-1})$
  $y_t = \bm{W}_{hy}\bm{h}_t$
- What would be some pseudocode for a RNN?
```
myRnn = RNN()

h = [0,0,0,0]

sentence = ["How", "was", "your"]

for word in sentence:
	prediction, h = myRnn(word, h)

# "Evening" is a good prediction here.
return prediction
```
- What is a mathematical interpretation of a RNN?
  $\bm{h}_t = \tanh(\bm{W}_{hh}\bm{h}_{t-1} + \bm{W}_{xh}\bm{x}_t).$
  $\bm{y}_t = \bm{W}_{hy}\bm{h}_t.$
- How would a computational graph look like over time?
  Many inputs, many predictions, same parameters.
- How do you calculate loss for an RNN over many timesteps?
  Sum losses for outputs of each time-step.
  Note $L_3$ should be $L_t$ .
- How to initialise an RNN in tensorflow?
```
myRnn = tf.keras.layers.SimpleRNN(hiddenDim)
```
- What are 4 sequence model design criteria?
  1. Variable length inputs.
  1. Long term dependencies are remembered.
  1. Order is remembered.
  1. Parameters are shared across the sequence.
- How is language encoded for our neural networks?
  Can be done with a list, or ordered set of words, with associated indexes.
- What is the difference between one-hot embedding and learned embedding?
  In both cases we have a vector.
  - One hot embedding has length = dictionary_length and a 1 in the index of the word and 0s elsewhere.
  - Learned embedding gives similar words similar embeddings(?).
- What are dependencies, and what does it mean to track them?
  Old words in a sentence for example, or patterns in music introduced early on, and potentially reintroduced later, with more pazazz 🎊
  Inputs to remember and the order they came in.
- What are some problems with long term dependencies?
  They are forgotten about later on in RNNs.
- How should weights and biases be initialised to help with vanishing gradient?
  Weights initialised to identity matrix, biases to 0.
LSTMs are only covered lightly in this course.
- What is gating/gated cells?
  Some details are preserved about the initial data being put in by applying different functions to different products throughout the time-steps:
- What are LSTMs? What does it stand for?
  A collection of related, gated cells.
- How are LSTMs implemented in tensorflow?
```
tf.keras.layers.LSTM(num_units) # ?
```
- What are some limitations of RNNs?
  1. Encoding bottleneck. (?)
  1. Slow due to no parallelisation.
  1. Bad long term memory.
- What are the pros and cons of using a dense network for sequential modelling?
  Pro - no recurrence, so those limitations vanish.
  Con - No variability of input
  Con - Input can be very large
  Con - Order is not (necessarily) relevant
- What is self attention?
  Not too sure…. All I have is intuition and non relatable (bad) math :(
  Basically, parse an input into what is important and should be grouped together against what isn’t important:
  This is done by first getting a position-aware encoding:
  This is done by getting a query, key and value, which are got by multiplying our input (positional embedding) 3 times by different matrices related to query, key and value:
  Then compare how similar different elements are by doing $\bm{QK}^T$ :
  Then you multiply that with the value matrix to get the output:
  Example with Iron-man:
- What is the general structure of a self attention network?
  Get queries and keys, get the similarities, multiply by value matrix and get output:

Lecture 3 - Computer Vision
- Computer vision basics
- What are the 2 parts of a CNN?
  Feature extraction - The convolutional part.
  Classification - The fully connected part, with a softmax at the end.
- What is an initial, naive approach to multiple object detection/boxing ?
  Get a (arbitrary) number of boxes with random sizes and extract features from each of those images. Kind of a brute force method, slow.
- What is a better, but still lacking, approach?
  Have a heuristic for extracting ‘important’ parts of an image.
- What is Faster R-CNN?
  Have a layer, or a few that extract important parts of an image. This layer is learned like any part of the network.
- What is Segmentation?
  Super accurate boxing. i.e.

Lecture 4 - Deep Generative Modelling
- What are latent variables?
  Important, but unknown variables in a model, such as identifying an image.
- What is an auto-encoder?
  Takes an image, puts it through a neural network, get’s a small set of latent variables, expands that set of latent variables through another (reconstruction) network, and compares the difference between the original and the reconstructed image as it’s loss.
- How can you compress an image with an auto-encoder?
  The latent variables produced by an auto-encoder can be thought of as a compressed version of the image, especially if trained to the point of overfitting.
- What is a variational auto-encoder (VAE)?
  Adds some randomness, from a gaussian distribution with mean and sd over many latent variable vectors.
- What is the regularisation term on the latent variables in a VAE?
  D(q(z|x) || p(z)).
  Where p(z) is based on gaussian N(0,1).
- What 2 properties of VAEs does regularisation help to achieve?
  1. Continuity - Similar points get mapped to similar images.
  1. Generated images are meaningful, AKA not garbage.
- How can you overcome the inability to back-propagate through the randomness in the latent variables?
  use the mean and standard deviation of z, plus a little random variable epsilon.
  so $\bm{z} = \bm{\mu} + \bm{\sigma}\epsilon$ , where $\epsilon \sim \mathcal{N}(0,1)$ . which is determined prior to backprop.
  This seems kinda dumb though, bc why not randomly fix all random sigma for each training example?
- What is disentanglement? and how is it achieved?
  One latent variable may have an impact on multiply semantically relevant properties of an image, and we want one latent variable to have an impact on only one semanitcally relevant property of an image; this is called disentanglement.
- What is the structure of a Generative Adversarial Network (GAN) ?
  A generative part.
  A discriminator part.
  We feed a latent variable vector into a generative model, and that generates an image.
  We then feed the generated image and a real image to a discriminator, and the discriminator needs to learn how to classify which one is fake.
- What is a discriminator and a generator?
  A generator makes fake data.
  A discriminator aims to find which data is real, given one real and one fake input.

Unbiasing Paper Summary
Increase the representation of under-represented data by learning latent features of the data and increasing the probability of data that has low probabilities in latent features.

Lecture 5 - Robust and Trustworthy Deep Learning
- What is uncertainty?
  When a model cannot make a clear decision.
- What is data vs model uncertainty?
  Data uncertainty is data that our model finds weird, like a horse in a cat-dog classifier.
  Model uncertainty is when the model is missing data (?).

Lecture 6 - Deep Reinforcement learning
- What are 2 aspects of reinforcment and how do they interact?
  An actor and an environment.
  The actor performs actions in a specific state, and the environment responds with a new state.
- What are rewards?
  Rewards are what the actor gets from the environment after each action that show that it’s doing well.
- What is total reward? What is discounted reward?
  The sum of all rewards produced by a single action.
  Disounted reward is:
  $R_t = \sum_{k=0}^\infty\gamma^kr_{k+t}$
  Where $R_t$ is the total discounted reward at timestep $t$ , $r_i$ are the rewards from the previous action and $0<\gamma < 1$ is a discounting factor. Later timesteps have smaller rewards.
- What is the Q function?
  A function that gives the expected return for each action, given a state.
  $Q(s_t, a_k) = \mathbb{E}[R_t|s_t,a_t]$
  This is learnt by our network.
- How can we use deep neural networks to model Q-functions?
- How does a policy function differ from a Q-function?
  A policy function gives the action directly from the state, rather than looking at every possible action.
- What are 3 downsides of Q-learning?
  1. Models have a small discrete action space
  1. Models cannot really model continuous spaces
  1. Cannot learn stochasticity because it’s deterministic.
- How can policy gradient functions map to a continuous set of actions?
  It can output parameters that relate to getting the ideal action. Taking atari breakout as an example, it could calculate the mean and variance of how fast it should move either left or right:
- How do you train a policy gradient?
  As the network runs over many timesteps, it may eventually lose the game*. Once it loses the game, we take the last half of the states, actions and rewards and reduce their probability, and we increase the probability of the first half. Probability here might be synonymous with reward (thank god for the intuitive picture 🙄).
  *Or crash the autonomous vehicle.
- What is a shortcoming of training a policy gradient network?
  There may be a lot of timesteps until a crash and it’s also expensive and dangerous to crash real life cars all the time.
- What is the Sim to Real gap?
  The gap in the performance of a network in a simulated environment, against a real environment. Usually the network performs far worse in real life because real life has far more objects and variables.
  MIT Amini made VISTA with others and it’s meant to overcome this issue by producing photorealistic training data by slightly changing real world data.

Lecture 7 - Deep Learning New Frontiers
- What are perturbations?
  Small changes. In the DL context, it’s small changes to images that may produce completely different labels.
- What is a graph convolutional network (GCN)?
  No clue. Its a repeated matrix acting on a node and it’s connections in a graph?
- What are some applications of GCNs?
  Traffic prediction. Molecule production.
- What are 3 issues with VAEs and GANs?
  1. Mode collapse - keeps making a similar image
  1. Can’t generate unique ideas
  1. Hard to train
- What is a diffusion model?
  Generates images from random noise.
- What are the 2 parts of training a diffusion model?
  1. Noisify a normal image into a noisy image.
  1. Denoise a noisy image into a normal image.
- How is a diffusion model able to create novel images from noise?
  All noise is very different, obviously we interpret it as noise but each noise image is near maximally variant from any other noisy image.

Lecture 8 - Text-to-Image Generation
- What is a dataset that has 5 Billion text and image pairs?
  LAION-5B
- How does MUSE differ to stable diffusion in terms of upscaling images?
  MUSE upscales the latent space, whereas SD analyses images by pixel.
- What is negative prompting?
  Specify what you don’t want in the image.
- What is the main reason for MUSE’s very fast speed?
  Fewer iterations. has 24 where SD may have 50-1000

Optional Lab - Reinforcement Learning (RL)
- What is reinforcement learning?
  You’ve got an agent (a neural network) and an environment (simulation), and you want the agent to perform well in the environment by choosing certain actions based on the state it’s in. If it performs good actions it gets rewards. Good actions are those that occur in the half of a trial run before a crash, bad ones happen in the second half, a bit arbitrary but eh; one might imagine using an exponential function that integrates to 1, over the range of the episode as being a scale for good and bad, making the very last few moments bad, and the first majority of moments good.
- What are 3 stages of deploying an RL model?
  1. Define the agent and environment
  1. Define the agents memory
  1. Define the learning algorithm (and reward function)
- What is an episode?
  The span of time (-steps) that the agent hasn’t done some illegal terminating move.

Lecture 9 - The Modern Era of Statistics
- What is double descent?
  Loss goes down as parameters increase, until it starts to increase*, and then it reaches a hill and goes down to an even lower loss than before.
  *Potentially due to overfitting.
- What is an overparametrization regime?
  Use more and more paramters in your model, it’s where accuracy increase after a low accuracy lull as parameters increase.
- What is a pro and a con of overparametrization?
  pro: improves robustness - deals well with adversarial examples.
  con: worse performance on minority samples.
- What is label noise?
  Adding some noise to your input to help deal with adversarial examples.
  Essentially making an adversarial dataset lol.
- What is a heuristic for parameter size? (law of robustness)
  $p = nd$
  where
  - $p$ is the number of parameters.
  - $n$ is the number of examples.
  - $d$ is the full dimensionality of each example.
- What is effective dimsionality?
  Since not all elements in the input will represent image to the same degree, the actual dimensionality is lower*.
  Looking at the MNIST digit example below, pixels near the corners and edges will be a lot less relevant for modelling. When inputs are more complex, it isn’t clear what the effective dimensionality is.
  *It is the number of pixels that actually make a difference in classification.
- What are Continuous time processes?
  Going from one layer to another in discrete time is a difference equation*:
  $\bm{h}_{t+1} = f(\bm{h}_t) .$
  In continuous time, we have
  $\frac{\partial \bm{h}(t)}{\partial t} = f(\bm{h}(t)).$
  This allows for continuous outputs.
  *Simplified, but during test time, the only new input is $\bm{h}_t$ so it is accurate, everything else is a coefficient or a known function.
- What is a benefit of continuous time processes?
  Smoother, more accurate.
- What is a CT RNN?
  Continuous Time RNN. It’s in the name.
  Not sure what the implementation details are but it’s the following formula,
- What are closed form liquid networks?
  A liquid network aims to approximate brain neuron function, partly with non-linear synapses (weights) among other techniques. This leads to some complicated differential equations, which have been solved in closed form as

Lecture 10 - The Future of Robotics
Cool lecture, no notes though 😂

Labs

Lab 1 and 2 are good fun.

Lab 3 requires the “capsa” package mentioned in the THEMIS AI talk, which doesn’t exist on PyPI when I checked on 22/09/2023.

Optional labs autonomous_driving and pong were both acually the same lab, just with different folder names, a pong lab does not exist.

As for the autonomous_driving lab, save_video_of_model didn’t work for CartPole (part 1), even after doing some investigation into the mitdeeplearning package and using lab3_old; which was a bit annoying but the agent itself seemed to train well 🤷.

As for Autonomous Driving with VISTA (part 2), I ended up getting an error on some of their pre-written code related to rendering an inspecting a human trace, so I avoided the rest of the lab.

Further Projects

Referenced in lecture 9.

https://github.com/mlech26l/ncps

https://github.com/raminmh/cfc

https://github.com/raminmh/liquid-s4

#Inputs	#Outputs	Example
One	One	Image → Label
Many	One	Sentence → Emoji
One	Many	Image → Caption (?)
Many	Many	Translation