Eric's Notes

Design by adaptive sampling

PDF: https://arxiv.org/pdf/1810.03714.pdf

Even after a few months, the paper still feels dense to digest. However, I think I have finally grokked it.

Setup:

We have an oracle: $p(y|x)$. (a.k.a. property prediction model)
Oracle can help us compute probability of a set $S$ of data.
- In the case of maximization, $S$ is the set of values $y$ s.t. $y \geq y_{max}$, where $y_{max}$ is given by the $x$ that maximizes the expectation of $y$ under $p(y|x)$.
- For practical settings, we instead want S such that $y \geq \gamma$, where $\gamma \leq y_{max}$.
- In the case of specification, $S$ is the set of values $y$ s.t. $y = y_{target}$ (more strictly speaking, taking on infinitesimal values around it).
- The conditional probability of the set $S$, i.e. "the probability that our property desideratum is satisfied for a given input", is $P(S|x) \equiv P(Y \in S|x) = \int p(y|x) I_{S}(y) dy$.
  - $I_{S}(y)$ is an indicator function for whether $y \in S$, takes on $1$ if yes, or $0$ otherwise.
- When we do thresholding, i.e. greater than a threshold, $P(S|x) = p(y \geq \gamma|x) = 1 - CDF(x, \gamma)$. (I think this in practice is just the empirical cumulative density function? Since it is the probability of $y$ greater than the threshold $\gamma$, conditioned on observed input data $x$).

Approach: Design input to satisfy desired property.

$S$ is the set of property values that satisfy what we want.
We want to maximize the expected probability that our desideratum is satisfied, where expectation is performed over a generative model distribution $p(x|\theta)$, like a VAE.
This gives rise to us optimizing $\theta$ w.r.t. the desirderatum.
We want $\hat{\theta} = \underset{\theta}{argmax} \log P(S|\theta)$, i.e. the $\theta$ that maximizes the log probability of S.
- This is the theta that maximizes the log probability of S.
In the paper, a few problems are mentioned that I don't fully understand. However, we do arrive at this point where the optimization problem that we want to solve is:
- $\theta^{(t+1)} = \underset{\theta}{argmax} \sum_{i=1}^M P(S^{(t)}|x_i^{(t)})\log p(x_i^{(t)}|\theta)$
- In English: At each iteration $t$, we want to the $\theta$ that improves the sum of (the likelihood of observing set $S$ given the sample input set $x$ drawn times the likelihood of observing the input set generated from the generative model parameters $\theta$).
- Key idea here: we must get the set $S$ out such that the desired property is non-vanishing. How so? Just sample from empirical distribution then.

How can we implement this using dummy data?

It's probably most instructive if we start with an $x^2$ model, from which we know ground truth and want to find the maxima adaptively.

Generative model for $x$: $x \sim N(\mu, \sigma)$. Here, $(\mu, \sigma) = \theta$, the set of parameters we want to optimize.
Designing the inputs means changing $\mu$ and $\sigma$, such that we continue to generate higher values of $x$ that maximize the $x^3$ function.

index

This is the landing page for my notes.

This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.

This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.

There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.

Is Transfer Learning Necessary for Protein Landscape Prediction

URL: https://arxiv.org/abs/2011.03443

My key summary of ideas in the paper:

The models benchmarked in TAPE are cool and all, but there are simpler models that can outperform these models in learning tasks.

We find that relatively shallow CNN encoders (1-layer for fluorescence, 3-layer for stability) can compete with and even outperform the models benchmarked in TAPE. For the fluorescence task, in particular, a simple linear regression model trained on full one-hot encodings outperforms our models and the TAPE models. Additionally, 2-layer CNN models offer competitive performance with Rives et al.’s ESM (evolutionary scale modeling) transformer models on β-lactamase variant activity prediction.

While TAPE’s benchmarking argued that pretraining improves the performance of language models on downstream landscape prediction tasks, our results show that small supervised models can, in a fraction of the time and compute required for semi-supervised models, achieve competitive performance on the same tasks.

So... the use of pre-training in big ML models is premised on this idea: "conditioned on us deciding that we want to use language models, pre-training is necessary to improve activity". However, this paper is saying, "you don't even have to use an overpowered language model, for a large fraction of tasks, you can just use a simpler CNN".

Model architectures:

From the text: Our supervised models only rely on 1-D convolution layers, dense layers, and ReLU activations.

The results presented in the paper do seem to suggest that empirically, these large language models aren't necessary. (see: Large models might not be necessary)

We see that relatively simple and small CNN models trained entirely with supervised learning for fluorescence or stability prediction compete with and outperform the semi-supervised models benchmarked in TAPE [12], despite requiring substantially less time and compute.

Dataset Distillation

URL: https://arxiv.org/abs/1811.10959

Notes:

Unlike these approaches, we are interested in understanding the intrinsic properties of the training data rather than a specific trained model.

Sections of the paper:

Train a network with fixed initialization using one grad descent step.
Use randomly initialized weights (more challenging)
Linear network -- understand limitations
Use more gradient steps
How to obtain distilled images using different initializations.

random note to self: This is a paper where the goal fits into the paradigm of Input design.

Algorithm

The algorithm (with some paraphrasing for myself)

Firstly, the required inputs that are of interest:

A prediction model $f(x, \theta)$ for the learning task at hand. $x$ is the actual data, and $\theta$ are its parameters.
A distribution for initial weights, $p(\theta_0)$.
The desired number of distilled data points, $M$.

Secondly, some other parameters that may be of interest:

Step size, $\alpha$
Batch size, $n$, such that $n \lt\lt M$.
Number of optimization iterations, $T$
$\tilde{\eta}{0}$, i.e. the initial value for $\tilde{\eta}$, which is a target learning rate for learning on the _distilled, synthetic data.

The algorithm in words, translated:

Initialize $\tilde{x}$, the distilled input data, using a random matrix.
Set $\eta$ to $\tilde{\eta}_0$.
For each training step $t=1$ to $T$:
1. Get a batch of $n$ data points from $x$. We call this $\mathbf{x_t}$, and in the paper, it is given the notation $\mathbf{x_t} = {x_{t,j} }_{j=1}^{n}$
2. Draw a batch of initial weights from $p(\theta_0)$, call it $\theta_{0}^{(j)}$. Given the notation in the paper, this likely is "sample $j$ weight draws".
3. For each sampled weight $\theta_{0}^{(j)}$:
  1. Update $\theta_{1}^{(j)}$ such by doing one step in the negative direction of the gradient of the loss function using the distilled data... $\theta_1^{(j)} = \theta_0^{(j)} - \tilde{\eta} \nabla_{\theta_0^{(j)}} l(\tilde{x}, \theta_0^{(j)})$.
  2. Now evaluate the new loss on actual data $L^{(j)} = l( \mathbf{x_t}, \theta_{1}^{(j)})$
4. Then, update the distilled data values $\tilde{x}$ and $\tilde{\eta}$:
  1. $\tilde{x} \leftarrow \tilde{x} - \alpha \nabla_{\tilde{x}} \sum_{j} L^{(j)}$: "update distilled data in the negative direction of the gradient of sum of losses over all sampled weights in $\theta_t^{(j)}$"
  2. $\tilde{\eta} \leftarrow \tilde{\eta} - \alpha \nabla_{\tilde{\eta}} \sum_{j} L^{(j)}$: same semantic meaning, except we do it on $\tilde{\eta}$, the learning rate to use when using distilled data to train the model.

It seems like this is going to be a paper I need to take a second look over.

In any case, some ideas to think about:

We can retrain UniRep by distilling protein space, maybe? Doing so might make retraining really easy, and we yield a set of weights that are free from the license restrictions?
What about distilling graph data? How easy would this be?

Why I'm so interested in dataset distillation

It's a neat idea for a few reasons.

The biggest reason: compression + fast training. I am tired of waiting for large models to finish training. If we can distill a dataset to its essentials, then we should be able to train large models faster.

Another ancillary reason: intellectual interest! There's an interesting question for me: what is the smallest dataset that is necessary for a model to explain the data?

What would dataset distillation in a quadratic model look like?

Some things I'd expect:

The learned $x$ should collapse to a few points? Like it should be able to get a good model representative of the training data, using just two points.
- We might not get such an extreme distillation, maybe?
On the other hand, because of the use of distribution draws for parameters, I can see how we would end up marginalizing over all possible pairs of "inducing" points
Oh wait, we actually get to define the number of inducing points!

The setup: We set up a bunch of training data points along the x-axis line, and create noisy $y$ outputs for the function of interest, in this case, a quadratic model. We should be able to distill points to include:

The hump/bowl
Points on both sides.

Then, we try to distill $x$ to a small set of $M = 10$ points.

Pages that link here

Algorithm

Why I'm so interested in dataset distillation

What would dataset distillation in a quadratic model look like?