Scholarly readings
Notes on papers that I'm reading
Design by adaptive sampling
PDF: https://arxiv.org/pdf/1810.03714.pdf
Even after a few months, the paper still feels dense to digest. However, I think I have finally grokked it.
Setup:
Approach: Design input to satisfy desired property.
How can we implement this using dummy data?
It's probably most instructive if we start with an $x^2$ model, from which we know ground truth and want to find the maxima adaptively.
index
This is the landing page for my notes.
This is 100% inspired by Andy Matuschak's famous notes page. I'm not technically skilled enough to replicate the full "Andy Mode", though, so I just did some simple hacks. If you're curious how these notes compiled, check out the summary in How these notes are made into HTML pages.
This is my "notes garden". I tend to it on a daily basis, and it contains some of my less fully-formed thoughts. Nothing here is intended to be cited, as the link structure evolves over time. The notes are best viewed on a desktop/laptop computer, because of the use of hovers for previews.
There's no formal "navigation", or "search" for these pages. To go somewhere, click on any of the "high-level" notes below, and enjoy.
Is Transfer Learning Necessary for Protein Landscape Prediction
URL: https://arxiv.org/abs/2011.03443
My key summary of ideas in the paper:
The models benchmarked in TAPE are cool and all, but there are simpler models that can outperform these models in learning tasks.
We find that relatively shallow CNN encoders (1-layer for fluorescence, 3-layer for stability) can compete with and even outperform the models benchmarked in TAPE. For the fluorescence task, in particular, a simple linear regression model trained on full one-hot encodings outperforms our models and the TAPE models. Additionally, 2-layer CNN models offer competitive performance with Rives et al.’s ESM (evolutionary scale modeling) transformer models on β-lactamase variant activity prediction.
While TAPE’s benchmarking argued that pretraining improves the performance of language models on downstream landscape prediction tasks, our results show that small supervised models can, in a fraction of the time and compute required for semi-supervised models, achieve competitive performance on the same tasks.
So... the use of pre-training in big ML models is premised on this idea: "conditioned on us deciding that we want to use language models, pre-training is necessary to improve activity". However, this paper is saying, "you don't even have to use an overpowered language model, for a large fraction of tasks, you can just use a simpler CNN".
Model architectures:
The results presented in the paper do seem to suggest that empirically, these large language models aren't necessary. (see: Large models might not be necessary)
We see that relatively simple and small CNN models trained entirely with supervised learning for fluorescence or stability prediction compete with and outperform the semi-supervised models benchmarked in TAPE [12], despite requiring substantially less time and compute.
Dataset Distillation
URL: https://arxiv.org/abs/1811.10959
Notes:
Unlike these approaches, we are interested in understanding the intrinsic properties of the training data rather than a specific trained model.
Sections of the paper:
random note to self: This is a paper where the goal fits into the paradigm of Input design.
The algorithm (with some paraphrasing for myself)
Firstly, the required inputs that are of interest:
Secondly, some other parameters that may be of interest:
The algorithm in words, translated:
It seems like this is going to be a paper I need to take a second look over.
In any case, some ideas to think about:
It's a neat idea for a few reasons.
The biggest reason: compression + fast training. I am tired of waiting for large models to finish training. If we can distill a dataset to its essentials, then we should be able to train large models faster.
Another ancillary reason: intellectual interest! There's an interesting question for me: what is the smallest dataset that is necessary for a model to explain the data?
Some things I'd expect:
The setup: We set up a bunch of training data points along the x-axis line, and create noisy $y$ outputs for the function of interest, in this case, a quadratic model. We should be able to distill points to include:
Then, we try to distill $x$ to a small set of $M = 10$ points.