Large models might not be necessary

I suspect this is a pattern in machine learning. Large models may not be necessary.

Is Transfer Learning Necessary for Protein Landscape Prediction

URL: https://arxiv.org/abs/2011.03443

My key summary of ideas in the paper:

The models benchmarked in TAPE are cool and all, but there are simpler models that can outperform these models in learning tasks.

We find that relatively shallow CNN encoders (1-layer for fluorescence, 3-layer for stability) can compete with and even outperform the models benchmarked in TAPE. For the fluorescence task, in particular, a simple linear regression model trained on full one-hot encodings outperforms our models and the TAPE models. Additionally, 2-layer CNN models offer competitive performance with Rives et al.’s ESM (evolutionary scale modeling) transformer models on β-lactamase variant activity prediction.

While TAPE’s benchmarking argued that pretraining improves the performance of language models on downstream landscape prediction tasks, our results show that small supervised models can, in a fraction of the time and compute required for semi-supervised models, achieve competitive performance on the same tasks.

So... the use of pre-training in big ML models is premised on this idea: "conditioned on us deciding that we want to use language models, pre-training is necessary to improve activity". However, this paper is saying, "you don't even have to use an overpowered language model, for a large fraction of tasks, you can just use a simpler CNN".

Model architectures:

  • From the text: Our supervised models only rely on 1-D convolution layers, dense layers, and ReLU activations.

The results presented in the paper do seem to suggest that empirically, these large language models aren't necessary. (see: Large models might not be necessary)

We see that relatively simple and small CNN models trained entirely with supervised learning for fluorescence or stability prediction compete with and outperform the semi-supervised models benchmarked in TAPE [12], despite requiring substantially less time and compute.