This protocol describes the computational steps necessary to reproduce the results described in the paper "Unified rational protein engineering with sequence-only deep representation learning" by Alley et al.
Method Article
Unified rational protein engineering with sequence-based deep representation learning
https://doi.org/10.21203/rs.2.13774/v1
This work is licensed under a CC BY 4.0 License
published 21 Oct, 2019
posted
You are reading this latest preprint version
This protocol describes the computational steps necessary to reproduce the results described in the paper "Unified rational protein engineering with sequence-only deep representation learning" by Alley et al.
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. Here we provide a protocol for reproducing these results.
No reagents necessary
Preferably, m5.12xlarge or m5.24xlarge AWS instance with Ubuntu Server 18.04 LTS AMI (for example, ami-0f65671a86f061fcd).
Code and dependencies described under "Requirements" in https://github.com/churchlab/UniRep-analysis
1. Clone the repository containing the code with ```git clone https://github.com/churchlab/UniRep-analysis.git```
2. Download and unzip the data using bash commands under "Getting the data" in the repository README
3. Reproduce figures and retrain top models by running ipython notebooks and python scripts as described under "Usage" in the repository README
1. Check that the requirements are in place (see "Requirements" in the repository README)
2. Make sure the path to data folder is correct and accessible
3. Reach out for assistance
<1 hour for regenerating figures
~7 hours for retraining top models and recomputing metrics
Detailed description of results available here: https://www.biorxiv.org/content/10.1101/589333v1
Pre-print of Alley et al. "Unified rational protein engineering with sequence-only deep representation learning" available at https://www.biorxiv.org/content/10.1101/589333v1
E.C.A., G.K., and S.B. are in the process of pursuing a patent on this technology. S.B. is
a former consultant for Flagship Pioneering company VL57. A full list of G.M.C.’s tech transfer, advisory roles,
and funding sources can be found on the lab’s website: http://arep.med.harvard.edu/gmc/tech.html
published 21 Oct, 2019
posted
You are reading this latest preprint version