Easy and accurate protein structure prediction using ColabFold

doi:10.21203/rs.3.pex-2490/v1

Method Article

Easy and accurate protein structure prediction using ColabFold

https://doi.org/10.21203/rs.3.pex-2490/v1

This work is licensed under a CC BY 4.0 License

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

Since its public release in 2021, AlphaFold2 (AF2) has made investigating biological questions, using predicted protein structures of single monomers or full complexes, a common practice. ColabFold-AF2 is an open-source Jupyter Notebook inside Google Colaboratory and a command-line tool, which makes it easy to use AF2, while exposing its advanced options. ColabFold-AF2 shortens turn-around times of experiments due to its optimized usage of AF2’s models. In this protocol, we guide the reader through ColabFold best-practices using three scenarios: (1) monomer prediction, (2) complex prediction, and (3) conformation sampling. The first two scenarios cover classic static structure prediction and are demonstrated on the human glycosylphosphatidylinositol transamidase (GPIT) protein. The third scenario demonstrates an alternative use-case of the AF2 models by predicting two conformations of the human Alanine Serine Transporter 2 (ASCT2). Users can run the protocol without command-line knowledge via Google Colaboratory or in a command-line environment. The protocol is available at https://protocol.colabfold.com.

Protein prediction models and ColabFold

Predicting the 3D structure of a protein from its sequence alone has long been a formidable task in the field of structural biology. The progress of machine learning models has made significant strides towards achieving this goal. AlphaFold2 (AF2)¹ and then RoseTTAFold^2,3 represent these groundbreaking models. For the first time, they offer computational methods capable of producing protein structure predictions nearly indistinguishable from experimentally-solved structures, given sufficient sequence information. Specifically, AF2 is an end-to-end neural network, composed of two main modules. The first module processes the information about the input amino-acid (AA) sequence (query) and generates hypotheses about which AAs are in contact with one another. The second module aggregates these hypotheses to predict the structure, i.e., the 3D coordinates for each of the atoms. Two of the key ideas behind the AF2 network are the use of the deep-learning attention mechanism⁴, which allows the network to better identify AAs in contact, and an iterative refinement of the prediction by passing the intermediate output of the two modules to the first module. Similar principles guided the design of RoseTTAFold, resulting in a different network architecture, whose exceptional accuracy is second only to AF2.

AF2 was first designed to predict the structure of a single chain, a single protein chain. However, its models have been successfully used to predict interactions between multiple chains or complexes^5–7. Additionally, the AF2 model has been further developed and trained specifically on multimeric input, resulting in AlphaFold-multimer⁸. Other computational models, which use the AF2 structure module, have since been developed^9–11.

ColabFold⁷ is an integrated protein prediction solution, aimed at simplifying the process of structure modeling for the user. As such, it offers both an easy interface to various protein prediction models as well as pre- and post-processing procedures. ColabFold has two interfaces: web-based, utilizing Google Colaboratory notebooks (in short and hereafter: Colab) and command-line tools. Its web-based interface includes five notebooks: AlphaFold2.ipynb (for using either AF2 or AF2-multimer), RoseTTAFold.ipynb, RoseTTAFold2.ipynb, ESMFold.ipynb and OmegaFold.ipynb. The web-based interface requires free registration to Colab and is primarily designed for making a single or a small batch prediction. The command-line interface includes only the AF2 and AF2-multimer prediction models and allows for batch predictions by processing multiple input sequences. Since the AF2 models are the most accurate currently published ones, this protocol is focused on AlphaFold2.ipynb and the command-line, which are jointly denoted as ColabFold-AF2.

A key input to protein prediction models is a Multiple Sequence Alignment (MSA), a collection of sequences that share some degree of similarity (homology) with the query in a way that informs of their evolutionary relationships at the residue level, facilitating the identification of conserved regions and co-evolutionary patterns across different species. ColabFold employs a custom MMseqs2^12,13 homology search server for fast and sensitive MSA generation, which is 40-60 times faster than other tools⁷, facilitating the prediction of hundreds of structures a day. Increasing the number of homologs in the MSA and their diversity by searching against large environmental databases has been shown to improve prediction accuracy^14,15. Therefore, by default, the ColabFold MSA server utilizes two databases: UniRef¹⁶ and ColabFoldDB⁷, a large environmental database of over 700 million sequences, constructed from various metagenomic resources. Owing to this publicly-available MSA server, ColabFold users avoid the need for storing such large databases locally on their computer.

Another input to ColabFold are templates, known structures of proteins, which have sequence similarity to the query. To detect templates, ColabFold first computes a smaller MSA for the query using only UniRef homologs and from that MSA - a position-specific scoring matrix (PSSM). This PSSM can be thought of as a generalization of the query and is used to search the Protein Data Bank (PDB)¹⁷, a communal resource for solved structures. Though, generally in the case of MSAs with sufficient homologs, templates do not significantly enhance the accuracy of AF2^15,18,19 and are thus not enabled in ColabFold by default.

In addition to processing the input MSA and templates, the behavior of protein prediction models is controlled by various parameters, such as the number of times the prediction will be iterated through the network (number of recycles). ColabFold sets these parameters to meaningful default values, saving the need to modify them in the average case, while offering an easy way to do so, if desired. This optimizes the usage of AF2 models for the user, allowing for efficient experimental turn-around times. Upon completion of the prediction process, ColabFold produces various visualizations of the predicted structure²⁰ and the prediction quality.

Owing to its simplicity and functionality, ColabFold has been widely adopted in numerous studies and its public MSA server is employed tens of thousands of times a day. Its applicability spans many biological fields. For example, it has been used to solve the structure of two members of the central AvrE-family of bacterial effectors²¹, revealing their beta barrel structure. This prediction prompted a series of cryo-EM imaging and other experiments, eventually leading to the full characterization of the biochemical function of this effector family. In another study²², scientists used ColabFold-AF2 to predict the structures of dozens of nucleoporins, the building blocks of the nuclear pore complex (NPC). The resulting models were extremely accurate, and thus used for integrative modeling, covering >90% of the human NPC. This protocol guides the reader on how to use ColabFold-AF2 for solving similar biological questions.

The biological examples in this protocol

To demonstrate the classic use of ColabFold-AF2 for predicting static protein structures, we focus on the human glycosylphosphatidylinositol transamidase (GPIT) protein. The structure of the GPIT complex, which catalyzes the attachment of GPI to the endoplasmic reticulum membrane, has recently been determined experimentally using single-particle cryo-EM in a resolution of 3.1Å²³. The procedure presented in this protocol results in a predicted GPIT structure, which is remarkably similar to the experimentally solved one, with a root mean square deviation (RMSD) of 1.87 Å, meaning the distance between the two protein models is very low, indicating great structural similarity.

A non-classic use-case of ColabFold-AF2 may interest readers, who investigate proteins, which shift between conformations, such as transporters and receptors. Even though AF2 was designed for predicting static structures, del Alamo et al.²⁴ have recently shown that certain manipulations to its input can, in some cases, tweak it to predict a landscape of structures spanning several conformations. Here, we modify and extend the protocol proposed by del Alamo et al. and demonstrate this ability using the human Alanine Serine Transporter 2 (ASCT2), a Na⁺ independent neutral AA transporter. This transporter is a homotrimer, which has at least two conformations, depending on whether it is facing the extracellular (outward)²⁵ or intracellular (inward) side²⁶. We show ColabFold-AF2 can be tweaked to predict them.

Overview of the protocol

This paper outlines a comprehensive protocol (Fig. 1) for utilizing ColabFold-AF2 for monomer (Procedure 1) and complex (Procedure 2) predictions and for an ad-hoc approach for conformational sampling (Procedure 3). We demonstrate Procedure 1 and 2 using the human GPIT protein. Procedure 1 predicts each of its five subunits: PIGU, PIGK, PIGT, PIGS, and GPAA1, as a monomer and Procedure 2 predicts them jointly as a complex. In Procedure 3, we use the human ASCT2. For all three procedures, we first provide instructions on the web-based version and then the command-line version, which can be run locally. Using the command-line version of the protocol requires basic familiarity with the Unix/Linux shell and a workstation capable of handling AF2 models. ColabFold-AF2 has over 15 tunable parameters. These should not overwhelm first-time users, as most are expected to be kept at their default values. In Box 1 we describe the ten parameters, which users are most likely to tune as they gain experience. For clarity, we divide the described parameters into categories by their role. However, these parameters are spread over several of the notebook’s cells. In each procedure, we instruct the user on how to provide the input protein sequence to ColabFold and in Box 2, we give full details concerning ColabFold’s accepted input and output formats. Box 3 informs of the various confidence measures computed by ColabFold-AF2 and Box 4 is intended for users who do not wish to use the default ColabFold server for MSA computation. The Anticipated Results section of this protocol contains general explanations on interpreting ColabFold’s plots and output, followed by a demonstration of the interpretation process on each of the procedures’ examples. This protocol is primarily designed for biologists aiming to conduct structural analysis and does not require coding expertise.

Alternatives to ColabFold-AF2

The first tool, which uses the AF2 models, was a command-line interface by the AlphaFold team at Google DeepMind (https://github.com/google-deepmind/alphafold). ColabFold followed, offering the first Colab-based notebook. The debut of ColabFold was then followed by a dedicated AF2 and AF2-multimer notebooks⁸. Other alternatives include software reimplementations like FastFold²⁷ and OpenFold²⁸, which offer adaptations to fit heterogeneous compute clusters and the ability to retrain the network from ground up, respectively. Unifold²⁹ is an alternative for AlphaFold-multimer, with optimized model performance on GPU, a retrained model, and novel weight sets.

Other tools, such as OmegaFold¹¹, ESMfold¹⁰, and Helixfold-single³⁰ bypass the need to compute MSAs by using pre-trained language models (LM) to process a single protein sequence. While these LM-based structure predictors are faster compared to MSA-based structure predictors, they are generally also less accurate.

ColabFold’s alternative-to-AF2 notebooks: ESMFold.ipynb, OmegaFold.ipynb and RoseTTAFold2.ipynb are especially useful for fast monomer prediction (ESMFold) and large complex prediction (RoseTTAFold2³). Though to date, the AF2 models remain the most accurate published model.

Limitations of the protocol and software

As the size of the input grows, the execution time and GPU RAM usage of the AF2 models increase. Given Colab’s capabilities, the maximum protein sequence size that can be computed varies depending on the allocated GPU. The T4 GPU with 16 GB of GPU RAM, available for all Colab users, handles approximately 1,500 AAs, and the A100 with 40GB of GPU RAM, available in Colab Pro+, can process up to 3,300 AAs in a single run. For longer proteins, users are encouraged to employ their own local or cloud-based GPU resources. In addition, ColabFold permits exchanging more system RAM for longer runtimes, which may be advantageous trade-off for specific use-cases.

Equipment

Protein sequence queries used for demonstrating the procedures:

• PIGU: https://protocol.colabfold.com/PIGU.fasta

• PIGK: https://protocol.colabfold.com/PIGK.fasta

• PIGT: https://protocol.colabfold.com/PIGT.fasta

• PIGS: https://protocol.colabfold.com/PIGS.fasta

• GPAA1: https://protocol.colabfold.com/GPAA1.fasta

• GPIT complex: https://rcsb.org/fasta/entry/7W72

• ASCT2: https://rcsb.org/fasta/entry/7BCQ

ColabFold software:

• Web-based ColabFold-AF2 notebook: https://alphafold.colabfold.com

• Code repository and alternative notebooks: https://github.com/sokrypton/ColabFold

Hardware: Running the web-based version requires a Google account (free). Throughout this protocol, unless stated otherwise, the freely-available T4 GPU was used. To reproduce the examples using command-line instructions we recommend a Linux workstation with 16 GB GB RAM and CUDA-capable GPU of the Volta generation or newer with at least 16 GB GPU RAM. See the “Troubleshooting” section for older GPUs.

Equipment setup

A. Web-based ColabFold within Colab

Colab is a proprietary web-based environment offered by Google to host Jupyter notebooks with a series of code blocks called cells. Colab provides logged-in users with free CPU and GPU resources for software execution. Sign in to Google and navigate to https://alphafold.colabfold.com.

B. Downloading and installing ColabFold locally

Local installation of ColabFold requires the installer script LocalColabFold (available together with a comprehensive set-up guide at https://github.com/YoshitakaMo/localcolabfold). LocalColabFold is compatible with various operating systems, including Windows Subsystem for Linux 2 (freely available for Windows 10 or later), Linux and macOS (albeit without GPU support). It utilizes the Python package managers pip and Conda (available at https://docs.conda.io). The following commands install ColabFold locally on a Linux system. Throughout this protocol, lines starting with $ refer to instructions that should be typed in the command-line interface.

$ wget https://raw.githubusercontent.com/YoshitakaMo/localcolabfold/main/install_colabbatch_linux.sh
$ bash install_colabbatch_linux.sh
$ export PATH="/path/to/your/localcolabfold/colabfold-conda/bin:$PATH"

ColabFold databases download

Typically, command-line users are not required to download large homology databases, as ColabFold automatically connects to an online server, containing such databases for MSA generation. As this server is a shared resource among ColabFold users, an IP-based rate-limiting access is implemented to ensure fairness. Thus, users who wish to compute many hundreds of structures, or set up their own local server are referred to the script setup_databases.sh, which downloads the required databases for local MSA generation. To run this script, MMseqs2 (15-6f452 or later, available at https://mmseqs.com) is required. Databases of protein sequences and templates are also available at https://colabfold.mmseqs.com. Obtain the script and grant it permissions:

$ wget https://raw.githubusercontent.com/sokrypton/ColabFold/main/setup_databases.sh
$ chmod +x setup_databases.sh

Before executing the script, please refer to the GitHub README section https://github.com/sokrypton/ColabFold#generating-msas-for-large-scale-structurecomplex-predictions, for execution instructions, which ensure optimal performance for large databases, as those used for protein structure prediction.

Procedure 1. Monomer prediction

In the following sections, we predict each of the five monomers of the GPIT complex individually, using ColabFold-AF2 within Colab as well as using the command line. We explicitly guide the reader through predicting the structure of the PIGU subunit and the other four subunits can be predicted analogously. To obtain the relevant sequences, see section “Equipment” of this protocol.

1A. Web-based prediction

We begin with the quick start, which requires the user to perform three steps, prompting the execution of all the cells in the notebook, without a break. This option is analogously available for other procedures in this protocol. Alternatively, the cells of the notebook can be run, one after the other, from top to bottom.

1A - Quick start Timing ~20 min

i. Open a web-browser and navigate to https://alphafold.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.

ii. Paste the AA sequence of PIGU into the query_sequence field.

iii. Click Runtime in the menu bar and then hit Run all (Fig. 2). This action performs all prediction steps without a break, by sequentially executing every cell in the notebook. The currently running cell is indicated by a spinning circle on the left side.

By default, ColabFold computes five structure models. After each model is generated, the predicted structures and result plots are displayed. Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2 for details about output formats) as a single compressed (zip) file. Please note that the runtime of these steps can vary depending on the GPU assigned by Colab.

1A - Cell-by-cell instructions Timing ~20 min

By running the cells of the notebook (Fig. 3) sequentially, the user can set the options of some cells, based on the result of the previous step(s).

i. Open a web-browser and navigate to https://alphafold.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.

ii. Paste the AA sequence of PIGU into the query_sequence field. However, now run only this single cell by clicking on the triangle icon on the left.

iii. Run the next cell install_dependencies. It doesn’t contain any modifiable parameters.

iv. Continue to MSA options. This cell controls the homology search for the input. Keep it at its default values and run the cell. You can check Box 1 for alternative options for setting this cell before running it.

v. The Advanced Settings cell allows for controlling the type of prediction model, recycling, MSA sampling and randomization. Use its pre-loaded default setting or set it (see Box 1) and run it.

vi. The next cell is Run prediction, which will start the prediction process with/without a preceding MSA computation step (depending on the values set in MSA options cell). The five computed protein models will be ranked by their pLDDT score.

vii. In the next cell Display 3D structure, you can select the model to be displayed by its rank using rank_num and then run the cell to view the result. This cell also allows for adjusting the colors and chain display.

viii. Running the next cell, Plots, will produce the PAE, sequence coverage, and pLDDT plots (discussed in Box 3 and Anticipated Results).

ix. Save the results as a compressed zip file by running the Package and download results cell.

Predicting the structure of PIGK, PIGT, PIGS, and GPAA1 Timing ~20, 30, 30,
40 min (T4 GPU)

Repeat the steps of 1A analogously for the other four monomeric units of GPIT.

1B. Local prediction Timing ~5 min

Local predictions with ColabFold are carried out with a single command-line tool, colabfold_batch. It supports batch prediction, eliminating the need to input query sequences one-by-one.

i. Save the sequence of PIGU locally in FASTA format (PIGU.fasta).
In addition to FASTA, ColabFold supports several types of input formats (Box 2). ColabFold can also be provided with pre-computed MSAs (for MSA generation options, see Box 4) by placing them within a single directory.
ii. Use a single command to generate the MSA and predict the structure:

$ colabfold_batch Q9H490.fasta /path/to/results

Additional arguments can be passed to this command if non-default settings are desired. Users can refer to the complete list of parameters by running:

$ colabfold_batch --help.

These parameter settings are as detailed in Box 1, however, when running locally, the user should replace underscores in the parameter name with dashes and set it using “-” before the name, for example:

--msa-mode mmseqs2_uniref

The output files (detailed in Box 2) are saved in the specified output directory (/path/to/results). Unlike web-based predictions, local predictions do not run on a notebook with 3D structure visualizations. Instead, users can explore the predicted structures in PDB format using tools like ChimeraX and PyMOL.

Procedure 2. Complex prediction

ColabFold’s complex prediction process closely resembles that of monomers. The main differences are the input, the additional MSA pairing option, and the use of a multimer prediction model. In the following steps, we focus solely on these differences, using the GPIT complex as an example. For all other settings, which are shared with monomer prediction, we refer the reader to the monomer section of the protocol and Box 1.

During input preparation, the sequences of all subunits in the complex need to be concatenated to each other, using a colon (‘:’). See Box 2 for an example concatenation in CSV format. To obtain the GPIT subunit sequences, see section “Equipment” of this protocol.

2A. Web-based prediction Timing ~60 min

Due to the long length of the GPIT complex (~2,500 residues), the web-based prediction was carried out using a paid Pro Colab account, leveraging the much larger amount of GPU RAM of the A100 GPU (40GB GPU RAM).

i. Open a web-browser and navigate to https://alphafold.colabfold.com. Set hardware accelerator to A100 GPU by navigating to Runtime → Change runtime type in the menu bar.

ii. Concatenate the sequences of all subunits of the GPIT complex using a colon (‘:’) and paste the concatenated sequence into the query_sequence field.

iii. All cells are preset to their default values, which will be used in this example. However, other options are available and their details are provided in Box 1. These include the options available for monomer prediction and in addition: controlling the pairing process, which identifies the same taxon across the different protein subunits and the model used for protein prediction. Click Runtime and hit Run all. This action performs all prediction steps without a break.

Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2) as a single compressed file. In complex prediction the five generated protein models will be ranked by their pTM and ipTM scores (Box 3) with the following formula: 0.2×pTM + 0.8×ipTM.

2B. Local prediction Timing ~110 min

i. Concatenate the sequences of all subunits of the GPIT complex using a colon (‘:’) and save the concatenated sequence in CSV format (example in Box 2) as input.csv.

ii. Use a single command to generate the MSA and predict the structure:

$ colabfold_batch input.csv /path/to/results

The results will be saved in the output directory (/path/to/results).

Procedure 3. Conformation prediction

This procedure uses the AF2 models outside their initial scope, which predicts a single conformation given an AA sequence. The outline provided here is partly based on an ad-hoc method proposed by del Alamo et al.²⁴, who also warn that there is no one-size-fits-all approach for sampling the conformational space. We therefore recommend trying out various parameter options as well as being cautious when interpreting the results.

As shown by del Alamo et al., reducing MSA depth can contribute to conformational sampling. At the heart of this approach lies the following logic. Proteins that undergo conformational changes are likely to have AA pairs, which are in strong structural interaction in one conformation, but exhibit different interaction in another. These AA pairs will tend to co-evolve, meaning that when examined over many homologs from various species, a change in one AA will often be associated with a change in the other AA of the pair. Thus, when an MSA with many homologs (“deep MSA”) is provided to the AF2 models, these AA pairs carry a signal, which prompts the AF2 models to predict the conformation in which they interact. However, when the MSA depth is reduced by removing homologs, the co-evolution signal is weakened, potentially allowing the AF2 models to predict alternative conformations.

Other factors can additionally contribute to conformational sampling by increasing the uncertainty of the network, i.e., having less certainty in a single dominant conformation^37,38. These factors include providing templates of alternative conformations or activating dropout layers in the AF2 neural network. Dropout layers are used during the neural network training and prompt the network to produce alternative solutions for the same input, by randomly deactivating certain weights. Thus, enabling these layers during inference enhances the variety of predicted structures.

Here we present two sampling strategies, focused either on reducing MSA depth, which is controlled by ColabFold-AF2 through the max_msa parameter (see Box 1) or on activating dropout layers. Depending on the starting point (seed), each AF2 execution can reach slightly different results. Thus, in both strategies, we increase the number of times the model is run using different starting points, thereby increasing the chance of sampling alternative conformations. The following procedure uses ASCT2 as an example. To obtain its sequence, see section “Equipment” of this protocol.

3A. Web-based prediction Timing ~35 min (MSA depth reduction), 60 min (Dropout)

To shorten run times, the web-based prediction was conducted using an A100 GPU on a paid Pro Colab account. However, this procedure can also be performed with the default T4 GPU, taking ca. 7 hours.

i. Open a web-browser and navigate to https://alphaf old.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.

ii. Paste the ASCT2 sequence into the query_sequence field.

iii. Conformation prediction requires changing the default settings to increase the uncertainty of the network by modifying parameters in the Advanced settings → Sample settings field. In the MSA depth reduction strategy, adjust the value of max_msa to 32:64. Alternatively, activate the dropout layers by checking the use_dropout box. In addition to either strategy, set num_seeds to its maximum (16), to generate models using different starting points (see Box 1). All other cells are kept at their default values in this example. Box 1 informs on how to modify them, if desired.

iv. Click Runtime and hit Run all. This action performs all prediction steps without a break.

Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2) as a single compressed file. For each seed from 0 to 15, five structure models will be computed, resulting in a total of 80 predictions, ranked by their pLDDT score.

3B. Local prediction Timing ~55 min (MSA depth reduction), 100 min (Dropout)

In the command-line, instead of setting num-seeds (see Box 1), we directly set the starting points through the random-seed parameter. Both these options result in a similar behavior.

i. Save the ASCT2 protein sequence FASTA file (rcsb_pdb_7BCQ.fasta) in the input file directory.

ii. For the MSA depth reduction strategy, the basic command is:

$ colabfold_batch rcsb_pdb_7BCQ.fasta /path/to/result --max-msa 32:64 --random-seed z

For the dropout-activation strategy it is:

$ colabfold_batch rcsb_pdb_7BCQ.fasta /path/to/result --use-dropout --random-seed z

This basic command should be run 16 times with a different z value each time (0,1,...,15). To avoid typing the command multiple types, use the following bash script, which embeds the basic command in a loop. In addition, it copies the MSA computed in the first iteration to all following iterations, thereby reducing calculation time:

#!/bin/bash
INPUTFILE="rcsb_pdb_7BCQ.fasta"
OUTPUTDIR="ASCT2/32_64"
for RANDOMSEED in `seq 0 16`; do
if test ${RANDOMSEED} -ne 0 ;then
    mkdir -p ${OUTPUTDIR}/${RANDOMSEED}
    # Copy the MSA file of iteration '0' to skip MSA computation by the webserver
    cp -rp ${OUTPUTDIR}/0/*_env ${OUTPUTDIR}/${RANDOMSEED}
fi
colabfold_batch \
    --random-seed ${RANDOMSEED} \
    --max-msa 32:64 \ # Keep this line for the MSA depth reduction strategy
    --use-dropout \ # Keep this line for the dropout-activation strategy
    ${INPUTFILE} \
    ${OUTPUTDIR}/${RANDOMSEED}
done

After completion, in each output directory ($OUTPUTDIR/$RANDOMSEED) corresponding to seed numbers 0-15, you will find the output files generated for the five models.

Box 1. Parameter configuration

Tuning homology search - MSA generation options

msa_mode

- Use ColabFold MSA server: search against the UniRef database only (mmseqs2_uniref) or UniRef and ColabFoldDB (mmseqs2_uniref_env, default)

- custom: upload a pre-computed MSA

- single_sequence: predict using a single sequence without an MSA. This option is useful for de novo designed proteins with no homology to natural proteins.

pair_mode: For heteromeric complex predictions, ColabFold first generates MSAs separately for each chain. Then, in a process called “pairing”, it identifies and marks sequences from the same (sub-)species across all MSAs, to enhance prediction quality.

- paired: perform pairing and retain only sequences that can be paired

- unpaired: skip pairing

- unpaired_paired (default): pair and retain all sequences

pairing_strategy: This parameter takes effect when pairing is enabled.

- complete: consider a sequence as “paired” if it has a taxonomic match in every chain (all MSAs)

- greedy (default): a sequence is paired if it has a taxonomic match in at least two chains (two MSAs)

Searching for alternative conformations - MSA sampling options

max_msa: controls the number of sequences used for structure prediction and is provided as max_msa_clusters:max_extra_msa, where max_msa_clusters corresponds to N_seq and max_extra_msa - to N
_{extra_seq}, as defined in the Suppl. section 1.1 of Jumper et al.¹. By default (auto), these values are set to 512:5120. Lower values are useful when searching for alternative conformations (explained in Procedure 3).

- max_msa_clusters: the number of randomly chosen MSA sequences, which serve as center points in a clustering procedure preceding prediction with the AF2 network

- max_extra_msa: MSA sequences, which were not selected as cluster centers and are used for extra computation by the AF2 network

num_seeds: indicates the number of different random seeds to use. By default (1) a single seed value will be used. These seeds determine random components throughout prediction (e.g., when selecting the sequences, which serve as cluster centers). Setting a higher value can increase the chance of obtaining a better confidence score in cases where the MSA is very small and templates are lacking. In combination with max_msa, it can help explore alternative conformations.

use_dropout: activates dropout layers during inference, which prompts the AF2 neural network to be less confident in a single conformation.

Improving or reproducing results - Model options

model_type: specifies the AF2 model (ptm, multimer_v1, _2, _3) to use for prediction. By default (auto), the newest models: alphafold2_ptm for monomer, and alphafold_multimer_v3 for complex, will be used. Using older models is mainly used for reproducing older results (but see also “Troubleshooting”).

template_mode: specifies whether to incorporate templates (pdb100, default) or not (none), or use custom templates (custom).

num_recycles: controls the number of times a prediction is re-fed to the model. By default, it is set to 3, except for model_type=alphafold2_multimer_v3, where it is set to 20.

recycle_early_stop_tolerance (in short tol): controls the convergence criterion, which is considered together with num_recycles. By default, it is set to 0.0, except for model_type=alphafold2_multimer_v3, where it is set to 0.5.

Box 2. Input and output formats

When starting from a single protein, ColabFold accepts the sequence in either FASTA or CSV/TSV. If the user provides a precomputed MSA, ColabFold requires A3M format, which is also the output format, in case the MSA is not precomputed. ColabFold can operate on multiple input files by taking in a list of sequences in FASTA/CSV format or a directory containing FASTA/A3M files.

FASTA: Each entry has a header starting with ’>’, followed by sequence lines. Example:

        >PIGU_trim
       MAAPLVLVLVVAVTVRAALFRSSLAEFISERVE
       >PIGT_trim
       ARDSLREELVITPLPSGDVAATFQFRTRWDSELQREGVSHY

CSV: The first line is always “id,sequence”. Each line that follows contains the header and the sequence, separated by a comma. Examples:

Monomer

        id,sequence
       PIGU_trim,MAAPLVLVLV
       PIGT_trim,ARDSLREELV

Complex

id,sequence
GPIT_trim,MAAPLVLVLV:ARDSLREELV

A3M begins with a super-header, followed by the aligned sequence entries. Each entry has a header starting with ’>’, followed by sequence lines.

- The super-header, starting with ’#’, contains two comma-separated lists separated by a tab. The first list indicates the length of each chain, while the second - its cardinality. Each chain appears only once, regardless of its cardinality.

- The MSA is described with respect to the first (query) sequence. An amino-acid (AA) that is aligned to an AA in the query is shown in upper case, an insertion relative to the query - in lower case, and a deletion as ‘-’. Example A3M for different input types:

Monomer

        #28 1
       >example_monomer_query
       MAAPLVLVLVVAVTVRAALFRSSLAEFI
       >found_homolog_1
       MAFPLALVLVVAVTVR-ALFRSSLAEFI
       >found_homolog_2
       ...

Hetero-oligomer

#28,20 4,2

>example_hetero6mer_A4B2_query

MAAPLVLVLVVAVTVRAALFRSSLAEFITTAVNYPFVDTMDKFDKITK

>found_homolog_paired1

--FPLALVLVVAVTVRAALFRSSLAEFITTLVNYPFVDTMDKFDFITF

>found_homolog_unpaired_chain1

-AFPLALVLVVAVTVRAALFRSSLAEFI--------------------

>found_homolog_unpaired_chain2

----------------------------TTLVNYPFVDTMDKFDKITF
...

Output: ColabFold computes five protein models and reports them in three types of files:

PDB-format for reporting the structures. See https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction for details. Reading the values stored in PDB files is possible using text editors and visualizing the structure - by tools like ChimeraX³¹and PyMOL (https://github.com/schrodinger/pymol-open-source). Within Colab the predicted PDB files are visualized using 3Dmol.js²⁰.

JSON hierarchical data format for reporting the models’ confidence measures and the configuration of the parameters used for prediction. Reading JSON files is possible through text editors or web-browsers like Firefox, among others.

PNG image format for visualizing the alignment coverage and confidence scores.

Box 3. Prediction confidence measures

AF2 has its own set of confidence measures, corresponding to traditional measures for structure prediction, whose computation requires knowledge of the true protein structure. Since the true structure is generally unknown, AF2, through training on a set of proteins with known structures, has learnt to predict these confidence measures alongside predicting the structure itself. These measures are named pX, where X is a traditional measure and p stands for “predicted”.

predicted Local Distance Difference Test (pLDDT): For every AA in the query, AF2 predicts its lDDT-Cα score³² on a scale of 0 (bad) to 100 (excellent). Each AA’s α carbon can be described by its distances to neighboring α carbons (e.g. within a radius of 15 Å) in the true structure. The superposition-free lDDT-Cα metric reflects how preserved these distances are in the prediction model, where the confidence for each score range is as follows:

>90: highly confident

70 <pLDDT <90: confident in a backbone-level

50 <pLDDT <70: lower confidence, interpret cautiously

<50: potentially disordered regions; should be ignored

Predicted Aligned Error (PAE): For every amino-acid x in the query, AF2 predicts a set of scores on a scale of 0 (excellent) to >30 (bad). Each score reflects the expected displacement of x in Å in the predicted structure relative to the unknown true structure, when the two structures are aligned on some other amino-acid y. PAE is especially useful when evaluating the prediction of multi-domain/chain proteins, where the placement of each domain is important. Specifically, if amino-acids, which are part of one domain, score well also when the alignment point is outside their domain, it suggests inter-domain confidence.

predicted Template Modeling score (pTM): the template modeling (TM) score^₃₃ is expressed using the distance between the predicted position of each AA and its true position, when the predicted and true structures are optimally aligned. AF2 predicts an approximate TM score by replacing the optimal alignment, which is infeasible to compute, with the row in the PAE matrix (i.e., an alignment on a single AA), where the total error is lowest (for full details, see Suppl. pg 37-38 of Jumper et al.¹). The computed pTM scores range from 0 (bad) to 1 (excellent), often interpreted as:

>0.8: highly confident with congruent topology and backbone

0.5 or 0.7 <pTM <0.8: reliable fold for single or multi domain proteins, respectively

<0.2: no correlation with the true structure, intrinsically disordered protein, or no MSA/templates

interface pTM (ipTM): For complexes, AlphaFold-multimer predicts a modified pTM score, ipTM⁸, which takes into account only the inter-subunit distances, estimating the prediction accuracy of interfaces. Similar to pTM, ipTM ranges from 0 (bad) to 1 (excellent), where an ipTM score > 0.85³⁴ is considered as reliable.

Box 4. Custom MSA generation

Instead of using the default MSA server, users can provide precomputed MSAs in A3M format (see Box 2). The following are alternatives for MSA generation.

Local colabfold_search: compute an MSA locally with the command

$ colabfold_search input.fasta database/ msas

- The input file (input.fasta in this example) should contain the query sequence(s) and database/ should indicate a path to the sequence database. The execution will generate A3M-formatted MSAs in the msas folder.

API hosting: For greater flexibility, users can also host their own API and pass its address to --host-url when running colabfold_batch.

$ colabfold_batch input.fasta /path/to/results --host-url "https://api.example.org"

Online tools: Users can utilize tools, such as the HHblits Toolkit server^35,36(https://toolkit.tuebingen.mpg.de/tools/hhblits) for web-based MSA generation. Select the tool’s settings for achieving maximal sensitivity. In the case of HHblits, we recommend using a protocol similar to RoseTTAFold2 using 3 iterations, setting the result list size to 10,000, and applying a strict E-value cut-off of 10^-20. If no or only few homologs are found, relax the E-value threshold.

Low confidence prediction

There are several possible reasons for low confidence in the predicted protein model. If the sequence coverage is low (shallow MSA), try using other sensitive homology search tools or larger environmental databases¹⁵ for constructing a deep MSA prior to providing it to ColabFold (see Box 4). Another reason for low confidence can be the presence of disordered regions, which challenge the network. Potential disordered regions can be detected by their low pLDDT scores and then trimmed off before re-running ColabFold-AF2. A third reason can be the stochasticity of the model. In this case, changing the model type or trying a different random seed (see Box 1) might help. Another mitigation is to try a different prediction network, such as ColabFold-RoseTTAFold2, as it may perform better on specific examples. Finally, in the case of complex-prediction, if the whole structure receives a low ipTM score, but the individual subunits are well predicted, it can be useful to provide the well predicted subunit structures as custom templates and re-run the complex prediction. This approach can also be beneficial when adding an interacting partner to a well-predicted complex by providing the complex model and the monomer model of the partner as custom templates.

Web-based predictions: results download error

Upon problems with downloading the results, try the following options. First, check that pop-ups are allowed for Colab and re-run the Package and download results cell. If this does not work, try downloading the result file manually. To do so, click on the folder icon on the left bar, navigate to the file (<jobname>.result.zip), right-click on it and click ‘Download’. Alternatively, instead of downloading the result zip file to your computer, you can choose to upload it to Google Drive, by selecting the save_to_google_drive option in the Advanced Settings cell.

Web-based predictions: running out of resources

If the execution is interrupted with a message “Your session crashed after using all available RAM” or “You cannot currently connect to a GPU due to usage limits in Colab”, it means the limit of available resources provided by Google Colaboratory has been reached. This can happen when analyzing longer input sequences, using many recycles or exceeding the number of times Google Colaboratory allows a user to run in a given time period. To solve this, consider waiting (usually a day, though the timeout period is not fixed) before attempting to use Colab again, using a paid Pro account, or running ColabFold locally.

Web-based predictions: running predictions takes too much time

If prediction time is very long, for example, when analyzing input of similar length to those presented in this protocol, make sure you are utilizing Colab’s GPU option. To do so, check that the Hardware accelerator is set to GPU at Runtime → Change runtime type.

Local predictions: CUDA version conflicts

A common source of issues with a local installation is version conflicts and incompatibilities of the various CUDA related drivers. We recommend using Conda and to create a new environment for ColabFold installation. In addition, it is important not to reuse this environment for purposes other than installing ColabFold.

Local predictions: Old CUDA-capable GPUs

GPUs from the Volta CUDA-capable GPU generation are the oldest GPUs we recommend using. Setting the --disable-unified-memory parameter in ColabFold should allow it to run on Kepler-generation GPUs. Additionally, JAX and Jaxlib libraries have to be downgraded to versions before 0.4.

Custom template upload error

If any errors occur when uploading a custom template, make sure the template name follows the PDB naming convention of four lowercase letters, and the template confirms to either mmCIF or PDB formats. Specifically, mmCIF-formatted templates must include an _entity_poly_seq field. PDB-formatted templates will be automatically converted to mmCIF by ColabFold.

Custom MSA upload error

If there are any errors when uploading a custom MSA, check that the file is in the required A3M format (see Box 2). In case of complexes, make sure the header format is correct (see Box 2).

Web-based predictions were performed using NVIDIA T4 GPU (Procedures 1A) or NVIDIA A100 (Procedure 2A, 3A). Local predictions were performed using NVIDIA RTX A5000.

Procedure 1: Monomer Prediction (PIGU 420 residues)
1A. Web-based prediction: 20 min
1B. Local prediction: 5 min

Procedure 2: Complex Prediction (GPIT 2,496 residues)
2A. Web-based prediction: 60 min
2B. Local prediction: 110 min

Procedure 3: Complex Prediction (ASCT2 541 residues)
3A. Web-based prediction
- MSA depth reduction: 35 min
- Dropout: 60 min
3B. Local prediction

Understanding ColabFold’s results requires examining the predicted protein structures alongside their confidence scores. The results can be organized as: (1) predicted structures, (2) computed MSA (if requested) and its quality measures and (3) model confidence measures (see Box 3). In the following sections, we give a general explanation about the interpretation of the various plots ColabFold produces. We then describe the web-based results for monomer and complex predictions, and the local prediction results for conformational sampling. While AF2 is deterministic, its use of frameworks like JAX and CUDA can lead to slight variations in prediction between runs under the same parameters due to differences in GPU models and GPU driver versions.

Using the appropriate confidence measure

ColabFold computes various confidence measures: pLDDT, PAE, pTM and ipTM (see Box 3 for an outline of their computation). Since pLDDT is a local measure, it’s not sensitive to the placement of each domain in a multi-domain protein. Therefore, a high pLDDT score does not ensure high confidence in the entire structure. In the case of proteins with multiple domains, a high pLDDT alongside a low pTM score could suggest that individual domains are predicted accurately, however, their relative orientation to each other is not. We therefore recommend considering pLDDT in combination with pTM. When using a complex model, ColabFold will rank the predicted protein models by the formula: 0.2×pTM + 0.8×ipTM.

ColabFold’s sequence coverage plot

The sequence coverage plot (example: Fig. 4a) illustrates the per residue coverage and diversity, measured as sequence identity to the query (qid). Coverage represents the number of homologous sequences detected per residue and is indicated as a black line in the plot. The x-axis indicates the position within the query sequence and the y-axis - the MSA coverage. The upper limit of the y-axis corresponds to the MSA depth. In most cases, a minimum coverage of 30 sequences for most query residues is required for accurate prediction, preferably over 100 sequences per residue¹. Next, qid indicates the sequence similarity (reflecting evolutionary distance) between each homolog and the query, where higher qid values indicate higher similarity. ColabFold encodes qid as a color, ranging from red (low) to blue (high). Each homologous sequence is illustrated as a horizontal line, where only segments aligned with the query are visualized in color. MSAs with similar proportions of high and low qid homologs, are preferable since homologs of varying evolutionary distances contribute different insights about the structure.

ColabFold’s pLDDT plot and JSON file

The pLDDT scores computed for each amino-acid are plotted by ColabFold in a single plot for the five predicted protein models (example: Fig. 4c). On the x-axis are the query’s amino-acids and on the y-axis are their pLDDT scores in each of the models (encoded in color). The pLDDT scores for each amino-acid are additionally reported in the JSON and the PDB files produced for each model: in the JSON file - under the pLDDT field and in the PDB file - under the B-factor field (second to last column). The average pLDDT score for each predicted model is provided in the output log.txt file.

ColabFold’s PAE plot and JSON file

The PAE scores are computed for each of the query’s residues over all other residues (Box 3) and thus can be plotted as a square in the dimensions of the query length (example: Fig. 4d). On the x-axis are the scored residues, on the y-axis are the points of alignment and the color reflects the PAE value (note the non-standard y-axis, with lower values on top). PAE scores are not symmetric, meaning that the score at position (x,y) is generally not equal to that at (y,x). Good scores (low) are colored in blue and bad (high) - in red. When examining a PAE plot, it is recommended to start by scanning the diagonal. Blue squares along the diagonal most likely indicate a well-predicted domain³⁹. See “Predicted aligned error tutorial” in https://alphafold.ebi.ac.uk/entry/Q5VSL9 for an additional example for interpreting PAE plots. The computed PAE scores used for plotting are also provided in JSON format (Box 2) in the predicted_aligned_error_v1.json file. This file contains two fields. The first field ‘predicted_aligned_error’ is a list (denoted by square brackets) of lists. Each internal list stores the PAE scores for each of the residues. The second field ‘max_predicted_aligned_error’ stores the highest (worst) PAE value.

Monomer prediction (PIGU)

After prediction is complete, the first result one can see is the “Sequence coverage” plot in the Run prediction cell. Here, we can see that the 420 amino-acid long query PIGU, has an MSA depth of around 2,700 and a coverage that exceeds 400 homologs for every AA (Fig. 4a), which are well sufficient for structure prediction. Additionally, it is well-balanced, with both distant homologs (colored in red) and close homologs (blue). Next, the five protein models ColabFold produced for PIGU appear in the cell sequentially and are ranked by their average pLDDT. Each protein model is displayed as two 3D plots, one colored by AA position (N→C) and the other by pLDDT. As can be seen in the 3D plots produced for the top-ranking model (Fig. 4b), the model consists of multiple helices arranged from the N-terminus (red) to the C-terminus (blue). In addition, the pLDDT scores of nearly all AAs appear in blue color, suggesting high prediction confidence. The predicted structures are additionally shown as an interactive plot, in the next cell Display 3D structure. The next cell Plots informs of the two confidence measures computed by ColabFold: pLDDT and PAE. The pLDDT plot (Fig. 4c) presents the same information visualized in the cell Run prediction but in greater detail. As can be seen, in all five predicted models, most of PIGU’s residues display high pLDDT scores (>70), reflecting strong local prediction reliability. Regions with slightly lower pLDDT (around residues 1, 120, 210 and 420) are likely to be linker regions or terminal. Examined together with Fig. 4a, we see these regions correspond to MSA positions with less evolutionary information (lower coverage). The PAE plot of the top-ranking model (Fig. 4d) is essentially a blue square, suggesting PIGU is composed of a single domain. When compared to the experimentally solved structure, the top-ranking predicted model from ColabFold had a TM-score of 0.989 (Fig. 4e), providing external support for the high average pLDDT of 94.1 (Table 1), and indicating high overall accuracy.

Complex prediction

The plots for complex prediction are similar to those of monomer prediction, with the distinction that each subunit is demarcated by a black line. The sequence coverage plot (Fig. 5a) indicates sufficient conditions for protein prediction because the resulting MSA consists of ca. 1,000 paired sequences as well as over 1,000 unpaired sequences for each subunit, due to setting MSA pair_mode to unpaired_paired (see Box 1). When analyzing the results, we focus on global confidence measures, suitable for complex prediction assessment: PAE, pTM, and ipTM (Box 3). In the PAE plot for the top-ranking model (Fig. 5b), we first notice near-perfect blue squares on the diagonal: one in the square of chain A and another in the square of chain D. This suggests each of these chains forms its own globular structure, which can be predicted with confidence, relative to itself. Specifically, knowing the position of some residues in any of these chains indicates the relative positions of all others in the same chain, with high accuracy. In contrast, the diagonal squares of chains C and E, are crossed with red lines, suggesting each of these chains is likely to be composed of two domains, whose relative position to each other cannot be as confidently determined. The diagonal square of chain B is least blue, possibly indicating three short domains or an unstructured region. Next, we examine the interplay between the chains. We can see the rows of chain B are mostly white/red for all other chains, i.e., knowing the positions of residues in chain B does not allow predicting the relative positions of other chains with confidence, especially those of chain E. In contrast, the rows of chains A and D are mostly light-blue, indicating inter-chain confidence. In addition to these “high level” insights, we notice positions of special interest: The positions of residues at the N-terminus of chain D cannot be predicted with confidence, no matter where the point of alignment is, as indicated by the red columns around sequence position 1,500. Additionally, all chains seem to include short segments, perhaps linkers, which are not informative for the structure of other elements, as indicated by red rows around position 100, for example. The computed pTM score for the GPIT structure is 0.9 (Table 1), suggesting high confidence in the structure, despite few lower confidence positions. Finally, the ColabFold-AF2 top-ranking predicted model has a TM score of 0.985, relative to the experimentally solved human GPIT structure (Fig. 5c), confirming AF2’s predicted confidence (pTM) in the predicted structure.

Conformation prediction

In Procedure 3 we used two strategies to explore various conformations of ASCT2: reducing MSA depth and activating dropout layers. In addition, through 16 iterations, a total of 80 predicted protein models were generated for each of the two strategies. We carried out the local version of Procedure 3, thereby producing a shared MSA for all predicted models under each strategy. As can be seen in Fig. 6a, sequence coverage is well beyond sufficient for structure prediction. Next, we found that the average value across all models of the per-residue pLDDT average (Fig. 6b) was 74.7, suggesting confidence in searching for real conformations among them. In the following, we identify alternative conformations using principal component analysis (PCA), as described by Howe⁴¹ and carried out by del Alamo et al. To that end, we used a short CPPTRAJ⁴² script to capture the essence of the conformational movements among the residues across the models. To reduce noise in the PCA, we trimmed off terminal stretches of at least ten residues, whose structure was consistently predicted as unreliable (pLDDT > 60 in at least 80% of the models). Ideally, the first two principal components (PCs) should capture most of the variance among the generated models to enable revealing patterns of agreement and disagreement among them. This is indeed the case for ASCT2, where PC1 is especially dominant, capturing >92% of the variance for both strategies (Fig. 6c). We therefore selected two representative models, furthest apart from each other along the PC1 axis (PC1 = -76 or PC1 = 146, for the MSA depth strategy) as these potentially correspond to two conformations. In the case of ASCT2, two experimentally solved conformations are available in the PDB, allowing us to compute the similarity between each model and each conformation. Fig. 6d affirms that PC1 separates the models by conformation: low PC1 values correspond to high similarity to the outward-open conformation, while high PC1 values correspond to low similarity to that conformation and high similarity to the inward-open conformation. Fig. 6e projects the structure of the two selected representatives on the experimentally-solved ASCT2 conformation they match, indicating the high similarity between them (TM score > 0.93 in both cases).

1. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

2. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

3. Baek, M. et al. Efficient and accurate prediction of protein structure using RoseTTAFold2. Preprint at https://www.biorxiv.org/content/10.1101/2023.05.24.542179 (2023).

4. Vaswani, A. et al. Attention is All you Need. Advances in Neural Information Processing Systems 30, 5998–6008 (2017).

5. Humphreys, I. R. et al. Computed structures of core eukaryotic protein complexes. Science 374, eabm4805 (2021).

6. Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nature Communications 13, 1265 (2022).

7. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nature Methods 19, 679–682 (2022).

8. Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint athttps://biorxiv.org/content/10.1101/2021.10.04.463034(2021).

9. Peng, Z., Wang, W., Han, R., Zhang, F. & Yang, J. Protein structure prediction in the deep learning era. Current Opinion in Structural Biology 77, 102495 (2022).

10. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

11. Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at https://biorxiv.org/content/10.1101/2022.07.21.500999 (2022).

12. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028 (2017).

13. Mirdita, M., Steinegger, M. & Söding, J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics 35, 2856–2858 (2019).

14. Abakarova, M., Marquet, Ć., Rera, M., Rost, B. & Laine, E. Alignment-based protein mutational landscape prediction: Doing more with less. Preprint at https://biorxiv.org/content/10.1101/2022.12.13.520259 (2022).

15. Lee, S. et al. Petascale Homology Search for Structure Prediction. Preprint at https://biorxiv.org/content/10.1101/2023.07.10.548308 (2023).

16. Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

17. wwPDB consortium et al. Protein Data Bank: The single global archive for 3D macromolecular structure data. Nucleic Acids Research 47, D520–D528 (2019).

18. Liu, J. et al. Enhancing alphafold-multimer-based protein complex structure prediction with MULTICOM in CASP15. Communications Biology 6, 1140 (2023).

19. Peng, Z., Wang, W., Wei, H., Li, X. & Yang, J. Improved protein structure prediction with trRosettaX2, AlphaFold2, and optimized MSAs in CASP15. Proteins: Structure, Function, and Bioinformatics (2023).

20. Rego, N. & Koes, D. 3Dmol.js: molecular visualization with WebGL. Bioinformatics 31, 1322–1324 (2014).

21. Nomura, K. et al. Bacterial pathogens deliver water- and solute-permeable channels to plant cells. Nature 621, 586–591 (2023).

22. Mosalaganti, S. et al. AI-based structure prediction empowers integrative structural analysis of human nuclear pores. Science 376, eabm9506 (2022).

23. Zhang, H. et al. Structure of human glycosylphosphatidylinositol transamidase. Nature Structural & Molecular Biology 29, 203–209 (2022).

24. Alamo, D. del, Sala, D., Mchaourab, H. S. & Meiler, J. Sampling alternative conformational states of transporters and receptors with AlphaFold2. eLife 11, e75751 (2022).

25. Garibsingh, R.-A. A. et al. Rational design of ASCT2 inhibitors using an integrated experimental-computational approach. Proceedings of the National Academy of Sciences 118, e2104093118 (2021).

26. Garaeva, A. A., Guskov, A., Slotboom, D. J. & Paulino, C. A one-gate elevator mechanism for the human neutral amino acid transporter ASCT2. Nature Communications 10, 3427 (2019).

27. Cheng, S. et al. FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. Preprint at https://arxiv.org/abs/2203.00854 (2022).

28. Ahdritz, G. et al. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Preprint at https://biorxiv.org/content/10.1101/2022.11.20.517210 (2022).

29. Li, Z. et al. Uni-Fold: An Open-Source Platform for Developing Protein Folding Models beyond AlphaFold. Preprint at https://biorxiv.org/content/10.1101/2022.08.04.502811 (2022).

30. Fang, X. et al. HelixFold-Single: MSA-free Protein Structure Prediction by Using Protein Language Model as an Alternative. Preprint at https://arxiv.org/abs/2207.13921 (2022).

31. Pettersen, E. F. et al. ChimeraX : Structure visualization for researchers, educators, and developers. Protein Science 30, 70–82 (2021).

32. Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: A local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).

33. Zhang, Y. & Skolnick, J. TM-align: A protein structure alignment algorithm based on the TM-score. Nucleic Acids Research 33, 2302–2309 (2005).

34. O’Reilly, F. J. et al. Protein complexes in cells by AI-assisted structural proteomics. Molecular Systems Biology 19, e11544 (2023).

35. Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).

36. Gabler, F. et al. Protein sequence analysis using the MPI bioinformatics toolkit. Current Protocols in Bioinformatics 72, e108 (2020).

37. Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. Preprint at https://arxiv.org/abs/1506.02142 (2015).

38. Wallner, B. AFsample: Improving Multimer Prediction with AlphaFold using Aggressive Sampling. Preprint at https://biorxiv.org/content/10.1101/2022.12.20.521205 (2022).

39. Zhang, J., Schaeffer, R. D., Durham, J., Cong, Q. & Grishin, N. V. DPAM: A Domain Parser for AlphaFold Models. Protein Science 32, e4548 (2023).

40. Zhang, H. et al. Structure of a human glycosylphosphatidylinositol (GPI) transamidase. doi:10.2210/pdb7w72/pdb (2022).

41. Howe, P. W. A. Principal components analysis of protein structure ensembles calculated using NMR data. Journal of Biomolecular NMR 20, 61–70 (2001).

42. Roe, D. R. & Cheatham, T. E. PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. Journal of Chemical Theory and Computation 9, 3084–3095 (2013).

43. Garibsingh, R. A. et al. ASCT2 in the presence of the inhibitor Lc-BPE (position “up”) in the outward-open conformation. doi:10.2210/pdb7bcq/pdb (2021).

44. Garaeva, A. A., Guskov, A., Slotboom, D. J. & Paulino, C. doi:10.2210/pdb6rvx/pdb (2019).

M.S. acknowledges the support by the National Research Foundation of Korea, grants [2020M3-A9G7-103933, 2021-R1C1-C102065, 2021-M3A9-I4021220], Samsung DS research fund and the Creative-Pioneering Researchers Program through Seoul National University. M.M. acknowledges support by the National Research Foundation of Korea (grant RS-2023-00250470). Y.M. acknowledges support from Platform Project for Supporting Drug Discovery and Life Science Research (Basis for Supporting Innovative Drug Discovery and Life Science Research (BINDS)) from AMED under grant number JP23ama121027. S.O. was supported by the National Institutes of Health (NIH) DP5OD026389 and the National Science Foundation (NSF) MCB2032259.

Download PDF

Version 1

posted

You are reading this latest protocol version

Easy and accurate protein structure prediction using ColabFold

Status:

Version 1

Abstract

Figures

Introduction

Protein prediction models and ColabFold

The biological examples in this protocol

Overview of the protocol

Alternatives to ColabFold-AF2

Limitations of the protocol and software

Equipment

Equipment

Equipment setup

Procedure

Procedure 1. Monomer prediction

1A. Web-based prediction

1B. Local prediction Timing ~5 min

Procedure 2. Complex prediction

Procedure 3. Conformation prediction

Box 1. Parameter configuration

Box 2. Input and output formats

Box 3. Prediction confidence measures

Box 4. Custom MSA generation

Troubleshooting

Low confidence prediction

Web-based predictions: results download error

Web-based predictions: running out of resources

Web-based predictions: running predictions takes too much time

Local predictions: CUDA version conflicts

Local predictions: Old CUDA-capable GPUs

Custom template upload error

Custom MSA upload error

Time Taken

Anticipated Results

Using the appropriate confidence measure

ColabFold’s sequence coverage plot

ColabFold’s pLDDT plot and JSON file

ColabFold’s PAE plot and JSON file

Monomer prediction (PIGU)

Complex prediction

Conformation prediction

References

Acknowledgements

Status:

Version 1

Privacy Policy

Terms of Service

Cookie Settings