Procedure 1. Monomer prediction
In the following sections, we predict each of the five monomers of the GPIT complex individually, using ColabFold-AF2 within Colab as well as using the command line. We explicitly guide the reader through predicting the structure of the PIGU subunit and the other four subunits can be predicted analogously. To obtain the relevant sequences, see section “Equipment” of this protocol.
1A. Web-based prediction
We begin with the quick start, which requires the user to perform three steps, prompting the execution of all the cells in the notebook, without a break. This option is analogously available for other procedures in this protocol. Alternatively, the cells of the notebook can be run, one after the other, from top to bottom.
1A - Quick start Timing ~20 min
i. Open a web-browser and navigate to https://alphafold.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.
ii. Paste the AA sequence of PIGU into the query_sequence field.
iii. Click Runtime in the menu bar and then hit Run all (Fig. 2). This action performs all prediction steps without a break, by sequentially executing every cell in the notebook. The currently running cell is indicated by a spinning circle on the left side.
By default, ColabFold computes five structure models. After each model is generated, the predicted structures and result plots are displayed. Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2 for details about output formats) as a single compressed (zip) file. Please note that the runtime of these steps can vary depending on the GPU assigned by Colab.
1A - Cell-by-cell instructions Timing ~20 min
By running the cells of the notebook (Fig. 3) sequentially, the user can set the options of some cells, based on the result of the previous step(s).
i. Open a web-browser and navigate to https://alphafold.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.
ii. Paste the AA sequence of PIGU into the query_sequence field. However, now run only this single cell by clicking on the triangle icon on the left.
iii. Run the next cell install_dependencies. It doesn’t contain any modifiable parameters.
iv. Continue to MSA options. This cell controls the homology search for the input. Keep it at its default values and run the cell. You can check Box 1 for alternative options for setting this cell before running it.
v. The Advanced Settings cell allows for controlling the type of prediction model, recycling, MSA sampling and randomization. Use its pre-loaded default setting or set it (see Box 1) and run it.
vi. The next cell is Run prediction, which will start the prediction process with/without a preceding MSA computation step (depending on the values set in MSA options cell). The five computed protein models will be ranked by their pLDDT score.
vii. In the next cell Display 3D structure, you can select the model to be displayed by its rank using rank_num and then run the cell to view the result. This cell also allows for adjusting the colors and chain display.
viii. Running the next cell, Plots, will produce the PAE, sequence coverage, and pLDDT plots (discussed in Box 3 and Anticipated Results).
ix. Save the results as a compressed zip file by running the Package and download results cell.
Predicting the structure of PIGK, PIGT, PIGS, and GPAA1 Timing ~20, 30, 30,
40 min (T4 GPU)
Repeat the steps of 1A analogously for the other four monomeric units of GPIT.
1B. Local prediction Timing ~5 min
Local predictions with ColabFold are carried out with a single command-line tool, colabfold_batch. It supports batch prediction, eliminating the need to input query sequences one-by-one.
i. Save the sequence of PIGU locally in FASTA format (PIGU.fasta).
In addition to FASTA, ColabFold supports several types of input formats (Box 2). ColabFold can also be provided with pre-computed MSAs (for MSA generation options, see Box 4) by placing them within a single directory.
ii. Use a single command to generate the MSA and predict the structure:
$ colabfold_batch Q9H490.fasta /path/to/results
Additional arguments can be passed to this command if non-default settings are desired. Users can refer to the complete list of parameters by running:
$ colabfold_batch --help.
These parameter settings are as detailed in Box 1, however, when running locally, the user should replace underscores in the parameter name with dashes and set it using “-” before the name, for example:
--msa-mode mmseqs2_uniref
The output files (detailed in Box 2) are saved in the specified output directory (/path/to/results). Unlike web-based predictions, local predictions do not run on a notebook with 3D structure visualizations. Instead, users can explore the predicted structures in PDB format using tools like ChimeraX and PyMOL.
Procedure 2. Complex prediction
ColabFold’s complex prediction process closely resembles that of monomers. The main differences are the input, the additional MSA pairing option, and the use of a multimer prediction model. In the following steps, we focus solely on these differences, using the GPIT complex as an example. For all other settings, which are shared with monomer prediction, we refer the reader to the monomer section of the protocol and Box 1.
During input preparation, the sequences of all subunits in the complex need to be concatenated to each other, using a colon (‘:’). See Box 2 for an example concatenation in CSV format. To obtain the GPIT subunit sequences, see section “Equipment” of this protocol.
2A. Web-based prediction Timing ~60 min
Due to the long length of the GPIT complex (~2,500 residues), the web-based prediction was carried out using a paid Pro Colab account, leveraging the much larger amount of GPU RAM of the A100 GPU (40GB GPU RAM).
i. Open a web-browser and navigate to https://alphafold.colabfold.com. Set hardware accelerator to A100 GPU by navigating to Runtime → Change runtime type in the menu bar.
ii. Concatenate the sequences of all subunits of the GPIT complex using a colon (‘:’) and paste the concatenated sequence into the query_sequence field.
iii. All cells are preset to their default values, which will be used in this example. However, other options are available and their details are provided in Box 1. These include the options available for monomer prediction and in addition: controlling the pairing process, which identifies the same taxon across the different protein subunits and the model used for protein prediction. Click Runtime and hit Run all. This action performs all prediction steps without a break.
Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2) as a single compressed file. In complex prediction the five generated protein models will be ranked by their pTM and ipTM scores (Box 3) with the following formula: 0.2×pTM + 0.8×ipTM.
2B. Local prediction Timing ~110 min
i. Concatenate the sequences of all subunits of the GPIT complex using a colon (‘:’) and save the concatenated sequence in CSV format (example in Box 2) as input.csv.
ii. Use a single command to generate the MSA and predict the structure:
$ colabfold_batch input.csv /path/to/results
The results will be saved in the output directory (/path/to/results).
Procedure 3. Conformation prediction
This procedure uses the AF2 models outside their initial scope, which predicts a single conformation given an AA sequence. The outline provided here is partly based on an ad-hoc method proposed by del Alamo et al.24, who also warn that there is no one-size-fits-all approach for sampling the conformational space. We therefore recommend trying out various parameter options as well as being cautious when interpreting the results.
As shown by del Alamo et al., reducing MSA depth can contribute to conformational sampling. At the heart of this approach lies the following logic. Proteins that undergo conformational changes are likely to have AA pairs, which are in strong structural interaction in one conformation, but exhibit different interaction in another. These AA pairs will tend to co-evolve, meaning that when examined over many homologs from various species, a change in one AA will often be associated with a change in the other AA of the pair. Thus, when an MSA with many homologs (“deep MSA”) is provided to the AF2 models, these AA pairs carry a signal, which prompts the AF2 models to predict the conformation in which they interact. However, when the MSA depth is reduced by removing homologs, the co-evolution signal is weakened, potentially allowing the AF2 models to predict alternative conformations.
Other factors can additionally contribute to conformational sampling by increasing the uncertainty of the network, i.e., having less certainty in a single dominant conformation37,38. These factors include providing templates of alternative conformations or activating dropout layers in the AF2 neural network. Dropout layers are used during the neural network training and prompt the network to produce alternative solutions for the same input, by randomly deactivating certain weights. Thus, enabling these layers during inference enhances the variety of predicted structures.
Here we present two sampling strategies, focused either on reducing MSA depth, which is controlled by ColabFold-AF2 through the max_msa parameter (see Box 1) or on activating dropout layers. Depending on the starting point (seed), each AF2 execution can reach slightly different results. Thus, in both strategies, we increase the number of times the model is run using different starting points, thereby increasing the chance of sampling alternative conformations. The following procedure uses ASCT2 as an example. To obtain its sequence, see section “Equipment” of this protocol.
3A. Web-based prediction Timing ~35 min (MSA depth reduction), 60 min (Dropout)
To shorten run times, the web-based prediction was conducted using an A100 GPU on a paid Pro Colab account. However, this procedure can also be performed with the default T4 GPU, taking ca. 7 hours.
i. Open a web-browser and navigate to https://alphafold.colabfold.com. GPU usage is enabled by default, you can verify that by navigating to Runtime → Change runtime type in the menu bar.
ii. Paste the ASCT2 sequence into the query_sequence field.
iii. Conformation prediction requires changing the default settings to increase the uncertainty of the network by modifying parameters in the Advanced settings → Sample settings field. In the MSA depth reduction strategy, adjust the value of max_msa to 32:64. Alternatively, activate the dropout layers by checking the use_dropout box. In addition to either strategy, set num_seeds to its maximum (16), to generate models using different starting points (see Box 1). All other cells are kept at their default values in this example. Box 1 informs on how to modify them, if desired.
iv. Click Runtime and hit Run all. This action performs all prediction steps without a break.
Once all cells are processed, a pop-up window will appear, offering to download the various result files (see Box 2) as a single compressed file. For each seed from 0 to 15, five structure models will be computed, resulting in a total of 80 predictions, ranked by their pLDDT score.
3B. Local prediction Timing ~55 min (MSA depth reduction), 100 min (Dropout)
In the command-line, instead of setting num-seeds (see Box 1), we directly set the starting points through the random-seed parameter. Both these options result in a similar behavior.
i. Save the ASCT2 protein sequence FASTA file (rcsb_pdb_7BCQ.fasta) in the input file directory.
ii. For the MSA depth reduction strategy, the basic command is:
$ colabfold_batch rcsb_pdb_7BCQ.fasta /path/to/result --max-msa 32:64 --random-seed z
For the dropout-activation strategy it is:
$ colabfold_batch rcsb_pdb_7BCQ.fasta /path/to/result --use-dropout --random-seed z
This basic command should be run 16 times with a different z value each time (0,1,...,15). To avoid typing the command multiple types, use the following bash script, which embeds the basic command in a loop. In addition, it copies the MSA computed in the first iteration to all following iterations, thereby reducing calculation time:
#!/bin/bash
INPUTFILE="rcsb_pdb_7BCQ.fasta"
OUTPUTDIR="ASCT2/32_64"
for RANDOMSEED in `seq 0 16`; do
if test ${RANDOMSEED} -ne 0 ;then
mkdir -p ${OUTPUTDIR}/${RANDOMSEED}
# Copy the MSA file of iteration '0' to skip MSA computation by the webserver
cp -rp ${OUTPUTDIR}/0/*_env ${OUTPUTDIR}/${RANDOMSEED}
fi
colabfold_batch \
--random-seed ${RANDOMSEED} \
--max-msa 32:64 \ # Keep this line for the MSA depth reduction strategy
--use-dropout \ # Keep this line for the dropout-activation strategy
${INPUTFILE} \
${OUTPUTDIR}/${RANDOMSEED}
done
After completion, in each output directory ($OUTPUTDIR/$RANDOMSEED) corresponding to seed numbers 0-15, you will find the output files generated for the five models.
Box 1. Parameter configuration
Tuning homology search - MSA generation options
msa_mode
- Use ColabFold MSA server: search against the UniRef database only (mmseqs2_uniref) or UniRef and ColabFoldDB (mmseqs2_uniref_env, default)
- custom: upload a pre-computed MSA
- single_sequence: predict using a single sequence without an MSA. This option is useful for de novo designed proteins with no homology to natural proteins.
pair_mode: For heteromeric complex predictions, ColabFold first generates MSAs separately for each chain. Then, in a process called “pairing”, it identifies and marks sequences from the same (sub-)species across all MSAs, to enhance prediction quality.
- paired: perform pairing and retain only sequences that can be paired
- unpaired: skip pairing
- unpaired_paired (default): pair and retain all sequences
pairing_strategy: This parameter takes effect when pairing is enabled.
- complete: consider a sequence as “paired” if it has a taxonomic match in every chain (all MSAs)
- greedy (default): a sequence is paired if it has a taxonomic match in at least two chains (two MSAs)
Searching for alternative conformations - MSA sampling options
max_msa: controls the number of sequences used for structure prediction and is provided as max_msa_clusters:max_extra_msa, where max_msa_clusters corresponds to Nseq and max_extra_msa - to N
extra_seq, as defined in the Suppl. section 1.1 of Jumper et al.1. By default (auto), these values are set to 512:5120. Lower values are useful when searching for alternative conformations (explained in Procedure 3).
- max_msa_clusters: the number of randomly chosen MSA sequences, which serve as center points in a clustering procedure preceding prediction with the AF2 network
- max_extra_msa: MSA sequences, which were not selected as cluster centers and are used for extra computation by the AF2 network
num_seeds: indicates the number of different random seeds to use. By default (1) a single seed value will be used. These seeds determine random components throughout prediction (e.g., when selecting the sequences, which serve as cluster centers). Setting a higher value can increase the chance of obtaining a better confidence score in cases where the MSA is very small and templates are lacking. In combination with max_msa, it can help explore alternative conformations.
use_dropout: activates dropout layers during inference, which prompts the AF2 neural network to be less confident in a single conformation.
Improving or reproducing results - Model options
model_type: specifies the AF2 model (ptm, multimer_v1, _2, _3) to use for prediction. By default (auto), the newest models: alphafold2_ptm for monomer, and alphafold_multimer_v3 for complex, will be used. Using older models is mainly used for reproducing older results (but see also “Troubleshooting”).
template_mode: specifies whether to incorporate templates (pdb100, default) or not (none), or use custom templates (custom).
num_recycles: controls the number of times a prediction is re-fed to the model. By default, it is set to 3, except for model_type=alphafold2_multimer_v3, where it is set to 20.
recycle_early_stop_tolerance (in short tol): controls the convergence criterion, which is considered together with num_recycles. By default, it is set to 0.0, except for model_type=alphafold2_multimer_v3, where it is set to 0.5.
Box 2. Input and output formats
When starting from a single protein, ColabFold accepts the sequence in either FASTA or CSV/TSV. If the user provides a precomputed MSA, ColabFold requires A3M format, which is also the output format, in case the MSA is not precomputed. ColabFold can operate on multiple input files by taking in a list of sequences in FASTA/CSV format or a directory containing FASTA/A3M files.
FASTA: Each entry has a header starting with ’>’, followed by sequence lines. Example:
>PIGU_trim
MAAPLVLVLVVAVTVRAALFRSSLAEFISERVE
>PIGT_trim
ARDSLREELVITPLPSGDVAATFQFRTRWDSELQREGVSHY
CSV: The first line is always “id,sequence”. Each line that follows contains the header and the sequence, separated by a comma. Examples:
Monomer
id,sequence
PIGU_trim,MAAPLVLVLV
PIGT_trim,ARDSLREELV
Complex
id,sequence
GPIT_trim,MAAPLVLVLV:ARDSLREELV
A3M begins with a super-header, followed by the aligned sequence entries. Each entry has a header starting with ’>’, followed by sequence lines.
- The super-header, starting with ’#’, contains two comma-separated lists separated by a tab. The first list indicates the length of each chain, while the second - its cardinality. Each chain appears only once, regardless of its cardinality.
- The MSA is described with respect to the first (query) sequence. An amino-acid (AA) that is aligned to an AA in the query is shown in upper case, an insertion relative to the query - in lower case, and a deletion as ‘-’. Example A3M for different input types:
Monomer
#28 1
>example_monomer_query
MAAPLVLVLVVAVTVRAALFRSSLAEFI
>found_homolog_1
MAFPLALVLVVAVTVR-ALFRSSLAEFI
>found_homolog_2
...
Hetero-oligomer
#28,20 4,2
>example_hetero6mer_A4B2_query
MAAPLVLVLVVAVTVRAALFRSSLAEFITTAVNYPFVDTMDKFDKITK
>found_homolog_paired1
--FPLALVLVVAVTVRAALFRSSLAEFITTLVNYPFVDTMDKFDFITF
>found_homolog_unpaired_chain1
-AFPLALVLVVAVTVRAALFRSSLAEFI--------------------
>found_homolog_unpaired_chain2
----------------------------TTLVNYPFVDTMDKFDKITF
...
Output: ColabFold computes five protein models and reports them in three types of files:
PDB-format for reporting the structures. See https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction for details. Reading the values stored in PDB files is possible using text editors and visualizing the structure - by tools like ChimeraX31 and PyMOL (https://github.com/schrodinger/pymol-open-source). Within Colab the predicted PDB files are visualized using 3Dmol.js20.
JSON hierarchical data format for reporting the models’ confidence measures and the configuration of the parameters used for prediction. Reading JSON files is possible through text editors or web-browsers like Firefox, among others.
PNG image format for visualizing the alignment coverage and confidence scores.
Box 3. Prediction confidence measures
AF2 has its own set of confidence measures, corresponding to traditional measures for structure prediction, whose computation requires knowledge of the true protein structure. Since the true structure is generally unknown, AF2, through training on a set of proteins with known structures, has learnt to predict these confidence measures alongside predicting the structure itself. These measures are named pX, where X is a traditional measure and p stands for “predicted”.
predicted Local Distance Difference Test (pLDDT): For every AA in the query, AF2 predicts its lDDT-Cα score32 on a scale of 0 (bad) to 100 (excellent). Each AA’s α carbon can be described by its distances to neighboring α carbons (e.g. within a radius of 15 Å) in the true structure. The superposition-free lDDT-Cα metric reflects how preserved these distances are in the prediction model, where the confidence for each score range is as follows:
>90: highly confident
70 <pLDDT <90: confident in a backbone-level
50 <pLDDT <70: lower confidence, interpret cautiously
<50: potentially disordered regions; should be ignored
Predicted Aligned Error (PAE): For every amino-acid x in the query, AF2 predicts a set of scores on a scale of 0 (excellent) to >30 (bad). Each score reflects the expected displacement of x in Å in the predicted structure relative to the unknown true structure, when the two structures are aligned on some other amino-acid y. PAE is especially useful when evaluating the prediction of multi-domain/chain proteins, where the placement of each domain is important. Specifically, if amino-acids, which are part of one domain, score well also when the alignment point is outside their domain, it suggests inter-domain confidence.
predicted Template Modeling score (pTM): the template modeling (TM) score33 is expressed using the distance between the predicted position of each AA and its true position, when the predicted and true structures are optimally aligned. AF2 predicts an approximate TM score by replacing the optimal alignment, which is infeasible to compute, with the row in the PAE matrix (i.e., an alignment on a single AA), where the total error is lowest (for full details, see Suppl. pg 37-38 of Jumper et al.1). The computed pTM scores range from 0 (bad) to 1 (excellent), often interpreted as:
>0.8: highly confident with congruent topology and backbone
0.5 or 0.7 <pTM <0.8: reliable fold for single or multi domain proteins, respectively
<0.2: no correlation with the true structure, intrinsically disordered protein, or no MSA/templates
interface pTM (ipTM): For complexes, AlphaFold-multimer predicts a modified pTM score, ipTM8, which takes into account only the inter-subunit distances, estimating the prediction accuracy of interfaces. Similar to pTM, ipTM ranges from 0 (bad) to 1 (excellent), where an ipTM score > 0.8534 is considered as reliable.
Box 4. Custom MSA generation
Instead of using the default MSA server, users can provide precomputed MSAs in A3M format (see Box 2). The following are alternatives for MSA generation.
Local colabfold_search: compute an MSA locally with the command
$ colabfold_search input.fasta database/ msas
- The input file (input.fasta in this example) should contain the query sequence(s) and database/ should indicate a path to the sequence database. The execution will generate A3M-formatted MSAs in the msas folder.
API hosting: For greater flexibility, users can also host their own API and pass its address to --host-url when running colabfold_batch.
$ colabfold_batch input.fasta /path/to/results --host-url "https://api.example.org"
Online tools: Users can utilize tools, such as the HHblits Toolkit server35,36 (https://toolkit.tuebingen.mpg.de/tools/hhblits) for web-based MSA generation. Select the tool’s settings for achieving maximal sensitivity. In the case of HHblits, we recommend using a protocol similar to RoseTTAFold2 using 3 iterations, setting the result list size to 10,000, and applying a strict E-value cut-off of 10-20. If no or only few homologs are found, relax the E-value threshold.