AutoMotif Server: a computational protocol for identification of post-translational modifications in protein sequences

doi:10.1038/nprot.2007.183

Method Article

AutoMotif Server: a computational protocol for identification of post-translational modifications in protein sequences

https://doi.org/10.1038/nprot.2007.183

This protocol has been posted on Protocol Exchange, an open repository of community-contributed protocols sponsored by Nature Portfolio. These protocols are posted directly on the Protocol Exchange by authors and are made freely available to the scientific community for use and comment.

Version 1

posted

You are reading this latest protocol version

Biochemistry

Computational biology and bioinformatics

post-translational modifications

phosphorylation

kinase substrate prediction

protein kinases

acetylation

sulfation

amidation

hydroxylation

methylation

pyrrolidone

gamma-carboxyglutamic modification

sequence similarity

database of functional sequence segments

Swiss-Prot database

support vector machine

machine learning

The rapid increase in genomic information requires new automatic techniques to investigate protein functions. The function of proteins is partially determined by short sequence segments. For example the phosphorylation by protein kinases is an important mechanism for controlling intracellular processes. Many kinases are known, but the identification of their potential biological targets is still ongoing research. High substrate specificity of protein kinases ensures correct transmission of signals in cells. The specificity is largely determined by the primary sequence of the target site, but we lack general, efficient and error prune tools for identifying these sites. Most methods designed to predict functional motifs process local sequence information around post-translational modification sites.

We present here an advanced computational protocol for rapid identification of post-translational modifications (PTM) in proteins on the whole genome scale. The AutoMotif Server (AMS) identifies various types of post-translational modifications in protein sequences. A query protein sequence is dissected into overlapping short segments. Each segment is projected into an abstract space of sequence fragments by 10 different representations. Those projections are compared with the database of representations of known and confirmed by experiments post-translational modification sites using the support vector machine (SVM) approach ^{1, 2}. The supervised machine learning approach is able to predict the most of post-translational modification sites in proteins. It is based on the classification of the biological functional information acquired from the Swiss-Prot database version 4.2. The classification models are then used to predict new modification sites in proteins. Users can access a list of sites in proteins annotated as being able to undergo certain post-translational modification in Swiss-Prot database and add new annotated sequence segments from proteins (positive instances).

The AMS server was demonstrated ^{3, 4} to gain high accuracy in distinguishing short sequence fragments that are post-translational modified from those that are not. The efficiency of the classification for each type of modifications and the prediction power of several versions of the method is estimated using the standardized leave-one-out tests. The sensitivities of the protocol for all types of modifications are in the range of 70%.

The AutoMotif Server is freely available at "http://automotif.bioinfo.pl/":http://automotif.bioinfo.pl/. The local version of the software is available on request from the authors. The parameters (the search type, the number of top models, and the PTM type) are optional and can be easily modified. The following protocol describes how to use AMS server to detect various types of post-translational modifications, and how to understand the resulting score for a given prediction.

A typical personal computer with Linux, Apple Mac OSX or Windows operating system
Input single sequence or a set of sequences in FASTA file format, from experimental data or sequence databases
The internet web browser. We suggest using Firefox, but Apple Safari, Microsoft Internet Explorer or Mozilla suite are allowed.

The AutoMotif Server (AMS) dissects a query protein sequence into overlapping short sequence segments and identifies selected types of post-translational modification sites. We use supervised SVM classification trained on experimental knowledge for identification of PTM sites. Each sequence segment has assigned a real number calculated by the cost function of SVM classification model. Residues with have the value of cost function, i.e. the score larger than a given cut-off value are identified as possible modification sites. This means that the point representing this sequence segment is located in the region of multidimensional space classified as “positive” by the SVM model’s hyperplane within given cut-off value. In AMS web server we use only single, the most effective type of the kernel, i.e. the polynomial kernel. The one-vote-wins method is used to annotate segments that are predicted as positives by at least one classification model.
The AMS server accepts input sequences in the one-letter mode in capital letters: 'ACDEFGHIKLMNPQRSTVWY', with additional letter X for marking empty or unknown positions in a protein sequence, or extension of a sequence segment. Users can input sequences by submiting text file in FASTA file format (for details see http://en.wikipedia.org/wiki/Fasta_format ), or by providing the SWISS-PROT/TrEMBL identifier or accession number in the text box, or simly pasting the amino acids seqeuences.
The server predicts by default all types of post-translational modification sites that were precalculated by the authors and which are available with enough statistics in the Swiss-Prot database. The list presently include acetylation, amidation, hydroxylation, methylation, sulfation and phopshorylation (by PKC, PKA, CK, CK2 and CDC2 protein kinases). The search can be limited by selecting particular type of functional motif from the drop-down menu on the server’s www page (for example phosphorylation sites in general or by specific kinases).
Two types of search procedures are available on the server: the identity search and scan based on SVM classification. The first method identifies identical in terms of sequence 9 residues segments in a query protein and the database of positives for that selected type of modification. The second method runs several versions of SVM predictions that use different projection methods. The registration of a user (by following the link ”User Site” from the main www page) allows for submiting his or her own list of training instances as a text file with the set of segments dissected from a multiple proteins known to pefrorm certain function. Then the AMS server train the SVM for the new type of functional motif and use it to scan any query protein sequence for potential substrates. This method allows for indroducing new types of biochemical process that are not yet known in public, or that are not contained in Swiss-Prot database .
The output www page for a query protein contains two sections. The first section displays results of predictions for each selected model, i.e. the parent protein information (i.e. the sequence number in a query set), local segment sequences predicted as a modificated sites, their positions (start, modified central residue, the end position, and the size of a segment) and the output scores. The second section of the output www page describes each used type of post-translational modification, its protein agent, the best SVM method used to classify known instances. Each SVM model is described by the number of positive and negative instances used in training, the precision and recall errors of the classification models.
The accuracy of SVM classification models is described by two numbers: the recall R and the precision P. The recall R value measures the percentage of correct predictions (the probability of correct prediction), whereas precision P gives the percentage of observed positives that are correctly predicted (the measure of the reliability of positive instances prediction). The measures of accuracy are calculated separately for each type of PTM using the leave-one-out procedure. The typical recall value is around 30%, and the precision P is over 70% for majority of PTM.
In the case of single query protein applying the computational protocol give for each type of PTM the list of predicted modifications for this sequence. When a set of sequences is used as an input, the protocol returns the for each type of modification the list of predicted short sequence fragments that are modified with the parent protein number. The list of predicted modified sites is not ordered.
The consensus prediction is also available on the output web page, when several different versions of the method predict the same local sequence fragment to perform given post-translational modification.

The AMS server is able to predict all types of post-translational modifications for a query sequence in real time, even if multiple classification models are used. When large set of input sequences is used the time needed to perform the prediction is scaling linearly with the size of the set. If two sets are submitted at once, they are run in parallel on our linux cluster, so the time is the same as for single submission. Therefore the critical step for computations is the proper preparation of input data.

Database & Representations

The AMS method for predicting plausible post-translational modification sites classifies known experimental instances. Only the sequence information is used as an input, because in most cases only the potential target protein sequence is known. Our analysis is based on biological information acquired from the Swiss-Prot database ^{5, 6}.
Proteins with acetylation, phosphorylation (by PKA, PKC, CK, CK2 and CDC2 kinases), sulfation, amidation, hydroxylation, methylation, pyrrolidone and gamma-carboxyglutamic modification sites are selected for our analysis. Those processes have the largest number of known experimental instances. Training cases are taken from proteins experimentally annotated proteins. For each type of post-translational modification the list of proteins with at least one modified site of that particular type is fetched from the Swiss-Prot database. Sites annotated “by similarity”, “partial”, “potential”, “probable” or “predicted” are neglected in the analysis. The remaining list of residues is used as the dataset of positive cases, which includes all short sequence segments dissected from parent proteins with size of 9 amino acids and centered on main annotated residue. If the case of non-symetric segments, where modified residue is not in the center, the lacking positions in a segment are filled with ‘X’ as the type of amino acids. All redundant segments in the database, i.e. with the same sequence, are removed from the training dataset.
The “negative” preferences for each position in a short sequence segment for each type of post-translational modification is calculated using the negative instances dataset. We randomly select short sequence segments from proteins of Swiss-Prot database that have appropriate to selected type of modification the central amino acid that are not experimentally annotated to undergo this modification. Those two datasets: positive and negative instances for each type of functional motif are then used for the training of SVM.
Sequence segments are projected into the multidimensional space using ten different projections. The first representation (BIN) encodes each position of a segment into a 20-dimensional vector of binary values 0 or 1. The 1 value denotes that corresponding type of amino acid is present at selected position of a segment and 0 otherwise. Therefore each the vector representing a segment contain 9 coordinates equal to 1, all other dimensions have 0 value. The second representation (BLOSUM) uses the BLOSUM62 matrix for encoding each position of a segment by a 20 dimensional vector of the substitution scores between the amino acid present in the projected segment at this position and all other 20 types of amino acids. If Arg is found at first position in a segment we represent it by the appropriate Arg column from the BLOSUM62 substitution matrix. In the case of 9 amino acids long segments the representation is in 180 dimensional space (constructed by 9 columns of the BLOSUM62 matrix for 9 types of amino acids that are present in the projected segment). The LOOKUP method represent each amino acid type in a segment as one dimensional scalar value that is equal to the normalized sequence preference for it. The normalized preferences are pre-calculated earlier for all 9 positions within a segment and for all types of amino acids. For example the Arg amino acid at first position of a segment has the normalization calculated by dividing the probability to find Arg at first position on annotated segments by the probability to find it at the first position of not annotated segments. The profile projection (PROF) uses similar normalized preferences for each position of 9 residue long segment but storing them as 20-dimensional vectors preserving the information about all types of amino acids for a particular position. The normalized preferences for all types of amino acids at this position is multiplied by appropriate to found amino acid type column from the BLOSUM62 substitution matrix. If in the segment we find Arg the all amino acids preferences are multiplied by the Arg column of the substitution matrix. Each segment is represented as a point in 180 multi-dimensional space. The sparse representation (SPARSE) takes the normalized preferences for the found type of amino acid at certain position of a segment instead of binary value. All other amino acids are marked by 0 values. In addition, the combinations of above generic representations are used in order to maximize the accuracy and efficiency of both representing the acquired biological knowledge and the training abilities of support vector machine.
The bioSQL database using ‘bioperl-db’ perl library is build directly from UniProtKB flat text file. The selection of positives was performed by querying bioSQL database by following MySQL procedure: SELECT count(sqv.value), sqv.value

FROM location l, seqfeature s, term t, seqfeature_qualifier_value sqv, biosequence bs

WHERE l.seqfeature_id = s.seqfeature_id

AND s.seqfeature_id = sqv.seqfeature_id

AND t.term_id = s.type_term_id

AND t.name = 'MOD_RES'

AND s.bioentry_id = bs.bioentry_id

AND bs.alphabet = 'protein'

AND sqv.value NOT LIKE '%(Probable)'

AND sqv.value NOT LIKE '%(Potential)'

AND sqv.value NOT LIKE '%(By similarity)'

GROUP BY sqv.value

ORDER BY count(sqv.value) DESC

Redundant samples were removed form output data (except those with different BLAST profile for PROF method).

The optimal way to investigate protein function is to use the complete parent protein sequence, not short parts of it. In that case the interesting non-local multiple modifications sites can be identified.
The output score is in the range [0.000-5.000]. The higher the output score indicate the higher confidence of the predictions.
The predicted sequence fragments that are modified for certain type of post-translational modification can repeat in the output page with different reliability scores. Those variants are predicted by different methods by the use of various projections. If more than one method predicts a site as modified, the prediction is more reliable even if low scores are presented.
In all types of post-translational modification sites the best type of a kernel is polynomial one. Representations mixed with LOOKUP projection (like PROF+LOOKUP and BLOSUM+LOOKUP) are the most efficients. Other projections (like generic BIN or PROF) have some advantages for particular types of modification sites, but they have lower overall efficiency (small recall and precision values). When the number of positive instances is large the simple binary method BIN is becoming the most accurate one, whereas in the case of lower statistics profile methods gain better results. The SVM finds more easily proper classification scheme of the test set with simple representations than more complex ones. The linear kernel function in the case of more complicated sequence signatures of post-translational modification sites is not efficient. However in some cases (PKA phosphorylation with SPARSE+LOOKUP representation) SVM models of this type reach efficiency of the polynomial kernel. In the case of radial basis kernel SVM frequently fails to build the model. In the case of large number of instances the simple LOOKUP method for this type of a kernel is the most accurate. The remarkable cases are acetylation, amidation and pyrrolidone cases, where the system with LOOKUP embedding reaches efficiency of the polynomial kernel.

The analysis of post-translational modification sites by support vector machine allows for quick and accurate (very conservative) prediction of a protein function. The high overall precision of best methods allows user to gain deep insight in plausible functional characteristics of unknown new proteins. The recall efficiency ensures that information from previously verified sites will be not lost during automatic scans of known instances. The algorithm can by applied independently from the Web interface in a pipe-line. Large scale genomes analysis is also possible.
The main problem for some types of functional modifications is the insufficient number of experimentally verified instances. The number of support vectors for some of our classification models is very large – which is explained by the large dimensionality of the embedding space in such cases and the complicated shape of the separation hyperplane between positive and negative instances. The number of support vectors can be lowered when one chooses low dimensional initial encoding of the amino acids into the general physicochemical properties (like hydrophobicity, hydrophilicity, polarity, volume, surface area, bulkiness or refractivity). We are working now on incorporating those features for recent update of our service, which will be available within one month.

Vapnik, V. N. The nature of statistical learning theory. Springer: New York, 1995; p xv, 188.
Vapnik, V. N. Statistical learning theory. Wiley: New York, 1998; p xxiv, 736 p.
Plewczynski, D.; Tkacz, A.; Wyrwicz, L. S.; Godzik, A.; Kloczkowski, A.; Rychlewski, L. Support-vector-machine classification of linear functional motifs in proteins. J Mol Model 2006, 12, 453-61.
Plewczynski, D.; Tkacz, A.; Wyrwicz, L. S.; Rychlewski, L. AutoMotif server: prediction of single residue post-translational modifications in proteins. Bioinformatics 2005, 21, 2525-7.
Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 1999, 27, 49-54.
Junker, V. L.; Apweiler, R.; Bairoch, A. Representation of functional information in the SWISS-PROT data bank. Bioinformatics 1999, 15, 1066-7.

This work was supported by EC BioSapiens (LHSG-CT-2003-503265) 6FP project as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and 2P05A00130).

Download PDF

Version 1

posted

You are reading this latest protocol version

AutoMotif Server: a computational protocol for identification of post-translational modifications in protein sequences

Status:

Version 1

Figures

Introduction

Equipment

Procedure

Timing

Critical Steps

Anticipated Results

References

Acknowledgements

Status:

Version 1

Privacy Policy

Terms of Service

Cookie Settings