The rapid increase in genomic information requires new automatic techniques to investigate protein functions. The function of proteins is partially determined by short sequence segments. For example the phosphorylation by protein kinases is an important mechanism for controlling intracellular processes. Many kinases are known, but the identification of their potential biological targets is still ongoing research. High substrate specificity of protein kinases ensures correct transmission of signals in cells. The specificity is largely determined by the primary sequence of the target site, but we lack general, efficient and error prune tools for identifying these sites. Most methods designed to predict functional motifs process local sequence information around post-translational modification sites.
We present here an advanced computational protocol for rapid identification of post-translational modifications (PTM) in proteins on the whole genome scale. The AutoMotif Server (AMS) identifies various types of post-translational modifications in protein sequences. A query protein sequence is dissected into overlapping short segments. Each segment is projected into an abstract space of sequence fragments by 10 different representations. Those projections are compared with the database of representations of known and confirmed by experiments post-translational modification sites using the support vector machine (SVM) approach 1, 2. The supervised machine learning approach is able to predict the most of post-translational modification sites in proteins. It is based on the classification of the biological functional information acquired from the Swiss-Prot database version 4.2. The classification models are then used to predict new modification sites in proteins. Users can access a list of sites in proteins annotated as being able to undergo certain post-translational modification in Swiss-Prot database and add new annotated sequence segments from proteins (positive instances).
The AMS server was demonstrated 3, 4 to gain high accuracy in distinguishing short sequence fragments that are post-translational modified from those that are not. The efficiency of the classification for each type of modifications and the prediction power of several versions of the method is estimated using the standardized leave-one-out tests. The sensitivities of the protocol for all types of modifications are in the range of 70%.
The AutoMotif Server is freely available at "http://automotif.bioinfo.pl/":http://automotif.bioinfo.pl/. The local version of the software is available on request from the authors. The parameters (the search type, the number of top models, and the PTM type) are optional and can be easily modified. The following protocol describes how to use AMS server to detect various types of post-translational modifications, and how to understand the resulting score for a given prediction.
The AMS server is able to predict all types of post-translational modifications for a query sequence in real time, even if multiple classification models are used. When large set of input sequences is used the time needed to perform the prediction is scaling linearly with the size of the set. If two sets are submitted at once, they are run in parallel on our linux cluster, so the time is the same as for single submission. Therefore the critical step for computations is the proper preparation of input data.
Database & Representations
- The AMS method for predicting plausible post-translational modification sites classifies known experimental instances. Only the sequence information is used as an input, because in most cases only the potential target protein sequence is known. Our analysis is based on biological information acquired from the Swiss-Prot database 5, 6.
- Proteins with acetylation, phosphorylation (by PKA, PKC, CK, CK2 and CDC2 kinases), sulfation, amidation, hydroxylation, methylation, pyrrolidone and gamma-carboxyglutamic modification sites are selected for our analysis. Those processes have the largest number of known experimental instances. Training cases are taken from proteins experimentally annotated proteins. For each type of post-translational modification the list of proteins with at least one modified site of that particular type is fetched from the Swiss-Prot database. Sites annotated “by similarity”, “partial”, “potential”, “probable” or “predicted” are neglected in the analysis. The remaining list of residues is used as the dataset of positive cases, which includes all short sequence segments dissected from parent proteins with size of 9 amino acids and centered on main annotated residue. If the case of non-symetric segments, where modified residue is not in the center, the lacking positions in a segment are filled with ‘X’ as the type of amino acids. All redundant segments in the database, i.e. with the same sequence, are removed from the training dataset.
- The “negative” preferences for each position in a short sequence segment for each type of post-translational modification is calculated using the negative instances dataset. We randomly select short sequence segments from proteins of Swiss-Prot database that have appropriate to selected type of modification the central amino acid that are not experimentally annotated to undergo this modification. Those two datasets: positive and negative instances for each type of functional motif are then used for the training of SVM.
- Sequence segments are projected into the multidimensional space using ten different projections. The first representation (BIN) encodes each position of a segment into a 20-dimensional vector of binary values 0 or 1. The 1 value denotes that corresponding type of amino acid is present at selected position of a segment and 0 otherwise. Therefore each the vector representing a segment contain 9 coordinates equal to 1, all other dimensions have 0 value. The second representation (BLOSUM) uses the BLOSUM62 matrix for encoding each position of a segment by a 20 dimensional vector of the substitution scores between the amino acid present in the projected segment at this position and all other 20 types of amino acids. If Arg is found at first position in a segment we represent it by the appropriate Arg column from the BLOSUM62 substitution matrix. In the case of 9 amino acids long segments the representation is in 180 dimensional space (constructed by 9 columns of the BLOSUM62 matrix for 9 types of amino acids that are present in the projected segment). The LOOKUP method represent each amino acid type in a segment as one dimensional scalar value that is equal to the normalized sequence preference for it. The normalized preferences are pre-calculated earlier for all 9 positions within a segment and for all types of amino acids. For example the Arg amino acid at first position of a segment has the normalization calculated by dividing the probability to find Arg at first position on annotated segments by the probability to find it at the first position of not annotated segments. The profile projection (PROF) uses similar normalized preferences for each position of 9 residue long segment but storing them as 20-dimensional vectors preserving the information about all types of amino acids for a particular position. The normalized preferences for all types of amino acids at this position is multiplied by appropriate to found amino acid type column from the BLOSUM62 substitution matrix. If in the segment we find Arg the all amino acids preferences are multiplied by the Arg column of the substitution matrix. Each segment is represented as a point in 180 multi-dimensional space. The sparse representation (SPARSE) takes the normalized preferences for the found type of amino acid at certain position of a segment instead of binary value. All other amino acids are marked by 0 values. In addition, the combinations of above generic representations are used in order to maximize the accuracy and efficiency of both representing the acquired biological knowledge and the training abilities of support vector machine.
- The bioSQL database using ‘bioperl-db’ perl library is build directly from UniProtKB flat text file. The selection of positives was performed by querying bioSQL database by following MySQL procedure:
SELECT count(sqv.value), sqv.value
FROM location l, seqfeature s, term t, seqfeature_qualifier_value sqv, biosequence bs
WHERE l.seqfeature_id = s.seqfeature_id
AND s.seqfeature_id = sqv.seqfeature_id
AND t.term_id = s.type_term_id
AND t.name = 'MOD_RES'
AND s.bioentry_id = bs.bioentry_id
AND bs.alphabet = 'protein'
AND sqv.value NOT LIKE '%(Probable)'
AND sqv.value NOT LIKE '%(Potential)'
AND sqv.value NOT LIKE '%(By similarity)'
GROUP BY sqv.value
ORDER BY count(sqv.value) DESC
Redundant samples were removed form output data (except those with different BLAST profile for PROF method).
This work was supported by EC BioSapiens (LHSG-CT-2003-503265) 6FP project as well as the Polish Ministry of Education and Science (PBZ-MNiI-2/1/2005 and 2P05A00130).