The AMS server is able to predict all types of post-translational modifications for a query sequence in real time, even if multiple classification models are used. When large set of input sequences is used the time needed to perform the prediction is scaling linearly with the size of the set. If two sets are submitted at once, they are run in parallel on our linux cluster, so the time is the same as for single submission. Therefore the critical step for computations is the proper preparation of input data.
Database & Representations
- The AMS method for predicting plausible post-translational modification sites classifies known experimental instances. Only the sequence information is used as an input, because in most cases only the potential target protein sequence is known. Our analysis is based on biological information acquired from the Swiss-Prot database 5, 6.
- Proteins with acetylation, phosphorylation (by PKA, PKC, CK, CK2 and CDC2 kinases), sulfation, amidation, hydroxylation, methylation, pyrrolidone and gamma-carboxyglutamic modification sites are selected for our analysis. Those processes have the largest number of known experimental instances. Training cases are taken from proteins experimentally annotated proteins. For each type of post-translational modification the list of proteins with at least one modified site of that particular type is fetched from the Swiss-Prot database. Sites annotated “by similarity”, “partial”, “potential”, “probable” or “predicted” are neglected in the analysis. The remaining list of residues is used as the dataset of positive cases, which includes all short sequence segments dissected from parent proteins with size of 9 amino acids and centered on main annotated residue. If the case of non-symetric segments, where modified residue is not in the center, the lacking positions in a segment are filled with ‘X’ as the type of amino acids. All redundant segments in the database, i.e. with the same sequence, are removed from the training dataset.
- The “negative” preferences for each position in a short sequence segment for each type of post-translational modification is calculated using the negative instances dataset. We randomly select short sequence segments from proteins of Swiss-Prot database that have appropriate to selected type of modification the central amino acid that are not experimentally annotated to undergo this modification. Those two datasets: positive and negative instances for each type of functional motif are then used for the training of SVM.
- Sequence segments are projected into the multidimensional space using ten different projections. The first representation (BIN) encodes each position of a segment into a 20-dimensional vector of binary values 0 or 1. The 1 value denotes that corresponding type of amino acid is present at selected position of a segment and 0 otherwise. Therefore each the vector representing a segment contain 9 coordinates equal to 1, all other dimensions have 0 value. The second representation (BLOSUM) uses the BLOSUM62 matrix for encoding each position of a segment by a 20 dimensional vector of the substitution scores between the amino acid present in the projected segment at this position and all other 20 types of amino acids. If Arg is found at first position in a segment we represent it by the appropriate Arg column from the BLOSUM62 substitution matrix. In the case of 9 amino acids long segments the representation is in 180 dimensional space (constructed by 9 columns of the BLOSUM62 matrix for 9 types of amino acids that are present in the projected segment). The LOOKUP method represent each amino acid type in a segment as one dimensional scalar value that is equal to the normalized sequence preference for it. The normalized preferences are pre-calculated earlier for all 9 positions within a segment and for all types of amino acids. For example the Arg amino acid at first position of a segment has the normalization calculated by dividing the probability to find Arg at first position on annotated segments by the probability to find it at the first position of not annotated segments. The profile projection (PROF) uses similar normalized preferences for each position of 9 residue long segment but storing them as 20-dimensional vectors preserving the information about all types of amino acids for a particular position. The normalized preferences for all types of amino acids at this position is multiplied by appropriate to found amino acid type column from the BLOSUM62 substitution matrix. If in the segment we find Arg the all amino acids preferences are multiplied by the Arg column of the substitution matrix. Each segment is represented as a point in 180 multi-dimensional space. The sparse representation (SPARSE) takes the normalized preferences for the found type of amino acid at certain position of a segment instead of binary value. All other amino acids are marked by 0 values. In addition, the combinations of above generic representations are used in order to maximize the accuracy and efficiency of both representing the acquired biological knowledge and the training abilities of support vector machine.
- The bioSQL database using ‘bioperl-db’ perl library is build directly from UniProtKB flat text file. The selection of positives was performed by querying bioSQL database by following MySQL procedure:
SELECT count(sqv.value), sqv.value
FROM location l, seqfeature s, term t, seqfeature_qualifier_value sqv, biosequence bs
WHERE l.seqfeature_id = s.seqfeature_id
AND s.seqfeature_id = sqv.seqfeature_id
AND t.term_id = s.type_term_id
AND t.name = 'MOD_RES'
AND s.bioentry_id = bs.bioentry_id
AND bs.alphabet = 'protein'
AND sqv.value NOT LIKE '%(Probable)'
AND sqv.value NOT LIKE '%(Potential)'
AND sqv.value NOT LIKE '%(By similarity)'
GROUP BY sqv.value
ORDER BY count(sqv.value) DESC
Redundant samples were removed form output data (except those with different BLAST profile for PROF method).