Workflow
The workflow of our approach is as described below (Figure 1). It may be broadly divided into genomics, proteomics, machine-learning and evolutionary sub-approaches.
Datasets were manually constructed and curated for seven protein families majorly involved in insect chemical communication. In order to build in-house datasets, well-annotated and curated protein sequences were mined from literature, UniProt and SwissProt searches. The protein ‘classes’ considered were Classic, Minus-C, Plus-C, Atypical, D7, CSP and NPC2. A control dataset was prepared that consisted of insect proteins that did not belong to any of the above classes. Unique and non-redundant protein sequences were retained.
Features were computed to contribute to bit-wise information about each class-specific sequence. A feature matrix was derived that comprised 36 bits per sequence in a given class and necessary steps were taken to minimize or nullify bias due to imbalanced datasets wherever applicable.
Machine learning models were built using sklearn library in Python v3.5 environment.
Various classifiers were used on the data. Data was split into training and testing using the 80:20 ratio. MinMax scaler was used to normalize the data. Data was split into training and testing for X and Y each. Performance was evaluated using measures of accuracy, precision, recall, f1-support, and Mathews’ correlation coefficient (MCC). The performance was also visualized using a confusion matrix plotted using Python libraries.
A total of 17 features were selected to be computed per sequence in each class. These are as follows-
a. Position-Specific Score Matrix (PSSM)
b. Accessible Surface Area (ASA)
c. Phi and Psi torsion angles (dihedral angles)
d. Secondary structure
e. Disorder scores of protein (two scores)
f. Number of Cysteines
g. Length of protein
h. Molecular weight
i. Aromaticity
j. Stability score
k. Isoelectric point of sequence (pI)
l. Molecular coefficient of reduction extinction
m. Molecular coefficient of disulphide
n. Entropy of sequence
o. Number of globular domains in the sequence
p. Residue Adjacency Matrix of Cysteines (RAM)
Insect genomes were collected from NCBI, Ensembl and VectorBase. Genomic alignments were obtained using protein datasets representative of each of the seven classes as query.
These were then used to obtain unique gene models and non-redundant protein sequences using a combined approach. A four-pronged approach was then used to score list of predicted proteins on the presence of cysteine topology, PBP/GOBP domain, length cut-off as well as the presence of signal peptide using predictive and in-house scripts. Due to the nature of predictions, scoring for only few of these criteria does not guarantee the sequence to be an OBP due to biochemical similarities across various classes of insect proteins. Hence, sequence passing the cut-off score were tested using optimized machine learning models trained to discriminate amongst seven major types of protein families/sub-families mediating insect chemical communication.
Phylogenetic analysis of predicted classes/sub-families coupled with analyses of motifs, domains and co-occuring domains identified yielded unique insights into the evolution and possible functional significance of OBP and other protein families involved in communication in insects.
User-end inputs-
i.First stage. Two modes in input are possible- either A or B, or both A and B can be provided as input.
A. An input file with a list of protein sequences in FASTA format with special amino acids ‘B’, ‘J’ and ‘X’ removed. The header of every fasta sequence will be retained upto the first twenty characters.
B. An input folder with genome of the organism of interest in fna format, genomic alignment output file and a file with query proteins of interest from any organism.
ii.Second stage. An input folder with feature files corresponding to each protein sequence in the output fasta file of the first stage (i). These feature files are to be correctly named with the prefix same as that of the unique header ID of the protein sequence.
iii.The features can be computed using the code provided in Github repository https://github.com/bhavikamam/SoCCer/featurizer.py after appropriate changes in path have been made as per the requirement of the user. The ‘readme’ file will contain instructions for the software to be installed locally. The output file at this step has to be provided as input to the Python script for further identification and classification.
Software, scripts
The entire pipeline and codes for the same have been uploaded in a repository on Github titled ‘SoCCer’ by the username ‘bhavikamam’.
Advantages
Due to the similarities across some protein families in combination of i. type of domain, ii. sequence, iii. structural and/ iv. functional properties, identifying a protein correctly as an OBP subtype is quite challenging. Our methodology derives a non-linear relationship from all the essential feature information provided and identifies whether a given protein sequence as one of the major families in insect chemical communication and accordingly classifies it further.
Comparison with other methods
There are a number of computational pipelines that combines gene prediction tools, starting from handling raw sequencing data, gathering transcript evidence and so on. However, such pipelines are not specialised for particular protein families. There are also sequence search techniques, but they mostly use amino acid exchange information to recognise homologous sequences and do not absorb or incorporate features such as secondary structures or cysteine connectivities in the identification or classification of proteins into subfamilies.
Currently to our knowledge, there is no other existing computational technique that classifies protein into one of the OBP subfamilies or other families involved in insect chemical communication using an optimized ML-based approach.
Application
The technique is useful in identifying and classifying the protein family/ or sub-family of insect chemical communication to which a novel and/or existing protein sequence predicted would belong by using a combinatorial genomics, machine learning and evolutionary-based approach.
Target audience
The major target audience of interest for this methodology would comprise entomologists and evolutionary biologists. Olfaction being a major sensory modality for insects, entomologists would be highly interested to learn more about the olfactory system in insects across orders as well as the types of proteins participating in the process of olfaction. Near-automatic classification of different types of OBPs not only provides them an advantage in terms of functional annotation but it would also help them to recognize subfamilies for designing experiments. For example, pheromone receptors could be taken up for detailed biochemical studies by cloning putative pheromone-binding proteins predicted by this methodological pipeline.
Use of technique
This technical is useful because the scientific community has to otherwise typically engage in laborious manual process of aligning widely different types of OBPs since they differ in their length and disulphide connectivity and hence have to be examined manually to then be classified. Instead, we hereby, present a computational scheme by which we apply a novel combined approach to perform these classifications computationally. This can be used for few other non-OBP families as well.
Limitations
The limitation of the technique is that as and when we come across newer OBP sub-types as we mine more insect genomes, we may come across unexpected or surprising deviations from these disulphide bonds patterns by which automatic recognition of these subfamily types would be challenging, so we expect that the protocol also must be open to adapt itself to increasing knowledge that comes from merely looking at more number of insect genomes as we go by.