izMiR is an data analysis workflow that can be used for the detection of pre-miRNAs. The overall system includes two important workflows
1) Training of machine learning classifiers with suitable examples
2) Application of the learned model for detection of new pre-miRNAs.
By using the training workflow it is possible to generate models that can be used on new data to predict whether the given data have potential miRNA hairpins. It is also possible to use prediction workflow directly with the models and input data provided by us. In the latter case compatible features need to be calculated for analysis. These calculations can be done using two webservers. One provided by us (http://www.jlab.iyte.edu.tr/software/izmir) and one by Yones et al. (http://www.fich.unl.edu.ar/sinc/web-demo/mirnafe-full/).
MicroRNA detection, computational, ab initio, machine learning
In many machine learning based miRNA precursor prediction studies, different data sets were used with various features using different machine learning algorithms. Superficially comparing the published performance measures is at best misleading for the end users. To allow for a proper comparison, these tools need to be unified in one framework and tested on the exact same inputs. There are many challenges for such ab initio methods. Here we provide a comprehensive approach which allows the opportunity to compare different data sets, feature groups and classifiers. The system is very flexible and can be seamlessly adopted for future studies gracefully allowing extensions and adjustment of any settings.
We also showed that by using izMiR it is possible to obtain consensus models which lead to increased classification performance. For both learning and prediction features need to be calculated. For this two services are available one provided by Yones et al. (http://fich.unl.edu.ar/sinc/web-demo/mirnafe-full/) and one provided by us (http://jlab.iyte.edu.tr/software/izmir). The main parts of learning and prediction workflows are:
The below steps are laid out in the KNIME workflow (http://bioinformatics.iyte.edu.tr/software/izmir, training workflow) and the sections are labeled accordingly (Figures 1-6).
- Selection of suitable training examples
- Input data for positive (real miRNA) and negative (non-miRNA) examples (Fig. 1)
- Preprocessing of input data (Fig. 1)
- MCCV (Monte Carlo Cross Validation; Figs. 2,3)
- Feature grouping (Fig. 4)
- Classifier training (Naive Bayes (NB), Decision Tree (DT), Weka LibSVM) per study (Fig. 5)
- Model performance scores (e.g.: accuracy, F-score, Fig. 5)
- Model generation (Fig. 5)
- Combination of outcomes for each iteration (Fig. 4)
- Model evaluation (Fig. 5)
- Model output (PMML file) to be used in prediction phase (Figs. 1,6)
The below steps are laid out in the KNIME workflow (http://bioinformatics.iyte.edu.tr/software/izmir, prediction workflow) and the sections are labeled accordingly (Figures 7-12).
- Input data reading and preprocessing (Fig. 7)
- Applying best PMML models (created in training phase; Fig. 8,9)
- Obtaining predictions (miRNA or negative) and prediction score (between 0-1, Fig. Figs 9-11)
- For Consensus Decision Tree and Naive Bayes; if a sequence is predicted as “miRNA” by 6 or more studies, it is consensus result is miRNA otherwise negative (Fig. 9)
- For Consensus Rule; ; if average DT or average NB is larger than 0.89 then it is labeled as “miRNA”, conversely, if average DT or average NB is less than 0.5 it is labeled as “negative”, finally the remaining are labeled as “candidate” miRNA. (Fig. 10)
- For Consensus Model generation the learning data sets used in this study are tested on the best models and prediction scores are saved. Then these prediction scores are used as input data for a new learning process by using Multi Layer Perceptron classifier (1000 fold Monte Carlo Cross Validation) and the model with the highest accuracy and F-score is stored. This Consensus Model is then included in prediction workflow (Fig. 11)
- Visualization meta-node is a collection of visual nodes available in KNIME for producing simple graphs (Fig. 12)
- Count meta-node provides numeric information about the results (Fig. 1)
1) KNIME installation
a. Download: https://www.knime.org/downloads/overview
b. Installation: https://tech.knime.org/installation-0
c. Update extensions: https://www.knime.org/downloads/update
2) Importing workflows
- On the left side of KNIME window, there is a box with LOCAL (Local Workspace) (Figure 1), right click to that area and “select import KNIME workflow” (https://tech.knime.org/workbench)
- In the pop-up window select the directory where the downloaded workflows are and load them
- The workflow has the input data (human miRNAs as positive and pseudo as negative)
If you do not want to generate new models or results you can explore already computed results by right clicking on the nodes and choosing the output table for display.
If you want to make modifications to the workflow you can click on the nodes and change their settings. Some example changes could be:
- Change input data by clicking on File Reader nodes (positive or negative)
- Change number of iterations for loop by going into Loop x-times meta-node and clicking on Counting Loop Start then setting the number
- Changing sampling ratio by going into Loop x-times meta-node and in sampling meta-node changing partitioning node’s settings
- Using your desired feature set; go into Loop x-times/studies. If you want to add another study with different feature set copy paste one of the meta-nodes (e.g. Ng) connect it in the same way as the existing ones. Right click and select Reconfigure to change meta-node name. Then go into your meta-node select filter (feature selection). Inside that meta-node, there are two column filter nodes; one for learning another for testing data, in these nodes select your choice of features. In classifier and CombineLearnedStats meta-nodes you should do renaming since they would be set to Ng in this instance.
The prediction workflow requires a column named "Accession" for joining. If your data has no such columns you can use RowID node to create unique accession values.
Figure 1. Overall training workflow
Figure 2. MCCV and model generation
Figure 3. Sampling.
Figure 4. Studies (feature groups).
Figure 5. Feature selection and application of 3 classifiers.
Figure 6. Model sorting, selection and saving as PMML files.
Figure 7. Prediction workflow.
Figure 8 Prediction Meta-node
Figure 9 Decision Tree/Naïve Bayes Meta-node
Figure 10 Consensus Result Meta-node
Figure 11 Consensus Model Meta-node
Figure 12 Visualization Meta-node