The main purpose of this study is to make single-molecule measurement and data analysis faster and more accurate so that this platform can be widely used as a new analytical technology in the future. Therefore, this study compared the new method with the old method for predicting the mixing ratio of dGMP (deoxyguanosine monophosphate) and dTMP (deoxythymidine monophosphate), and proved that the new method is faster and more accurate.
(1) Single-molecule measurement
After preparing the measurement solution, inject 10 μL of the solution into the PDMS well of the nanogap electrode chip. The nanogap distances are set to 0.52, 0.54 and 0.56 nm, respectively, and continuously controlled/maintained by feedback. For measurement, a bias voltage of 100 mV is applied to the electrodes. The step of a single measurement is 5 minutes, with a total of 60 minutes of measurement at each distance nanogap.
(2) Classification process
[Conventional method]
1. Signal extraction from raw data
Signals with a maximum current of 20 pA or more and a dwell time of 10 ms or more are individually extracted from the single-molecule measurement data.
2. Feature extraction
Extract features from the signal files. The factors of feature include Ip (peak current), Td (dwell time), 10-dimensional normalized current shape, and average current value.
3. Random Forest-based Classification with 10-fold Cross-Validation
In this study, the Random Forest (RF) classifier was employed for data classification. To evaluate the performance of the classifier, a 10-fold cross-validation technique was utilized. The dataset was divided into subsets, with one subset used for testing and the remaining subsets for training in each iteration. The RF classifier, with a parameter value of 100 for "n_estimators," was used to construct a single-molecule machine learning classifier using the dGMP and dTMP datasets.
4. Prediction of the mixing ratio of mixed solutions
The dataset is divided into training and testing sets. The Random Forest classifier is instantiated and trained using the training set. It constructs multiple decision trees and combines their predictions for accurate classifications. The trained classifier is then used to predict the mixing ratio of the mixture samples in the testing set. These predictions are compared against the true labels to evaluate the classifier's performance. Performance metrics such as accuracy, precision, and recall are calculated to assess the classifier's effectiveness.
[New method]
In this paper, we tried to classify two molecules without training data using the mixture measurement data used in the conventional method.
1. Signal extraction from raw data
Same as Conventional method
2. Feature extraction
Same as Conventional method
3. Classification with Kernel Density Estimation (KDE)
Classification is performed using Kernel Density Estimation (KDE), which is one of the algorithms belonging to the Univariate Unimodal Classifier (UUC) family. Probability density estimation is performed on the given training data, and weights are updated to proceed with the classification. Firstly, preprocess the training data by applying upper and lower bounds, shuffling the data, and performing undersampling if needed. Then, train the UUC algorithm using the preprocessed data. Next, load the prediction data using the specified file paths and labels. Preprocess the prediction data based on the specified features and conditions, set the prediction ratios, and finally, utilize the trained model to make predictions on the data.