Step 1: Data acquisition
Download the following genomic sequence data:
- SARS-CoV-2 GISAID Brazil, China (Wuhan region only and whole country less Wuhan), England, Germany, Italy, Russia, Spain and USA strains genomes, available at https://www.gisaid.org/. We recommend keeping strains genomes from different countries in different files.
- The human genome, available at http://igenomes.illumina.com.s3-website-us-east-1.amazonaws.com/Homo_sapiens/Ensembl/GRCh37/Homo_sapiens_Ensembl_GRCh37.tar.gz;
- Human coding transcriptome, available at ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz;
- Human non-coding transcriptome, available at ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz
- SARS-CoV-2, SARS, MERS, and H1N1 genomes, available at https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/
- SARS-CoV-2 Wuhan strain from NCBI Assembly (code ASM985889v3), available at https://www.ncbi.nlm.nih.gov/assembly/GCF_009858895.2/
Step 2: Scripts acquisition
Download the scripts that will be used to extract features from siRNAs sequences:
- Auxiliary scripts and files implemented by authors at https://github.com/inaciomdrs/sirna_db_building_protocol
- Softwares ThermoComposition218 and si_shRNA_selector9, whose copies are available at https://github.com/inaciomdrs/sirna_db_building_protocol/bin
- Script file deltacalculator.py is a modified part of SSD7 software destined for siRNA efficiency prediction and features calculation.
Step 3: Strains cleaning
To assess siRNAs’ efficiency against SARS-CoV-2 strains from different countries, the genomes from these strains must be at least 90% complete. Thus, remove from SARS-CoV-2 GISAID strains genomes the ones whose percentage of N-type nucleotides are higher or equal to 10 (not-closed regions). We recommend using ref_clean.pl script for performing this task. An example of how to use this script follows below:
$ ./ref_clean.pl uk_sars_cov_2_strains.fa > uk_sars_cov_2_clean.fa
This step must also be performed on SARS, MERS, and H1N1 genomes.
Step 4: Genomes indexing
Index all the genomes downloaded in Step 1 with Bowtie10 version 1.1.0. An example of how to make this indexing follows below:
$ bowtie-build uk_sars_cov_2_clean.fa uk_sars_cov_2_clean
This step must be performed for every genome downloaded in Step 1.
Step 5: Path settings and siRNAs sequences generation
Edit step_0_seq.pl script file informing, where indicated, the path to the fasta file of SARS-CoV-2 Wuhan strain from NCBI Assembly (code ASM985889v3), downloaded in Step 1. After that, edit aln_commands.pl script file informing, where indicated, the paths of prefixes of indexed genomes in Step 4. Then, run the following commands:
$ ./step_0_seq.pl 21 > input
$ ./aln_commands.pl input 21 > aln_commands_21.sh
Where step_0_seq.pl generates siRNAs of 21 nucleotides length, saving them on input; and aln_commands.pl generates a shell-script file responsible for executing the alignments of generated siRNAs sequences against indexed genomes, producing STS files that report the minimum number of needed mismatches for those siRNAs to have a match with those genomes.
After that, the user has the option of generating sequences between 18 and 21-nt long by changing the 21 in the above commands by the size of interest. Generated aln_commands_21.sh script file can be either run sequentially (one command at a time) or by batches of commands run in parallel. Finally, create a directory called STS and move all generated STS files to it.
Step 6: Database building
Run script file run.pl for generating siRNA targets database, using the following command:
$ ./run.pl 21
Where 21 is the length of siRNAs sequences. The user must assure that this number is the same as the used in Step 5, at step_0_seq.pl script. Important note: the user also must ensure that run.pl script, bin directory, STS folder, NC_045512.bed file, and db_olig_seq2.pl script file are in the same folder. All of these files are available at https://github.com/inaciomdrs/sirna_db_building_protocol. run.pl calculates features regarding base context, thermodynamic information, and efficiency prediction, using for these latter two ThermoComposition218 and si_shRNA_selector9 software and the other downloaded scripts in Step 2 for the whole process of database building. It also organizes information in STS files across the produced table. It’s important to note that run.pl triggers processes that are run in background. Use top program to track the execution of such processes and know when they are finished. When this finally happens, execute the following two commands:
$ cat *.res > db.txt
$ rm input.*
Where db.txt is the generated siRNAs database for the chosen size.