Construction of tRFdb

Analysis of the small RNA data

Source and processing of small RNA high throughput sequencing data: The small RNA high throughput sequencing data were downloaded from the GEO database (http://www.ncbi.nlm.nih.gov/geo/) or NCBI SRA database (http://www.ncbi.nlm.nih.gov/sra). We considered only 14-36 bases long, size selected small RNA high-throughput sequencing datasets. For each dataset we considered the processed small RNA data (unique sequence and their cloning frequency). In case of non-availability of processed data, the raw data were used to generate the unique RNA sequence and its cloning frequency. The adaptor sequences from the raw data were removed using Cutadapt" (version 2.0) program (http://code.google.com/p/cutadapt/) using all default parameters. Finally each unique sequence was used as query sequence to search in-house developed custom blast database “tRNAdb”.

Building and Mapping of small RNA on "tRNAdb"

Information about the tRNA genes for species R. sphaeroides (Bacteria; ATCC_17025), S. pombe (schiPomb1), Droshophila (dm3), C. elegans (ce6), Xenopus (xenTro3), zebra fish (Zv9), mouse (mm9) and human (hg19) was either downloaded from the "Genomic tRNA database" (http://gtrnadb.ucsc.edu/) or " NCBI" (http://www.ncbi.nlm.nih.gov/). We extracted sequences that spanned 100bp upstream and 100bp downstream to the mature tRNA genes from the same genome assembly on which the tRNA gene coordinates were built. The genomic sequences were extracted based on the strand information of tRNA gene transcription. A species-specific "tRNAdb blast database was built to query the small RNA sequences. To find the tRNA-related RNA sequences in each library, the small RNAs were mapped to the species-specific tRNAdb, using BLASTn (Altschul, Madden et al. 1997). In general we considered only those alignments where the query sequence (small RNA) was mapped to the database sequence (tRNA) along 100% of its length. Blast output file was parsed using an in-house developed script to get the information of mapped positions of small RNA on tRNA genes. Since custom tRNA database was built considering the strand information of tRNA gene therefore only those blast alignments were considered where queries mapped on to the positive strand of the subject sequence. We extracted the information for small RNAs aligned from its first to the last base with tRNA sequences, allowing either one or no mismatch. Since “CCA” is added at the 3’ end of tRNA by tRNA nucleotidyltransferase during maturation of tRNA (Xiong and Steitz 2006) therefore a special exception for the small RNA mapping to the 3’ ends of tRNAs in the tRNAdb was devised allowing a terminal mismatch of <=3 bases. To eliminate any false positives, the small RNAs that mapped on to the “tRNAdb” were again searched against the whole genome database using blast search excluding the tRNA loci. Finally only those small RNAs were qualified as probable tRFs that mapped exclusively on tRNA loci.

Quality filter to remove random degradation products of tRNA genes

The ends of the small RNA mapped on tRNA genes was used to access the significant enrichment of any mapped small RNA on tRNA. If the tRFs were a result of the random degradation of tRNA transcript, the ends of the tRFs would be equally distributed along the lengths of the tRNA genes with comparable frequency. However, the small RNA mostly (>90% of total mapped reads on individual tRNA) mapped on three specific regions: extreme 5’ end (tRF-5), extreme 3’ end (tRF-3) of mature tRNA and 3’ trailer region (tRF-1) of primary tRNA genes. Therefore tRFs mapped only to these specific locations were considered for further analysis. As shown in Figure for each tRF there is one most abundant RNA sequenced (example the tRF-5 “GCATTGGTGGTTCAGTGGTAGA” was sequenced at 8258 reads per million) accounting for more than 80% of the reads mapping to that site. This distinguishes the main tRF from other low abundance products created by nucleases digesting the main tRF, or possibly from random degradation of tRNAs and tRFs. Finally, only highly abundant tRFs (>20 reads per million in at least one library) were included in the database. .



Annotation of tRFs

Unlike the microRNA the information about tRF sequence, length and mechanism of generation is not well known, so that we were very stringent in corroborating tRFs by analysis of multiple small RNA libraries of a species. The most abundant clones mapping to individual tRNA loci were first identified in each library. Only if the identical sequence is detected in >80% of the small RNA libraries analyzed in that species is that particular clone included in the tRF database. Such an abundant tRF mapping at the 5’ extreme end of the mature tRNA was given a unique tRF ID prefix with “5” (e.g. 5001 “5” followed by number). A tRF that mapped to the 3’ extreme end of mature tRNA was given a unique tRF ID prefix with “3” (e.g. 3001 “3” followed by number) and the tRF-1 series were assigned the ID starting with “1”. .