Example of a fastq file in read 1 (in paired read sequencing) is as follows:

@SRR3117565.1.1 1 length=100
NCAAAACAGCTCTCCCTCCTTTGATCTGATGGTCTGCAGAGGTCCTCAAATCCACACACTGCCACTCTTCAAGACCAACCACTGGGCCTTCTTAATCTCA
+SRR3117565.1.1 1 length=100
#1=BDDDDHFHHAHDE?GFEEDHG@HFHEECDCGHE:FDFHD*?DHFD<GEDEGIIIIIA=AFFGACHDH>EHHF>;;B<;A;>=A=??@CCCC>5>>AC
@SRR3117565.2.1 2 length=100
NTCCTGACTCACACGCCACAACCATGACTGGCTCAGCTCCCTTAATTCCAGCTTCCCTTACATGACGCAATTCCTTCTCAGATTCGGGTTTTCAGCTGAG
+SRR3117565.2.1 2 length=100
#4BDFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJIHHFEFEEEEEEDDEDDDDDDDDBDDDDDDDDDDD


Example of a fastq file in read 2 is as follows:
@SRR3117565.1.2 1 length=100
TGGTTTTTTTTTTTGTCCCTCAAATTTTTGGACTCCGTAACATCAACCAGTTTGGAGTGGGATGACAGAGAGAATGCCCAATTTTGTGAGGCCCATGATT
+SRR3117565.1.2 1 length=100
59(2(3(2=9/>;?/=)))))().8<8))'-).6..8:(',)..)(((((-(53/(,,(((((,(((+(+(+2((((+((+((+23++(+2+++2++(((
@SRR3117565.2.2 2 length=100
CTGGCTTGTTATAACGCAAAGCTTGGTTGTTTATGCAACTCTATCTTAAGAACTGCCCAGCCTCAGCTGAAAACCCGAATCTGAGAAGGAATTGCGTCAT
+SRR3117565.2.2 2 length=100
CCCFFFFFHHHHHJJJJJJJJJJJJJIJJIJJJJJJJJJJJJJJJJJJJJJJJJIJJJJHIJJJHHHHHFFFFFDDDDDDDDEDDDDDDDDDDDDDDDDB
Following script fix it such that both fastq files (corresponding to the paired sequencing reads) of sample SRR3117565 have similar read names.
nohup sed 's/\([@|\+]SRR.*\)\.1.*/\1/' ./SRR3117565_1.fastq > ./SRR3117565_1.correctId.fastq &
nohup sed 's/\([@|\+]SRR.*\)\.2.*/\1/'./SRR3117565_2.fastq > ./SRR3117565_2.correctId.fastq &

Modify read names in bam files in Linux

In this example I would show how to remove ":1" and ":2" at the end of the query/read-names that show the first and the second paired read. If the reads/query names start with "SRR" the following scripts can be used:

samtools view ./file.bam | perl -pe 's/(^SRR.*?):[1-2]\t/\1\t/g' > ./file.sam
cat ./file.sam | samtools view -bS - > ./file_modified.bam

or all together the following script can be run:

samtools view ./file.bam | perl -pe 's/(^SRR.*?):[1-2]\t/\1\t/g' | samtools view -bS - > ./file_modified.bam

In the end the string modification may mix up the header of the bam files, as a solution to the problem the header could be seprataed the then reattached as following:
samtools view -H /netapp/seqRawData/eugeneMouse/DIV0.bam > /netapp/seqRawData/eugeneMouse/DIV0_head.sam &
samtools view /netapp/seqRawData/eugeneMouse/DIV0.bam | perl -pe 's/(^.*?):[1-2]\t/\1\t/g' > /netapp/seqRawData/eugeneMouse/DIV0M.sam &
cat /netapp/seqRawData/eugeneMouse/DIV0_head.sam /netapp/seqRawData/eugeneMouse/DIV0M.sam |samtools view -bS - > /netapp/seqRawData/eugeneMouse/DIV0M.bam

0

Add a comment

In this post I show how groupScatterPlot(), function of the rnatoolbox R package can be used for plotting the individual values in several groups together with their mean (or other statistics). I think this is a useful function for plotting grouped data when some groups (or all groups) have few data points ! You may be wondering why to include such function in the rnatoolbox package ?! Well ! I happen to use it quit a bit for plotting expression values of different groups of genes/transcripts in a sample or expression levels of a specific gene/transcript in several sample groups. These expression value are either FPKM, TPM, LCPM, or PSI values (Maybe I should go through these different normalizations later in a different post 😐!). But of course its application is not restricted to gene expression or RNAseq data analysis.

For the test, I first generate a list with three random values. The values are generated randomly using normal distribution, featuring different means and standard deviations.

library(rnatoolbox)
datList<- list(
  l1=rnorm(n=30, mean = 10, sd = 3),
  l2=rnorm(n=20, mean = 0, sd = 1),
  l3=rnorm(n=25, mean = 10, sd = 1)
)


Then I plot the grouped values. By default the mean function is used to add a summary for the values. However, other functions (e.g. median) can be defined as the FUN parameter.


png(
  "/proj/pehackma/ali/test/test_rnatoolbox/test_groupedScatterPlot_3.png",
  width=500, height=500, pointsize=21)
groupScatterPlot(l=datList, col=rainbow(3),
                 lty=1, lwd=1.5,
                 ylab="Test values")
dev.off()



0

Add a comment

Labels
Blog Archive
About Me
About Me
My Photo
I am a Postdoc researcher at the Neuromuscular Disorders Research lab and Genetic Determinants of Osteoporosis Research lab, in University of Helsinki and Folkhälsan RC. I specialize in Bioinformatics. I am interested in Machine learning and multi-omics data analysis. My go-to programming language is R.
My Blog List
My Blog List
Loading
Dynamic Views theme. Powered by Blogger. Report Abuse.