Uploaded image for project: 'IGB'
  1. IGB
  2. IGBF-1222

Check if IGB shows soft-clipped bases from read alignments correctly

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • Story Points:
      4
    • Sprint:
      B - Summer 2018

      Description

      SAM specification allows for read alignments to be represented where the ends are "soft-clipped", meaning: they are not reported in the alignment.

      An IGB user noticed that older versions of IGB omitted these bases from the view. Instead, they should be shown, with mis-matches. This may have been fixed, but we need to check.

      If IGB shows the soft-clipped bases, users can more easily detect re-arrangements and other issues critical to diagnosing genomic defects in clinical samples.

      To start, make sure the above description of "soft-clipping" is correct.

        Attachments

          Issue Links

            Activity

            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            Some helpful documentation...

            Understanding CIGAR: https://jef.works/blog/2017/03/28/CIGAR-strings-for-dummies/

            SAM documentation pdf: https://samtools.github.io/hts-specs/SAMv1.pdf
            CIGAR explanation in SAM documentation (page 6):

            CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ‘*’ if unavailable):
            Op __ BAM __ Description __ Consumes query __ Consumes reference
            M 0 alignment match (can be a sequence match or mismatch) yes yes
            I 1 insertion to the reference yes no
            D 2 deletion from the reference no yes
            N 3 skipped region from the reference no yes
            S 4 soft clipping (clipped sequences present in SEQ) yes no
            H 5 hard clipping (clipped sequences NOT present in SEQ) no no
            P 6 padding (silent deletion from padded reference) no no
            = 7 sequence match yes yes
            X 8 sequence mismatch yes yes

            • “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the
            alignment to step along the query sequence and the reference sequence respectively.
            • H can only be present as the first and/or last operation.
            • S may only have H operations between them and the ends of the CIGAR string.
            • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments,
            the interpretation of N is not defined.
            • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - Some helpful documentation... Understanding CIGAR: https://jef.works/blog/2017/03/28/CIGAR-strings-for-dummies/ SAM documentation pdf: https://samtools.github.io/hts-specs/SAMv1.pdf CIGAR explanation in SAM documentation (page 6): CIGAR: CIGAR string. The CIGAR operations are given in the following table (set ‘*’ if unavailable): Op __ BAM __ Description __ Consumes query __ Consumes reference M 0 alignment match (can be a sequence match or mismatch) yes yes I 1 insertion to the reference yes no D 2 deletion from the reference no yes N 3 skipped region from the reference no yes S 4 soft clipping (clipped sequences present in SEQ) yes no H 5 hard clipping (clipped sequences NOT present in SEQ) no no P 6 padding (silent deletion from padded reference) no no = 7 sequence match yes yes X 8 sequence mismatch yes yes • “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively. • H can only be present as the first and/or last operation. • S may only have H operations between them and the ends of the CIGAR string. • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            I saw this in the SAM doucmentation. It is a rare case since most BAM alignments won't have this many operators, but it would still be good to make sure that our parser is living up to the spec described here...

            4.2.2 N CIGAR OP field
            With 16 bits, n cigar op can keep at most 65535 CIGAR operations in BAM files. For an alignment with
            more CIGAR operations, BAM stores....

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - I saw this in the SAM doucmentation. It is a rare case since most BAM alignments won't have this many operators, but it would still be good to make sure that our parser is living up to the spec described here... 4.2.2 N CIGAR OP field With 16 bits, n cigar op can keep at most 65535 CIGAR operations in BAM files. For an alignment with more CIGAR operations, BAM stores....
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            Since there is an operator in the CIGAR string for soft clipped reads, I should be able to select soft clipped reads from a BAM file using the CIGAR string.
            I tried this with the RNA-Seq > Pollen SRP022162 > Reads > Leaf1 dataset. It looks like there are not soft-clipped reads in that bam file. I think my filtering method is correct: it was able to correctly pull out the gapped reads (selecting for N rather than S in the CIGAR string).

            samtools view -h Leaf1.bam | awk '{if($0 ~ /^@/ || $6 ~ /S/) {print $0}}' | samtools view -Sb - > Leaf1_softClipped.bam
            samtools index Leaf1_softClipped.bam

            samtools view -h Leaf1.bam | awk '{if($0 ~ /^@/ || $6 ~ /N/) {print $0}}' | samtools view -Sb - > Leaf1_gapped.bam
            samtools index Leaf1_gapped.bam

            Leaf1_gapped.bam has the gapped reads from Leaf1.bam
            Leaf1_softClipped.bam has no reads.

            I can try the same method on files from other projects and maybe find one that does have soft-clipped reads.
            If that fails, I'll have to figure out how to make soft-clip reads myself.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - Since there is an operator in the CIGAR string for soft clipped reads, I should be able to select soft clipped reads from a BAM file using the CIGAR string. I tried this with the RNA-Seq > Pollen SRP022162 > Reads > Leaf1 dataset. It looks like there are not soft-clipped reads in that bam file. I think my filtering method is correct: it was able to correctly pull out the gapped reads (selecting for N rather than S in the CIGAR string). samtools view -h Leaf1.bam | awk '{if($0 ~ /^@/ || $6 ~ /S/) {print $0}}' | samtools view -Sb - > Leaf1_softClipped.bam samtools index Leaf1_softClipped.bam samtools view -h Leaf1.bam | awk '{if($0 ~ /^@/ || $6 ~ /N/) {print $0}}' | samtools view -Sb - > Leaf1_gapped.bam samtools index Leaf1_gapped.bam Leaf1_gapped.bam has the gapped reads from Leaf1.bam Leaf1_softClipped.bam has no reads. I can try the same method on files from other projects and maybe find one that does have soft-clipped reads. If that fails, I'll have to figure out how to make soft-clip reads myself.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment - - edited

            I generated a file with soft-clipped reads.
            They are NOT shown in IGB as mismatches. They are shown like hard-clipped reads, that is: the clipped bases are treated as if they do not exist.

            Each alignment record includes the query sequence (a "read" in RNA-Seq).
            If a sequence is hard-clipped, some sequence is removed from the end of the query so the sequence that is included in the alignment record is shorter than the original. This is done to improve alignments, presumably cutting off low quality bases.
            If a sequence is soft-clipped, the un-aligned portion of the sequence (always at an end) is still stored as as part of the alignment record. Since the "clipped" bases are still there, that portion of the sequence is included if the sequences are extracted back out again (using a tool like bamtofastq).

            In bowtie2, you get soft-clipped alignments by specifying the --local option. The idea here is that the unaligned bit of sequence at the end of the query isn't bad data, its just DNA from several bases down the strand. bowtie2 doesn't support gapped reads, but it allows you align queries that might contain gaps by soft-clipping bits away so you still get to align at least a chunk of it. I suppose there may even be another alignment record for the same read where the soft-clipped portion is aligned and rest is "soft-clipped".

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - - edited I generated a file with soft-clipped reads. They are NOT shown in IGB as mismatches. They are shown like hard-clipped reads, that is: the clipped bases are treated as if they do not exist. Each alignment record includes the query sequence (a "read" in RNA-Seq). If a sequence is hard-clipped, some sequence is removed from the end of the query so the sequence that is included in the alignment record is shorter than the original. This is done to improve alignments, presumably cutting off low quality bases. If a sequence is soft-clipped, the un-aligned portion of the sequence (always at an end) is still stored as as part of the alignment record. Since the "clipped" bases are still there, that portion of the sequence is included if the sequences are extracted back out again (using a tool like bamtofastq). In bowtie2, you get soft-clipped alignments by specifying the --local option. The idea here is that the unaligned bit of sequence at the end of the query isn't bad data, its just DNA from several bases down the strand. bowtie2 doesn't support gapped reads, but it allows you align queries that might contain gaps by soft-clipping bits away so you still get to align at least a chunk of it. I suppose there may even be another alignment record for the same read where the soft-clipped portion is aligned and rest is "soft-clipped".
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment - - edited

            Here is how I generated a file with soft-clipped reads:

            get a bam file
            wget http://lorainelab-quickload.scidas.org/rnaseq/A_thaliana_Jun_2009/SRP022162/Leaf1.bam

            This bam file didn't have any soft-clipped reads (see earlier comment)
            --probably because it was aligned using an aligner that supports gapped reads, and doesn't do soft-clipping.

            # get a small fastq file
            samtools index Leaf1.bam
            samtools view -b Leaf1.bam Chr1:12534725-13296891 > Leaf1.small.bam
            samtools index Leaf1.small.bam
            bedtools bamtofastq -i Leaf1.small.bam -fq Leaf1.small.fastq

            # ref genome for alignment
            # this is quick and dirty--I just copied the genome from a place where I happened to have it saved as a fasta file already.
            mkdir ref; cd ref
            cp /Users/ivory/Documents/Repos/inducible-ant/ExternalDataSets/A_thaliana_Jun_2009.fa .
            bowtie2-build A_thaliana_Jun_2009.fa A_thaliana_Jun_2009
            cd ..

            # align the small fastq file to the reference genome
            bowtie2 --local -x ref/A_thaliana_Jun_2009 -U Leaf1.small.fastq -S Leaf1.small.aligned.sam
            samtools view -S -b Leaf1.small.aligned.sam > Leaf1.small.aligned.bam
            samtools sort Leaf1.small.aligned.bam > Leaf1.small.aligned.sorted.bam
            samtools index Leaf1.small.aligned.sorted.bam

            # pull out just the reads that were soft-clipped
            samtools view -h Leaf1.small.aligned.sorted.bam \
            | awk '{if($0 ~ /^@/ || $6 ~ /S/) {print $0}}' | samtools view -Sb - \
            > Leaf1.small.aligned.sorted.softClipped.bam
            samtools index Leaf1.small.aligned.sorted.softClipped.bam

            Finally, drag and drop Leaf1.small.aligned.sorted.softClipped.bam into IGB. All reads in that track are examples of soft-clipped reads. Every read in this track has a corresponding alignment in the Leaf1.bam file (you can find it using the search function if you want to compare them).

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - - edited Here is how I generated a file with soft-clipped reads: get a bam file wget http://lorainelab-quickload.scidas.org/rnaseq/A_thaliana_Jun_2009/SRP022162/Leaf1.bam This bam file didn't have any soft-clipped reads (see earlier comment) --probably because it was aligned using an aligner that supports gapped reads, and doesn't do soft-clipping. # get a small fastq file samtools index Leaf1.bam samtools view -b Leaf1.bam Chr1:12534725-13296891 > Leaf1.small.bam samtools index Leaf1.small.bam bedtools bamtofastq -i Leaf1.small.bam -fq Leaf1.small.fastq # ref genome for alignment # this is quick and dirty--I just copied the genome from a place where I happened to have it saved as a fasta file already. mkdir ref; cd ref cp /Users/ivory/Documents/Repos/inducible-ant/ExternalDataSets/A_thaliana_Jun_2009.fa . bowtie2-build A_thaliana_Jun_2009.fa A_thaliana_Jun_2009 cd .. # align the small fastq file to the reference genome bowtie2 --local -x ref/A_thaliana_Jun_2009 -U Leaf1.small.fastq -S Leaf1.small.aligned.sam samtools view -S -b Leaf1.small.aligned.sam > Leaf1.small.aligned.bam samtools sort Leaf1.small.aligned.bam > Leaf1.small.aligned.sorted.bam samtools index Leaf1.small.aligned.sorted.bam # pull out just the reads that were soft-clipped samtools view -h Leaf1.small.aligned.sorted.bam \ | awk '{if($0 ~ /^@/ || $6 ~ /S/) {print $0}}' | samtools view -Sb - \ > Leaf1.small.aligned.sorted.softClipped.bam samtools index Leaf1.small.aligned.sorted.softClipped.bam Finally, drag and drop Leaf1.small.aligned.sorted.softClipped.bam into IGB. All reads in that track are examples of soft-clipped reads. Every read in this track has a corresponding alignment in the Leaf1.bam file (you can find it using the search function if you want to compare them).
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            I tried to pull out hard-clipped reads so we'd have an example of those for comparison, but no luck.

            I tried to get them from the full Leaf1.bam file (none) and I tried to get them from the small bowtie-aligned filed (none).
            I'm not sure if bowtie makes hard-clipped reads at all or if it might with different settings. Getting an example of hard-clipped reads might take some time if we do want it.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - I tried to pull out hard-clipped reads so we'd have an example of those for comparison, but no luck. I tried to get them from the full Leaf1.bam file (none) and I tried to get them from the small bowtie-aligned filed (none). I'm not sure if bowtie makes hard-clipped reads at all or if it might with different settings. Getting an example of hard-clipped reads might take some time if we do want it.
            Hide
            ieclabau Ivory Blakley (Inactive) added a comment -

            If we want to change the way that soft-clipped bases are shown, we could try to direct IGB to process the CIGAR string as if that 'S' (soft-clipped) were 'X' (mis-match). IGB determines what is a match or a mismatch based on the sequence itself. So the real key would be to make IGB use the whole sequence including the soft-clipped bases. This might require adjusting the sequence length attribute and the alignment start position attribute. ...and potentially more that I haven't thought of yet.

            Show
            ieclabau Ivory Blakley (Inactive) added a comment - If we want to change the way that soft-clipped bases are shown, we could try to direct IGB to process the CIGAR string as if that 'S' (soft-clipped) were 'X' (mis-match). IGB determines what is a match or a mismatch based on the sequence itself. So the real key would be to make IGB use the whole sequence including the soft-clipped bases. This might require adjusting the sequence length attribute and the alignment start position attribute. ...and potentially more that I haven't thought of yet.

              People

              • Assignee:
                Unassigned
                Reporter:
                ann.loraine Ann Loraine
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: