Re: [Geuvadis_rna_analysis] GEM mappings

5 Jul 2012

Hello,

In my opinion we need have a filter for maximum number of mismatches (as 
well as MAPQ>150) when we want to have well-mapped reliable reads - if a 
read has lots of mismatches, I wouldn't trust it even if the other 
matches were even worse. But you're right that 3 or 4 is too stringent, 
I was thinking of the 75 bps and not the total of 150.

I'd say that we keep reads with <=6 mismatches according to the NM flag. 
If no one objects by Thursday noon, I'll proceed with this. I'll provide 
a script for filtering, and upload a filtered set of bam files to the 
ftp site - you can do whatever is easier for you.

best regards,
Tuuli

Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550
tuuli.lappalainen(a)unige.ch

On 7/4/12 8:55 PM, Paolo Ribeca wrote:
...

     Thanks for the clarification Paolo!

 You are welcome. Also, please keep in mind that if you do not like the 
 score we can always quickly rescore the matches from the original GEM 
 files, and regenerate BAMs accordingly.

     I would say that maybe 3 or 4 would be a good maximum

 I would disagree on this point. There are some matches which are 
 perfectly unique (see my classification) and they could easily be kept 
 independently of the number of mismatches. They just only map (not 
 very well) to a single position, why discarding them? Also, remember 
 that the BWA default is to allow the 4% of mismatches, and 4%*150=6. 
 One might perhaps keep only matches which have (<=3,<=3) mismatches 
 considering the two ends separatly rather than those having a sum <= 6 
 -- but anyway I would say that 3 or 4 is definitely too conservative.

 Maybe it is a good idea if you add me to the analysis list, so that in 
 the future I can step in sooner if problems arise...
 Best,
 -------Paolo

 On Wed, Jul 4, 2012 at 8:13 PM, Tuuli Lappalainen 
 &lt;Tuuli.Lappalainen(a)unige.ch <mailto:Tuuli.Lappalainen@unige.ch>> wrote:

     Thanks for the clarification Paolo! I had slightly misunderstood
     the philosophy of the MAPQs, and hadn't paid enough attention to
     the number of mismatches - thanks to Matthias and Daniela as well
     for noticing this.

     In most analyses, we should probably exclude reads with a high
     number of mismatches - I would say that maybe 3 or 4 would be a
     good maximum. What do you guys think?

     I will need to rerun exon quantifications... :-/

     best,
     Tuuli

     Tuuli Lappalainen, PhD
     Department of Genetic Medicine and Development
     University of Geneva Medical School
     CMU / Rue Michel-Servet 1
     1211 Geneva 4
     Switzerland
     Tel.+41-(0)22-3795550  <tel:%2B41-%280%2922-3795550>
     tuuli.lappalainen(a)unige.ch  <mailto:tuuli.lappalainen@unige.ch>

     On 7/4/12 7:57 PM, Paolo Ribeca wrote:
      Hi all,
     very quickly because at the moment I am in London with an
     intermittent connection.

     The quality scores are computed as follows: for each read,

      1. all the existing mappings for /both/ ends are found up to a
         (very high) number of substitutions/indels/splices
      2. a score is computed for each mapping: the more unique the
         mapping, the better the score.

     Hence the scores are /relative/, not /absolute/: their purpose is
     that of deciding which mapping is the best one /among its
     fellows/, not to stratify all the mappings in the SAM file by
     their number of errors (which you can very well do in the obvious
     way, since the number of mismatches is recorded in the NM field).

     For instance: the first mapping for the first pair you consider,
      HWI-ST661:153:D0FTJACXX:8:2201:16410:93092
163     chr1    14704
     *180     75M *= 14770   -134
     CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
     CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
     RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12

XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

     HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83      chr1    14770
     *180 39M1D22M1I3M2I1M1I2M4S *=       14704 134
     ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
     5@
     9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
     RG:Z:0  NM:i:7  XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
     XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2

I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
     has an overall quality score of 180. This is due to the fact that
     there are several mappings, but the best one (the one described
     in the regular SAM record), to chr1:14704/chr1:14770 is anyway
     better than the second best one (the one to
     chr15,-102516387/chr15,+102516328 described in the XA optional
     field). In fact, the mapping to chr1 has 7 mismatches (following
     GEM's scoring system), while the mapping to chr15 has 8
     mismatches. So: in /absolute/ terms all mappings are bad because
     the second end does not map well to any position, but still one
     can say that the mapping to chr1 is better than the other ones,
     and this is why it gets a high /relative/ score.

     If for some reason you feel uncomfortable with using mapping
     having a high number of mismacthes (which is an absolute score),
     you can easily filter them out based on the NM tag, or on other
     criteria (for instance, if you had filtered out mappings with
     NM>5 this pair would have disappeared).

     Hope this clarifies things. I do not have the time now to explain
     how the relative score is computed, but I will be happy to do
     that if you are interested.
     Best,
     -------Paolo

     On Wed, Jul 4, 2012 at 3:58 PM, Tuuli Lappalainen
     &lt;Tuuli.Lappalainen(a)unige.ch <mailto:Tuuli.Lappalainen@unige.ch>>
     wrote:

         Hi Matthias and others,

         I noticed the read length discordance myself very recently,
         and was going to bring it up. This is really unfortunate -
         people processing the data in Barcelona didn't follow the
         explicit instructions of submitting 75 bp data. This is
         something that I should have checked though, and I thought I
         had, but apparently I didn't for the mRNA data. However, I
         don't think this will really affect things. The Barcelona
         samples don't show worse mapping stats, and I definitely
         don't want to remap these samples without a very strong reason.

         Regarding the GEM mapping quality scores, *Paolo, Thasso and
         Micha, can you please comment on this? *Matthias and Daniela,
         what do you mean by referring to the example reads as being
         well or badly mapped? The classifications below is the
         information I have from Paolo, but I don't know the
         algorithms that they use to calculate the MAPQ. In my
         understanding the qualities are calculated independently for
         the two mates. I'm using reads with MAPQ >150, i.e. reads in
         categories 1 and 2 below. It depends on the analysis whether
         I require both mates to have this quality (e.g. exon
         quantifications) or if I can just use a single good-quality
         mate regardless of the quality of its mate (e.g. ASE).

          1. Matches which are unique, and do not have any subdominant
             match:
             251 >= MAPQ >= 255, XT=U
          2. Matches which are unique, and have subdominant matches
             but a different score:
             175 >= MAPQ >= 181, XT=U
          3. Matches which are putatively unique (not unique, but
             distinguishable by score):
             119 >= MAPQ >= 127, XT=U
          4. Matches which are a perfect tie:
             78 >= MAPQ >= 90, XT=R.

         best regards,
         Tuuli

         Tuuli Lappalainen, PhD
         Department of Genetic Medicine and Development
         University of Geneva Medical School
         CMU / Rue Michel-Servet 1
         1211 Geneva 4
         Switzerland
         Tel.+41-(0)22-3795550  <tel:%2B41-%280%2922-3795550>
         tuuli.lappalainen(a)unige.ch  <mailto:tuuli.lappalainen@unige.ch>

         On 7/4/12 3:24 PM, Matthias Barann wrote:
          Dear all,

         we noticed some inconsistencies in the read lengths of the
         GEM mappings (I didn't check the BWA mappings, but it's
         probably the same).
         Some samples appear to have up to 76 bp matched, while other
         samples only have 75 bp. In regard to comparability this
         might cause some (probably very little) differences between
         the samples.

         i.e.
         NA20589.1.M_111124_3.bam has 75 bp
         NA20812.2.M_111216_6.bam has *76 bp*
         NA20760.3.M_120202_5.bam has 75 bp
         NA20783.4.M_120208_6.bam has 75 bp
         NA20768.5.M_120131_1.bam has 75 bp
         NA20798.6.M_120119_6.bam has 75 bp
         NA20803.7.M_120219_1.bam has 75 bp

         We checked some more files for institute 2, and they seem to
         have generally 1 bp more than the others.

         We're also a bit lost regarding GEM's quality score. We
         would like to filter reads for good mapping quality, we're
         just not sure how to accomplish this.

         Below are two read pairs as an example.
         The fist pair has a mapping quality of 180. The second read
         however was terribly aligned.
         The second pair has only a quality of 99, while the reads
         mapped much better (at least I would say so).
         We're not sure what's the reason for this.

         Curiously, both reads of a pair get the same mapping
         quality. This could be by intention (i.e. it's always the
         pair quality rather than the read quality) or, which could
         also explain the quality differences for the reads, both
         reads of the pair get the quality of the first mapped read.
         We're thankful for any suggestions on how to filter for
         'good' mappings.

HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 163     chr1   
         14704 *180     75M *=       14770   -134
         CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG

CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
         RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12

XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

         HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83      chr1   
         14770 *180 39M1D22M1I3M2I1M1I2M4S *= 14704   134
         ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
         5@
         9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
         RG:Z:0  NM:i:7  XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
         XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2

I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;

          HWI-ST661:153:D0FTJACXX:8:2201:6270:52066
99      chr1   
         14582 *99 1M2I71M1S *=       14665 158
         CCCTGGTTCCGTCACCCCCTCCCAGGGAAGCAGGTCTGAGCAGCTTGTCCTGGCTGTGTCAATGTCAGAGCAACA
         @11ADFFFHHHHHIIJJJJIJJIJJJJHHJJIJJHIJJJIIGFGIIJIIJF@@EGDAE>?)).7?BDFFDCCCB@
         RG:Z:0  NM:i:8  XT:A:U md:Z:1>2-21A5T29C13(1)

XA:Z:chrY,-59358087,75M,4;chrX,-155255081,75M,4;chr9,+14691,75M,7;chr2,-114356359,75M,7;chr16,+64266,75M,8;chr12,-90971,75M,8;

         HWI-ST661:153:D0FTJACXX:8:2201:6270:52066 147     chr1   
         14665 *99      75M *=       14582   -158
         GGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTAGGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGTG

<B@A<9;>@CAAADDDDCCC>CC=?;>FHCGGEGIGJJIIGGIHHIIIIJIIJIJIGJJIJIHHFDDFFFFDB@;
         RG:Z:0  NM:i:8  XT:A:U  md:Z:AG38G34

XA:Z:chrY,+59358002,75M,4;chrX,+155254996,75M,4;chr9,-14776,75M,7;chr2,+114356274,75M,7;chr16,-64350,58M1I1M2I11M1D2M,8;chr12,+90885,2M1D73M,8;

         best wishes,

         Matthias & Daniela
         -- 
         Matthias Barann
         Institute of Clinical Molecular Biology
         Christian Albrechts University Kiel
         Schittenhelmstr. 12
         D-24105 Kiel, Germany

         m.barann(a)ikmb.uni-kiel.de  <mailto:m.barann@ikmb.uni-kiel.de>
         +49 - (0)431 - 597 8681  <tel:%2B49%20-%20%280%29431%20-%20597%208681> 
(office)

         _______________________________________________
         Geuvadis_rna_analysis mailing list
         Geuvadis_rna_analysis(a)lists.crg.es 
<mailto:Geuvadis_rna_analysis@lists.crg.es>
         http://davinci.crg.es/mailman/listinfo/geuvadis_rna_analysis 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Geuvadis_rna_analysis] GEM mappings