Re: [Geuvadis_rna_analysis] GEM mappings

4 Jul 2012

Thanks for the clarification Paolo! I had slightly misunderstood the 
philosophy of the MAPQs, and hadn't paid enough attention to the number 
of mismatches - thanks to Matthias and Daniela as well for noticing this.

In most analyses, we should probably exclude reads with a high number of 
mismatches - I would say that maybe 3 or 4 would be a good maximum. What 
do you guys think?

I will need to rerun exon quantifications... :-/

best,
Tuuli

Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550
tuuli.lappalainen(a)unige.ch

On 7/4/12 7:57 PM, Paolo Ribeca wrote:
...
  Hi all,
 very quickly because at the moment I am in London with an intermittent 
 connection.

 The quality scores are computed as follows: for each read,

  1. all the existing mappings for /both/ ends are found up to a (very
     high) number of substitutions/indels/splices
  2. a score is computed for each mapping: the more unique the mapping,
     the better the score.

 Hence the scores are /relative/, not /absolute/: their purpose is that 
 of deciding which mapping is the best one /among its fellows/, not to 
 stratify all the mappings in the SAM file by their number of errors 
 (which you can very well do in the obvious way, since the number of 
 mismatches is recorded in the NM field).

 For instance: the first mapping for the first pair you consider,
  HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 163   
 chr1    14704 
 *180     75M *=       14770 -134 
 CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG 
 CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@############### 
 RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12 

XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

 HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      83      chr1 14770 
 *180     39M1D22M1I3M2I1M1I2M4S *=       14704 134 
 ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT 
 5@
 9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B 
 RG:Z:0  NM:i:7  XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39 
 XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2

I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
 has an overall quality score of 180. This is due to the fact that 
 there are several mappings, but the best one (the one described in the 
 regular SAM record), to chr1:14704/chr1:14770 is anyway better than 
 the second best one (the one to chr15,-102516387/chr15,+102516328 
 described in the XA optional field). In fact, the mapping to chr1 has 
 7 mismatches (following GEM's scoring system), while the mapping to 
 chr15 has 8 mismatches. So: in /absolute/ terms all mappings are bad 
 because the second end does not map well to any position, but still 
 one can say that the mapping to chr1 is better than the other ones, 
 and this is why it gets a high /relative/ score.

 If for some reason you feel uncomfortable with using mapping having a 
 high number of mismacthes (which is an absolute score), you can easily 
 filter them out based on the NM tag, or on other criteria (for 
 instance, if you had filtered out mappings with NM>5 this pair would 
 have disappeared).

 Hope this clarifies things. I do not have the time now to explain how 
 the relative score is computed, but I will be happy to do that if you 
 are interested.
 Best,
 -------Paolo

 On Wed, Jul 4, 2012 at 3:58 PM, Tuuli Lappalainen 
 &lt;Tuuli.Lappalainen(a)unige.ch <mailto:Tuuli.Lappalainen@unige.ch>> wrote:

     Hi Matthias and others,

     I noticed the read length discordance myself very recently, and
     was going to bring it up. This is really unfortunate - people
     processing the data in Barcelona didn't follow the explicit
     instructions of submitting 75 bp data. This is something that I
     should have checked though, and I thought I had, but apparently I
     didn't for the mRNA data. However, I don't think this will really
     affect things. The Barcelona samples don't show worse mapping
     stats, and I definitely don't want to remap these samples without
     a very strong reason.

     Regarding the GEM mapping quality scores, *Paolo, Thasso and
     Micha, can you please comment on this? *Matthias and Daniela, what
     do you mean by referring to the example reads as being well or
     badly mapped? The classifications below is the information I have
     from Paolo, but I don't know the algorithms that they use to
     calculate the MAPQ. In my understanding the qualities are
     calculated independently for the two mates. I'm using reads with
     MAPQ >150, i.e. reads in categories 1 and 2 below. It depends on
     the analysis whether I require both mates to have this quality
     (e.g. exon quantifications) or if I can just use a single
     good-quality mate regardless of the quality of its mate (e.g. ASE).

      1. Matches which are unique, and do not have any subdominant match:
         251 >= MAPQ >= 255, XT=U
      2. Matches which are unique, and have subdominant matches but a
         different score:
         175 >= MAPQ >= 181, XT=U
      3. Matches which are putatively unique (not unique, but
         distinguishable by score):
         119 >= MAPQ >= 127, XT=U
      4. Matches which are a perfect tie:
         78 >= MAPQ >= 90, XT=R.

     best regards,
     Tuuli

     Tuuli Lappalainen, PhD
     Department of Genetic Medicine and Development
     University of Geneva Medical School
     CMU / Rue Michel-Servet 1
     1211 Geneva 4
     Switzerland
     Tel.+41-(0)22-3795550  <tel:%2B41-%280%2922-3795550>
     tuuli.lappalainen(a)unige.ch  <mailto:tuuli.lappalainen@unige.ch>

     On 7/4/12 3:24 PM, Matthias Barann wrote:
      Dear all,

     we noticed some inconsistencies in the read lengths of the GEM
     mappings (I didn't check the BWA mappings, but it's probably the
     same).
     Some samples appear to have up to 76 bp matched, while other
     samples only have 75 bp. In regard to comparability this might
     cause some (probably very little) differences between the samples.

     i.e.
     NA20589.1.M_111124_3.bam has 75 bp
     NA20812.2.M_111216_6.bam has *76 bp*
     NA20760.3.M_120202_5.bam has 75 bp
     NA20783.4.M_120208_6.bam has 75 bp
     NA20768.5.M_120131_1.bam has 75 bp
     NA20798.6.M_120119_6.bam has 75 bp
     NA20803.7.M_120219_1.bam has 75 bp

     We checked some more files for institute 2, and they seem to have
     generally 1 bp more than the others.

     We're also a bit lost regarding GEM's quality score. We would
     like to filter reads for good mapping quality, we're just not
     sure how to accomplish this.

     Below are two read pairs as an example.
     The fist pair has a mapping quality of 180. The second read
     however was terribly aligned.
     The second pair has only a quality of 99, while the reads mapped
     much better (at least I would say so).
     We're not sure what's the reason for this.

     Curiously, both reads of a pair get the same mapping quality.
     This could be by intention (i.e. it's always the pair quality
     rather than the read quality) or, which could also explain the
     quality differences for the reads, both reads of the pair get the
     quality of the first mapped read.
     We're thankful for any suggestions on how to filter for 'good'
     mappings.

      HWI-ST661:153:D0FTJACXX:8:2201:16410:93092
163     chr1    14704
     *180     75M *= 14770   -134
     CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
     CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
     RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12

XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

     HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      83 chr1    14770
     *180     39M1D22M1I3M2I1M1I2M4S *= 14704   134
     ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
     5@
     9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
     RG:Z:0  NM:i:7  XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
     XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2

I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;

      HWI-ST661:153:D0FTJACXX:8:2201:6270:52066 99 
    chr1    14582
     *99      1M2I71M1S *= 14665   158
     CCCTGGTTCCGTCACCCCCTCCCAGGGAAGCAGGTCTGAGCAGCTTGTCCTGGCTGTGTCAATGTCAGAGCAACA
     @11ADFFFHHHHHIIJJJJIJJIJJJJHHJJIJJHIJJJIIGFGIIJIIJF@@EGDAE>?)).7?BDFFDCCCB@
     RG:Z:0  NM:i:8  XT:A:U  md:Z:1>2-21A5T29C13(1)

XA:Z:chrY,-59358087,75M,4;chrX,-155255081,75M,4;chr9,+14691,75M,7;chr2,-114356359,75M,7;chr16,+64266,75M,8;chr12,-90971,75M,8;

     HWI-ST661:153:D0FTJACXX:8:2201:6270:52066       147 chr1   
     14665 *99      75M *=       14582 -158
     GGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTAGGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGTG

<B@A<9;>@CAAADDDDCCC>CC=?;>FHCGGEGIGJJIIGGIHHIIIIJIIJIJIGJJIJIHHFDDFFFFDB@;
     RG:Z:0  NM:i:8  XT:A:U  md:Z:AG38G34

XA:Z:chrY,+59358002,75M,4;chrX,+155254996,75M,4;chr9,-14776,75M,7;chr2,+114356274,75M,7;chr16,-64350,58M1I1M2I11M1D2M,8;chr12,+90885,2M1D73M,8;

     best wishes,

     Matthias & Daniela
     -- 
     Matthias Barann
     Institute of Clinical Molecular Biology
     Christian Albrechts University Kiel
     Schittenhelmstr. 12
     D-24105 Kiel, Germany

     m.barann(a)ikmb.uni-kiel.de  <mailto:m.barann@ikmb.uni-kiel.de>
     +49 - (0)431 - 597 8681  <tel:%2B49%20-%20%280%29431%20-%20597%208681> 
(office)

     _______________________________________________
     Geuvadis_rna_analysis mailing list
     Geuvadis_rna_analysis(a)lists.crg.es 
<mailto:Geuvadis_rna_analysis@lists.crg.es>
     http://davinci.crg.es/mailman/listinfo/geuvadis_rna_analysis 

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Geuvadis_rna_analysis] GEM mappings