Re: [Geuvadis_rna_analysis] GEM mappings

4 Jul 2012

Hi all,
very quickly because at the moment I am in London with an intermittent
connection.

The quality scores are computed as follows: for each read,

   1. all the existing mappings for *both* ends are found up to a (very
   high) number of substitutions/indels/splices
   2. a score is computed for each mapping: the more unique the mapping,
   the better the score.

Hence the scores are *relative*, not *absolute*: their purpose is that of
deciding which mapping is the best one *among its fellows*, not to stratify
all the mappings in the SAM file by their number of errors (which you can
very well do in the obvious way, since the number of mismatches is recorded
in the NM field).

For instance: the first mapping for the first pair you consider,

HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      163     chr1    14704
 *180
75M     *=       14770   -134
CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12
XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      83      chr1    14770
 *180
39M1D22M1I3M2I1M1I2M4S  *=       14704   134
ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
5@
9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
RG:Z:0  NM:i:7  XT:A:U  md:Z:(4)2>1-1>2-T2>1-22>1+39
XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2
I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;

has an overall quality score of 180. This is due to the fact that there are
several mappings, but the best one (the one described in the regular SAM
record), to chr1:14704/chr1:14770 is anyway better than the second best one
(the one to chr15,-102516387/chr15,+102516328 described in the XA optional
field). In fact, the mapping to chr1 has 7 mismatches (following GEM's
scoring system), while the mapping to chr15 has 8 mismatches. So: in *
absolute* terms all mappings are bad because the second end does not map
well to any position, but still one can say that the mapping to chr1 is
better than the other ones, and this is why it gets a high *relative* score.

If for some reason you feel uncomfortable with using mapping having a high
number of mismacthes (which is an absolute score), you can easily filter
them out based on the NM tag, or on other criteria (for instance, if you
had filtered out mappings with NM>5 this pair would have disappeared).

Hope this clarifies things. I do not have the time now to explain how the
relative score is computed, but I will be happy to do that if you are
interested.
Best,
-------Paolo

On Wed, Jul 4, 2012 at 3:58 PM, Tuuli Lappalainen <
Tuuli.Lappalainen(a)unige.ch&gt; wrote:

...
   Hi Matthias and others,

 I noticed the read length discordance myself very recently, and was going
 to bring it up. This is really unfortunate - people processing the data in
 Barcelona didn't follow the explicit instructions of submitting 75 bp data.
 This is something that I should have checked though, and I thought I had,
 but apparently I didn't for the mRNA data. However, I don't think this will
 really affect things. The Barcelona samples don't show worse mapping stats,
 and I definitely don't want to remap these samples without a very strong
 reason.

 Regarding the GEM mapping quality scores, *Paolo, Thasso and Micha, can
 you please comment on this? *Matthias and Daniela, what do you mean by
 referring to the example reads as being well or badly mapped? The
 classifications below is the information I have from Paolo, but I don't
 know the algorithms that they use to calculate the MAPQ. In my
 understanding the qualities are calculated independently for the two mates.
 I'm using reads with MAPQ >150, i.e. reads in categories 1 and 2 below. It
 depends on the analysis whether I require both mates to have this quality
 (e.g. exon quantifications) or if I can just use a single good-quality mate
 regardless of the quality of its mate (e.g. ASE).

    1. Matches which are unique, and do not have any subdominant match:
    251 >= MAPQ >= 255, XT=U
     2. Matches which are unique, and have subdominant matches but a
    different score:
    175 >= MAPQ >= 181, XT=U
    3. Matches which are putatively unique (not unique, but
    distinguishable by score):
    119 >= MAPQ >= 127, XT=U
    4. Matches which are a perfect tie:
    78 >= MAPQ >= 90, XT=R.

 best regards,
 Tuuli

 Tuuli Lappalainen, PhD
 Department of Genetic Medicine and Development
 University of Geneva Medical School
 CMU / Rue Michel-Servet 1
 1211 Geneva 4
 Switzerland
 Tel. +41-(0)22-3795550tuuli.lappalainen(a)unige.ch

 On 7/4/12 3:24 PM, Matthias Barann wrote:

 Dear all,

 we noticed some inconsistencies in the read lengths of the GEM mappings (I
 didn't check the BWA mappings, but it's probably the same).
 Some samples appear to have up to 76 bp matched, while other samples only
 have 75 bp. In regard to comparability this might cause some (probably very
 little) differences between the samples.

 i.e.
 NA20589.1.M_111124_3.bam has 75 bp
 NA20812.2.M_111216_6.bam has *76 bp*
 NA20760.3.M_120202_5.bam has 75 bp
 NA20783.4.M_120208_6.bam has 75 bp
 NA20768.5.M_120131_1.bam has 75 bp
 NA20798.6.M_120119_6.bam has 75 bp
 NA20803.7.M_120219_1.bam has 75 bp

 We checked some more files for institute 2, and they seem to have
 generally 1 bp more than the others.

 We're also a bit lost regarding GEM's quality score. We would like to
 filter reads for good mapping quality, we're just not sure how to
 accomplish this.

 Below are two read pairs as an example.
 The fist pair has a mapping quality of 180. The second read however was
 terribly aligned.
 The second pair has only a quality of 99, while the reads mapped much
 better (at least I would say so).
 We're not sure what's the reason for this.

 Curiously, both reads of a pair get the same mapping quality. This could
 be by intention (i.e. it's always the pair quality rather than the read
 quality) or, which could also explain the quality differences for the
 reads, both reads of the pair get the quality of the first mapped read.
 We're thankful for any suggestions on how to filter for 'good' mappings.

 HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      163     chr1    14704   *180
 75M     *=       14770   -134
 CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
 CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
 RG:Z:0  NM:i:7  XT:A:U  md:Z:62T12

XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;

 HWI-ST661:153:D0FTJACXX:8:2201:16410:93092      83      chr1    14770   *180
 39M1D22M1I3M2I1M1I2M4S  *=       14704   134
 ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
 5@
 9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
 RG:Z:0  NM:i:7  XT:A:U  md:Z:(4)2>1-1>2-T2>1-22>1+39
 XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2

I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;

 HWI-ST661:153:D0FTJACXX:8:2201:6270:52066       99      chr1    14582   *99
 1M2I71M1S       *=       14665   158
 CCCTGGTTCCGTCACCCCCTCCCAGGGAAGCAGGTCTGAGCAGCTTGTCCTGGCTGTGTCAATGTCAGAGCAACA
 @11ADFFFHHHHHIIJJJJIJJIJJJJHHJJIJJHIJJJIIGFGIIJIIJF@
 @EGDAE>?)).7?BDFFDCCCB@       RG:Z:0  NM:i:8  XT:A:U
 md:Z:1>2-21A5T29C13(1)

XA:Z:chrY,-59358087,75M,4;chrX,-155255081,75M,4;chr9,+14691,75M,7;chr2,-114356359,75M,7;chr16,+64266,75M,8;chr12,-90971,75M,8;

 HWI-ST661:153:D0FTJACXX:8:2201:6270:52066       147     chr1    14665   *99
 75M     *=       14582   -158
 GGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTAGGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGTG

<B@A<9;>@CAAADDDDCCC>CC=?;>FHCGGEGIGJJIIGGIHHIIIIJIIJIJIGJJIJIHHFDDFFFFDB@;
 RG:Z:0  NM:i:8  XT:A:U  md:Z:AG38G34

XA:Z:chrY,+59358002,75M,4;chrX,+155254996,75M,4;chr9,-14776,75M,7;chr2,+114356274,75M,7;chr16,-64350,58M1I1M2I11M1D2M,8;chr12,+90885,2M1D73M,8;

 best wishes,

 Matthias & Daniela

 --
 Matthias Barann
 Institute of Clinical Molecular Biology
 Christian Albrechts University Kiel
 Schittenhelmstr. 12
 D-24105 Kiel, Germany
 m.barann(a)ikmb.uni-kiel.de+49 - (0)431 - 597 8681 (office)

 _______________________________________________
 Geuvadis_rna_analysis mailing
listGeuvadis_rna_analysis@lists.crg.eshttp://davinci.crg.es/mailman/listinfo/geuvadis_rna_analysis

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Re: [Geuvadis_rna_analysis] GEM mappings