Thanks for the clarification Paolo!
You are welcome. Also, please keep in mind that if you do not like the
score we can always quickly rescore the matches from the original GEM
files, and regenerate BAMs accordingly.
I would say that maybe 3 or 4 would be a good maximum
I would disagree on this point. There are some
matches which are perfectly
unique (see my classification) and they could easily be kept independently
of the number of mismatches. They just only map (not very well) to a single
position, why discarding them? Also, remember that the BWA default is to
allow the 4% of mismatches, and 4%*150=6. One might perhaps keep only
matches which have (<=3,<=3) mismatches considering the two ends separatly
rather than those having a sum <= 6 -- but anyway I would say that 3 or 4
is definitely too conservative.
Maybe it is a good idea if you add me to the analysis list, so that in the
future I can step in sooner if problems arise...
Best,
-------Paolo
On Wed, Jul 4, 2012 at 8:13 PM, Tuuli Lappalainen <
Tuuli.Lappalainen(a)unige.ch> wrote:
> Thanks for the clarification Paolo! I had
slightly misunderstood the
> philosophy of the MAPQs, and hadn't paid enough attention to the number of
> mismatches - thanks to Matthias and Daniela as well for noticing this.
> In most analyses, we should probably
exclude reads with a high number of
> mismatches - I would say that maybe 3 or 4 would be a good maximum. What do
> you guys think?
> I will need to rerun exon
quantifications... :-/
> best,
> Tuuli
> Tuuli Lappalainen, PhD
> Department of Genetic Medicine and Development
> University of Geneva Medical School
> CMU / Rue Michel-Servet 1
> 1211 Geneva 4
> Switzerland
> Tel. +41-(0)22-3795550tuuli.lappalainen(a)unige.ch
> On 7/4/12 7:57 PM, Paolo Ribeca wrote:
> Hi all,
> very quickly because at the moment I am in London with an intermittent
> connection.
> The quality scores are computed as follows:
for each read,
> 1. all the existing mappings for *both*
ends are found up to a (very
> high) number of substitutions/indels/splices
> 2. a score is computed for each mapping: the more unique the mapping,
> the better the score.
> Hence the scores are *relative*, not
*absolute*: their purpose is that of
> deciding which mapping is the best one *among its fellows*, not to
> stratify all the mappings in the SAM file by their number of errors (which
> you can very well do in the obvious way, since the number of mismatches is
> recorded in the NM field).
> For instance: the first mapping for the
first pair you consider,
> HWI-ST661:153:D0FTJACXX:8:2201:16410:93092
163 chr1 14704 *180
> 75M *= 14770 -134
> CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
> CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
> RG:Z:0 NM:i:7 XT:A:U md:Z:62T12
>
XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;
> HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83 chr1 14770 *180
> 39M1D22M1I3M2I1M1I2M4S *= 14704 134
> ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
> 5@
> 9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
> RG:Z:0 NM:i:7 XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
> XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2
>
I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
> has an overall quality score of 180. This
is due to the fact that there
> are several mappings, but the best one (the one described in the regular
> SAM record), to chr1:14704/chr1:14770 is anyway better than the second best
> one (the one to chr15,-102516387/chr15,+102516328 described in the XA
> optional field). In fact, the mapping to chr1 has 7 mismatches (following
> GEM's scoring system), while the mapping to chr15 has 8 mismatches. So: in
> *absolute* terms all mappings are bad because the second end does not map
> well to any position, but still one can say that the mapping to chr1 is
> better than the other ones, and this is why it gets a high *relative*score.
> If for some reason you feel uncomfortable
with using mapping having a high
> number of mismacthes (which is an absolute score), you can easily filter
> them out based on the NM tag, or on other criteria (for instance, if you
> had filtered out mappings with NM>5 this pair would have disappeared).
> Hope this clarifies things. I do not have
the time now to explain how the
> relative score is computed, but I will be happy to do that if you are
> interested.
> Best,
> -------Paolo
> On Wed, Jul 4, 2012 at 3:58 PM, Tuuli
Lappalainen <
> Tuuli.Lappalainen(a)unige.ch> wrote:
>> Hi Matthias and others,
>
>> I noticed the read length discordance
myself very recently, and was going
>> to bring it up. This is really unfortunate - people processing the data in
>> Barcelona didn't follow the explicit instructions of submitting 75 bp data.
>> This is something that I should have checked though, and I thought I had,
>> but apparently I didn't for the mRNA data. However, I don't think this
will
>> really affect things. The Barcelona samples don't show worse mapping stats,
>> and I definitely don't want to remap these samples without a very strong
>> reason.
>
>> Regarding the GEM mapping quality
scores, *Paolo, Thasso and Micha, can
>> you please comment on this? *Matthias and Daniela, what do you mean by
>> referring to the example reads as being well or badly mapped? The
>> classifications below is the information I have from Paolo, but I don't
>> know the algorithms that they use to calculate the MAPQ. In my
>> understanding the qualities are calculated independently for the two mates.
>> I'm using reads with MAPQ >150, i.e. reads in categories 1 and 2 below.
It
>> depends on the analysis whether I require both mates to have this quality
>> (e.g. exon quantifications) or if I can just use a single good-quality mate
>> regardless of the quality of its mate (e.g. ASE).
>
>
>> 1. Matches which are unique, and do not have any subdominant match:
>> 251 >= MAPQ >= 255, XT=U
>> 2. Matches which are unique, and have subdominant matches but a
>> different score:
>> 175 >= MAPQ >= 181, XT=U
>> 3. Matches which are putatively unique (not unique, but
>> distinguishable by score):
>> 119 >= MAPQ >= 127, XT=U
>> 4. Matches which are a perfect tie:
>> 78 >= MAPQ >= 90, XT=R.
>
>
>> best regards,
>> Tuuli
>
>
>
>> Tuuli
Lappalainen, PhD
>> Department of Genetic Medicine and Development
>> University of Geneva Medical School
>> CMU / Rue Michel-Servet 1
>> 1211 Geneva 4
>> Switzerland
>> Tel. +41-(0)22-3795550tuuli.lappalainen(a)unige.ch
>
>> On 7/4/12 3:24 PM, Matthias Barann
wrote:
>
>> Dear all,
>
>> we noticed some inconsistencies in the
read lengths of the GEM mappings
>> (I didn't check the BWA mappings, but it's probably the same).
>> Some samples appear to have up to 76 bp matched, while other samples only
>> have 75 bp. In regard to comparability this might cause some (probably very
>> little) differences between the samples.
>
>> i.e.
>> NA20589.1.M_111124_3.bam has 75 bp
>> NA20812.2.M_111216_6.bam has *76 bp*
>> NA20760.3.M_120202_5.bam has 75 bp
>> NA20783.4.M_120208_6.bam has 75 bp
>> NA20768.5.M_120131_1.bam has 75 bp
>> NA20798.6.M_120119_6.bam has 75 bp
>> NA20803.7.M_120219_1.bam has 75 bp
>
>> We checked some more files for
institute 2, and they seem to have
>> generally 1 bp more than the others.
>
>
>> We're also a bit lost regarding GEM's quality score. We would like
to
>> filter reads for good mapping quality, we're just not sure how to
>> accomplish this.
>
>> Below are two read pairs as an
example.
>> The fist pair has a mapping quality of 180. The second read however was
>> terribly aligned.
>> The second pair has only a quality of 99, while the reads mapped much
>> better (at least I would say so).
>> We're not sure what's the reason for this.
>
>> Curiously, both reads of a pair get the
same mapping quality. This could
>> be by intention (i.e. it's always the pair quality rather than the read
>> quality) or, which could also explain the quality differences for the
>> reads, both reads of the pair get the quality of the first mapped read.
>> We're thankful for any suggestions on how to filter for 'good'
mappings.
>
>>
HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 163 chr1 14704 *180
>> 75M *= 14770 -134
>> CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
>>
CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
>> RG:Z:0 NM:i:7 XT:A:U md:Z:62T12
>>
XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;
>
>
>> HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83 chr1 14770
*180
>> 39M1D22M1I3M2I1M1I2M4S *= 14704 134
>> ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
>> 5@
>> 9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
>> RG:Z:0 NM:i:7 XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
>> XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2
>
>>
I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
>
>
>> HWI-ST661:153:D0FTJACXX:8:2201:6270:52066 99 chr1 14582
*99
>> 1M2I71M1S *= 14665 158
>> CCCTGGTTCCGTCACCCCCTCCCAGGGAAGCAGGTCTGAGCAGCTTGTCCTGGCTGTGTCAATGTCAGAGCAACA
>> @11ADFFFHHHHHIIJJJJIJJIJJJJHHJJIJJHIJJJIIGFGIIJIIJF@
>> @EGDAE>?)).7?BDFFDCCCB@ RG:Z:0 NM:i:8 XT:A:U
>> md:Z:1>2-21A5T29C13(1)
>>
XA:Z:chrY,-59358087,75M,4;chrX,-155255081,75M,4;chr9,+14691,75M,7;chr2,-114356359,75M,7;chr16,+64266,75M,8;chr12,-90971,75M,8;
>
>>
HWI-ST661:153:D0FTJACXX:8:2201:6270:52066 147 chr1 14665 *99
>> 75M *= 14582 -158
>> GGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTAGGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGTG
>> <B@A
>>
<9;>@CAAADDDDCCC>CC=?;>FHCGGEGIGJJIIGGIHHIIIIJIIJIJIGJJIJIHHFDDFFFFDB@;
>> RG:Z:0 NM:i:8 XT:A:U md:Z:AG38G34
>>
XA:Z:chrY,+59358002,75M,4;chrX,+155254996,75M,4;chr9,-14776,75M,7;chr2,+114356274,75M,7;chr16,-64350,58M1I1M2I11M1D2M,8;chr12,+90885,2M1D73M,8;
>
>
>
>> best
wishes,
>
>> Matthias & Daniela
>
>> --
>> Matthias Barann
>> Institute of Clinical Molecular Biology
>> Christian Albrechts University Kiel
>> Schittenhelmstr. 12
>> D-24105 Kiel, Germany
>> m.barann(a)ikmb.uni-kiel.de+49 - (0)431 - 597 8681 (office)
>
>
>
>>
_______________________________________________
>> Geuvadis_rna_analysis mailing
listGeuvadis_rna_analysis@lists.crg.eshttp://davinci.crg.es/mailman/listinfo/geuvadis_rna_analysis
>
>
>
>