Thanks for the clarification Paolo!
You are welcome. Also, please keep in mind that if you do not like the
score we can always quickly rescore the matches from the original GEM
files, and regenerate BAMs accordingly.
I would say that maybe 3 or 4 would be a good maximum
I would disagree on this point. There are some matches which are
perfectly unique (see my classification) and they could easily be kept
independently of the number of mismatches. They just only map (not
very well) to a single position, why discarding them? Also, remember
that the BWA default is to allow the 4% of mismatches, and 4%*150=6.
One might perhaps keep only matches which have (<=3,<=3) mismatches
considering the two ends separatly rather than those having a sum <= 6
-- but anyway I would say that 3 or 4 is definitely too conservative.
Maybe it is a good idea if you add me to the analysis list, so that in
the future I can step in sooner if problems arise...
Best,
-------Paolo
On Wed, Jul 4, 2012 at 8:13 PM, Tuuli Lappalainen
<Tuuli.Lappalainen(a)unige.ch <mailto:Tuuli.Lappalainen@unige.ch>> wrote:
Thanks for the clarification Paolo! I had slightly misunderstood
the philosophy of the MAPQs, and hadn't paid enough attention to
the number of mismatches - thanks to Matthias and Daniela as well
for noticing this.
In most analyses, we should probably exclude reads with a high
number of mismatches - I would say that maybe 3 or 4 would be a
good maximum. What do you guys think?
I will need to rerun exon quantifications... :-/
best,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel.+41-(0)22-3795550 <tel:%2B41-%280%2922-3795550>
tuuli.lappalainen(a)unige.ch <mailto:tuuli.lappalainen@unige.ch>
On 7/4/12 7:57 PM, Paolo Ribeca wrote:
Hi all,
very quickly because at the moment I am in London with an
intermittent connection.
The quality scores are computed as follows: for each read,
1. all the existing mappings for /both/ ends are found up to a
(very high) number of substitutions/indels/splices
2. a score is computed for each mapping: the more unique the
mapping, the better the score.
Hence the scores are /relative/, not /absolute/: their purpose is
that of deciding which mapping is the best one /among its
fellows/, not to stratify all the mappings in the SAM file by
their number of errors (which you can very well do in the obvious
way, since the number of mismatches is recorded in the NM field).
For instance: the first mapping for the first pair you consider,
HWI-ST661:153:D0FTJACXX:8:2201:16410:93092
163 chr1 14704
*180 75M *= 14770 -134
CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
RG:Z:0 NM:i:7 XT:A:U md:Z:62T12
XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;
HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83 chr1 14770
*180 39M1D22M1I3M2I1M1I2M4S *= 14704 134
ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
5@
9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
RG:Z:0 NM:i:7 XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2
I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
has an overall quality score of 180. This is due to the fact that
there are several mappings, but the best one (the one described
in the regular SAM record), to chr1:14704/chr1:14770 is anyway
better than the second best one (the one to
chr15,-102516387/chr15,+102516328 described in the XA optional
field). In fact, the mapping to chr1 has 7 mismatches (following
GEM's scoring system), while the mapping to chr15 has 8
mismatches. So: in /absolute/ terms all mappings are bad because
the second end does not map well to any position, but still one
can say that the mapping to chr1 is better than the other ones,
and this is why it gets a high /relative/ score.
If for some reason you feel uncomfortable with using mapping
having a high number of mismacthes (which is an absolute score),
you can easily filter them out based on the NM tag, or on other
criteria (for instance, if you had filtered out mappings with
NM>5 this pair would have disappeared).
Hope this clarifies things. I do not have the time now to explain
how the relative score is computed, but I will be happy to do
that if you are interested.
Best,
-------Paolo
On Wed, Jul 4, 2012 at 3:58 PM, Tuuli Lappalainen
<Tuuli.Lappalainen(a)unige.ch <mailto:Tuuli.Lappalainen@unige.ch>>
wrote:
Hi Matthias and others,
I noticed the read length discordance myself very recently,
and was going to bring it up. This is really unfortunate -
people processing the data in Barcelona didn't follow the
explicit instructions of submitting 75 bp data. This is
something that I should have checked though, and I thought I
had, but apparently I didn't for the mRNA data. However, I
don't think this will really affect things. The Barcelona
samples don't show worse mapping stats, and I definitely
don't want to remap these samples without a very strong reason.
Regarding the GEM mapping quality scores, *Paolo, Thasso and
Micha, can you please comment on this? *Matthias and Daniela,
what do you mean by referring to the example reads as being
well or badly mapped? The classifications below is the
information I have from Paolo, but I don't know the
algorithms that they use to calculate the MAPQ. In my
understanding the qualities are calculated independently for
the two mates. I'm using reads with MAPQ >150, i.e. reads in
categories 1 and 2 below. It depends on the analysis whether
I require both mates to have this quality (e.g. exon
quantifications) or if I can just use a single good-quality
mate regardless of the quality of its mate (e.g. ASE).
1. Matches which are unique, and do not have any subdominant
match:
251 >= MAPQ >= 255, XT=U
2. Matches which are unique, and have subdominant matches
but a different score:
175 >= MAPQ >= 181, XT=U
3. Matches which are putatively unique (not unique, but
distinguishable by score):
119 >= MAPQ >= 127, XT=U
4. Matches which are a perfect tie:
78 >= MAPQ >= 90, XT=R.
best regards,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel.+41-(0)22-3795550 <tel:%2B41-%280%2922-3795550>
tuuli.lappalainen(a)unige.ch <mailto:tuuli.lappalainen@unige.ch>
On 7/4/12 3:24 PM, Matthias Barann wrote:
Dear all,
we noticed some inconsistencies in the read lengths of the
GEM mappings (I didn't check the BWA mappings, but it's
probably the same).
Some samples appear to have up to 76 bp matched, while other
samples only have 75 bp. In regard to comparability this
might cause some (probably very little) differences between
the samples.
i.e.
NA20589.1.M_111124_3.bam has 75 bp
NA20812.2.M_111216_6.bam has *76 bp*
NA20760.3.M_120202_5.bam has 75 bp
NA20783.4.M_120208_6.bam has 75 bp
NA20768.5.M_120131_1.bam has 75 bp
NA20798.6.M_120119_6.bam has 75 bp
NA20803.7.M_120219_1.bam has 75 bp
We checked some more files for institute 2, and they seem to
have generally 1 bp more than the others.
We're also a bit lost regarding GEM's quality score. We
would like to filter reads for good mapping quality, we're
just not sure how to accomplish this.
Below are two read pairs as an example.
The fist pair has a mapping quality of 180. The second read
however was terribly aligned.
The second pair has only a quality of 99, while the reads
mapped much better (at least I would say so).
We're not sure what's the reason for this.
Curiously, both reads of a pair get the same mapping
quality. This could be by intention (i.e. it's always the
pair quality rather than the read quality) or, which could
also explain the quality differences for the reads, both
reads of the pair get the quality of the first mapped read.
We're thankful for any suggestions on how to filter for
'good' mappings.
HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 163 chr1
14704 *180 75M *= 14770 -134
CCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGGCTGACACGCGGG
CCCFFFFFHHHGHJJJIGIJIIIIIJIJHIIJJJB?BFHG7@F@AB:9=(6=(;>B29<@###############
RG:Z:0 NM:i:7 XT:A:U md:Z:62T12
XA:Z:chr15,-102516387,75M,8;chr9,+14815,75M,10;chr16,+64389,19M1I1M2I52M,10;chr2,-114356235,75M,11;
HWI-ST661:153:D0FTJACXX:8:2201:16410:93092 83 chr1
14770 *180 39M1D22M1I3M2I1M1I2M4S *= 14704 134
ACACGCGGGCAAAGGCTCCTCCGGGCCCCTCACCAGCCCAGGTCCTTTCCCAGAGATGCCTTGGCTCGTGGCTGT
5@
9DDDDDDDDDC@BDDDDDDFHHIIIIJIHFJJJJIIJIJIJJIJJJJJJJJJJIIIJJJJHHHHHFFDDA11B
RG:Z:0 NM:i:7 XT:A:U md:Z:(4)2>1-1>2-T2>1-22>1+39
XA:Z:chr15,+102516328,1S1M2I4M3I1M1I1M1I24M1D36M,8;chr9,-14881,39M1D22M1I3M2
I1M1I2M4S,10;chr16,-64452,39M1D22M1I3M2I1M1I2M4S,10;chr2,+114356176,1S1M2I4M3I1M1I1M1I24M1D36M,11;
HWI-ST661:153:D0FTJACXX:8:2201:6270:52066
99 chr1
14582 *99 1M2I71M1S *= 14665 158
CCCTGGTTCCGTCACCCCCTCCCAGGGAAGCAGGTCTGAGCAGCTTGTCCTGGCTGTGTCAATGTCAGAGCAACA
@11ADFFFHHHHHIIJJJJIJJIJJJJHHJJIJJHIJJJIIGFGIIJIIJF@@EGDAE>?)).7?BDFFDCCCB@
RG:Z:0 NM:i:8 XT:A:U md:Z:1>2-21A5T29C13(1)
XA:Z:chrY,-59358087,75M,4;chrX,-155255081,75M,4;chr9,+14691,75M,7;chr2,-114356359,75M,7;chr16,+64266,75M,8;chr12,-90971,75M,8;
HWI-ST661:153:D0FTJACXX:8:2201:6270:52066 147 chr1
14665 *99 75M *= 14582 -158
GGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTAGGATTCCCAGTCGTCCTCGTCCTCCTCTGCCTGTGGCTGTG
<B@A<9;>@CAAADDDDCCC>CC=?;>FHCGGEGIGJJIIGGIHHIIIIJIIJIJIGJJIJIHHFDDFFFFDB@;
RG:Z:0 NM:i:8 XT:A:U md:Z:AG38G34
XA:Z:chrY,+59358002,75M,4;chrX,+155254996,75M,4;chr9,-14776,75M,7;chr2,+114356274,75M,7;chr16,-64350,58M1I1M2I11M1D2M,8;chr12,+90885,2M1D73M,8;
best wishes,
Matthias & Daniela
--
Matthias Barann
Institute of Clinical Molecular Biology
Christian Albrechts University Kiel
Schittenhelmstr. 12
D-24105 Kiel, Germany
m.barann(a)ikmb.uni-kiel.de <mailto:m.barann@ikmb.uni-kiel.de>
+49 - (0)431 - 597 8681 <tel:%2B49%20-%20%280%29431%20-%20597%208681>
(office)
_______________________________________________
Geuvadis_rna_analysis mailing list
Geuvadis_rna_analysis(a)lists.crg.es
<mailto:Geuvadis_rna_analysis@lists.crg.es>
http://davinci.crg.es/mailman/listinfo/geuvadis_rna_analysis