Thanks for the clarification Paolo! I had slightly misunderstood
the philosophy of the MAPQs, and hadn't paid enough attention to
the number of mismatches - thanks to Matthias and Daniela as well
for noticing this.
In most analyses, we should probably exclude reads with a high
number of mismatches - I would say that maybe 3 or 4 would be a
good maximum. What do you guys think?
I will need to rerun exon quantifications... :-/
best,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550
tuuli.lappalainen@unige.ch
On 7/4/12 7:57 PM, Paolo Ribeca wrote:
Hi all,
very quickly because at the moment I am in London with an
intermittent connection.
The quality scores are computed as follows: for each read,
all the existing mappings for both ends are found up
to a (very high) number of substitutions/indels/splices
a score is computed for each mapping: the more unique the
mapping, the better the score.
Hence the scores are relative, not absolute: their
purpose is that of deciding which mapping is the best one among
its fellows, not to stratify all the mappings in the SAM
file by their number of errors (which you can very well do in the
obvious way, since the number of mismatches is recorded in the NM
field).
For instance: the first mapping for the first pair you consider,
has an overall quality score of 180. This is due to the fact that
there are several mappings, but the best one (the one described in
the regular SAM record), to chr1:14704/chr1:14770 is anyway better
than the second best one (the one to
chr15,-102516387/chr15,+102516328 described in the XA optional
field). In fact, the mapping to chr1 has 7 mismatches (following
GEM's scoring system), while the mapping to chr15 has 8
mismatches. So: in absolute terms all mappings are bad
because the second end does not map well to any position, but
still one can say that the mapping to chr1 is better than the
other ones, and this is why it gets a high relative score.
If for some reason you feel uncomfortable with using mapping
having a high number of mismacthes (which is an absolute score),
you can easily filter them out based on the NM tag, or on other
criteria (for instance, if you had filtered out mappings with
NM>5 this pair would have disappeared).
Hope this clarifies things. I do not have the time now to explain
how the relative score is computed, but I will be happy to do that
if you are interested.
Best,
-------Paolo
I noticed the read length discordance myself very
recently, and was going to bring it up. This is really
unfortunate - people processing the data in Barcelona
didn't follow the explicit instructions of submitting 75
bp data. This is something that I should have checked
though, and I thought I had, but apparently I didn't for
the mRNA data. However, I don't think this will really
affect things. The Barcelona samples don't show worse
mapping stats, and I definitely don't want to remap these
samples without a very strong reason.
Regarding the GEM mapping quality scores, Paolo,
Thasso and Micha, can you please comment on this? Matthias
and Daniela, what do you mean by referring to the example
reads as being well or badly mapped? The classifications
below is the information I have from Paolo, but I don't
know the algorithms that they use to calculate the MAPQ.
In my understanding the qualities are calculated
independently for the two mates. I'm using reads with MAPQ
>150, i.e. reads in categories 1 and 2 below. It
depends on the analysis whether I require both mates to
have this quality (e.g. exon quantifications) or if I can
just use a single good-quality mate regardless of the
quality of its mate (e.g. ASE).
Matches which are unique, and do not have any
subdominant match:
251 >= MAPQ >= 255, XT=U
Matches which are unique, and have subdominant
matches but a different score:
175 >= MAPQ >= 181, XT=U
Matches which are putatively unique (not unique, but
distinguishable by score):
119 >= MAPQ >= 127, XT=U
Matches which are a perfect tie:
78 >= MAPQ >= 90, XT=R.
best regards,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550tuuli.lappalainen@unige.ch
On 7/4/12 3:24 PM, Matthias Barann wrote:
Dear all,
we noticed some inconsistencies in the read lengths of the
GEM mappings (I didn't check the BWA mappings, but it's
probably the same).
Some samples appear to have up to 76 bp matched, while
other samples only have 75 bp. In regard to comparability
this might cause some (probably very little) differences
between the samples.
i.e.
NA20589.1.M_111124_3.bam has 75 bp
NA20812.2.M_111216_6.bam has 76 bp
NA20760.3.M_120202_5.bam has 75 bp
NA20783.4.M_120208_6.bam has 75 bp
NA20768.5.M_120131_1.bam has 75 bp
NA20798.6.M_120119_6.bam has 75 bp
NA20803.7.M_120219_1.bam has 75 bp
We checked some more files for institute 2, and they seem
to have generally 1 bp more than the others.
We're also a bit lost regarding GEM's quality score. We
would like to filter reads for good mapping quality, we're
just not sure how to accomplish this.
Below are two read pairs as an example.
The fist pair has a mapping quality of 180. The second
read however was terribly aligned.
The second pair has only a quality of 99, while the reads
mapped much better (at least I would say so).
We're not sure what's the reason for this.
Curiously, both reads of a pair get the same mapping
quality. This could be by intention (i.e. it's always the
pair quality rather than the read quality) or, which could
also explain the quality differences for the reads, both
reads of the pair get the quality of the first mapped
read.
We're thankful for any suggestions on how to filter for
'good' mappings.
--
Matthias Barann
Institute of Clinical Molecular Biology
Christian Albrechts University Kiel
Schittenhelmstr. 12
D-24105 Kiel, Germany
m.barann@ikmb.uni-kiel.de+49 - (0)431 - 597 8681 (office)