In my opinion we need have a filter for maximum number of
mismatches (as well as MAPQ>150) when we want to have
well-mapped reliable reads - if a read has lots of mismatches, I
wouldn't trust it even if the other matches were even worse. But
you're right that 3 or 4 is too stringent, I was thinking of the
75 bps and not the total of 150.
I'd say that we keep reads with <=6 mismatches according to the
NM flag. If no one objects by Thursday noon, I'll proceed with
this. I'll provide a script for filtering, and upload a filtered
set of bam files to the ftp site - you can do whatever is easier
for you.
best regards,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550
tuuli.lappalainen@unige.ch
On 7/4/12 8:55 PM, Paolo Ribeca wrote:
Thanks
for the clarification Paolo!
You are welcome. Also, please keep in mind that if you do not
like the score we can always quickly rescore the matches from
the original GEM files, and regenerate BAMs accordingly.
I would
say that maybe 3 or 4 would be a good maximum
I would disagree on this point. There are some matches which
are perfectly unique (see my classification) and they could
easily be kept independently of the number of mismatches. They
just only map (not very well) to a single position, why
discarding them? Also, remember that the BWA default is to allow
the 4% of mismatches, and 4%*150=6. One might perhaps keep only
matches which have (<=3,<=3) mismatches considering the
two ends separatly rather than those having a sum <= 6 -- but
anyway I would say that 3 or 4 is definitely too conservative.
Maybe it is a good idea if you add me to the analysis list, so
that in the future I can step in sooner if problems arise...
Best,
-------Paolo
Thanks for the clarification Paolo! I had slightly
misunderstood the philosophy of the MAPQs, and hadn't paid
enough attention to the number of mismatches - thanks to
Matthias and Daniela as well for noticing this.
In most analyses, we should probably exclude reads with a
high number of mismatches - I would say that maybe 3 or 4
would be a good maximum. What do you guys think?
I will need to rerun exon quantifications... :-/
best,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550tuuli.lappalainen@unige.ch
On 7/4/12 7:57 PM, Paolo Ribeca wrote:
Hi all,
very quickly because at the moment I am in London with
an intermittent connection.
The quality scores are computed as follows: for each
read,
all the existing mappings for both ends
are found up to a (very high) number of
substitutions/indels/splices
a score is computed for each mapping: the more
unique the mapping, the better the score.
Hence the scores are relative, not absolute:
their purpose is that of deciding which mapping is the
best one among its fellows, not to stratify
all the mappings in the SAM file by their number of
errors (which you can very well do in the obvious way,
since the number of mismatches is recorded in the NM
field).
For instance: the first mapping for the first pair you
consider,
has an overall quality score of 180. This is due to
the fact that there are several mappings, but the best
one (the one described in the regular SAM record), to
chr1:14704/chr1:14770 is anyway better than the second
best one (the one to chr15,-102516387/chr15,+102516328
described in the XA optional field). In fact, the
mapping to chr1 has 7 mismatches (following GEM's
scoring system), while the mapping to chr15 has 8
mismatches. So: in absolute terms all mappings
are bad because the second end does not map well to
any position, but still one can say that the mapping
to chr1 is better than the other ones, and this is why
it gets a high relative score.
If for some reason you feel uncomfortable with using
mapping having a high number of mismacthes (which is
an absolute score), you can easily filter them out
based on the NM tag, or on other criteria (for
instance, if you had filtered out mappings with
NM>5 this pair would have disappeared).
Hope this clarifies things. I do not have the time now
to explain how the relative score is computed, but I
will be happy to do that if you are interested.
Best,
-------Paolo
I noticed the read length discordance myself
very recently, and was going to bring it up.
This is really unfortunate - people processing
the data in Barcelona didn't follow the
explicit instructions of submitting 75 bp
data. This is something that I should have
checked though, and I thought I had, but
apparently I didn't for the mRNA data.
However, I don't think this will really affect
things. The Barcelona samples don't show worse
mapping stats, and I definitely don't want to
remap these samples without a very strong
reason.
Regarding the GEM mapping quality scores, Paolo,
Thasso and Micha, can you please comment on
this? Matthias and Daniela, what do you
mean by referring to the example reads as
being well or badly mapped? The
classifications below is the information I
have from Paolo, but I don't know the
algorithms that they use to calculate the
MAPQ. In my understanding the qualities are
calculated independently for the two mates.
I'm using reads with MAPQ >150, i.e. reads
in categories 1 and 2 below. It depends on the
analysis whether I require both mates to have
this quality (e.g. exon quantifications) or if
I can just use a single good-quality mate
regardless of the quality of its mate (e.g.
ASE).
Matches which are unique, and do not
have any subdominant match:
251 >= MAPQ >= 255, XT=U
Matches which are unique, and have
subdominant matches but a different score:
175 >= MAPQ >= 181, XT=U
Matches which are putatively unique (not
unique, but distinguishable by score):
119 >= MAPQ >= 127, XT=U
Matches which are a perfect tie:
78 >= MAPQ >= 90, XT=R.
best regards,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550tuuli.lappalainen@unige.ch
On 7/4/12 3:24 PM, Matthias Barann wrote:
Dear all,
we noticed some inconsistencies in the read
lengths of the GEM mappings (I didn't check
the BWA mappings, but it's probably the same).
Some samples appear to have up to 76 bp
matched, while other samples only have 75 bp.
In regard to comparability this might cause
some (probably very little) differences
between the samples.
i.e.
NA20589.1.M_111124_3.bam has 75 bp
NA20812.2.M_111216_6.bam has 76 bp
NA20760.3.M_120202_5.bam has 75 bp
NA20783.4.M_120208_6.bam has 75 bp
NA20768.5.M_120131_1.bam has 75 bp
NA20798.6.M_120119_6.bam has 75 bp
NA20803.7.M_120219_1.bam has 75 bp
We checked some more files for institute 2,
and they seem to have generally 1 bp more than
the others.
We're also a bit lost regarding GEM's quality
score. We would like to filter reads for good
mapping quality, we're just not sure how to
accomplish this.
Below are two read pairs as an example.
The fist pair has a mapping quality of 180.
The second read however was terribly aligned.
The second pair has only a quality of 99,
while the reads mapped much better (at least I
would say so).
We're not sure what's the reason for this.
Curiously, both reads of a pair get the same
mapping quality. This could be by intention
(i.e. it's always the pair quality rather than
the read quality) or, which could also explain
the quality differences for the reads, both
reads of the pair get the quality of the first
mapped read.
We're thankful for any suggestions on how to
filter for 'good' mappings.
--
Matthias Barann
Institute of Clinical Molecular Biology
Christian Albrechts University Kiel
Schittenhelmstr. 12
D-24105 Kiel, Germany
m.barann@ikmb.uni-kiel.de+49 - (0)431 - 597 8681 (office)