Hi Matthias and Daniela,
yes, your guess is correct, with the additional provisos that:
1. reads are split only at reliable sites, that have flanking sequences
satisfying a suitable splice site consensus and are covered by a reasonable
number of reads
2. when there are equivalent splits (same number of edit operations)
they are sorted by increasing insert distance: on the basis of
thermodynamical considerations, the smaller the distance, the better the
split.
Best,
-------Paolo
On Thu, Jul 5, 2012 at 8:47 AM, Matthias Barann
<m.barann(a)ikmb.uni-kiel.de>wrote;wrote:
Hello,
just to be sure about this, how do you calculate the edit distance?
Our guess is that you consider each of the following variations...
- mismatches
- each insertion/deletion *independent of the length*
- split-reads
with an edit distance of 1? That's actually how we would like to have it,
though its borderline for the split-reads.
Thanks,
Matthias & Daniela
On 05.07.2012 02:18, Tuuli Lappalainen wrote:
Hello,
In my opinion we need have a filter for maximum number of mismatches (as
well as MAPQ>150) when we want to have well-mapped reliable reads - if a
read has lots of mismatches, I wouldn't trust it even if the other matches
were even worse. But you're right that 3 or 4 is too stringent, I was
thinking of the 75 bps and not the total of 150.
I'd say that we keep reads with <=6 mismatches according to the NM flag.
If no one objects by Thursday noon, I'll proceed with this. I'll provide a
script for filtering, and upload a filtered set of bam files to the ftp
site - you can do whatever is easier for you.
best regards,
Tuuli
Tuuli Lappalainen, PhD
Department of Genetic Medicine and Development
University of Geneva Medical School
CMU / Rue Michel-Servet 1
1211 Geneva 4
Switzerland
Tel. +41-(0)22-3795550tuuli.lappalainen(a)unige.ch
--
Matthias Barann
Institute of Clinical Molecular Biology
Christian Albrechts University Kiel
Schittenhelmstr. 12
D-24105 Kiel, Germany
m.barann(a)ikmb.uni-kiel.de+49 - (0)431 - 597 8681 (office)