bleties

IES retention analysis from PacBio CCS reads

View the Project on GitHub Swart-lab/bleties

MILRET - Method of IES Long-read Retention

The MILRET module calculates per-IES retention scores, either directly using the predicted IES junctions predicted by MILRAA, or IES junction coordinates from third party software.

Inputs

Differences of long read to short read alignments

MILRET is a reimplementation of the MIRET module in the ParTIES pipeline, that was designed to work with short read data. MIRET takes the following inputs: two reference genomes (somatic and germline), separate mapping files of the same reads to the somatic and germline genomes, and coordinates of known IES junctions. Reads that map with match to the somatic version are counted as IES-, whereas reads that map with match to the germline version are counted as IES+.

With long reads, we assume that IESs are spanned completely by most reads, vs short reads, where reads are unlikely to completely span an IES insert. So we do not count soft/hard clips on the ends of reads, only matches and inserts. We also assume that the read mapper will handle mapping of reads containing inserts properly, because most reads will contain predominantly MDS sequence that will be correctly anchored in the somatic reference sequence. Therefore, we only map the reads to the somatic genome. Reads that do contain IES+ forms will be reported as mappings with insert operations (I in the CIGAR string). We then simply compare reads mapping with match to the somatic genome at the IES junction (counted as IES-) to reads mapping with an insert at the exact location of the IES junction (counted as IES+).

Expected IES lengths

Sequencing errors frequently cause short indels in reads, relative to the reference assembly. Such erroneous indels may happen to overlap with IES junction coordinates and be miscounted as evidence for an IES presence/absence. When the --use_ies_lengths option is used, only indels with lengths matching the expected IES length, as supplied in the input GFF3 file, will be counted as a “true” IES retention. The expected IES length should be supplied with the IES_length key in the attributes field of the GFF3 file, as done in the output from MILRAA.

Because of sequencing error, indel lengths may not match the expected IES length exactly. This can be controlled with the --length_threshold option. For example, a threshold of 0.05 means that sequences +/- 5% (inclusive) of the expected length are still counted.

Output

The main output of MILRET is a table in TSV format {OUT}.milret.tsv, where {OUT} is the output filename prefix supplied to the --out option. The TSV table contains the following fields:

With the --dump option, internal data is dumped to JSON format for troubleshooting to {OUT}.milret.dump.json