Using MAJIQ HET (ENCODE RBP knockdowns)

We have created a small example dataset from a few select genes/samples from ENCODE in order to demonstrate how to use MAJIQ. First, download the example dataset files:

[1]:
# create temporary working directory for this notebook
from tempfile import TemporaryDirectory
tempdir = TemporaryDirectory()
%cd $tempdir.name
# download archive with example data
!curl -LO http://majiq.biociphers.org/data/majiq_het_vignette.zip
# unzip and show what files are in the archive
!unzip -qn majiq_het_vignette.zip
/tmp/tmpucqll51w
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   240  100   240    0     0  13331      0 --:--:-- --:--:-- --:--:-- 14117
100 52.6M  100 52.6M    0     0  19.8M      0  0:00:02  0:00:02 --:--:-- 18.7M
[2]:
ls
bam/            gff3/    majiq_het_vignette.zip  settings.ini
build-info.tsv  images/  output/                 vignette.ipynb

What are our inputs?

We selected RNA-seq experiments corresponding to SRSF4 knockdown (by CRISPR) vs no-target controls in K562 and HepG2 cells from ENCODE:

cell_line

knockdown

encode_accession

prefix

HepG2

SRSF4

ENCSR471INA

ENCSR471INA.rep1

HepG2

SRSF4

ENCSR471INA

ENCSR471INA.rep2

HepG2

SRSF4

ENCSR929JKA

ENCSR929JKA.rep1

HepG2

SRSF4

ENCSR929JKA

ENCSR929JKA.rep2

HepG2

notarget

ENCSR105NML

ENCSR105NML.rep1

HepG2

notarget

ENCSR194SPW

ENCSR194SPW.rep2

HepG2

notarget

ENCSR805TIZ

ENCSR805TIZ.rep1

HepG2

notarget

ENCSR805TIZ

ENCSR805TIZ.rep2

HepG2

notarget

ENCSR853AOV

ENCSR853AOV.rep1

HepG2

notarget

ENCSR853AOV

ENCSR853AOV.rep2

K562

SRSF4

ENCSR905NIK

ENCSR905NIK.rep1

K562

SRSF4

ENCSR905NIK

ENCSR905NIK.rep2

K562

SRSF4

ENCSR939BTN

ENCSR939BTN.rep1

K562

SRSF4

ENCSR939BTN

ENCSR939BTN.rep2

K562

notarget

ENCSR005JBU

ENCSR005JBU.rep2

K562

notarget

ENCSR015KGT

ENCSR015KGT.rep1

K562

notarget

ENCSR163JUC

ENCSR163JUC.rep2

K562

notarget

ENCSR772DUM

ENCSR772DUM.rep1

K562

notarget

ENCSR772DUM

ENCSR772DUM.rep2

We downloaded fastqs for these samples, aligned them to GRCh38, and took the subset of aligned reads corresponding to the following genes/genomic regions:

seqid

start

end

gene_name

gene_id

0

1

29147743

29181987

SRSF4

gene:ENSG00000116350

1

1

32179686

32198285

TXLNA

gene:ENSG00000084652

2

17

43170481

43211689

NBR1

gene:ENSG00000188554

3

19

5158495

5340803

PTPRS

gene:ENSG00000105426

4

21

42879644

42913304

NDUFV3

gene:ENSG00000160194

5

3

152243828

152465780

MBNL1

gene:ENSG00000152601

6

7

43608456

43729717

COA1

gene:ENSG00000106603

7

8

101177878

101206193

ZNF706

gene:ENSG00000120963

The resulting BAMs are located in the folder bam/:

[3]:
ls bam/*.bam
bam/ENCSR005JBU.rep2.subset.bam  bam/ENCSR805TIZ.rep2.subset.bam
bam/ENCSR015KGT.rep1.subset.bam  bam/ENCSR853AOV.rep1.subset.bam
bam/ENCSR105NML.rep1.subset.bam  bam/ENCSR853AOV.rep2.subset.bam
bam/ENCSR163JUC.rep2.subset.bam  bam/ENCSR905NIK.rep1.subset.bam
bam/ENCSR194SPW.rep2.subset.bam  bam/ENCSR905NIK.rep2.subset.bam
bam/ENCSR471INA.rep1.subset.bam  bam/ENCSR929JKA.rep1.subset.bam
bam/ENCSR471INA.rep2.subset.bam  bam/ENCSR929JKA.rep2.subset.bam
bam/ENCSR772DUM.rep1.subset.bam  bam/ENCSR939BTN.rep1.subset.bam
bam/ENCSR772DUM.rep2.subset.bam  bam/ENCSR939BTN.rep2.subset.bam
bam/ENCSR805TIZ.rep1.subset.bam

Running the MAJIQ Builder

To analyze RNA splicing in these experiments we need to run the MAJIQ builder to:

  • define splicegraphs of annotated junctions and retained introns between gene exons,

  • obtain coverage over the splicegraph’s LSVs for each experiment.

The resulting coverage files can be used for quantification, and the splicegraph database can be used for downstream analysis and visualization.

In order for an LSV to have coverage generated for it in these output files, MAJIQ requires at least one of its junctions or retained introns to have sufficient evidence for it in its input experiments.

An individual experiment provides supporting evidence for a junction when it passes junction filters: --minpos and either --minreads (annotated junctions) or --min-denovo (de novo junctions). For retained introns, when it passes intron filters: --irnbins and --min-intronic-cov. See majiq build --help for details on these parameters.

Support for a junction or retained intron from an individual experiment by itself is not necessarily enough. MAJIQ partitions the set of input experiments into independent build groups. A junction or retained intron is passed (considered to have enough evidence) when at least --min-experiments experiments from the same build group pass per-experiment filters in at least one build group.

The MAJIQ builder uses a configuration file to specify where the input experiments are located and how they are partitioned into build groups. Global information about experiments are specified in an [info] section:

[info]
# species/reference name for UCSC links by VOILA
genome=GRCh38
# relative or absolute path to directories (comma-separated) with bam files
bamdirs=bam/
# these ENCODE experiments are all reverse stranded
strandness=reverse

The build groups and experiments which belong to them are specified in an [experiments] section. Build groups are keys with values listing experiment names (prefixes to BAM files excluding the .bam extension) separated by commas. If we wanted to focus only on junctions and retained introns found in all experiments for a single cell-type/knockdown, we could organize this section like (setting --min-experiments to reflect how many experiments are necessary to pass):

[experiments]
HepG2_NT=ENCSR805TIZ.rep1.subset,ENCSR805TIZ.rep2.subset,ENCSR853AOV.rep1.subset,ENCSR853AOV.rep2.subset,ENCSR105NML.rep1.subset,ENCSR194SPW.rep2.subset
HepG2_SRSF4=ENCSR929JKA.rep1.subset,ENCSR929JKA.rep2.subset,ENCSR471INA.rep1.subset,ENCSR471INA.rep2.subset
K562_NT=ENCSR005JBU.rep2.subset,ENCSR772DUM.rep1.subset,ENCSR772DUM.rep2.subset,ENCSR015KGT.rep1.subset,ENCSR163JUC.rep2.subset
K562_SRSF4=ENCSR939BTN.rep1.subset,ENCSR939BTN.rep2.subset,ENCSR905NIK.rep1.subset,ENCSR905NIK.rep2.subset

If any experiment is sufficient, we can instead have a build group for each experiment:

[experiments]
ENCSR805TIZ.rep1.subset=ENCSR805TIZ.rep1.subset
ENCSR805TIZ.rep2.subset=ENCSR805TIZ.rep2.subset
# {... all of the other experiments ...}
ENCSR905NIK.rep1.subset=ENCSR905NIK.rep1.subset
ENCSR905NIK.rep2.subset=ENCSR905NIK.rep2.subset

Equivalently, we could set --min-experiments to 1.

Another reasonable possibility would be to group by accessions, which would result in the configuration in settings.ini.

For this analysis, we are interested in considering heterogeneity between individual experiments, so we’ll use settings.ini, but with--min-experiments set to 1. We specify these flags to the builder using -c settings.ini --min-experiments 1.

In addition to these input experiments/configuration file, MAJIQ requires a GFF3 describing annotated transcripts. The subset of Ensembl GRCh38 v94 annotations corresponding to our selected genes can be found at gff3/ensembl.Homo_sapiens.GRCh38.94.subset.gff3.gz. This is passed to the builder directly as a positional argument.

Finally, we must specify the builder’s output directory using a required flag. To save its output to output/build, this would be -o output/build.

Put together, we run the builder as follows:

[4]:
%%bash
majiq build \
    -o output/build \
    -c settings.ini \
    gff3/ensembl.Homo_sapiens.GRCh38.94.subset.gff3.gz \
    --min-experiments 1 \
    2> /dev/null

This produces output files:

  • splicegraph.sql (splicegraph database for use with VOILA)

  • majiq.log (logging output from majiq build)

  • {experiment}.sj (intermediate files that can be used instead of bams for future runs of majiq build)

  • {experiment}.majiq (inputs for majiq quantifiers)

[5]:
ls output/build
ENCSR005JBU.rep2.subset.majiq  ENCSR805TIZ.rep2.subset.majiq
ENCSR005JBU.rep2.subset.sj     ENCSR805TIZ.rep2.subset.sj
ENCSR015KGT.rep1.subset.majiq  ENCSR853AOV.rep1.subset.majiq
ENCSR015KGT.rep1.subset.sj     ENCSR853AOV.rep1.subset.sj
ENCSR105NML.rep1.subset.majiq  ENCSR853AOV.rep2.subset.majiq
ENCSR105NML.rep1.subset.sj     ENCSR853AOV.rep2.subset.sj
ENCSR163JUC.rep2.subset.majiq  ENCSR905NIK.rep1.subset.majiq
ENCSR163JUC.rep2.subset.sj     ENCSR905NIK.rep1.subset.sj
ENCSR194SPW.rep2.subset.majiq  ENCSR905NIK.rep2.subset.majiq
ENCSR194SPW.rep2.subset.sj     ENCSR905NIK.rep2.subset.sj
ENCSR471INA.rep1.subset.majiq  ENCSR929JKA.rep1.subset.majiq
ENCSR471INA.rep1.subset.sj     ENCSR929JKA.rep1.subset.sj
ENCSR471INA.rep2.subset.majiq  ENCSR929JKA.rep2.subset.majiq
ENCSR471INA.rep2.subset.sj     ENCSR929JKA.rep2.subset.sj
ENCSR772DUM.rep1.subset.majiq  ENCSR939BTN.rep1.subset.majiq
ENCSR772DUM.rep1.subset.sj     ENCSR939BTN.rep1.subset.sj
ENCSR772DUM.rep2.subset.majiq  ENCSR939BTN.rep2.subset.majiq
ENCSR772DUM.rep2.subset.sj     ENCSR939BTN.rep2.subset.sj
ENCSR805TIZ.rep1.subset.majiq  majiq.log
ENCSR805TIZ.rep1.subset.sj     splicegraph.sql

Running the MAJIQ quantifiers

We quantify and test for differences in PSI between independent samples from the different cell lines/knockdowns using majiq heterogen.

[6]:
%%bash
majiq heterogen -o output/heterogen/ -n K562_SRSF4KD HepG2_SRSF4KD \
    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \
    -grp2 output/build/ENCSR{929JKA,471INA}*.majiq \
    2> /dev/null
[7]:
%%bash
majiq heterogen -o output/heterogen/ -n K562_SRSF4KD K562_NT \
    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \
    -grp2 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \
    2> /dev/null
[8]:
%%bash
majiq heterogen -o output/heterogen/ -n K562_SRSF4KD HepG2_NT \
    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \
    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \
    2> /dev/null
[9]:
%%bash
majiq heterogen -o output/heterogen/ -n HepG2_SRSF4KD K562_NT \
    -grp1 output/build/ENCSR{929JKA,471INA}*.majiq \
    -grp2 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \
    2> /dev/null
[10]:
%%bash
majiq heterogen -o output/heterogen/ -n HepG2_SRSF4KD HepG2_NT \
    -grp1 output/build/ENCSR{929JKA,471INA}*.majiq \
    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \
    2> /dev/null
[11]:
%%bash
majiq heterogen -o output/heterogen/ -n K562_NT HepG2_NT \
    -grp1 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \
    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \
    2> /dev/null

These commands produce output to the directory specified (-o output/heterogen):

[12]:
ls output/heterogen
HepG2_SRSF4KD-HepG2_NT.het.voila  K562_SRSF4KD-HepG2_NT.het.voila
HepG2_SRSF4KD-K562_NT.het.voila   K562_SRSF4KD-HepG2_SRSF4KD.het.voila
het_majiq.log                     K562_SRSF4KD-K562_NT.het.voila
K562_NT-HepG2_NT.het.voila

These commands do the heavy lifting of inference, sampling, and testing. However, we need to use VOILA with the splicegraph database to generate human-readable outputs, as described in the next section.

Running VOILA

VOILA has three modes for looking at the output of our quantifiers:

  • voila tsv: generate TSVs with quantified LSVs and associated statistics

  • voila modulize: generate TSVs which differentiate each specific type of splicing variation

  • voila view: run web service for interactively visualizing quantifications, statistics and their relationship to genes’ splicegraphs.

Generating TSV output

We can generate a TSV combining the LSVs quantified in all 6 pairs of comparisons. We:

  • specify the splicegraph output/build/splicegraph.sql and voila files output/heterogen/*.het.voila as positional arguments,

  • specify the path for our output TSV by the argument -f output/voila_tsv/all-pairs-het.tsv,

  • have VOILA save all quantified LSVs, not only with those called as changing using --show-all (alternatively, run voila view --help to set thresholds for changing events).

This leads to the following command:

[13]:
ls output/heterogen/tmp5sw2q083/
ls: cannot access 'output/heterogen/tmp5sw2q083/': No such file or directory
[14]:
%%bash
voila tsv \
    output/build/splicegraph.sql output/heterogen/*.het.voila \
    --show-all \
    -f output/voila_tsv/all-pairs-het.tsv
2023-10-11 12:46:03,016 (PID:44528) - INFO - Command: /home/sjewell/PycharmProjects/majiq/env_3.10/bin/voila tsv output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila --show-all -f output/voila_tsv/all-pairs-het.tsv
2023-10-11 12:46:03,016 (PID:44528) - INFO - Voila v2.42.dev285+gcfe719c7.d20231011
2023-10-11 12:46:03,016 (PID:44528) - INFO - config file: /tmp/tmpb4gclvpb
2023-10-11 12:46:03,026 (PID:44528) - INFO - heterogen TSV
2023-10-11 12:46:03,032 (PID:44528) - INFO - Showing all results and ignoring all filters.
2023-10-11 12:46:03,060 (PID:44528) - INFO - Creating Tab-delimited output file
2023-10-11 12:46:07,577 (PID:44528) - INFO - Delimited output file successfully created in: output/voila_tsv/all-pairs-het.tsv
2023-10-11 12:46:07,585 (PID:44528) - INFO - Duration: 0:00:04.552941

Let’s look at what these outputs look like:

[15]:
# print commented-out header
with open("output/voila_tsv/all-pairs-het.tsv", "r") as handle:
    for line in handle:
        if not line.startswith("#"):
            break
        print(line, end="")
# {
#     "command": "/home/sjewell/PycharmProjects/majiq/env_3.10/bin/voila tsv output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila --show-all -f output/voila_tsv/all-pairs-het.tsv",
#     "group_sizes": {
#         "HepG2_NT": 6,
#         "HepG2_SRSF4KD": 4,
#         "K562_NT": 5,
#         "K562_SRSF4KD": 4
#     },
#     "psi_samples": "100",
#     "stat_names": [
#         "TNOM",
#         "TTEST",
#         "WILCOXON"
#     ],
#     "test_percentile": "0.95",
#     "voila_version": "2.42.dev285+gcfe719c7.d20231011"
# }

First, there is a commented-out header with metadata (in JSON format, after excluding the comment characters at the start of each line):

  • command: the call to voila tsv that generated the TSV. This helps indicate which thresholds were set, etc.,

  • group_sizes: the unique group names compared in the input voila files and the number of experiments included in each group,

  • psi_samples, test_percentile: parameters used by majiq heterogen for computing test statistics not only on PSI posterior means but on repeated samples from PSI posterior distributions (more detailed explanation below when discussing column names),

  • stat_names: the different statistical tests used to evaluate significance of differences between observations from groups,

  • voila_version: the version of VOILA that was used.

The most important consideration here are the stat_names: TNOM, TTEST, WILCOXON. By default, MAJIQ performs these three statistical tests (all but INFOSCORE) on each junction/retained intron from each LSV. TTEST corresponds to a independent two-sample t-test with unequal variances (Welch’s t-test). WILCOXON corresponds to an independent two-sample Mann-Whitney U test. INFOSCORE and TNOM correspond to the test statistics as originally implemented in the ScoreGenes package. INFOSCORE is excluded from defaults because it runs slower than the other tests with larger number of experiments.

MAJIQ can be configured to run only a subset of these tests by setting the --stats argument to the desired subset of these test-statistics. See {ref}what statistic should I use <statistics> for more opinionated guidance about picking which statistics to use.

[16]:
# print first uncommented line (column names)
with open("output/voila_tsv/all-pairs-het.tsv", "r") as handle:
    for line in handle:
        if line.startswith("#"):
            continue
        print(line, end="")
        break
gene_name       gene_id lsv_id  lsv_type        strand  seqid   HepG2_SRSF4KD_median_psi        HepG2_NT_median_psi     K562_NT_median_psi      K562_SRSF4KD_median_psi HepG2_SRSF4KD_percentile25_psi  HepG2_SRSF4KD_percentile75_psi  HepG2_NT_percentile25_psi       HepG2_NT_percentile75_psi       K562_NT_percentile25_psi        K562_NT_percentile75_psi        K562_SRSF4KD_percentile25_psi   K562_SRSF4KD_percentile75_psi   HepG2_SRSF4KD_num_quantified    HepG2_NT_num_quantified K562_NT_num_quantified  K562_SRSF4KD_num_quantified     HepG2_SRSF4KD-HepG2_NT TNOM     HepG2_SRSF4KD-HepG2_NT TTEST    HepG2_SRSF4KD-HepG2_NT WILCOXON HepG2_SRSF4KD-K562_NT TNOM      HepG2_SRSF4KD-K562_NT TTEST     HepG2_SRSF4KD-K562_NT WILCOXON  K562_NT-HepG2_NT TNOM   K562_NT-HepG2_NT TTEST  K562_NT-HepG2_NT WILCOXON       K562_SRSF4KD-HepG2_NT TNOM      K562_SRSF4KD-HepG2_NT TTEST     K562_SRSF4KD-HepG2_NT WILCOXON  K562_SRSF4KD-HepG2_SRSF4KD TNOM K562_SRSF4KD-HepG2_SRSF4KD TTEST        K562_SRSF4KD-HepG2_SRSF4KD WILCOXON     K562_SRSF4KD-K562_NT TNOM       K562_SRSF4KD-K562_NT TTEST      K562_SRSF4KD-K562_NT WILCOXON   HepG2_SRSF4KD-HepG2_NT TNOM_quantile    HepG2_SRSF4KD-HepG2_NT TTEST_quantile   HepG2_SRSF4KD-HepG2_NT WILCOXON_quantile        HepG2_SRSF4KD-K562_NT TNOM_quantile     HepG2_SRSF4KD-K562_NT TTEST_quantile    HepG2_SRSF4KD-K562_NT WILCOXON_quantile K562_NT-HepG2_NT TNOM_quantile  K562_NT-HepG2_NT TTEST_quantile K562_NT-HepG2_NT WILCOXON_quantile      K562_SRSF4KD-HepG2_NT TNOM_quantile     K562_SRSF4KD-HepG2_NT TTEST_quantile    K562_SRSF4KD-HepG2_NT WILCOXON_quantile K562_SRSF4KD-HepG2_SRSF4KD TNOM_quantile        K562_SRSF4KD-HepG2_SRSF4KD TTEST_quantile       K562_SRSF4KD-HepG2_SRSF4KD WILCOXON_quantile    K562_SRSF4KD-K562_NT TNOM_quantile      K562_SRSF4KD-K562_NT TTEST_quantile     K562_SRSF4KD-K562_NT WILCOXON_quantile  HepG2_SRSF4KD-HepG2_NT tnom_score       HepG2_SRSF4KD-K562_NT tnom_score        K562_NT-HepG2_NT tnom_score     K562_SRSF4KD-HepG2_NT tnom_score        K562_SRSF4KD-HepG2_SRSF4KD tnom_score   K562_SRSF4KD-K562_NT tnom_score HepG2_SRSF4KD-HepG2_NT changing HepG2_SRSF4KD-K562_NT changing  K562_NT-HepG2_NT changing       K562_SRSF4KD-HepG2_NT changing  K562_SRSF4KD-HepG2_SRSF4KD changing     K562_SRSF4KD-K562_NT changing   HepG2_SRSF4KD-HepG2_NT nonchanging      HepG2_SRSF4KD-K562_NT nonchanging       K562_NT-HepG2_NT nonchanging    K562_SRSF4KD-HepG2_NT nonchanging       K562_SRSF4KD-HepG2_SRSF4KD nonchanging  K562_SRSF4KD-K562_NT nonchanging        num_junctions   num_exons       de_novo_junctions       junctions_coords        exons_coords    ir_coords       ucsc_lsv_link

Second, our first uncommented line in the TSV file indicates the column names for the table. Of note:

The 25th, 50th, and 75th percentiles of PSI for each group are indicated by the columns {group_name}_{percentile25,median,percentile75}_psi

The results of the statistical tests between groups are indicated by columns {grp1}-{grp2} {test_name}{,_quantile}. When the groups compared are unambiguous (only one VOILA file as input), {grp1}_{grp2} is excluded from the column name. When the TNOM statistical test was run, we also report {grp1}_{grp2} tnom_score. The special case is tnom_score, which reports the raw test-statistic – the minimum total number of mistakes when picking a single threshold on PSI to classify the observations back into the two groups. Otherwise, the statistical tests correspond to p-values from each test. For each test, we report {test_name} and {test_name}_quantile. The columns {test_name} (no quantile) correspond to statistical tests on the posterior means of each sample. These p-values are calibrated on an individual basis. In some cases, we can benefit from considering the uncertainty encoded in the full posterior distributions by performing tests on samples from the posterior distribution. Then, we can consider a credible range of test statistics from each comparison that we might expect. By default, majiq heterogen takes 100 samples from each posterior distribution and reports the 95th percentile p-value from the tests comparing each of the posterior samples (change using --psi-samples and --test_percentile flags). The {test_name}_quantile column thus indicates the selected percentile of sampled p-values, which is generally more conservative and not calibrated as a p-value in its own right.

There are also columns with the format {grp1}-{grp2} ?(non)changing. These indicate whether each junction/retained intron in the LSV met VOILA’s thresholds for being changing or nonchanging. By default, VOILA calls a junction as changing between groups if:

  • all test statistics have nominal p-value less than 0.05 (change with --changing-pvalue-threshold flag)

  • difference in group median PSI of at least 20% (change with --changing-between-group-dpsi flag).

Similarly, by default, VOILA calls a junction as non-changing between groups if:

  • all test statistics have nominal p-value greater than 0.05 (change with --non-changing-pvalue-threshold flag)

  • within-group interquartile range of PSI less than 10% in both groups (change with --non-changing-within-group-IQR flag)

  • difference in group median PSI of at most 5% (change with --non-changing-between-group-dpsi flag).

In some cases, for a certain quantifier run, due to low coverage or thresholding, a certain junction may not be quantified in a certain experimental group. This possibility is indicated through the {grp} num_quantified columns. In the majority of cases, the number here will match the value of num_junctions, but when fewer junctions are quantified in that group, the value here will reflect a lower sum.

[17]:
# print first line of data
with open("output/voila_tsv/all-pairs-het.tsv", "r") as handle:
    for line in handle:
        if line.startswith("#"):
            continue
        break  # don't print first uncommented line (column names)
    # print the next line (first row of data)
    print(next(handle), end="")
ZNF706  gene:ENSG00000120963    gene:ENSG00000120963:s:101204650-101204779      s|1e1.2o2|1e2.1o1|i     -       8       0.0801;0.0479;0.8666    NA      0.0534;0.0151;0.9143    0.0855;0.0290;0.8511    7.708e-02;3.677e-02;8.538e-01   8.310e-02;5.904e-02;8.795e-01   NA      NA      5.314e-02;7.576e-03;8.880e-01   9.191e-02;3.825e-02;9.345e-01   6.541e-02;9.534e-03;7.970e-01   1.210e-01;8.175e-02;8.914e-01   2       0       5       4                               1.000e+00;1.000e+00;1.000e+00   7.341e-01;9.087e-01;9.355e-01   8.571e-01;5.714e-01;5.714e-01                                                   1.000e+00;1.000e+00;1.000e+00   5.047e-01;7.792e-01;5.303e-01   1.000e+00;8.000e-01;8.000e-01   5.635e-01;9.921e-01;5.635e-01   4.326e-01;8.750e-01;5.854e-01   2.857e-01;1.000e+00;4.127e-01                           1.000e+00;1.000e+00;1.000e+00   8.957e-01;9.323e-01;9.582e-01   1.000e+00;1.000e+00;1.000e+00                                                   1.000e+00;1.000e+00;1.000e+00   9.408e-01;9.585e-01;9.252e-01   1.000e+00;1.000e+00;1.000e+00   9.921e-01;9.921e-01;9.921e-01   9.604e-01;9.714e-01;9.578e-01   1.000e+00;1.000e+00;1.000e+00           1.000e+00;1.000e+00;1.000e+00                   1.000e+00;1.000e+00;1.000e+00   2.500e+00;3.000e+00;2.000e+00           False;False;False                       False;False;False       False;False;False               True;True;True                  True;True;True  True;True;False 3       3       0;0;0   101202447-101204650;101201743-101204650;101202453-101204649     101201716-101201743;101202333-101202452;101204650-101204779     101202453-101204649     http://genome.ucsc.edu/cgi-bin/hgTracks?db=GRCh38&position=8%3A101204779-101201716

Finally, the remaining lines in the TSV are for the data in the table. Each row corresponds to a quantified LSV. For the quantification, test, etc. columns, the values for each junction/retained intron in the LSV are delimited by semicolons. Note that if computation is desired on the individual values, it can be convenient to “explode” the data to rows per LSV connection. In pandas, this can be done by using pd.Series.str.split(";") on appropriate columns followed by pd.DataFrame.explode(). In R, this can be done by using tidyr::separate_rows().

Categorizing splicing events with voila modulizer

We can also generate summaries of categorizations of splicing events with voila modulize. The purpose of this tool is not only to summarize classical events, but also a variety of complex events as well as determine common combinations of two or more events.

voila-modulizer

You may run the modulizer in the following manner:

[18]:
%%bash
voila modulize output/build/splicegraph.sql output/heterogen/*.het.voila -d output/modulized --overwrite
2023-10-11 12:46:08,649 (PID:44628) - INFO - Command: /home/sjewell/PycharmProjects/majiq/env_3.10/bin/voila modulize output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila -d output/modulized --overwrite
2023-10-11 12:46:08,649 (PID:44628) - INFO - Voila v2.42.dev285+gcfe719c7.d20231011
2023-10-11 12:46:08,649 (PID:44628) - INFO - config file: /tmp/tmp68pqwnnk
2023-10-11 12:46:08,664 (PID:44628) - INFO - HETx6 MODULIZE
2023-10-11 12:46:08,672 (PID:44628) - INFO - Modulizing 8 gene(s)
2023-10-11 12:46:08,672 (PID:44628) - INFO - Quantifications based on 6 input file(s)
2023-10-11 12:46:08,672 (PID:44628) - INFO - ╔═══════════════════════════════════════════════════════════════╗
2023-10-11 12:46:08,672 (PID:44628) - INFO - ╠╡ Before Modulization:                                         ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ║     Dropping junctions with max(PSI) < 0.050                  ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ║     Dropping junctions with max(number_of_reads) < 1          ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ╠╡ After Modulization:                                          ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ║     Dropping modules if none(max(E(|dPSI|)) >= 0.200)         ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ║     Dropping modules if none(max(P(|dPSI|>=0.200)) >= 0.950)  ║
2023-10-11 12:46:08,672 (PID:44628) - INFO - ╚═══════════════════════════════════════════════════════════════╝
2023-10-11 12:46:08,673 (PID:44628) - INFO - Writing TSVs to /tmp/tmpucqll51w/output/modulized
2023-10-11 12:46:12,725 (PID:44628) - INFO - Concatenating Results
2023-10-11 12:46:12,810 (PID:44628) - INFO - Modulization Complete

Note that the modulizer can run with any combination of the standard .psi.voila or .deltapsi.voila files in addition to our het.voila files. Each additional piece of data for the same LSV/junction will be recorded in the output row in the modulized output.

The modulized output will be recorded in a number of files in .tsv format.

[19]:
ls output/modulized
alt3and5prime.tsv         heatmap.tsv              p_alt5prime.tsv
alt3prime.tsv             junctions.tsv            p_alternate_first_exon.tsv
alt5prime.tsv             multi_exon_spanning.tsv  p_alternate_last_exon.tsv
alternate_first_exon.tsv  mutually_exclusive.tsv   summary.tsv
alternate_last_exon.tsv   orphan_junction.tsv      tandem_cassette.tsv
alternative_intron.tsv    other.tsv
cassette.tsv              p_alt3prime.tsv

They are:

  • summary.tsv: Likely the most immediately useful for the majority of users. Each row corresponds to a splicing module, and lists the counts of each type of splicing event found inside the module.

  • heatmap.tsv: For each module, a junction of greatest biological importance is considered and quantified here, one row per module. The method/parameters for choosing which junction is selected can be changed, see voila modulize --help for details.

  • event tsv files, ex: cassette.tsv: Contains quantification information across all input files for each junction making up that specific splicing event.

  • other.tsv: for any junctions not assigned to ANY splicing event, they are dumped here.

  • junctions.tsv: A concatenation of all junction data from all event tsv files as well as other.tsv.

Please see VOILA modulizer documentation for more details on the modulizer.

Interactively visualizing results with voila view

We can also view results interactively in our web browser using voila view. We can start the web service by running

voila view output/build/splicegraph.sql output/heterogen/*.het.voila

By default, the web service is hosted on a random port for localhost (--host and -p flags can be used to specify directly). The console will print a URL that can be accessed from the same machine within a web browser. For example, when we run the above command, we see:

2021-09-01 01:08:53,380 (PID:5613) - INFO - Command: /{redacted}/bin/voila view output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heteroge
n/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila
2021-09-01 01:08:53,380 (PID:5613) - INFO - Voila v2.2.0
2021-09-01 01:08:53,380 (PID:5613) - INFO - config file: /tmp/tmpixnu75yk
2021-09-01 01:08:53,395 (PID:5613) - INFO - Creating index: /{redacted}/output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila
Indexing LSV IDs: 0 / 94
2021-09-01 01:08:55,510 (PID:5613) - INFO - Writing index: /{redacted}/output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila
Serving on http://localhost:45899

So we open up http://localhost:45899 in our web browser. VOILA view first opens up to an index over genes/LSVs:

voila index

We can browse through pages of genes/LSV diagrams as convenient, or we can also use the search bar in the top-right hand side to search for a particular gene (or LSV). Once we’ve found a gene of interest, we click the link in the “Gene Name” column to view details for that gene.

For example, clicking into NDUFV3 gives us a screen that looks like:

voila-ndufv3

The top portion of the screen corresponds to splicegraphs. These indicate exon numbers and annotated vs denovo exons, junctions, and retained introns, and observed raw read coverage for each intron or junction. Which experiments have coverage displayed can be modified by clicking on the “SpliceGraph” button, and adding or removing additional experiments to display. The magnification buttons next to the legend allow zooming in or out on the splicegraph, and the wand button changes whether intronic regions are squashed or displayed to scale.

The bottom portion of the screen corresponds to quantified LSVs. Each LSV has three columns:

  1. The first column indicates the genomic coordinates of the junction or retained intron.

  2. The second column shows violin plots of the averaged posterior distributions per group. Individual points show the posterior means for PSI for each experiment per group. Hovering over points displays a tooltip indicating the experiment name.

  3. The third column shows a heat map showing group-vs-group comparisons. Below the diagonal, the heatmap displays observed p-values for each comparison. The specific test displayed can be changed using the radio buttons in the top-right corner. Above the diagonal, the heatmap shows the difference in PSI between the medians of each group.

We can see the quantified values of PSI for each experiment per group in the violin plots.

voila view as a multi-user web server

voila view was also designed with sharing and collaboration in mind. So, in addition to being used locally, it is possible to set it up as its own web server host.

Note that accessibility of the service within local networks or the internet in general can be highly dependent on network settings and security practices (e.g. port forwarding, IP {white,black}lists, etc.) outside the scope of MAJIQ/VOILA itself; You can consult with your system administrator for advice if you want to make voila view available outside of the local machine / for multiple users.

  • voila view has a --host parameter, which works similarly to --host parameters in most other unix-like programs. That is, it sets the bind address. For the most common use case, to listen to all addresses, simply pass --host 0.0.0.0. (The default is --host 127.0.0.1, which listens only to your local machine)

  • for security against port-scanners, voila view provides a --enable-passcode switch. This generates a unique passcode in the link which is needed for users to successfully view the output. Recommended for use when hosting sensitive data with the tool.

  • voila view has three web server options to cover all needs. you can choose flask, waitress, or gunicorn for as a parameter to the --web-server switch. flask is insecure and slow, and should only be used if you are trying to resolve a bizarre error, as it generates more debugging information. waitress is the default. It is efficient, and suitable for most single-user use cases. However, it is single-threaded. gunicorn is a production-ready, performant web server used by many projects. It can serve many users/page requests concurrently. It may be helpful in some cases to speed up access even for a single user if loading very large numbers of LSVs in a gene at once. You can set the number of threads to host with the --num-web-workers argument. Safe to use with nginx proxy-pass as well.

Filtering results

The output files used by VOILA (splicegraph.sql, *.voila) can be very large for real analyses. voila filter allows filtering these files to reduce their filesize to facilitate sharing (or downloading from a cluster environment).

We:

  • specify the splicegraph output/build/splicegraph.sql and voila files output/heterogen/*.het.voila as positional arguments,

  • specify the output directory for the filtered files with the required argument --directory output/voila_filter

  • specify the set of genes or LSVs to keep after filtering either directly (--gene-ids, --lsv-ids) or by specifying a file with IDs to keep (--gene-ids-file, --lsv-ids-file): --gene-ids 'gene:ENSG00000106603'.

This leads to the command:

[20]:
%%bash
voila filter output/build/splicegraph.sql output/heterogen/*.het.voila \
    --directory output/voila_filter --gene-ids 'gene:ENSG00000106603' --overwrite
2023-10-11 12:46:13,977 (PID:44710) - INFO - Command: /home/sjewell/PycharmProjects/majiq/env_3.10/bin/voila filter output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila --directory output/voila_filter --gene-ids gene:ENSG00000106603 --overwrite
2023-10-11 12:46:13,977 (PID:44710) - INFO - Voila v2.42.dev285+gcfe719c7.d20231011
2023-10-11 12:46:13,977 (PID:44710) - INFO - config file: /tmp/tmpmrp3ep9r
2023-10-11 12:46:13,989 (PID:44710) - INFO - HETx6 FILTER
2023-10-11 12:46:13,989 (PID:44710) - INFO - Filtering input to 1 gene(s)
2023-10-11 12:46:13,989 (PID:44710) - INFO - One splicegraph and 6 voila files
2023-10-11 12:46:13,989 (PID:44710) - INFO - Writing output files to /tmp/tmpucqll51w/output/voila_filter
2023-10-11 12:46:14,023 (PID:44740) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila
2023-10-11 12:46:14,024 (PID:44742) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila
2023-10-11 12:46:14,024 (PID:44744) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/K562_NT-HepG2_NT.het.voila
2023-10-11 12:46:14,024 (PID:44746) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/K562_SRSF4KD-HepG2_NT.het.voila
2023-10-11 12:46:14,025 (PID:44750) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/K562_SRSF4KD-K562_NT.het.voila
2023-10-11 12:46:14,024 (PID:44748) - INFO - Started Filtering /tmp/tmpucqll51w/output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila
2023-10-11 12:46:14,031 (PID:44744) - INFO - Finished writing output/voila_filter/K562_NT-HepG2_NT.het.voila
2023-10-11 12:46:14,032 (PID:44746) - INFO - Finished writing output/voila_filter/K562_SRSF4KD-HepG2_NT.het.voila
2023-10-11 12:46:14,032 (PID:44742) - INFO - Finished writing output/voila_filter/HepG2_SRSF4KD-K562_NT.het.voila
2023-10-11 12:46:14,033 (PID:44740) - INFO - Finished writing output/voila_filter/HepG2_SRSF4KD-HepG2_NT.het.voila
2023-10-11 12:46:14,033 (PID:44750) - INFO - Finished writing output/voila_filter/K562_SRSF4KD-K562_NT.het.voila
2023-10-11 12:46:14,034 (PID:44748) - INFO - Finished writing output/voila_filter/K562_SRSF4KD-HepG2_SRSF4KD.het.voila
2023-10-11 12:46:15,027 (PID:44744) - INFO - Filtering Splicegraph
2023-10-11 12:46:15,197 (PID:44744) - INFO - Finished Filtering Splicegraph
2023-10-11 12:46:16,030 (PID:44710) - INFO - Filtering Complete