{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "844a6d50",
   "metadata": {},
   "source": [
    "# Using MAJIQ HET (ENCODE RBP knockdowns)\n",
    "\n",
    "We have created a [small example dataset](http://majiq.biociphers.org/data/majiq_het_vignette.zip)\n",
    "from a few select genes/samples from ENCODE in order to demonstrate how to use MAJIQ.\n",
    "First, download the example dataset files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b99c2b84",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create temporary working directory for this notebook\n",
    "from tempfile import TemporaryDirectory\n",
    "tempdir = TemporaryDirectory()\n",
    "%cd $tempdir.name\n",
    "# download archive with example data\n",
    "!curl -LO http://majiq.biociphers.org/data/majiq_het_vignette.zip\n",
    "# unzip and show what files are in the archive\n",
    "!unzip -qn majiq_het_vignette.zip"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "da8eb592",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ace2e52",
   "metadata": {},
   "source": [
    "## What are our inputs?\n",
    "\n",
    "We selected RNA-seq experiments corresponding to _SRSF4_ knockdown (by CRISPR) vs no-target controls in K562 and HepG2 cells from ENCODE:\n",
    "\n",
    "| cell_line   | knockdown   | encode_accession   | prefix           |\n",
    "|:------------|:------------|:-------------------|:-----------------|\n",
    "| HepG2       | SRSF4       | ENCSR471INA        | ENCSR471INA.rep1 |\n",
    "| HepG2       | SRSF4       | ENCSR471INA        | ENCSR471INA.rep2 |\n",
    "| HepG2       | SRSF4       | ENCSR929JKA        | ENCSR929JKA.rep1 |\n",
    "| HepG2       | SRSF4       | ENCSR929JKA        | ENCSR929JKA.rep2 |\n",
    "|             |             |                    |                  |\n",
    "| HepG2       | notarget    | ENCSR105NML        | ENCSR105NML.rep1 |\n",
    "| HepG2       | notarget    | ENCSR194SPW        | ENCSR194SPW.rep2 |\n",
    "| HepG2       | notarget    | ENCSR805TIZ        | ENCSR805TIZ.rep1 |\n",
    "| HepG2       | notarget    | ENCSR805TIZ        | ENCSR805TIZ.rep2 |\n",
    "| HepG2       | notarget    | ENCSR853AOV        | ENCSR853AOV.rep1 |\n",
    "| HepG2       | notarget    | ENCSR853AOV        | ENCSR853AOV.rep2 |\n",
    "|             |             |                    |                  |\n",
    "| K562        | SRSF4       | ENCSR905NIK        | ENCSR905NIK.rep1 |\n",
    "| K562        | SRSF4       | ENCSR905NIK        | ENCSR905NIK.rep2 |\n",
    "| K562        | SRSF4       | ENCSR939BTN        | ENCSR939BTN.rep1 |\n",
    "| K562        | SRSF4       | ENCSR939BTN        | ENCSR939BTN.rep2 |\n",
    "|             |             |                    |                  |\n",
    "| K562        | notarget    | ENCSR005JBU        | ENCSR005JBU.rep2 |\n",
    "| K562        | notarget    | ENCSR015KGT        | ENCSR015KGT.rep1 |\n",
    "| K562        | notarget    | ENCSR163JUC        | ENCSR163JUC.rep2 |\n",
    "| K562        | notarget    | ENCSR772DUM        | ENCSR772DUM.rep1 |\n",
    "| K562        | notarget    | ENCSR772DUM        | ENCSR772DUM.rep2 |\n",
    "\n",
    "We downloaded fastqs for these samples, aligned them to GRCh38, and took the subset of aligned reads corresponding to the following genes/genomic regions:\n",
    "\n",
    "|    |   seqid |     start |       end | gene_name   | gene_id              |\n",
    "|---:|--------:|----------:|----------:|:------------|:---------------------|\n",
    "|  0 |       1 |  29147743 |  29181987 | SRSF4       | gene:ENSG00000116350 |\n",
    "|  1 |       1 |  32179686 |  32198285 | TXLNA       | gene:ENSG00000084652 |\n",
    "|  2 |      17 |  43170481 |  43211689 | NBR1        | gene:ENSG00000188554 |\n",
    "|  3 |      19 |   5158495 |   5340803 | PTPRS       | gene:ENSG00000105426 |\n",
    "|  4 |      21 |  42879644 |  42913304 | NDUFV3      | gene:ENSG00000160194 |\n",
    "|  5 |       3 | 152243828 | 152465780 | MBNL1       | gene:ENSG00000152601 |\n",
    "|  6 |       7 |  43608456 |  43729717 | COA1        | gene:ENSG00000106603 |\n",
    "|  7 |       8 | 101177878 | 101206193 | ZNF706      | gene:ENSG00000120963 |\n",
    "\n",
    "The resulting BAMs are located in the folder `bam/`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9486f12f",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls bam/*.bam"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ecfa5895",
   "metadata": {},
   "source": [
    "## Running the MAJIQ Builder\n",
    "\n",
    "To analyze RNA splicing in these experiments we need to run the MAJIQ builder to:\n",
    "\n",
    "+ define splicegraphs of annotated junctions and retained introns between gene exons,\n",
    "+ obtain coverage over the splicegraph's LSVs for each experiment.\n",
    "\n",
    "The resulting coverage files can be used for quantification, and the splicegraph database can be used for downstream analysis and visualization.\n",
    "\n",
    "In order for an LSV to have coverage generated for it in these output files, MAJIQ requires at least one of its junctions or retained\n",
    "introns to have sufficient evidence for it in its input experiments.\n",
    "\n",
    "An individual experiment provides supporting evidence for a junction when it passes junction filters:\n",
    "`--minpos` and either `--minreads` (annotated junctions) or `--min-denovo` (_de novo_ junctions).\n",
    "For retained introns, when it passes intron filters: `--irnbins` and `--min-intronic-cov`.\n",
    "See `majiq build --help` for details on these parameters.\n",
    "\n",
    "Support for a junction or retained intron from an individual experiment by itself is not necessarily enough.\n",
    "MAJIQ partitions the set of input experiments into independent build groups.\n",
    "A junction or retained intron is passed (considered to have enough evidence) when at least `--min-experiments`\n",
    "experiments from the same build group pass per-experiment filters in at least one build group.\n",
    "\n",
    "The MAJIQ builder uses a configuration file to specify where the input experiments are located and how they are partitioned into build groups.\n",
    "Global information about experiments are specified in an `[info]` section:\n",
    "\n",
    "```\n",
    "[info]\n",
    "# species/reference name for UCSC links by VOILA\n",
    "genome=GRCh38\n",
    "# relative or absolute path to directories (comma-separated) with bam files\n",
    "bamdirs=bam/\n",
    "# these ENCODE experiments are all reverse stranded\n",
    "strandness=reverse\n",
    "```\n",
    "\n",
    "The build groups and experiments which belong to them are specified in an `[experiments]` section.\n",
    "Build groups are keys with values listing experiment names (prefixes to BAM files excluding the `.bam` extension) separated by commas.\n",
    "If we wanted to focus only on junctions and retained introns found in all experiments for a single cell-type/knockdown, we could organize this\n",
    "section like (setting `--min-experiments` to reflect how many experiments are necessary to pass):\n",
    "\n",
    "```\n",
    "[experiments]\n",
    "HepG2_NT=ENCSR805TIZ.rep1.subset,ENCSR805TIZ.rep2.subset,ENCSR853AOV.rep1.subset,ENCSR853AOV.rep2.subset,ENCSR105NML.rep1.subset,ENCSR194SPW.rep2.subset\n",
    "HepG2_SRSF4=ENCSR929JKA.rep1.subset,ENCSR929JKA.rep2.subset,ENCSR471INA.rep1.subset,ENCSR471INA.rep2.subset\n",
    "K562_NT=ENCSR005JBU.rep2.subset,ENCSR772DUM.rep1.subset,ENCSR772DUM.rep2.subset,ENCSR015KGT.rep1.subset,ENCSR163JUC.rep2.subset\n",
    "K562_SRSF4=ENCSR939BTN.rep1.subset,ENCSR939BTN.rep2.subset,ENCSR905NIK.rep1.subset,ENCSR905NIK.rep2.subset\n",
    "```\n",
    "\n",
    "If any experiment is sufficient, we can instead have a build group for each experiment:\n",
    "\n",
    "```\n",
    "[experiments]\n",
    "ENCSR805TIZ.rep1.subset=ENCSR805TIZ.rep1.subset\n",
    "ENCSR805TIZ.rep2.subset=ENCSR805TIZ.rep2.subset\n",
    "# {... all of the other experiments ...}\n",
    "ENCSR905NIK.rep1.subset=ENCSR905NIK.rep1.subset\n",
    "ENCSR905NIK.rep2.subset=ENCSR905NIK.rep2.subset\n",
    "```\n",
    "\n",
    "Equivalently, we could set `--min-experiments` to 1.\n",
    "\n",
    "Another reasonable possibility would be to group by accessions, which would result in the configuration in `settings.ini`.\n",
    "\n",
    "For this analysis, we are interested in considering heterogeneity between individual experiments, so we'll use `settings.ini`, but with`--min-experiments` set to 1.\n",
    "We specify these flags to the builder using `-c settings.ini --min-experiments 1`.\n",
    "\n",
    "In addition to these input experiments/configuration file, MAJIQ requires a GFF3 describing annotated transcripts.\n",
    "The subset of Ensembl GRCh38 v94 annotations corresponding to our selected genes can be found at `gff3/ensembl.Homo_sapiens.GRCh38.94.subset.gff3.gz`.\n",
    "This is passed to the builder directly as a positional argument.\n",
    "\n",
    "Finally, we must specify the builder's output directory using a required flag. To save its output to `output/build`, this would be `-o output/build`.\n",
    "\n",
    "Put together, we run the builder as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5f0a04b",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq build \\\n",
    "    -o output/build \\\n",
    "    -c settings.ini \\\n",
    "    gff3/ensembl.Homo_sapiens.GRCh38.94.subset.gff3.gz \\\n",
    "    --min-experiments 1 \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28a7ab12",
   "metadata": {},
   "source": [
    "This produces output files:\n",
    "\n",
    "+ `splicegraph.sql` (splicegraph database for use with VOILA)\n",
    "+ `majiq.log` (logging output from majiq build)\n",
    "+ `{experiment}.sj` (intermediate files that can be used instead of bams for future runs of majiq build)\n",
    "+ `{experiment}.majiq` (inputs for majiq quantifiers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "85899863",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls output/build"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "614c55aa",
   "metadata": {},
   "source": [
    "## Running the MAJIQ quantifiers\n",
    "\n",
    "We quantify and test for differences in PSI between independent samples from the different cell lines/knockdowns using `majiq heterogen`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "507e55bd",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n K562_SRSF4KD HepG2_SRSF4KD \\\n",
    "    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{929JKA,471INA}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b88139be",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n K562_SRSF4KD K562_NT \\\n",
    "    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b839fcf",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n K562_SRSF4KD HepG2_NT \\\n",
    "    -grp1 output/build/ENCSR{939BTN,905NIK}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bcd0b2f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n HepG2_SRSF4KD K562_NT \\\n",
    "    -grp1 output/build/ENCSR{929JKA,471INA}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40e52a28",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n HepG2_SRSF4KD HepG2_NT \\\n",
    "    -grp1 output/build/ENCSR{929JKA,471INA}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f29bba30",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "majiq heterogen -o output/heterogen/ -n K562_NT HepG2_NT \\\n",
    "    -grp1 output/build/ENCSR{005JBU,772DUM,015KGT,163JUC}*.majiq \\\n",
    "    -grp2 output/build/ENCSR{805TIZ,853AOV,105NML,194SPW}*.majiq \\\n",
    "    2> /dev/null"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d5f5423",
   "metadata": {},
   "source": [
    "These commands produce output to the directory specified (`-o output/heterogen`):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ee1ccae",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls output/heterogen"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d87bc63b",
   "metadata": {},
   "source": [
    "These commands do the heavy lifting of inference, sampling, and testing.\n",
    "However, we need to use VOILA with the splicegraph database to generate human-readable outputs, as described in the next section."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58425d03",
   "metadata": {},
   "source": [
    "## Running VOILA\n",
    "\n",
    "VOILA has three modes for looking at the output of our quantifiers:\n",
    "\n",
    "+ `voila tsv`: generate TSVs with quantified LSVs and associated statistics\n",
    "+ `voila modulize`: generate TSVs which differentiate each specific type of splicing variation\n",
    "+ `voila view`: run web service for interactively visualizing quantifications, statistics and their relationship to genes' splicegraphs.\n",
    "\n",
    "### Generating TSV output\n",
    "\n",
    "We can generate a TSV combining the LSVs quantified in all 6 pairs of comparisons.\n",
    "We:\n",
    "\n",
    "+ specify the splicegraph `output/build/splicegraph.sql` and voila files `output/heterogen/*.het.voila` as positional arguments,\n",
    "+ specify the path for our output TSV by the argument `-f output/voila_tsv/all-pairs-het.tsv`,\n",
    "+ have VOILA save all quantified LSVs, not only with those called as changing using `--show-all` (alternatively, run `voila view --help` to set thresholds for changing events).\n",
    "\n",
    "This leads to the following command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d80f44c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls output/heterogen/tmp5sw2q083/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f1e9d0a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "voila tsv \\\n",
    "    output/build/splicegraph.sql output/heterogen/*.het.voila \\\n",
    "    --show-all \\\n",
    "    -f output/voila_tsv/all-pairs-het.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d821b29",
   "metadata": {},
   "source": [
    "Let's look at what these outputs look like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a23519c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# print commented-out header\n",
    "with open(\"output/voila_tsv/all-pairs-het.tsv\", \"r\") as handle:\n",
    "    for line in handle:\n",
    "        if not line.startswith(\"#\"):\n",
    "            break\n",
    "        print(line, end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84668430",
   "metadata": {},
   "source": [
    "First, there is a commented-out header with metadata (in JSON format, after excluding the comment characters at the start of each line):\n",
    "\n",
    "+ `command`: the call to `voila tsv` that generated the TSV. This helps indicate which thresholds were set, etc.,\n",
    "+ `group_sizes`: the unique group names compared in the input voila files and the number of experiments included in each group,\n",
    "+ `psi_samples`, `test_percentile`: parameters used by majiq heterogen for computing test statistics not only on PSI posterior means but on repeated samples from PSI posterior distributions (more detailed explanation below when discussing column names),\n",
    "+ `stat_names`: the different statistical tests used to evaluate significance of differences between observations from groups,\n",
    "+ `voila_version`: the version of VOILA that was used.\n",
    "\n",
    "The most important consideration here are the `stat_names`: TNOM, TTEST, WILCOXON.\n",
    "By default, MAJIQ performs these three statistical tests (all but INFOSCORE) on each junction/retained intron from each LSV.\n",
    "TTEST corresponds to a independent two-sample t-test with unequal variances (Welch's t-test).\n",
    "WILCOXON corresponds to an independent two-sample Mann-Whitney U test.\n",
    "INFOSCORE and TNOM correspond to the test statistics as originally implemented in the [ScoreGenes package][scoregenes].\n",
    "INFOSCORE is excluded from defaults because it runs slower than the other tests with larger number of experiments.\n",
    "\n",
    "   [scoregenes]: https://www.cs.huji.ac.il/labs/compbio/scoregenes/\n",
    "\n",
    "MAJIQ can be configured to run only a subset of these tests by setting the `--stats` argument to the desired subset of these test-statistics.\n",
    "See {ref}`what statistic should I use <statistics>` for more opinionated guidance about picking which statistics to use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8567a31e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# print first uncommented line (column names)\n",
    "with open(\"output/voila_tsv/all-pairs-het.tsv\", \"r\") as handle:\n",
    "    for line in handle:\n",
    "        if line.startswith(\"#\"):\n",
    "            continue\n",
    "        print(line, end=\"\")\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f6b851d",
   "metadata": {},
   "source": [
    "Second, our first uncommented line in the TSV file indicates the column names for the table. Of note:\n",
    "\n",
    "The 25th, 50th, and 75th percentiles of PSI for each group are indicated by the columns `{group_name}_{percentile25,median,percentile75}_psi`\n",
    "\n",
    "The results of the statistical tests between groups are indicated by columns `{grp1}-{grp2} {test_name}{,_quantile}`. When the groups compared are unambiguous (only one VOILA file as input), `{grp1}_{grp2} ` is excluded from the column name. When the TNOM statistical test was run, we also report `{grp1}_{grp2} tnom_score`.\n",
    "The special case is `tnom_score`, which reports the raw test-statistic -- the minimum **t**otal **n**umber **o**f **m**istakes when picking a single threshold on PSI to classify the observations back into the two groups.\n",
    "Otherwise, the statistical tests correspond to p-values from each test.\n",
    "For each test, we report `{test_name}` and `{test_name}_quantile`.\n",
    "The columns `{test_name}` (no quantile) correspond to statistical tests on the posterior means of each sample. _These p-values are calibrated on an individual basis_.\n",
    "In some cases, we can benefit from considering the uncertainty encoded in the full posterior distributions by performing tests on samples from the posterior\n",
    "distribution.\n",
    "Then, we can consider a credible range of test statistics from each comparison that we might expect.\n",
    "By default, `majiq heterogen` takes 100 samples from each posterior distribution and reports the 95th percentile p-value from the tests comparing each\n",
    "of the posterior samples (change using `--psi-samples` and `--test_percentile` flags).\n",
    "The `{test_name}_quantile` column thus indicates the selected percentile of sampled p-values, which is generally more conservative and not calibrated as a p-value in its own right.\n",
    "\n",
    "There are also columns with the format `{grp1}-{grp2} ?(non)changing`.\n",
    "These indicate whether each junction/retained intron in the LSV met VOILA's thresholds for being changing or nonchanging.\n",
    "By default, VOILA calls a junction as changing between groups if:\n",
    "\n",
    "+ all test statistics have nominal p-value less than 0.05 (change with `--changing-pvalue-threshold` flag)\n",
    "+ difference in group median PSI of at least 20% (change with `--changing-between-group-dpsi` flag).\n",
    "\n",
    "Similarly, by default, VOILA calls a junction as non-changing between groups if:\n",
    "\n",
    "+ all test statistics have nominal p-value greater than 0.05 (change with `--non-changing-pvalue-threshold` flag)\n",
    "+ within-group interquartile range of PSI less than 10% in both groups (change with `--non-changing-within-group-IQR` flag)\n",
    "+ difference in group median PSI of at most 5% (change with `--non-changing-between-group-dpsi` flag).\n",
    "\n",
    "In some cases, for a certain quantifier run, due to low coverage or thresholding, a certain junction may not be quantified in a certain experimental group. This possibility is indicated through the `{grp} num_quantified` columns. In the majority of cases, the number here will match the value of `num_junctions`, but when fewer junctions are quantified in that group, the value here will reflect a lower sum."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8795ee2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# print first line of data\n",
    "with open(\"output/voila_tsv/all-pairs-het.tsv\", \"r\") as handle:\n",
    "    for line in handle:\n",
    "        if line.startswith(\"#\"):\n",
    "            continue\n",
    "        break  # don't print first uncommented line (column names)\n",
    "    # print the next line (first row of data)\n",
    "    print(next(handle), end=\"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ef0dca3",
   "metadata": {},
   "source": [
    "Finally, the remaining lines in the TSV are for the data in the table.\n",
    "Each row corresponds to a quantified LSV. For the quantification, test, etc. columns, the values for each junction/retained intron in the LSV are delimited by semicolons.\n",
    "Note that if computation is desired on the individual values, it can be convenient to \"explode\" the data to rows per LSV connection.\n",
    "In pandas, this can be done by using `pd.Series.str.split(\";\")` on appropriate columns followed by `pd.DataFrame.explode()`.\n",
    "In R, this can be done by using `tidyr::separate_rows()`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "221d722a",
   "metadata": {},
   "source": [
    "### Categorizing splicing events with `voila modulizer`\n",
    "\n",
    "We can also generate summaries of categorizations of splicing events with `voila modulize`. The purpose of this tool is not only to summarize classical events, but also a variety of complex events as well as determine common combinations of two or more events. \n",
    "\n",
    "![voila-modulizer](../static/voila-modulizer.png)\n",
    "\n",
    "You may run the modulizer in the following manner:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e70cae70",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "voila modulize output/build/splicegraph.sql output/heterogen/*.het.voila -d output/modulized --overwrite"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7dc71aa6",
   "metadata": {},
   "source": [
    "Note that the modulizer can run with any combination of the standard .psi.voila or .deltapsi.voila files in addition to our het.voila files.\n",
    "Each additional piece of data for the same LSV/junction will be recorded in the output row in the modulized output. \n",
    "\n",
    "The modulized output will be recorded in a number of files in .tsv format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8afe1dd6",
   "metadata": {},
   "outputs": [],
   "source": [
    "ls output/modulized"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b45026e",
   "metadata": {},
   "source": [
    "They are:\n",
    "\n",
    "- `summary.tsv`: Likely the most immediately useful for the majority of users.\n",
    "  Each row corresponds to a splicing module, and lists the counts of each\n",
    "  type of splicing event found inside the module.\n",
    "- `heatmap.tsv`: For each module, a junction of greatest biological\n",
    "  importance is considered and quantified here, one row per module.\n",
    "  The method/parameters for choosing which junction is selected can be changed, see `voila modulize --help` for details.\n",
    "- event tsv files, ex: `cassette.tsv`: Contains quantification information\n",
    "  across all input files for each junction making up that specific\n",
    "  splicing event. \n",
    "- `other.tsv`: for any junctions not assigned to ANY splicing event,\n",
    "  they are dumped here. \n",
    "- `junctions.tsv`: A concatenation of all junction data from all\n",
    "  event tsv files as well as `other.tsv`.\n",
    "\n",
    "\n",
    "Please see [VOILA modulizer documentation][modulizer-docs] for more details on the modulizer.\n",
    "\n",
    "   [modulizer-docs]: https://docs.google.com/document/d/1JQ0zlGA5VXRBoQifmz777VZMkJvnvZ7_xQA7AOmucBw/edit?usp=sharing"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c67f9a4f",
   "metadata": {},
   "source": [
    "### Interactively visualizing results with `voila view`\n",
    "\n",
    "We can also view results interactively in our web browser using `voila view`.\n",
    "We can start the web service by running\n",
    "\n",
    "```\n",
    "voila view output/build/splicegraph.sql output/heterogen/*.het.voila\n",
    "```\n",
    "\n",
    "By default, the web service is hosted on a random port for localhost (`--host` and `-p` flags can be used to specify directly).\n",
    "The console will print a URL that can be accessed from the same machine within a web browser. For example, when we run the above command, we see:\n",
    "\n",
    "```\n",
    "2021-09-01 01:08:53,380 (PID:5613) - INFO - Command: /{redacted}/bin/voila view output/build/splicegraph.sql output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila output/heterogen/HepG2_SRSF4KD-K562_NT.het.voila output/heterogen/K562_NT-HepG2_NT.het.voila output/heteroge\n",
    "n/K562_SRSF4KD-HepG2_NT.het.voila output/heterogen/K562_SRSF4KD-HepG2_SRSF4KD.het.voila output/heterogen/K562_SRSF4KD-K562_NT.het.voila\n",
    "2021-09-01 01:08:53,380 (PID:5613) - INFO - Voila v2.2.0\n",
    "2021-09-01 01:08:53,380 (PID:5613) - INFO - config file: /tmp/tmpixnu75yk\n",
    "2021-09-01 01:08:53,395 (PID:5613) - INFO - Creating index: /{redacted}/output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila\n",
    "Indexing LSV IDs: 0 / 94\n",
    "2021-09-01 01:08:55,510 (PID:5613) - INFO - Writing index: /{redacted}/output/heterogen/HepG2_SRSF4KD-HepG2_NT.het.voila\n",
    "Serving on http://localhost:45899\n",
    "```\n",
    "\n",
    "So we open up `http://localhost:45899` in our web browser.\n",
    "VOILA view first opens up to an index over genes/LSVs:\n",
    "\n",
    "![voila index](../static/voila-index.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de5b4618",
   "metadata": {},
   "source": [
    "We can browse through pages of genes/LSV diagrams as convenient, or we can also use the search bar in the top-right hand side to search\n",
    "for a particular gene (or LSV).\n",
    "Once we've found a gene of interest, we click the link in the \"Gene Name\" column to view details for that gene.\n",
    "\n",
    "For example, clicking into _NDUFV3_ gives us a screen that looks like:\n",
    "\n",
    "![voila-ndufv3](../static/voila-ndufv3.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c20eb48e",
   "metadata": {},
   "source": [
    "The top portion of the screen corresponds to splicegraphs.\n",
    "These indicate exon numbers and annotated vs denovo exons, junctions, and retained introns, and observed raw read coverage for each intron or junction.\n",
    "Which experiments have coverage displayed can be modified by clicking on the \"SpliceGraph\" button, and adding or removing additional experiments to display.\n",
    "The magnification buttons next to the legend allow zooming in or out on the splicegraph, and the wand button changes whether intronic regions are squashed or displayed to scale.\n",
    "\n",
    "The bottom portion of the screen corresponds to quantified LSVs.\n",
    "Each LSV has three columns:\n",
    "\n",
    "1.  The first column indicates the genomic coordinates of the junction or retained intron.\n",
    "2.  The second column shows violin plots of the averaged posterior distributions per group.\n",
    "    Individual points show the posterior means for PSI for each experiment per group.\n",
    "    Hovering over points displays a tooltip indicating the experiment name.\n",
    "3.  The third column shows a heat map showing group-vs-group comparisons.\n",
    "    Below the diagonal, the heatmap displays observed p-values for each comparison.\n",
    "    The specific test displayed can be changed using the radio buttons in the top-right corner.\n",
    "    Above the diagonal, the heatmap shows the difference in PSI between the medians of each group.\n",
    "\n",
    "We can see the quantified values of PSI for each experiment per group in the violin plots."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "562110b1",
   "metadata": {},
   "source": [
    "#### `voila view` as a multi-user web server\n",
    "\n",
    "`voila view` was also designed with sharing and collaboration in mind.\n",
    "So, in addition to being used locally, it is possible to set it up as its own web server host.\n",
    "\n",
    "Note that accessibility of the service within local networks or the internet\n",
    "in general can be highly dependent on network settings and security practices\n",
    "(e.g. port forwarding, IP {white,black}lists, etc.) outside the scope of\n",
    "MAJIQ/VOILA itself; You can consult with your system administrator for advice if you want to make\n",
    "`voila view` available outside of the local machine / for multiple users.\n",
    "\n",
    "- `voila view` has a `--host` parameter, which works similarly to `--host` parameters in most other unix-like programs. That is, it sets the bind address. For the most common use case, to listen to all addresses, simply pass `--host 0.0.0.0`. (The default is `--host 127.0.0.1`, which listens only to your local machine)\n",
    "- for security against port-scanners, `voila view` provides a `--enable-passcode` switch. This generates a unique passcode in the link which is needed for users to successfully view the output. Recommended for use when hosting sensitive data with the tool. \n",
    "- `voila view` has three web server options to cover all needs. you can choose `flask`, `waitress`, or `gunicorn` for as a parameter to the `--web-server` switch. `flask` is insecure and slow, and should only be used if you are trying to resolve a bizarre error, as it generates more debugging information. `waitress` is the default. It is efficient, and suitable for most single-user use cases. However, it is single-threaded. `gunicorn` is a production-ready, performant web server used by many projects. It can serve many users/page requests concurrently. It may be helpful in some cases to speed up access even for a single user if loading very large numbers of LSVs in a gene at once. You can set the number of threads to host with the `--num-web-workers` argument. Safe to use with nginx proxy-pass as well. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "926f0030",
   "metadata": {},
   "source": [
    "### Filtering results\n",
    "\n",
    "The output files used by VOILA (`splicegraph.sql`, `*.voila`) can be very large\n",
    "for real analyses.\n",
    "`voila filter` allows filtering these files to reduce their\n",
    "filesize to facilitate sharing (or downloading from a cluster environment).\n",
    "\n",
    "We:\n",
    "\n",
    "+ specify the splicegraph `output/build/splicegraph.sql` and voila files `output/heterogen/*.het.voila` as positional arguments,\n",
    "+ specify the output directory for the filtered files with the required argument `--directory output/voila_filter`\n",
    "+ specify the set of genes or LSVs to keep after filtering either directly (`--gene-ids`, `--lsv-ids`) or by specifying a file with IDs to keep (`--gene-ids-file`, `--lsv-ids-file`): `--gene-ids 'gene:ENSG00000106603'`.\n",
    "\n",
    "This leads to the command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1daa1dab",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "voila filter output/build/splicegraph.sql output/heterogen/*.het.voila \\\n",
    "    --directory output/voila_filter --gene-ids 'gene:ENSG00000106603' --overwrite"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}