This document provides a step-by-step guide to reproduce the results
of the paper “Multi-view
biclustering via non-negative matrix tri-factorisation”. This
webpage is generated from the reproduce_results.rmd file in
the resNMTF_paper GitHub repository. It is designed to be
run on a local MacOS/linux machine with R,
Python and a shell language
available.
Details on how to run the file and reproduce the results are provided below. To begin, some details on setup and requirements are provided.
Note: After the paper was published, a significant
speed up was made to the functions in the resnmtf package.
For complete reproducibility, the version of resnmtf used
in the paper (commit 6bc9ff2) is installed in the code
chunk to install packages. To use the latest version of
resnmtf, simply change the commit to main in
the relevant code chunk.
Due to the extensive number of simulation studies, we have only included the code to reproduce the results for those in the main body of the text. Further instructions to reproduce the results for those in the supplementary material are detailed at relevant points in the document. All figures included in the poster for the Bioinference 2025 conference are from the main body of the paper and as such are reproduced here.
All figures included in this webpage are generated by the preceeding code chunk.
Note: Due to long run times several modifications have been made:
n_reps parameter in the relevant code
chunk. Each repeat adds just over an hour to the run time.eval=FALSE and
include=FALSE to prevent the code from running. To
reproduce these results anyway, simply remove these two arguments.The runtime for this document (on a 2024 M4 Macbook Pro with 16GB RAM) without the real data applications is approximatley 3.5 hours. The runtime for the each real data applications is approximately 1-2 days (on a cluster).
Any figures from the paper that are not produced by code (i.e. illustrations) are not included in this document.
The document is structured as follows:
This section details the software required to run the code and includes details to install relevant packages as well as set up required environments.
As the code was developed using shell scripts, a shell language is
required. If you are on Windows, please install a zsh shell
(e.g. Git Bash or Cygwin).
Important: The following assumes you are using a
zsh shell. If you are using a different shell, please
adjust the commands in the code chunks accordingly.
In order to run this document, you need clone the github repository
to your local machine, and knit the reproduce_results.rmd
file. The repository can be found here.
Important: the `resnmtf’ package benefits from the speed ups provided by using a non-“reference” version of BLAS (a library to perform standard mathematical operations). It’s really easy to use a quicker version, see here for an example.
Secondly, we need to install the required packages. The
resnmtf and bisilhouette packages are
available on github and can be installed using the devtools
package. This work requires our R packages
resnmtf and bisilhouette, the github pages for
which can be found here and here respectively.
Here we install the specific version of resnmtf used in the
paper (commit 6bc9ff2).
# Install and load required packages
source("install_packages.r")
library(devtools)
devtools::install_github("eso28599/resnmtf@6bc9ff2") # version used in paper
devtools::install_github("eso28599/bisilhouette")
library(resnmtf)
library(bisilhouette)
library(knitr)
library(magick)
As one of the methods ResNMTF is compared against a method written in Python, an environment is provided for reproducibility. This assumes you have python and pip available on your machine. If you do not, please install them.
chmod +x ./conda_env.sh
source conda_env.sh
Alternatively, if you do not have conda installed, you can create a
virtual environment using pip. This can be done by changing
eval running the following code chunk.
If you don’t have pyenv installed, please install it
following instructions here. This allows you to
create a virtual environment with a specific version of Python, which is
needed to run the iSSVD method.
chmod +x ./virtual_env.sh
./virtual_env.sh
You will also need to uncomment the source line in the
SimStudy/run_sim.sh file to activate the environment using
pip rather than conda.
This section details the code to reproduce all results from the paper, with each code chunk corresponding to a figure or table in the paper. All figures included in this document are generated by the preceeding code chunk.
To run the simulation study for increasing the number of biclusters (on the base scenario of three views), with 3 repeats, the following code chunk can be used:
export sim="bicl"
export sim_folder_name="bicl_3v"
export seq=(3 4 5 6)
export n_reps=3
./SimStudy/run_sim.sh $sim $sim_folder_name $n_reps $seq
Tip: The full results produced can now be found in
the SimStudy/Results/bicl_3v folder, individually under
various file names (F_score_plot.pdf,
BiS_plot.pdf, etc.) or in one pdf
(results.pdf).
Some of the results produced are shown below:
Figure 5A
Figure 5B
The remaining experiments can be reproduced by setting the
sim, sim_folder_name and seq
variables to the relevant names in the code chunk above. The variable
names for the other experiments in Figure 6 are:
views,
views_5b, 2 3 4 5indiv,
indiv_3v5b, 50 200 300 500 1000noise,
noise_3v5b,
1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100The variable names for remaining experiments are detailed in the
readme file.For a specific simulation, the seq variable is
found in the corresponding
SimStudy/Scripts/[sim_folder_name]_script_one.sh file, in
the for loop.
Note: The experiments were initially run on
clusters, the scripts to run the simulations on clusters can be found in
the SimStudy/Scripts folder where each experiment has a
[sim_study]_script_one.sh and
a[sim_study]_script_two.sh file.
The results for ResNMTF applied on the various real datasets are
obtained from the investigations into the effect of distance metrics and
hyperparameter tuning (producing
RealData/all_results_dis.txt). This can be implemented for
each dataset by running the following relevant chunk: #### 3sources 1.
Run restriction hyperprameter study on HPC system.
qsub RealData/3sources/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "3sources" 3
qsub RealData/3sources/stab_script.sh
qsub RealData/combine_stab_metrics.r "3sources" 3
python RealData/3sources/3sources_python_analysis # python based methods
qsub RealData/other_results.r "3sources" 3
The analysis of the other real datasets follows a similar structure. #### bbcsport 1. Run restriction hyperprameter study on HPC system.
qsub RealData/bbcsport/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "bbcsport" 2
qsub RealData/bbcsport/stab_script.sh
qsub RealData/combine_stab_metrics.r "bbcsport" 2
python RealData/bbcsport/bbc_python_analysis # python based methods
qsub RealData/other_results.r "bbcsport" 2
qsub RealData/single_cell/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "single_cell" 2
qsub RealData/single_cell/stab_script.sh
qsub RealData/combine_stab_metrics.r "single_cell" 2
python RealData/single_cell/single_cell_python_analysis # python based methods
qsub RealData/other_results.r "single_cell" 2
qsub RealData/3cancers/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "3cancers" 2
qsub RealData/3cancers/stab_script.sh
qsub RealData/combine_stab_metrics.r "3cancers" 2
python RealData/3cancers/3cancers_python_analysis # python based methods
qsub RealData/other_results.r "3cancers" 2
The results for ResNMTF applied on the various real datasets are obtained from the investigations above, producing the following results:
Figure 5A
Figure 5B
Figure 6B
Lastly, Figure 6A is produced via the following code chunk which
creates and save the bisilhouette plot for the shuffled synthetic data.
This is saved to
Exploration/visual_data/shuffled_bisil_plot.pdf.
source("Exploration/visualisation_ex.r")
Figure 7A
The following code chunk creates and saves the two subfigures in the supplementary material.
source("Exploration/visualisation_ex.r")
These are saved to
Exploration/convergence_data/error_plot.pdf and
RealData/3cancers/data/stability/error_plot.pdf.
Figure S1A
Figure S1B
The following code chunk creates and save the two subfigures in the supplementary materail.
source("Exploration/JSD_diff_dists.r")
These are saved to
Exploration/visual_data/f_data_dists.pdf and
Exploration/visual_data/shuffled_data_dists.pdf.
Figure S2A
Figure S2B
The generation of the data corresponding to the overlap regions in the published manuscript was not implemented as intended. Whilst the generation method used within the paper is not incorrect, it is not what would be typically assumed. We have discussed and included the results from both generation methods for completeness. The results of the simulation study on the original overlapping data (which were reported in the supplementary material not the main body of the manuscript) are seen in Figure A1 whilst Figure A2 contains the results on the additive overlapping data. The trends seen and conclusions drawn from both figures are the same. Indeed the main body of the paper contains only one sentence on the overlapping results, which remains accurate for the results from the updated structure.
This repository contains the original implementation (where overlap regions of the bicluster were taken as a single draw from a normal distriubtuion) with the intended implementation (where overlap regions of the bicluster are the sum of two independent draws from a normal distriubtuion) commented out in Lines 183-185 in SimStudy/Functions/data_generations.r. The trends seen and conclusions drawn remain unaltered once this modification is made.
Figure A1: Original overlap
Figure A2: Additive overlap