This document provides a step-by-step guide to reproduce the results
of the paper “Multi-view
biclustering via non-negative matrix tri-factorisation”. This
webpage is generated from the reproduce_results.rmd
file in
the resNMTF_paper
GitHub repository. It is designed to be
run on a local MacOS/linux machine with R,
Python and a shell language
available.
Details on how to run the file and reproduce the results are provided below. To begin, some details on setup and requirements are provided.
Due to the extensive number of simulation studies, we have only included the code to reproduce the results for those in the main body of the text. Further instructions to reproduce the results for those in the supplementary material are detailed at relevant points in the document. All figures included in the poster for the Bioinference 2025 conference are from the main body of the paper and as such are reproduced here.
All figures included in this webpage are generated by the preceeding code chunk.
Note: Due to long run times several modifications have been made:
n_reps
parameter in the relevant code
chunk. Each repeat adds just over an hour to the run time.eval=FALSE
and
include=FALSE
to prevent the code from running. To
reproduce these results anyway, simply remove these two arguments.The runtime for this document (on a 2024 M4 Macbook Pro with 16GB RAM) without the real data applications is approximatley 3.5 hours. The runtime for the each real data applications is approximately 1-2 days (on a cluster).
Any figures from the paper that are not produced by code (i.e. illustrations) are not included in this document.
The document is structured as follows:
This section details the software required to run the code and includes details to install relevant packages as well as set up required environments.
As the code was developed using shell scripts, a shell language is
required. If you are on Windows, please install a zsh
shell
(e.g. Git Bash or Cygwin).
Important: The following assumes you are using a
zsh
shell. If you are using a different shell, please
adjust the commands in the code chunks accordingly.
In order to run this document, you need clone the github repository
to your local machine, and knit the reproduce_results.rmd
file. The repository can be found here.
Secondly, we need to install the required packages. The
resnmtf
and bisilhouette
packages are
available on github and can be installed using the devtools
package. This work requires our R
packages
resnmtf
and bisilhouette
, the github pages for
which can be found here and here
respectively.
# Install and load required packages
source("install_packages.r")
library(devtools)
devtools::install_github("eso28599/resnmtf")
devtools::install_github("eso28599/bisilhouette")
library(resnmtf)
library(bisilhouette)
library(knitr)
library(magick)
As one of the methods ResNMTF is compared against a method written in Python, an environment is provided for reproducibility. This assumes you have python and pip available on your machine. If you do not, please install them.
chmod +x ./conda_env.sh
source conda_env.sh
Alternatively, if you do not have conda installed, you can create a
virtual environment using pip
. This can be done by changing
eval running the following code chunk.
If you don’t have pyenv
installed, please install it
following instructions here. This allows you to
create a virtual environment with a specific version of Python, which is
needed to run the iSSVD method.
chmod +x ./virtual_env.sh
./virtual_env.sh
You will also need to uncomment the source
line in the
SimStudy/run_sim.sh
file to activate the environment using
pip
rather than conda
.
This section details the code to reproduce all results from the paper, with each code chunk corresponding to a figure or table in the paper. All figures included in this document are generated by the preceeding code chunk.
To run the simulation study for increasing the number of biclusters (on the base scenario of three views), with 3 repeats, the following code chunk can be used:
export sim="bicl"
export sim_folder_name="bicl_3v"
export seq=(3 4 5 6)
export n_reps=3
./SimStudy/run_sim.sh $sim $sim_folder_name $n_reps $seq
Tip: The full results produced can now be found in
the SimStudy/Results/bicl_3v
folder, individually under
various file names (F_score_plot.pdf
,
BiS_plot.pdf
, etc.) or in one pdf
(results.pdf
).
Some of the results produced are shown below:
Figure 5A
Figure 5B
The remaining experiments can be reproduced by setting the
sim
, sim_folder_name
and seq
variables to the relevant names in the code chunk above. The variable
names for the other experiments in Figure 6 are:
views
,
views_5b
, 2 3 4 5
indiv
,
indiv_3v5b
, 50 200 300 500 1000
noise
,
noise_3v5b
,
1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100
The variable names for remaining experiments are detailed in the
readme file.For a specific simulation, the seq
variable is
found in the corresponding
SimStudy/Scripts/[sim_folder_name]_script_one.sh
file, in
the for loop.
Note: The experiments were initially run on
clusters, the scripts to run the simulations on clusters can be found in
the SimStudy/Scripts
folder where each experiment has a
[sim_study]_script_one.sh
and
a[sim_study]_script_two.sh
file.
The results for ResNMTF applied on the various real datasets are
obtained from the investigations into the effect of distance metrics and
hyperparameter tuning (producing
RealData/all_results_dis.txt
). This can be implemented for
each dataset by running the following relevant chunk: #### 3sources 1.
Run restriction hyperprameter study on HPC system.
qsub RealData/3sources/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "3sources" 3
qsub RealData/3sources/stab_script.sh
qsub RealData/combine_stab_metrics.r "3sources" 3
python RealData/3sources/3sources_python_analysis # python based methods
qsub RealData/other_results.r "3sources" 3
The analysis of the other real datasets follows a similar structure. #### bbcsport 1. Run restriction hyperprameter study on HPC system.
qsub RealData/bbcsport/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "bbcsport" 2
qsub RealData/bbcsport/stab_script.sh
qsub RealData/combine_stab_metrics.r "bbcsport" 2
python RealData/bbcsport/bbc_python_analysis # python based methods
qsub RealData/other_results.r "bbcsport" 2
qsub RealData/single_cell/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "single_cell" 2
qsub RealData/single_cell/stab_script.sh
qsub RealData/combine_stab_metrics.r "single_cell" 2
python RealData/single_cell/single_cell_python_analysis # python based methods
qsub RealData/other_results.r "single_cell" 2
qsub RealData/3cancers/hyperparam_script.sh
qsub RealData/combine_dis_metrics.r "3cancers" 2
qsub RealData/3cancers/stab_script.sh
qsub RealData/combine_stab_metrics.r "3cancers" 2
python RealData/3cancers/3cancers_python_analysis # python based methods
qsub RealData/other_results.r "3cancers" 2
The results for ResNMTF applied on the various real datasets are obtained from the investigations above, producing the following results:
Figure 5A
Figure 5B
Figure 6B
Lastly, Figure 6A is produced via the following code chunk which
creates and save the bisilhouette plot for the shuffled synthetic data.
This is saved to
Exploration/visual_data/shuffled_bisil_plot.pdf
.
source("Exploration/visualisation_ex.r")
Figure 7A
The following code chunk creates and saves the two subfigures in the supplementary material.
source("Exploration/visualisation_ex.r")
These are saved to
Exploration/convergence_data/error_plot.pdf
and
RealData/3cancers/data/stability/error_plot.pdf
.
Figure S1A
Figure S1B
The following code chunk creates and save the two subfigures in the supplementary materail.
source("Exploration/JSD_diff_dists.r")
These are saved to Exploration/f_data_dists.pdf
and
Exploration/shuffled_data_dists.pdf
.
Figure S2A
Figure S2B