Overview

This document provides a step-by-step guide to reproduce the results of the paper “Multi-view biclustering via non-negative matrix tri-factorisation”. This webpage is generated from the reproduce_results.rmd file in the resNMTF_paper GitHub repository. It is designed to be run on a local MacOS/linux machine with R, Python and a shell language available.

Details on how to run the file and reproduce the results are provided below. To begin, some details on setup and requirements are provided.

Which results are reproduced here?

Due to the extensive number of simulation studies, we have only included the code to reproduce the results for those in the main body of the text. Further instructions to reproduce the results for those in the supplementary material are detailed at relevant points in the document. All figures included in the poster for the Bioinference 2025 conference are from the main body of the paper and as such are reproduced here.

All figures included in this webpage are generated by the preceeding code chunk.

Note: Due to long run times several modifications have been made:

  1. The simulation studies were run with 100 repetitions, this has been changed to 3 repetitions. To run with the full number of repetitions, simply change the n_reps parameter in the relevant code chunk. Each repeat adds just over an hour to the run time.
  2. The code chunks to reproduce the results on the real data (which take substantially longer to run (at least a day each) than the simulations) have been set to eval=FALSE and include=FALSE to prevent the code from running. To reproduce these results anyway, simply remove these two arguments.

The runtime for this document (on a 2024 M4 Macbook Pro with 16GB RAM) without the real data applications is approximatley 3.5 hours. The runtime for the each real data applications is approximately 1-2 days (on a cluster).

Any figures from the paper that are not produced by code (i.e. illustrations) are not included in this document.

Document structure

The document is structured as follows:

  • Setup: This section details the software and packages required to run the code. It also includes instructions on how to set up the environment.
  • Results: This section details the code to reproduce the results in the paper. It is divided into three sections: Simulation studies, Real data applications and Additional figures. Each subsection contains the code to reproduce the relevant results.

Setup

This section details the software required to run the code and includes details to install relevant packages as well as set up required environments.

As the code was developed using shell scripts, a shell language is required. If you are on Windows, please install a zsh shell (e.g. Git Bash or Cygwin).

Important: The following assumes you are using a zsh shell. If you are using a different shell, please adjust the commands in the code chunks accordingly.

R

1. Clone the repository

In order to run this document, you need clone the github repository to your local machine, and knit the reproduce_results.rmd file. The repository can be found here.

2. Install R packages

Secondly, we need to install the required packages. The resnmtf and bisilhouette packages are available on github and can be installed using the devtools package. This work requires our R packages resnmtf and bisilhouette, the github pages for which can be found here and here respectively.

# Install and load required packages
source("install_packages.r")
library(devtools)
devtools::install_github("eso28599/resnmtf")
devtools::install_github("eso28599/bisilhouette")
library(resnmtf)
library(bisilhouette)
library(knitr)
library(magick)

Python

As one of the methods ResNMTF is compared against a method written in Python, an environment is provided for reproducibility. This assumes you have python and pip available on your machine. If you do not, please install them.

Conda environment installation (default)

chmod +x ./conda_env.sh
source conda_env.sh

Pip environment installation

Alternatively, if you do not have conda installed, you can create a virtual environment using pip. This can be done by changing eval running the following code chunk.

If you don’t have pyenv installed, please install it following instructions here. This allows you to create a virtual environment with a specific version of Python, which is needed to run the iSSVD method.

chmod +x ./virtual_env.sh
./virtual_env.sh

You will also need to uncomment the source line in the SimStudy/run_sim.sh file to activate the environment using pip rather than conda.

Results

This section details the code to reproduce all results from the paper, with each code chunk corresponding to a figure or table in the paper. All figures included in this document are generated by the preceeding code chunk.

Simulation studies

Figure 4

(A-B) Increasing number of biclusters

To run the simulation study for increasing the number of biclusters (on the base scenario of three views), with 3 repeats, the following code chunk can be used:

export sim="bicl"
export sim_folder_name="bicl_3v"
export seq=(3 4 5 6)
export n_reps=3
./SimStudy/run_sim.sh $sim $sim_folder_name $n_reps $seq

Tip: The full results produced can now be found in the SimStudy/Results/bicl_3v folder, individually under various file names (F_score_plot.pdf, BiS_plot.pdf, etc.) or in one pdf (results.pdf).

Some of the results produced are shown below:

Figure 5A

Figure 5A

Figure 5B

Figure 5B

The remaining experiments can be reproduced by setting the sim, sim_folder_name and seq variables to the relevant names in the code chunk above. The variable names for the other experiments in Figure 6 are:

  • (C-D) Increasing number of views: views, views_5b, 2 3 4 5
  • (E-F) Increasing number of individuals: indiv, indiv_3v5b, 50 200 300 500 1000
  • (G-H) Increasing level of noise: noise, noise_3v5b, 1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100

The variable names for remaining experiments are detailed in the readme file.For a specific simulation, the seq variable is found in the corresponding SimStudy/Scripts/[sim_folder_name]_script_one.sh file, in the for loop.

Note: The experiments were initially run on clusters, the scripts to run the simulations on clusters can be found in the SimStudy/Scripts folder where each experiment has a [sim_study]_script_one.sh and a[sim_study]_script_two.sh file.

Real data analysis

The results for ResNMTF applied on the various real datasets are obtained from the investigations into the effect of distance metrics and hyperparameter tuning (producing RealData/all_results_dis.txt). This can be implemented for each dataset by running the following relevant chunk: #### 3sources 1. Run restriction hyperprameter study on HPC system.

qsub RealData/3sources/hyperparam_script.sh 
  1. Combine results once this has run. This produces the files “RealData/3souces/data/euc_results.csv”, “RealData/3souces/data/cos_results.csv”, “RealData/3souces/data/mann_results.csv” and “RealData/3souces/distance_study.csv”.
qsub RealData/combine_dis_metrics.r "3sources" 3
  1. Run stability study.
qsub RealData/3sources/stab_script.sh 
  1. Process results. This produces the files; “RealData/3sources/3sources_stab_ResNMTF_F.csv”, “RealData/3sources/3sources_stab_ResNMTF_BiS.csv”, “RealData/3sources/3sources_stab_NMTF.csv”, “RealData/3sources/stab_plot_ResNMTF_BiS.pdf”, “RealData/3sources/stab_plot_ResNMTF_F.pdf” and “RealData/3sources/stability_study.csv”. This last file contains results found in Table 2 and the last plot corresponds to Figure 5a.
qsub RealData/combine_stab_metrics.r "3sources" 3
  1. Produce the results of the competing methods, found in “RealData/3sources/other_results.csv”, completing the results found in Table 2 for 3sources.
python RealData/3sources/3sources_python_analysis # python based methods
qsub RealData/other_results.r "3sources" 3

The analysis of the other real datasets follows a similar structure. #### bbcsport 1. Run restriction hyperprameter study on HPC system.

qsub RealData/bbcsport/hyperparam_script.sh 
  1. Combine results once this has run.
qsub RealData/combine_dis_metrics.r "bbcsport" 2
  1. Run stability study.
qsub RealData/bbcsport/stab_script.sh 
  1. Process results.
qsub RealData/combine_stab_metrics.r "bbcsport" 2
  1. Produce the results of the competing methods.
python RealData/bbcsport/bbc_python_analysis # python based methods
qsub RealData/other_results.r "bbcsport" 2

A549

  1. Run restriction hyperprameter study on HPC system.
qsub RealData/single_cell/hyperparam_script.sh 
  1. Combine results once this has run.
qsub RealData/combine_dis_metrics.r "single_cell" 2
  1. Run stability study.
qsub RealData/single_cell/stab_script.sh 
  1. Process results.
qsub RealData/combine_stab_metrics.r "single_cell" 2
  1. Produce the results of the competing methods.
python RealData/single_cell/single_cell_python_analysis # python based methods
qsub RealData/other_results.r "single_cell" 2

TCGA

  1. Run restriction hyperprameter study on HPC system.
qsub RealData/3cancers/hyperparam_script.sh 
  1. Combine results once this has run.
qsub RealData/combine_dis_metrics.r "3cancers" 2
  1. Run stability study.
qsub RealData/3cancers/stab_script.sh 
  1. Process results.
qsub RealData/combine_stab_metrics.r "3cancers" 2
  1. Produce the results of the competing methods.
python RealData/3cancers/3cancers_python_analysis # python based methods
qsub RealData/other_results.r "3cancers" 2

Results

The results for ResNMTF applied on the various real datasets are obtained from the investigations above, producing the following results:

  • Table 2, main F-score results
  • Figure 5A, relevance against increasing \(\omega\) at “RealData/3sources/stab_plot_ResNMTF_F.pdf”
    Figure 5A

    Figure 5A

  • Figure 5B, illustrating correlation of F score and the bisilhouette score at “RealData/single_cell/f_score_bis_sil.pdf”
    Figure 5B

    Figure 5B

  • Figure 6B, the bisilhouette plot for the 3sources dataset at “RealData/3sources/data/stability_ResNMTF_F/bisil_plot.pdf”
    Figure 6B

    Figure 6B

  • Table S1, the distance study at “RealData/all_results.csv”
  • Table S2, the bisilhouette scores of the analysis

Lastly, Figure 6A is produced via the following code chunk which creates and save the bisilhouette plot for the shuffled synthetic data. This is saved to Exploration/visual_data/shuffled_bisil_plot.pdf.

source("Exploration/visualisation_ex.r")
Figure 7A

Figure 7A

Supplementary material figures

Figure S1

The following code chunk creates and saves the two subfigures in the supplementary material.

source("Exploration/visualisation_ex.r")
These are saved to Exploration/convergence_data/error_plot.pdf and RealData/3cancers/data/stability/error_plot.pdf.
Figure S1A

Figure S1A

Figure S1B

Figure S1B

Figure S2

The following code chunk creates and save the two subfigures in the supplementary materail.

source("Exploration/JSD_diff_dists.r")

These are saved to Exploration/f_data_dists.pdf and Exploration/shuffled_data_dists.pdf.

Figure S2A

Figure S2A

Figure S2B

Figure S2B