This repository is a re-worked version of Alvaro’s original pipeline for peak picking from REIMS data. Due to various errors and problems with package dependencies, the code has been split into 2 mini pipelines. The first half of the data pre-processing is performed using a Python environment, and the remainder is performed in the same RStudio environment as required in Alvaro’s original implementation.
The biggest change is that Python functions are not called from RStudio. This reduces the level of complexity for debugging any issues that arise, although it does not get away from problems with requiring certain versions of R packages.
If you have any bugs/errors then please log these using the Issues form, which can be found in the menubar above. Please provide as much information as possible, and a minimal working example so that we can replicate a similar error. (You’ll need to provide the paths to the files/metadata you were trying to process, the parameters, etc…).
Before reporting an issue, please ensure that you have read all of the information here and in the README.md file.
H5_File_Checker.ipynb
to determine which files may have failed. It can also be used to write out a new metadata file without the failed files, in order to continue with the R pipelineCheck_R_Packages.ipynb
Before you get too far, please note that you will need a Windows PC for this code to function.
Each ‘Release’ is supposed to be a stable version of the code which works as expected. If you chose the ‘Download’ option, you will be getting the most recent version of the code from whatever branch you have selected (N.B. the default branch is called main
). This may not work as expected because it may be in the midst of being developed.
You will need to download a copy of the code in this repository to your computer. There are a couple of ways of doing this, as shown in the image.
The functions used in the R part of the pipeline depend on very specific package versions, many of which can no longer be downloaded from CRAN (which is where most of them are managed). These packages are not included in this Github repository (because they occupy a lot of space), but instead downloaded via a link found in the README. When you download this code, you should add the packages folder directly into this repository’s folder.
There is a Python function called Check_R_Packages.ipynb
which you can run and it will tell you if any packages are missing / incorrect.
In addition to the two downloads mentioned above, you will need to download a few other things. These are mentioned in the Install section.
In addition to this code and the packages folder, you will also need:
In addition to the full-sized Conda suggested, you may also wish to consider Miniconda which is discussed in the next section.
There may be a few ways to get the Python environment working, but this is the method that I used and should be able to help with.
E:
) then type jupyter lab
or jupyter notebook
y
as required. Note that the # symbols are comments and shouldn’t be typed# This creates a new environment called 'rawh5'
conda create --name rawh5 python=3.9.2
# Activate this environment
conda activate rawh5
# Install all of these packages
conda install scipy h5py numba numpy pandas pywavelets matplotlib scikit-image scikit-learn statsmodels
You can, of course, provide an alternative environment name other than rawh5
. Should you ever forget what you called the environment, you can get a list of the available environments with the following command: conda env list
.
That should be all that is required to install the Python environment. When you come back to it from now on, you should only need to type the following command into the Terminal: conda activate rawh5
(or whatever you called your environment).
Conda is a larger package manager that installs a lot of surplus stuff onto your computer. Instead, Miniconda might be more useful but only if you are familiar with the command line as much of the visual parts of Conda are not available. You can find a recent version from here. Once installed, open Anaconda Prompt (Miniconda3)
from the Windows start menu and follow the code example above to create an environment. You may also wish to install either Jupyter Lab or Notebook (as these won’t be installed by default, unlike with Conda).
These are the same requirements from Alvaro’s original pipeline, which exclusively used RStudio. Implementations of R itself, and R packages, can change quite considerably from version to version. It is because of this that these instructions are quite rigid to ensure that the correct packages are installed. Note that if you already have a working RStudio installation, you do not need to complete these stages.
Tools > Global Options
setwd("E:/Github/reimsPP/")
or equivalent for your computery
as required# This command ensures that RStudio looks in the _packages_ folder
.libPaths("./packages")
# Each of the following commands installs a single package. Verify it has successfully been installed
install.packages("MALDIquantForeign")
install.packages("doParallel")
install.packages("iterators")
install.packages("reticulate")
install.packages("plotly")
install.packages("htmltools")
install.packages("shiny")
install.packages("dplyr")
install.packages("tictoc")
If you encounter any errors during the installation of these packages in R, please return to Python and run the function called Check_R_Packages.ipynb
. This should identify any problems with missing packages, or incorrect versions. If any issues are highlighted, please download those packages from the shared folder and try again.
You’ll need to run the Python code in order to perform the pre-processing of the files. The final output from the Python code is then used by the R code to align the files, perform across file peak matching and generating a csv file for statistical analysis.
In addition the ‘main’ Python and R files, there are two additional files which might be some use…
Check_R_Packages.ipynb
This Python notebook can be used to identify if you have the correct R packages installed, as specified in this wiki.
H5_File_Checker.ipynb
Use this file after the Python preprocessing to see if you have managed to successfully process all of the files. For certain reasons, some files may not have fully processed (e.g. no peaks detected), and this notebook allows you to write a new metadata file without these failed files, and this new file can be used in the R pipeline without causing crashes.
You’ll need to launch Anaconda and your Jupyter installation of choice, and then proceed to activate the environment that you created previously. In the Terminal type conda activate rawh5
or similar, and if successful the prompt should change from (base) to (rawh5).
The file to open and edit is called Raw2H5_PeakPick.ipynb. Within this file, you will need to navigate to the first editable cell and update the parameters accordingly. Descriptions for each are given below.
metadataName = 'C:/Path/To/Metadata.csv'
Specify the exact or relative path to a metadata (CSV) file. Exact paths begin with the drive letter, (C:/REIMS_PP/Results/), and are recommended due to their lack of ambiguity.
rawPath = 'Z:/Path/To/Raw/Data/'
Location of the raw files. They should be in one single folder and not nested.
resPath = 'J:/Folder/For/Results/'
A folder into which the results will be saved.
method = 'average+filtering2'
default
'average+filtering2'
: average scans and filter noise using FFT with more levels of noise detection'average+filtering'
: average scan but doesn’t filter noise. Only use this option when you have problems with the filtering algorithms with your data'convert'
: this is not part of Alvaro’s original pipeline. Instead, it simply reads the raw files and exports the chosen scans to an H5 file. This option is not compatible with further analysis in the RStudio pipeline.scans = 'csv'
default
'csv'
: this options uses the start/end scans as specified in the metadata CSV'all'
: only recommended as part of the 'convert'
option above, this will cause problems with the RStudio pipelinesmooth_input = 19
default
This is the parameter used for smoothing the interpolated spectra.
step_size_input = 5
and mz_step_input = 5
defaults
These are used as interval sizes for determining the noise levels in the spectra. It is recommended that you leave these values unchanged.
Alter the various parameters in the first editable cell (called ‘Parameters to change’) according to the options specified above. Ensure that the spelling is correct and that there are no errant spaces for paths and methods.
Once these are correct, you can run each cell in turn and the results should be output to the results folder that you specified.
If you encounter any interesting errors/bugs/etc please log them in the Issues tab in the toolbar at the top.
Open this and ensure that the version of R being used is 3.4.4. This is visible at the top of the console window. If it is another version, then you can change it via Tools > Global Options.
Use the files browser within RStudio to navigate to the repository folder.
Alternatively, run the setwd
command, as in setwd(dir = 'C:/Folder/reimsPP/')
Open the file called Raw2H5_Combine.R by clicking on it in the files browser. The second section contains the various parameters that need to be defined. Edit these accordingly, being careful to ensure that they are entered as requested.
rootDir = 'J:/Folder/For/Results/Final/'
Where are the processed H5 files? They are likely to be in a subfolder of this folder called ‘Final’ so ensure that you provide the path to this ‘Final’ folder. [The reason being that some intermediate H5 files are produced.]
metadataPathName = 'C:/Path/To/Metadata.csv'
As for the Python pipeline, specify the path to the CSV metadata file.
doLockmass = FALSE
FALSE
: do not perform. Note the lack of speech marks ‘ ‘ around logical TRUE and FALSE commandsTRUE
: perform using the m/z value specified belowlockmass_mz = 554.2615
The default m/z for lock mass correction is LeuEnk with m/z = 554.2615. This may not be suitable for your data, so check accordingly. Note that this has no effect if doLockMass = FALSE
smooth_input = 19
This parameter is not used in RStudio, and will be removed in future versions.
peakMatchTol = 50
ppm
Peak matching tolerance for assimilating peak lists from the various files together.
postprocessing = '~'
on-off
: use it very carefully. On-off biomarker detection. Very easy overfitting method. Creates a profile per class and combines different classesclassification
: doesn’t take into account classes to build the list of peaks. More peaks in the final profile.Once you have edited the parameters in §2 as required, you can start to run the subsequent sections. Work down each one, the next being to load the dependent packages.
Whilst intermediate outputs are saved in case of problems, it is expected that the entire pipeline is completed in one session.
The ‘final’ result, in §9, is a CSV file called ‘matrix_for_postprocessing.csv’. This should be saved in the results folder, and is ready for further analysis using other resources (there are no statistical resources included in this repository).
The Python and RStudio pipelines work independently of each other, so you can complete them at separate times. Once you have generated the H5 files using the Python pipeline, then you can change certain parameters within RStudio without having to return to the Python code.
No substantial modifications have been made to the code with respect to Alvaro’s original implementation. This was confirmed by comparison of CSV files from the original pipeline and this newer implementation.