Basic GUIDE (version with plotly visualisation and updated functions)¶

Rodin Class object¶

In [1]:
import rodin

The Rodin class requires two data files: a features table and a class labels file.

  1. Features File:
    • The first two columns must represent 'mass to charge' ratios and 'retention times' values, respectively.
      (for targeted analysis: the first column should include the annotation of metabolites, parameter feat_stat must be set to 'ann'.)
    • All subsequent columns should contain the intensities.
  1. Class Labels File:
    • The first column should contain sample IDs. These IDs must correspond to the names of the columns in the features table.
    • Additional columns in the class labels file represent the classes of the samples, such as treatment groups, age, etc. The names of these columns are not fixed.

To create an object of the Rodin class using these files, use the following command:

In [2]:
obj=rodin.create_object_csv("./data/features.csv","./data/class_labels.csv",
                            feat_sep=',',class_sep=',')

Once the Rodin object is created, it has three attributes:

  1. X: This attribute holds the intensities data from the samples.

  2. features: This contains the metabolites features data, including 'mass to charge' ratios and 'retention times'.

  3. samples: This attribute includes the classes data, which correspond to various characteristics of the samples such as treatment groups, age, etc.

In [3]:
print(obj.samples)
   Sample ID     Dose     Sex
0        N10  Control  Female
1         H3     High  Female
2        N23  Control  Female
3        N21  Control  Female
4         H9     High  Female
..       ...      ...     ...
66       H15     High    Male
67       H16     High    Male
68       H21     High    Male
69       H27     High    Male
70       H22     High    Male

[71 rows x 3 columns]

Pre-Processing¶

The transform function offers a convenient method for preprocessing data, comprising several key steps:

  1. Imputation: Fills in missing data points within the dataset.

  2. Filtering Features (thresh): Eliminates features that have more than a specified threshold of missing values. This step ensures the quality and reliability of the dataset by removing less informative features.

  3. Data Normalization (norm): Normalizes the data using one of two methods:

    • Quantile Normalization ('q'): Adjusts the distributions of intensities across samples to be more similar.
    • Total Intensity Normalization ('t'): Divides each value in a column by the sum of values in that column, normalizing the data based on the total intensity of each feature.
  4. Log2 Transformation (log): Applies a log2 transformation to the normalized data. This transformation is useful for stabilizing variance across the dataset and making the data more suitable for linear models and other statistical analyses.

  5. Scaling (scale): Scales the data by dividing each feature by its unit variance.

These preprocessing steps are crucial for preparing the data for further analysis, ensuring that it is clean, normalized, and ready for accurate and meaningful interpretation. However, you are able to skip particular step by setting its parameter to None.

In [4]:
obj.transform(thresh=0.5,norm='q')
Number of features filtered: 1972
Out[4]:
< Rodin object > 
dim: 24233 X 71

Statistical Tests¶

After the data in the X attribute has been transformed, statistical tests can be performed. In this basic guide, we will focus on using one-way ANOVA (Analysis of Variance).

  • One-way ANOVA: This statistical test is used to determine whether there are any statistically significant differences between the means of three or more independent groups.

To perform one-way ANOVA using the Rodin class:

  1. Provide a column from the samples attribute as an argument. This column should represent the different groups or classes among your samples.
  2. Run the function designated for performing ANOVA.

Following the execution of one-way ANOVA, the features attribute of the Rodin object will be updated. It will now include data about the statistical tests performed. The p_adj column displays the p-values adjusted using the Benjamini-Hochberg method.

In [5]:
obj.oneway_anova('Dose')
Out[5]:
mz time p_value(owa) Dose p_adj(owa) Dose
0 85.0284 38.3 0.867775 0.929819
1 85.0285 213.1 0.003990 0.036638
2 85.0285 213.1 0.817019 0.900804
3 85.0285 69.4 0.960532 0.979629
4 85.0285 69.4 0.006463 0.049960
... ... ... ... ...
26200 1245.6035 276.0 0.329061 0.536328
26201 1253.8981 29.2 0.000134 0.003188
26202 1255.9144 29.4 0.000485 0.008308
26203 1274.1295 40.1 0.154725 0.343972
26204 1274.1842 39.9 0.008926 0.060915

24233 rows × 4 columns

The fold_change function is recommended for pathway analysis as it calculates the log fold change difference between classes in the dataset. The function uses the first class appearing in the specified column of the samples attribute as the reference, if it is not provided as parameter.

In [6]:
obj.fold_change('Dose')
Out[6]:
mz time p_value(owa) Dose p_adj(owa) Dose lfc (High vs Control) lfc (Low vs Control) lfc (others vs Control)
0 85.0284 38.3 0.867775 0.929819 -0.092725 -0.046480 -0.069603
1 85.0285 213.1 0.003990 0.036638 0.128757 -0.255070 -0.063157
2 85.0285 213.1 0.817019 0.900804 -0.052215 0.080238 0.014011
3 85.0285 69.4 0.960532 0.979629 -0.042041 0.012479 -0.014781
4 85.0285 69.4 0.006463 0.049960 -0.140702 0.161327 0.010312
... ... ... ... ... ... ... ...
26200 1245.6035 276.0 0.329061 0.536328 0.253545 0.079541 0.166543
26201 1253.8981 29.2 0.000134 0.003188 0.394721 0.792212 0.593466
26202 1255.9144 29.4 0.000485 0.008308 0.220599 0.734591 0.477595
26203 1274.1295 40.1 0.154725 0.343972 0.286580 0.063715 0.175147
26204 1274.1842 39.9 0.008926 0.060915 0.185690 -0.249075 -0.031692

24233 rows × 7 columns

With the same approach the following tests are available:

ttest('categorical')
pls_da('categorical')

twoway_anova(['cat1','cat2'])

sf_lg('binary') - Logistic Regression
sf_lr('continuis') - Linear Regression

rf_class('categorical') - Random Forest Classifier
rf_regress('continuis') - Random Forest Regressor

Additonal parameteres as degree, moderator or n_estimators could be provided.

Visualisation¶

Plotly library is used by default, to get static plots set interactive parameter to False (except of volcano plot)

Results of statistical tests could be visualized using volcano function.

In [7]:
obj.volcano(p='p_adj(owa) Dose', effect_size='lfc (High vs Control)', sign_line=0.00001, effect_size_line=[-1,1])

Metabolites of interest could be plotted using boxplot, violinplot and regplot providing the raw index numbers (indexes from the original data are saved in the Rodin object)

In [8]:
obj.boxplot(rows=[9999,4561],
               hue='Dose',category_order=['Control','Low','High'])

Slicing¶

The Rodin class also offers a convenient way to slice data, enhancing the flexibility and specificity of data analysis. When slicing, for example, by a certain criterion such as p-value in the features table, all main attributes of the Rodin object are updated correspondingly. This means that if you slice the features table based on p-value, the X attribute (intensities data) will be updated to reflect this change. This feature is particularly useful as it allows for the calculation of principal components on the updated dataset and facilitates the visualization of results with selected features.

Key Analysis Options Post-Slicing:

  • Principal Component Analysis (PCA): Once the data is sliced and the X attribute is updated, you can perform PCA on this refined dataset to uncover patterns and relationships.
  • Umap and t-SNE: In addition to PCA, the Rodin class supports other visualization techniques like UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding).
In [9]:
obj_2 = obj[obj.features[obj.features['p_adj(owa) Dose']<0.01]]
In [10]:
obj_2.run_pca()
obj_2.run_umap()

After running dimenional reduction methods n-components will be stored in the class as well in dr attribute

In [11]:
obj_2
Out[11]:
< Rodin object > 
dim: 1518 X 71
dr: pca, umap
In [12]:
obj_2.plot(dr_name='umap',hue='Dose')
In [13]:
obj_2.plot(dr_name='pca',hue='Dose')

To understand the patterns in the data - clustergram function could be utilitized:

In [16]:
obj_2.clustergram(hue='Dose',standardize='row',width=1050)

Pathway Analysis¶

The Rodin class incorporates the mummichog algorithm for pathway analysis, which is particularly useful for identifying metabolic pathways in mass spectrometry data.

Input Requirements:

  • To effectively use the mummichog algorithm, users are required to specify two key columns from the features table:
    • A column with p-values, for instance, p_adj(owa).
    • A column for statistics, where log fold change (LFC) could be used, such as in a comparison like lfc (others vs Control).
    • Mass spectrometry analysis mode such as positive or negative (our lab uses custom modes f_positive | f_negative).

Output:

  • The function analyze_pathways produces a detailed table containing information about the identified pathways and compounds.
  • The results of this analysis are stored in the uns attribute of the Rodin object. This allows for convenient access and further analysis of the pathway data.

For a comprehensive understanding of the parameters and functionality of the analyze_pathways function refer to the dedicated pathways guide.

In [17]:
obj.analyze_pathways('p_adj(owa) Dose','lfc (others vs Control)',mode='f_positive')
Started @ Fri Aug 30 14:54:57 2024

Read 24233 features as reference list.
Automatically choosing (p < 0.000100) as significant cutoff.
Using 415 features (p < 0.000100) as significant list.

Pathway Analysis...
Query number of significant compounds = 76 compounds
Resampling, 100 permutations to estimate background ...
Got 1692 cpd2mzFeatures
Got 3449 ListOfEmpiricalCompounds
Got 2198 merged ListOfEmpiricalCompounds
Got 1132 final ListOfEmpiricalCompounds
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
Pathway background is estimated on 11900 random pathway values
Out[17]:
pathway overlap_size pathway_size p-value overlap_EmpiricalCompounds (id)
0 Aspartate and asparagine metabolism 8 57 0.01646920426854886 E245,E1139,E468,E442,E430,E281,E300,E101
1 Glutamate metabolism 3 11 0.016637257373329974 E245,E442,E101
2 Vitamin H (biotin) metabolism 2 5 0.019662213259389966 E468,E1752
3 Lysine metabolism 5 29 0.019914292916561632 E468,E1054,E667,E281,E1442
4 Nitrogen metabolism 2 6 0.027056549869758844 E442,E300
... ... ... ... ... ...
114 Starch and Sucrose Metabolism 0 18 0.3344256785144106
115 Trihydroxycoprostanoyl-CoA beta-oxidation 0 1 0.3344256785144106
116 Omega-6 fatty acid metabolism 0 8 0.3344256785144106
117 Phytanic acid peroxisomal oxidation 0 8 0.3344256785144106
118 Selenoamino acid metabolism 0 20 0.3344256785144106

119 rows × 5 columns

In [18]:
obj
Out[18]:
< Rodin object > 
dim: 24233 X 71
dr: pca, umap
uns: pathways, compounds
In [19]:
obj.uns['compounds']
Out[19]:
input_row EID str_row_ion compounds compound_names m/z retention_time p_value statistic
0 51 E11 M+H[1+] C02218 2-Aminoacrylate; Dehydroalanine 88.0394 74.1 0.602135 0.029470
1 51 E14 M+2H[2+] C00439 N-Formimidoyl-L-glutamate 88.0394 74.1 0.602135 0.029470
2 52 E16 M+2H[2+] C00437 N2-Acetyl-L-ornithine 88.0579 68.0 0.627157 0.106719
3 53 E20 M+H[1+] C00555 4-Aminobutanal; 4-Aminobutyraldehyde; Butyrald... 88.0758 23.8 0.640468 -0.023287
4 54 E20 M+H[1+] C00555 4-Aminobutanal; 4-Aminobutyraldehyde; Butyrald... 88.0758 23.8 0.839262 0.065401
... ... ... ... ... ... ... ... ... ...
1756 26030 E2193 M+Na[1+] lnlccoa;CE5162;C16163;lneldccoa linoleic coenzyme A$trans,cis-2,9-octadecadien... 1048.3058 129.2 0.000440 -0.725758
1757 26044 E2190 M+Na[1+] C16218;C00510 trans-Octadec-2-enoyl-CoA$Octadecenoyl-CoA (n-... 1054.3533 287.6 0.155874 -0.213352
1758 26044 E2194 M+H[1+] CE4843;C02249;CE5150 cis,cis-11,14-eicosadienoyl-CoA$C20:4-CoA$tran... 1054.3533 287.6 0.155874 -0.213352
1759 26149 E2018 M+Na[1+] CE4817 3(S)-hydroxy-docosa-7,10,13,16,19-all-cis-pent... 1114.3193 269.9 0.936672 0.045092
1760 26173 E2198 M+Na[1+] C15613;C17343;C05448 3alpha,7alpha,12alpha-Trihydroxy-5beta-cholest... 1222.4244 290.2 0.162603 0.358422

1761 rows × 9 columns

Compounds could be aggregated by pathways using function show_compounds. Provide cutoff_eids to reflect the same eids as in pathways table.cutoff_path is used to filter the pathways of interest.

In [20]:
obj.show_compounds(cutoff_eids=0.0001,cutoff_path=0.017)
Out[20]:
str_row_ion compounds compound_names input_row m/z retention_time p_value statistic
pathway compound_id
Aspartate and asparagine metabolism E101 M+H[1+] C02356;C01026;C00334;C05145;C01205;C03284 (S)-2-Aminobutanoate; (S)-2-Aminobutanoic acid... 354 104.0707 57.2 1.797939e-07 0.187342
E281 M+H[1+] C05936;C00408 N4-Acetylaminobutanal$L-Pipecolate; Pipecolini... 1032 130.0863 72.4 6.369287e-15 -0.161556
E281 M(C13)+H[1+] C05936;C00408 N4-Acetylaminobutanal$L-Pipecolate; Pipecolini... 1073 131.0897 76.7 7.092228e-11 -0.158624
E300 M+H[1+] C02642;C00152 N-Carbamoyl-beta-alanine$L-Asparagine 1122 133.0608 79.8 9.416820e-05 -0.020385
E430 M+H[1+] C01035 4-Guanidinobutanoate 1583 146.0923 30.8 6.932411e-18 0.361329
E442 M+H[1+] C00064;C05100;C00819 L-Glutamine$3-Ureidoisobutyrate$D-Glutamine; D... 1616 147.0765 76.9 1.029164e-06 0.053800
E468 M+H[1+] C00047 L-Lysine 1623 147.1127 79.6 3.517658e-14 -0.136833
E468 M(C13)+H[1+] C00047 L-Lysine 1658 148.1161 77.7 8.028457e-12 -0.159557
E1139 M+H[1+] C01239 N-Acetyl-beta-D-glucosaminylamine 4248 221.1133 164.1 1.263207e-22 0.278673
E245 M+H[1+] C00669 gamma-L-Glutamyl-L-cysteine 5327 251.0697 69.1 8.833216e-05 -0.125672
Glutamate metabolism E101 M+H[1+] C02356;C01026;C00334;C05145;C01205;C03284 (S)-2-Aminobutanoate; (S)-2-Aminobutanoic acid... 354 104.0707 57.2 1.797939e-07 0.187342
E442 M+H[1+] C00064;C05100;C00819 L-Glutamine$3-Ureidoisobutyrate$D-Glutamine; D... 1616 147.0765 76.9 1.029164e-06 0.053800
E245 M+H[1+] C00669 gamma-L-Glutamyl-L-cysteine 5327 251.0697 69.1 8.833216e-05 -0.125672

After conducting pathway analysis in the Rodin class, users can visualize the results through barplots, violin or reg plots based on Entity IDs (EIDs) of metabolites or pathways names. By default, these plots will only display statistically significant metabolites (0.05 by default), omitting those not meeting the significance threshold. This focused approach ensures clearer and more meaningful visualizations, aiding in the interpretation of the pathway analysis outcomes.

In [21]:
obj.violinplot(eids=['E468','E1752	'],hue='Dose',significant=0.0001,category_order=['Control','Low','High'])
In [22]:
obj.boxplot(pathways=['Vitamin H (biotin) metabolism'],
               hue='Dose',significant=0.001,category_order=['Control','Low','High'],grid_dim=(1,3),figsize=(10,5))

Saving and Importing Objects in Rodin

  • Saving the Object: To save the current state of a Rodin object, use the save method. This method allows you to store the object for future use, preserving all its data and attributes.

    obj.save('path_to_save.pkl')

  • Importing the Object: If you need to continue working with a previously saved Rodin object, you can easily import it back into your environment using the import_object method. This method reloads the object with all its data and attributes intact.

    obj = rodin.import_object('path_to_saved_object.pkl')