basics

Basic GUIDE (version with plotly visualisation and updated functions)¶

Rodin Class object¶

In [1]:

import rodin

The Rodin class requires two data files: a features table and a class labels file.

Features File:
- The first two columns must represent 'mass to charge' ratios and 'retention times' values, respectively.
  (for targeted analysis: the first column should include the annotation of metabolites, parameter feat_stat must be set to 'ann'.)
- All subsequent columns should contain the intensities.

Class Labels File:
- The first column should contain sample IDs. These IDs must correspond to the names of the columns in the features table.
- Additional columns in the class labels file represent the classes of the samples, such as treatment groups, age, etc. The names of these columns are not fixed.

To create an object of the Rodin class using these files, use the following command:

In [2]:

obj=rodin.create_object_csv("./data/features.csv","./data/class_labels.csv",
                            feat_sep=',',class_sep=',')

Once the Rodin object is created, it has three attributes:

X: This attribute holds the intensities data from the samples.
features: This contains the metabolites features data, including 'mass to charge' ratios and 'retention times'.
samples: This attribute includes the classes data, which correspond to various characteristics of the samples such as treatment groups, age, etc.

In [3]:

print(obj.samples)

   Sample ID     Dose     Sex
0        N10  Control  Female
1         H3     High  Female
2        N23  Control  Female
3        N21  Control  Female
4         H9     High  Female
..       ...      ...     ...
66       H15     High    Male
67       H16     High    Male
68       H21     High    Male
69       H27     High    Male
70       H22     High    Male

[71 rows x 3 columns]

Pre-Processing¶

The transform function offers a convenient method for preprocessing data, comprising several key steps:

Imputation: Fills in missing data points within the dataset.
Filtering Features (thresh): Eliminates features that have more than a specified threshold of missing values. This step ensures the quality and reliability of the dataset by removing less informative features.
Data Normalization (norm): Normalizes the data using one of two methods:
- Quantile Normalization ('q'): Adjusts the distributions of intensities across samples to be more similar.
- Total Intensity Normalization ('t'): Divides each value in a column by the sum of values in that column, normalizing the data based on the total intensity of each feature.
Log2 Transformation (log): Applies a log2 transformation to the normalized data. This transformation is useful for stabilizing variance across the dataset and making the data more suitable for linear models and other statistical analyses.
Scaling (scale): Scales the data by dividing each feature by its unit variance.

These preprocessing steps are crucial for preparing the data for further analysis, ensuring that it is clean, normalized, and ready for accurate and meaningful interpretation. However, you are able to skip particular step by setting its parameter to None.

In [4]:

obj.transform(thresh=0.5,norm='q')

Number of features filtered: 1972

Out[4]:

< Rodin object > 
dim: 24233 X 71

Statistical Tests¶

After the data in the X attribute has been transformed, statistical tests can be performed. In this basic guide, we will focus on using one-way ANOVA (Analysis of Variance).

One-way ANOVA: This statistical test is used to determine whether there are any statistically significant differences between the means of three or more independent groups.

To perform one-way ANOVA using the Rodin class:

Provide a column from the samples attribute as an argument. This column should represent the different groups or classes among your samples.
Run the function designated for performing ANOVA.

Following the execution of one-way ANOVA, the features attribute of the Rodin object will be updated. It will now include data about the statistical tests performed. The p_adj column displays the p-values adjusted using the Benjamini-Hochberg method.

In [5]:

obj.oneway_anova('Dose')

Out[5]:

	mz	time	p_value(owa) Dose	p_adj(owa) Dose
0	85.0284	38.3	0.867775	0.929819
1	85.0285	213.1	0.003990	0.036638
2	85.0285	213.1	0.817019	0.900804
3	85.0285	69.4	0.960532	0.979629
4	85.0285	69.4	0.006463	0.049960
...	...	...	...	...
26200	1245.6035	276.0	0.329061	0.536328
26201	1253.8981	29.2	0.000134	0.003188
26202	1255.9144	29.4	0.000485	0.008308
26203	1274.1295	40.1	0.154725	0.343972
26204	1274.1842	39.9	0.008926	0.060915

24233 rows × 4 columns

The fold_change function is recommended for pathway analysis as it calculates the log fold change difference between classes in the dataset. The function uses the first class appearing in the specified column of the samples attribute as the reference, if it is not provided as parameter.

In [6]:

obj.fold_change('Dose')

Out[6]:

	mz	time	p_value(owa) Dose	p_adj(owa) Dose	lfc (High vs Control)	lfc (Low vs Control)	lfc (others vs Control)
0	85.0284	38.3	0.867775	0.929819	-0.092725	-0.046480	-0.069603
1	85.0285	213.1	0.003990	0.036638	0.128757	-0.255070	-0.063157
2	85.0285	213.1	0.817019	0.900804	-0.052215	0.080238	0.014011
3	85.0285	69.4	0.960532	0.979629	-0.042041	0.012479	-0.014781
4	85.0285	69.4	0.006463	0.049960	-0.140702	0.161327	0.010312
...	...	...	...	...	...	...	...
26200	1245.6035	276.0	0.329061	0.536328	0.253545	0.079541	0.166543
26201	1253.8981	29.2	0.000134	0.003188	0.394721	0.792212	0.593466
26202	1255.9144	29.4	0.000485	0.008308	0.220599	0.734591	0.477595
26203	1274.1295	40.1	0.154725	0.343972	0.286580	0.063715	0.175147
26204	1274.1842	39.9	0.008926	0.060915	0.185690	-0.249075	-0.031692

24233 rows × 7 columns

With the same approach the following tests are available:

ttest('categorical')
pls_da('categorical')

twoway_anova(['cat1','cat2'])

sf_lg('binary') - Logistic Regression
sf_lr('continuis') - Linear Regression

rf_class('categorical') - Random Forest Classifier
rf_regress('continuis') - Random Forest Regressor

Additonal parameteres as degree, moderator or n_estimators could be provided.

Visualisation¶

Plotly library is used by default, to get static plots set interactive parameter to False (except of volcano plot)

Results of statistical tests could be visualized using volcano function.

In [7]:

obj.volcano(p='p_adj(owa) Dose', effect_size='lfc (High vs Control)', sign_line=0.00001, effect_size_line=[-1,1])

Metabolites of interest could be plotted using boxplot, violinplot and regplot providing the raw index numbers (indexes from the original data are saved in the Rodin object)

In [8]:

obj.boxplot(rows=[9999,4561],
               hue='Dose',category_order=['Control','Low','High'])

Slicing¶

The Rodin class also offers a convenient way to slice data, enhancing the flexibility and specificity of data analysis. When slicing, for example, by a certain criterion such as p-value in the features table, all main attributes of the Rodin object are updated correspondingly. This means that if you slice the features table based on p-value, the X attribute (intensities data) will be updated to reflect this change. This feature is particularly useful as it allows for the calculation of principal components on the updated dataset and facilitates the visualization of results with selected features.

Key Analysis Options Post-Slicing:

Principal Component Analysis (PCA): Once the data is sliced and the X attribute is updated, you can perform PCA on this refined dataset to uncover patterns and relationships.
Umap and t-SNE: In addition to PCA, the Rodin class supports other visualization techniques like UMAP (Uniform Manifold Approximation and Projection) and t-SNE (t-Distributed Stochastic Neighbor Embedding).

In [9]:

obj_2 = obj[obj.features[obj.features['p_adj(owa) Dose']<0.01]]

In [10]:

obj_2.run_pca()
obj_2.run_umap()

After running dimenional reduction methods n-components will be stored in the class as well in dr attribute

In [11]:

obj_2

Out[11]:

< Rodin object > 
dim: 1518 X 71
dr: pca, umap

In [12]:

obj_2.plot(dr_name='umap',hue='Dose')

In [13]:

obj_2.plot(dr_name='pca',hue='Dose')

To understand the patterns in the data - clustergram function could be utilitized:

In [16]:

obj_2.clustergram(hue='Dose',standardize='row',width=1050)

Pathway Analysis¶

The Rodin class incorporates the mummichog algorithm for pathway analysis, which is particularly useful for identifying metabolic pathways in mass spectrometry data.

Input Requirements:

To effectively use the mummichog algorithm, users are required to specify two key columns from the features table:
- A column with p-values, for instance, p_adj(owa).
- A column for statistics, where log fold change (LFC) could be used, such as in a comparison like lfc (others vs Control).
- Mass spectrometry analysis mode such as positive or negative (our lab uses custom modes f_positive | f_negative).

Output:

The function analyze_pathways produces a detailed table containing information about the identified pathways and compounds.
The results of this analysis are stored in the uns attribute of the Rodin object. This allows for convenient access and further analysis of the pathway data.

For a comprehensive understanding of the parameters and functionality of the analyze_pathways function refer to the dedicated pathways guide.

In [17]:

obj.analyze_pathways('p_adj(owa) Dose','lfc (others vs Control)',mode='f_positive')

Started @ Fri Aug 30 14:54:57 2024

Read 24233 features as reference list.
Automatically choosing (p < 0.000100) as significant cutoff.
Using 415 features (p < 0.000100) as significant list.

Pathway Analysis...
Query number of significant compounds = 76 compounds
Resampling, 100 permutations to estimate background ...

Got 1692 cpd2mzFeatures
Got 3449 ListOfEmpiricalCompounds
Got 2198 merged ListOfEmpiricalCompounds
Got 1132 final ListOfEmpiricalCompounds
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Pathway background is estimated on 11900 random pathway values

Out[17]:

	pathway	overlap_size	pathway_size	p-value	overlap_EmpiricalCompounds (id)
0	Aspartate and asparagine metabolism	8	57	0.01646920426854886	E245,E1139,E468,E442,E430,E281,E300,E101
1	Glutamate metabolism	3	11	0.016637257373329974	E245,E442,E101
2	Vitamin H (biotin) metabolism	2	5	0.019662213259389966	E468,E1752
3	Lysine metabolism	5	29	0.019914292916561632	E468,E1054,E667,E281,E1442
4	Nitrogen metabolism	2	6	0.027056549869758844	E442,E300
...	...	...	...	...	...
114	Starch and Sucrose Metabolism	0	18	0.3344256785144106
115	Trihydroxycoprostanoyl-CoA beta-oxidation	0	1	0.3344256785144106
116	Omega-6 fatty acid metabolism	0	8	0.3344256785144106
117	Phytanic acid peroxisomal oxidation	0	8	0.3344256785144106
118	Selenoamino acid metabolism	0	20	0.3344256785144106

119 rows × 5 columns

In [18]:

obj

Out[18]:

< Rodin object > 
dim: 24233 X 71
dr: pca, umap
uns: pathways, compounds

In [19]:

obj.uns['compounds']

Out[19]:

	input_row	EID	str_row_ion	compounds	compound_names	m/z	retention_time	p_value	statistic
0	51	E11	M+H[1+]	C02218	2-Aminoacrylate; Dehydroalanine	88.0394	74.1	0.602135	0.029470
1	51	E14	M+2H[2+]	C00439	N-Formimidoyl-L-glutamate	88.0394	74.1	0.602135	0.029470
2	52	E16	M+2H[2+]	C00437	N2-Acetyl-L-ornithine	88.0579	68.0	0.627157	0.106719
3	53	E20	M+H[1+]	C00555	4-Aminobutanal; 4-Aminobutyraldehyde; Butyrald...	88.0758	23.8	0.640468	-0.023287
4	54	E20	M+H[1+]	C00555	4-Aminobutanal; 4-Aminobutyraldehyde; Butyrald...	88.0758	23.8	0.839262	0.065401
...	...	...	...	...	...	...	...	...	...
1756	26030	E2193	M+Na[1+]	lnlccoa;CE5162;C16163;lneldccoa	linoleic coenzyme A$trans,cis-2,9-octadecadien...	1048.3058	129.2	0.000440	-0.725758
1757	26044	E2190	M+Na[1+]	C16218;C00510	trans-Octadec-2-enoyl-CoA$Octadecenoyl-CoA (n-...	1054.3533	287.6	0.155874	-0.213352
1758	26044	E2194	M+H[1+]	CE4843;C02249;CE5150	cis,cis-11,14-eicosadienoyl-CoA$C20:4-CoA$tran...	1054.3533	287.6	0.155874	-0.213352
1759	26149	E2018	M+Na[1+]	CE4817	3(S)-hydroxy-docosa-7,10,13,16,19-all-cis-pent...	1114.3193	269.9	0.936672	0.045092
1760	26173	E2198	M+Na[1+]	C15613;C17343;C05448	3alpha,7alpha,12alpha-Trihydroxy-5beta-cholest...	1222.4244	290.2	0.162603	0.358422

1761 rows × 9 columns

Compounds could be aggregated by pathways using function show_compounds. Provide cutoff_eids to reflect the same eids as in pathways table.cutoff_path is used to filter the pathways of interest.

In [20]:

obj.show_compounds(cutoff_eids=0.0001,cutoff_path=0.017)

Out[20]:

		str_row_ion	compounds	compound_names	input_row	m/z	retention_time	p_value	statistic
pathway	compound_id
Aspartate and asparagine metabolism	E101	M+H[1+]	C02356;C01026;C00334;C05145;C01205;C03284	(S)-2-Aminobutanoate; (S)-2-Aminobutanoic acid...	354	104.0707	57.2	1.797939e-07	0.187342
	E281	M+H[1+]	C05936;C00408	N4-Acetylaminobutanal$L-Pipecolate; Pipecolini...	1032	130.0863	72.4	6.369287e-15	-0.161556
	E281	M(C13)+H[1+]	C05936;C00408	N4-Acetylaminobutanal$L-Pipecolate; Pipecolini...	1073	131.0897	76.7	7.092228e-11	-0.158624
	E300	M+H[1+]	C02642;C00152	N-Carbamoyl-beta-alanine$L-Asparagine	1122	133.0608	79.8	9.416820e-05	-0.020385
	E430	M+H[1+]	C01035	4-Guanidinobutanoate	1583	146.0923	30.8	6.932411e-18	0.361329
	E442	M+H[1+]	C00064;C05100;C00819	L-Glutamine$3-Ureidoisobutyrate$D-Glutamine; D...	1616	147.0765	76.9	1.029164e-06	0.053800
	E468	M+H[1+]	C00047	L-Lysine	1623	147.1127	79.6	3.517658e-14	-0.136833
	E468	M(C13)+H[1+]	C00047	L-Lysine	1658	148.1161	77.7	8.028457e-12	-0.159557
	E1139	M+H[1+]	C01239	N-Acetyl-beta-D-glucosaminylamine	4248	221.1133	164.1	1.263207e-22	0.278673
	E245	M+H[1+]	C00669	gamma-L-Glutamyl-L-cysteine	5327	251.0697	69.1	8.833216e-05	-0.125672
Glutamate metabolism	E101	M+H[1+]	C02356;C01026;C00334;C05145;C01205;C03284	(S)-2-Aminobutanoate; (S)-2-Aminobutanoic acid...	354	104.0707	57.2	1.797939e-07	0.187342
	E442	M+H[1+]	C00064;C05100;C00819	L-Glutamine$3-Ureidoisobutyrate$D-Glutamine; D...	1616	147.0765	76.9	1.029164e-06	0.053800
	E245	M+H[1+]	C00669	gamma-L-Glutamyl-L-cysteine	5327	251.0697	69.1	8.833216e-05	-0.125672

After conducting pathway analysis in the Rodin class, users can visualize the results through barplots, violin or reg plots based on Entity IDs (EIDs) of metabolites or pathways names. By default, these plots will only display statistically significant metabolites (0.05 by default), omitting those not meeting the significance threshold. This focused approach ensures clearer and more meaningful visualizations, aiding in the interpretation of the pathway analysis outcomes.

In [21]:

obj.violinplot(eids=['E468','E1752	'],hue='Dose',significant=0.0001,category_order=['Control','Low','High'])

In [22]:

obj.boxplot(pathways=['Vitamin H (biotin) metabolism'],
               hue='Dose',significant=0.001,category_order=['Control','Low','High'],grid_dim=(1,3),figsize=(10,5))

Saving and Importing Objects in Rodin

Saving the Object: To save the current state of a Rodin object, use the save method. This method allows you to store the object for future use, preserving all its data and attributes.

obj.save('path_to_save.pkl')
Importing the Object: If you need to continue working with a previously saved Rodin object, you can easily import it back into your environment using the import_object method. This method reloads the object with all its data and attributes intact.

obj = rodin.import_object('path_to_saved_object.pkl')