Delightful Data Analysis¶
End-to-End Proteomics Analysis Pipeline¶
Welcome! This notebook showcases a complete proteomics workflow - from raw DIA-NN quantification to publication-ready insights. We'll explore quality control, dimensionality reduction, missing data imputation, and differential expression analysis using real experimental data.
%%capture
%load_ext autoreload
%autoreload 2
%load_ext rpy2.ipython
# Standard data science toolkit
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
# Custom project utilities for streamlined QC and analysis
from ProjectUtility import dim_red_utility, mis_val_utility, correlation_utilities
from ProjectUtility.core import (
clean_axes,
convert_palette_to_hex,
create_group_color_mapping,
norm_loading,
parse_fasta_file,
add_desc
)
Data Preparation¶
Let's start by loading and cleaning our DIA-NN output. We'll streamline column names to make them more readable while preserving the experimental structure.
# Load raw DIA-NN quantification matrix
df = pd.read_csv('020_2025_DUN_DH-report.pg_matrix.tsv', sep='\t')
# Extract only protein groups and intensity columns
intensity_cols = [col for col in df.columns if col.endswith('.raw')]
df = df[['Protein.Group'] + intensity_cols].set_index('Protein.Group')
# Clean up column names for better readability
df.columns = (
df.columns
.str.replace('.raw', '', regex=False)
.str.split('-GB-').str[1]
.str.replace('-', '', regex=False)
.str.replace('+', '', regex=False)
.str.replace('Rep2', '', regex=False)
.str.replace('Rep', '', regex=False)
)
# Apply standardized column names
standardized_cols = [
'2T1_1', '2T1_2', '2T1_3',
'SLBP12_1', 'SLBP12_2', 'SLBP12_3',
'SLBP12TET_1', 'SLBP12TET_2', 'SLBP12TET_3',
'SLBP1_1', 'SLBP1_2', 'SLBP1_3',
'SLBP1TET_1', 'SLBP1TET_2', 'SLBP1TET_3'
]
df.columns = standardized_cols
# Save cleaned data for reproducibility
df.to_pickle('indata.pkl')
df.head()
| 2T1_1 | 2T1_2 | 2T1_3 | SLBP12_1 | SLBP12_2 | SLBP12_3 | SLBP12TET_1 | SLBP12TET_2 | SLBP12TET_3 | SLBP1_1 | SLBP1_2 | SLBP1_3 | SLBP1TET_1 | SLBP1TET_2 | SLBP1TET_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Protein.Group | |||||||||||||||
| Phleomycin | 4.347880e+07 | 36651100.0 | 3.500330e+07 | 38470400.0 | 34726400.0 | 29039600.0 | 24671200.0 | 35611800.0 | 38989400.00 | 35651600.0 | 40151800.0 | 40348200.0 | 39719800.0 | 37193800.00 | 34347500.00 |
| Puromycin | 3.821860e+08 | 425111000.0 | 3.694850e+08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Tb04.24M18.150:mRNA-p1 | 9.676820e+03 | 20313.4 | 8.381180e+03 | NaN | NaN | NaN | 12192.6 | 13784.5 | 5528.62 | 10335.9 | NaN | NaN | NaN | 8260.32 | 6979.02 |
| Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 | 5.654800e+06 | 5339560.0 | 6.262950e+06 | 4577010.0 | 5836170.0 | 6658720.0 | 6350950.0 | 6026440.0 | 4977160.00 | 5657600.0 | 5669860.0 | 5763490.0 | 6027480.0 | 5277680.00 | 5985870.00 |
| Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 | 6.031810e+07 | 59744200.0 | 6.273160e+07 | 64689500.0 | 66848700.0 | 65379700.0 | 65218100.0 | 58591900.0 | 63377600.00 | 64882100.0 | 67925200.0 | 64827100.0 | 66387600.0 | 69991700.00 | 65262800.00 |
Quality Control¶
Before diving into analysis, let's thoroughly assess data quality. This step is crucial - catching issues early saves time and ensures reliable results!
# Load our prepared dataset
df = pd.read_pickle('indata.pkl')
df.head()
| 2T1_1 | 2T1_2 | 2T1_3 | SLBP12_1 | SLBP12_2 | SLBP12_3 | SLBP12TET_1 | SLBP12TET_2 | SLBP12TET_3 | SLBP1_1 | SLBP1_2 | SLBP1_3 | SLBP1TET_1 | SLBP1TET_2 | SLBP1TET_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Protein.Group | |||||||||||||||
| Phleomycin | 4.347880e+07 | 36651100.0 | 3.500330e+07 | 38470400.0 | 34726400.0 | 29039600.0 | 24671200.0 | 35611800.0 | 38989400.00 | 35651600.0 | 40151800.0 | 40348200.0 | 39719800.0 | 37193800.00 | 34347500.00 |
| Puromycin | 3.821860e+08 | 425111000.0 | 3.694850e+08 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Tb04.24M18.150:mRNA-p1 | 9.676820e+03 | 20313.4 | 8.381180e+03 | NaN | NaN | NaN | 12192.6 | 13784.5 | 5528.62 | 10335.9 | NaN | NaN | NaN | 8260.32 | 6979.02 |
| Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 | 5.654800e+06 | 5339560.0 | 6.262950e+06 | 4577010.0 | 5836170.0 | 6658720.0 | 6350950.0 | 6026440.0 | 4977160.00 | 5657600.0 | 5669860.0 | 5763490.0 | 6027480.0 | 5277680.00 | 5985870.00 |
| Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 | 6.031810e+07 | 59744200.0 | 6.273160e+07 | 64689500.0 | 66848700.0 | 65379700.0 | 65218100.0 | 58591900.0 | 63377600.00 | 64882100.0 | 67925200.0 | 64827100.0 | 66387600.0 | 69991700.00 | 65262800.00 |
Missing Value Analysis¶
Missing values in proteomics can be informative! Let's visualize their patterns to understand if they're random or systematic (e.g., abundance-dependent).
# Generate comprehensive missing values dashboard
mv_analyzer = mis_val_utility.MissingValuesAnalyzer(df)
fig, axes, summary = mv_analyzer.plot_missing_dashboard()
Key Observations:¶
- Good news! Missing values are sparse and concentrated in the low-intensity range
- This pattern suggests typical left-censored data (proteins below detection limit)
Validation with Known Markers¶
Let's verify our data integrity using selection markers:
- Puromycin resistance: Expected only in 2T1 samples (transgene marker)
- Phleomycin resistance: Should be present across all samples (background marker)
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
# Puromycin: should be 2T1-specific
np.log2(df.loc['Puromycin']).plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Puromycin Resistance Marker', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity')
clean_axes(axes[0])
# Phleomycin: should be ubiquitous
np.log2(df.loc['Phleomycin']).plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Phleomycin Resistance Marker', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity')
clean_axes(axes[1])
plt.tight_layout()
plt.show()
Perfect! Markers behave as expected - this validates our sample identities and data integrity.
# Filter out proteins with excessive missingness
# Rationale:除Puromycin外 proteins missing in >50% of samples are unreliable
print(f"Before filtering: {df.shape[0]} proteins")
missing_count = df.isna().sum(axis=1)
df = df[missing_count < 8] # Keep proteins present in at least half the samples
print(f"After filtering: {df.shape[0]} proteins ({df.shape[0]/6835*100:.1f}% retained)")
Before filtering: 6835 proteins After filtering: 6790 proteins (99.3% retained)
Color-Coding Experimental Groups¶
Consistent visual representation helps spot patterns across multiple plots. Let's assign colors to our experimental groups.
# Generate sample-to-color mapping (3 replicates per group)
sample_groups, color_dictionary = create_group_color_mapping(
df.columns, group_size=3, return_color_to_group=True
)
sample_groups
{'2T1_1': '#FF5733',
'2T1_2': '#FF5733',
'2T1_3': '#FF5733',
'SLBP12_1': '#33FF57',
'SLBP12_2': '#33FF57',
'SLBP12_3': '#33FF57',
'SLBP12TET_1': '#3357FF',
'SLBP12TET_2': '#3357FF',
'SLBP12TET_3': '#3357FF',
'SLBP1_1': '#FF33A8',
'SLBP1_2': '#FF33A8',
'SLBP1_3': '#FF33A8',
'SLBP1TET_1': '#33FFF5',
'SLBP1TET_2': '#33FFF5',
'SLBP1TET_3': '#33FFF5'}
# Customize group labels for clarity
color_dictionary = {
'#FF5733': '2T1',
'#33FF57': 'SLBP-12',
'#3357FF': 'SLBP-12-TET',
'#FF33A8': 'SLBP-1',
'#33FFF5': 'SLBP-1-TET'
}
Dimensionality Reduction¶
Time to explore global patterns! We'll use PCA and MDS to visualize sample relationships in high-dimensional space.
from sklearn.preprocessing import StandardScaler
def preprocess_for_dim_reduction(df):
"""
Prepare data for dimensionality reduction:
- Log2 transform (stabilizes variance)
- Remove remaining missing values
- Z-score normalization (centers and scales features)
"""
scaler = StandardScaler()
# Log transform and clean protein IDs
processed = np.log2(df).dropna().copy()
processed.index = [idx.split(':')[0] for idx in processed.index]
# Standardize features
processed = pd.DataFrame(
scaler.fit_transform(processed),
index=processed.index,
columns=processed.columns
)
return processed
# Generate comprehensive dimensionality reduction dashboard
fig, axes, results_dict = dim_red_utility.create_dim_reduction_dashboard(
in_df=preprocess_for_dim_reduction(df),
sample_palette=sample_groups,
feature_palette={},
top=500, # Focus on 500 most variable proteins
color_dictionary=color_dictionary,
title="Multi-Method Dimensionality Reduction Analysis"
)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Display variance explained
print("\nVariance explained by first 3 principal components:")
print(results_dict['explained_variance'].head(3))
Explained variance ratio: [0.64777143 0.08738508 0.07778569 0.05570276 0.03574467] Variance explained by first 3 principal components: component explained_variance cumulative_variance 0 1 64.777143 64.777143 1 2 8.738508 73.515651 2 3 7.778569 81.294220
Critical Finding: Outlier Detection!¶
Alert: Two samples show unexpected behavior across all reduction methods:
- SLBP12-TET replicate A
- SLBP1-TET replicate C
These outliers cluster separately from their biological replicates - we need to investigate further!
Hierarchical Clustering: A Different Perspective¶
Let's use unsupervised clustering to confirm our outlier observations.
# Generate clustered heatmap with sample dendrograms
sns.clustermap(
preprocess_for_dim_reduction(df),
z_score=0, # Normalize by protein (rows)
cmap='Blues',
figsize=(8, 6),
cbar_kws={'label': 'Z-score'}
)
plt.suptitle('Sample Clustering Analysis', y=0.99)
Text(0.5, 0.99, 'Sample Clustering Analysis')
Root Cause Identified!¶
Clustering confirms the outliers. After investigating with the MS facility:
- Problem: Poor protein recovery during trypsin digestion for these two samples
- Action: We want to discard those two samples for differential expression analysis, howewer we will keep them for the sake of simplicity
Lesson: Always communicate with your core facility when you spot anomalies!
Intensity Distribution Assessment¶
DIA-NN includes built-in normalization, but let's verify intensity distributions are comparable across samples.
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(
data=np.log2(df),
showfliers=False,
palette=sample_groups.values(),
ax=ax
)
ax.set_title('Pre-Normalization Intensity Distribution', fontweight='bold', fontsize=14)
ax.set_ylabel('Log2 Intensity', fontsize=12)
ax.set_xlabel('Sample', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
# Apply median normalization for additional balance
data_normalized = norm_loading(np.log2(df))
fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(
data=data_normalized,
showfliers=False,
palette=sample_groups.values(),
ax=ax
)
ax.set_title('Post-Normalization Intensity Distribution', fontweight='bold', fontsize=14)
ax.set_ylabel('Log2 Intensity (Normalized)', fontsize=12)
ax.set_xlabel('Sample', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
medians [23.48822472 23.49156809 23.49019743 23.47039845 23.46962243 23.27145354 23.18479772 23.48068322 23.48804703 23.45988983 23.45657874 23.458376 23.4732135 23.45718642 23.48640297] target 23.441776005765004 norm_facs [0.99802247 0.99788043 0.99793865 0.99878049 0.99881351 1.00731894 1.01108391 0.99834301 0.99803002 0.99922788 0.99936893 0.99929236 0.99866071 0.99934304 0.99809988]
Validating Knockdown Targets¶
Our experimental design includes conditional knockdowns:
- Protein 1 (Tb927.3.870): Depleted in SLBP-1-TET and SLBP-12-TET samples
- Protein 2 (Tb927.3.1910): Depleted only in SLBP-12-TET samples
Let's verify the knockdown efficiency!
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
# Protein 1 knockdown
data_normalized.loc['Tb927.3.870:mRNA-p1'].plot(
kind='bar', ax=axes[0], color='darkred', alpha=0.7
)
axes[0].set_title('Protein 1 Knockdown Efficiency', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity (Normalized)')
axes[0].axhline(y=data_normalized.loc['Tb927.3.870:mRNA-p1'].median(),
color='gray', linestyle='--', alpha=0.5, label='Median')
clean_axes(axes[0])
# Protein 2 knockdown
data_normalized.loc['Tb927.3.1910:mRNA-p1'].plot(
kind='bar', ax=axes[1], color='darkblue', alpha=0.7
)
axes[1].set_title('Protein 2 Knockdown Efficiency', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity (Normalized)')
axes[1].axhline(y=data_normalized.loc['Tb927.3.1910:mRNA-p1'].median(),
color='gray', linestyle='--', alpha=0.5, label='Median')
clean_axes(axes[1])
plt.tight_layout()
plt.show()
Excellent! Both knockdowns show clear depletion in the expected samples. This confirms:
- Successful tetracycline induction
- Target-specific depletion
- Biological system is responding as designed
Missing Data Imputation¶
The great debate: To impute or not to impute?
Here, we'll use MissForest (random forest-based imputation) to demonstrate the approach. This method:
- Leverages correlations between proteins
- Handles non-linear relationships
- Provides more nuanced estimates than simple substitution
Note: Given our low missingness rate, results with/without imputation should be similar.
%%R -i data_normalized
suppressPackageStartupMessages({
library(missForest)
library(doParallel)
})
# MissForest imputation with group-wise processing
apply_missForest_to_subsets <- function(df, group_vector, cores = 3) {
registerDoParallel(cores = cores)
list_df_imputed <- list()
unique_groups <- unique(group_vector)
for(group in unique_groups) {
df_subset <- df[, group_vector == group, drop = FALSE]
# Impute missing values using random forest
imputed_subset <- missForest(
df_subset,
ntree = 200,
maxiter = 20,
mtry = 3,
nodesize = c(5, 5),
parallelize = "forests",
replace = TRUE
)$ximp
list_df_imputed[[length(list_df_imputed) + 1]] <- imputed_subset
}
df_imputed_full <- do.call(cbind, list_df_imputed)
stopImplicitCluster()
return(df_imputed_full)
}
# Define sample groups (3 replicates each)
subset_vector <- c(1,1,1, 2,2,2, 3,3,3, 4,4,4, 5,5,5)
# Perform imputation
data_normalized_imputed <- apply_missForest_to_subsets(data_normalized, subset_vector)
randomForest 4.7-1.2 Type rfNews() to see new features/changes/bug fixes. Loading required package: rngtools
# Transfer imputed data back to Python
%R -o data_normalized_imputed
data_normalized_imputed.head()
| 2T1_1 | 2T1_2 | 2T1_3 | SLBP12_1 | SLBP12_2 | SLBP12_3 | SLBP12TET_1 | SLBP12TET_2 | SLBP12TET_3 | SLBP1_1 | SLBP1_2 | SLBP1_3 | SLBP1TET_1 | SLBP1TET_2 | SLBP1TET_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Phleomycin | 25.323631 | 25.074094 | 25.009328 | 25.166517 | 25.019809 | 24.972966 | 24.828505 | 25.044285 | 25.166902 | 25.068093 | 25.243021 | 25.248122 | 25.209547 | 25.132037 | 24.986135 |
| Tb04.24M18.150:mRNA-p1 | 13.214134 | 14.279813 | 13.006072 | 16.301880 | 16.453885 | 16.657805 | 13.724168 | 13.727974 | 12.408211 | 13.325080 | 13.206031 | 13.275544 | 13.098157 | 13.003434 | 12.744546 |
| Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 | 22.386686 | 22.300921 | 22.531869 | 22.098991 | 22.449922 | 22.832711 | 22.849021 | 22.485554 | 22.203065 | 22.414439 | 22.420724 | 22.442619 | 22.492958 | 22.316802 | 22.470352 |
| Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 | 25.794976 | 25.777542 | 25.849295 | 25.915385 | 25.963554 | 26.152356 | 26.246494 | 25.761441 | 25.866413 | 25.931280 | 26.001025 | 25.931730 | 25.949610 | 26.043560 | 25.910431 |
| Tb05.5K5.120:mRNA-p1;Tb927.5.4470:mRNA-p1 | 22.775044 | 22.771027 | 22.735765 | 22.778995 | 22.881175 | 23.037907 | 22.919583 | 22.680461 | 22.663162 | 22.822548 | 22.840285 | 22.780760 | 22.793283 | 22.743623 | 22.670075 |
# Verify imputation preserved biological signal
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))
data_normalized_imputed.loc['Tb927.3.870:mRNA-p1'].plot(
kind='bar', ax=axes[0], color='darkred', alpha=0.7
)
axes[0].set_title('Protein 1 (Post-Imputation)', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity')
clean_axes(axes[0])
data_normalized_imputed.loc['Tb927.3.1910:mRNA-p1'].plot(
kind='bar', ax=axes[1], color='darkblue', alpha=0.7
)
axes[1].set_title('Protein 2 (Post-Imputation)', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity')
clean_axes(axes[1])
plt.tight_layout()
plt.show()
Validated! Imputation preserved our knockdown signals perfectly. Ready for statistical analysis!
Differential Expression Analysis¶
Time for the main event! We'll use limma to identify differentially expressed proteins.
Strategy: Generate all pairwise comparisons, then select the most biologically relevant contrasts. This exploratory approach helps uncover unexpected patterns.
# Create output directory
!mkdir -p table_comparisons_diaNN
%%R -i data_normalized_imputed
options(warn = -1)
library(limma)
# Define experimental groups
f <- factor(
c('T1', 'T1', 'T1',
'S12', 'S12', 'S12',
'S12T', 'S12T', 'S12T',
'S1', 'S1', 'S1',
'S1T', 'S1T', 'S1T'),
levels = c("T1", "S12", "S12T", "S1", "S1T")
)
# Create design matrix
design <- model.matrix(~0 + f)
colnames(design) <- c("T1", "S12", "S12T", "S1", "S1T")
# Fit linear model
fit2 <- lmFit(data_normalized_imputed, design)
# Generate all pairwise contrasts
contrast_names <- list()
for (i in 1:length(levels(f))) {
for (j in 1:length(levels(f))) {
if (i != j) {
contrast_name <- paste(levels(f)[i], "-", levels(f)[j], sep = "")
contrast_names[[length(contrast_names) + 1]] <- contrast_name
}
}
}
contrast_vector <- unlist(contrast_names)
contrast_matrix <- makeContrasts(contrasts = contrast_vector, levels = design)
# Apply contrasts and empirical Bayes moderation
fit2 <- contrasts.fit(fit2, contrast_matrix)
fit2 <- eBayes(fit2)
# Export all comparison results
export_results <- function(fit, contrast_index, name) {
results <- topTable(fit, coef = contrast_index, sort.by = "none", number = Inf)
write.csv(results, paste0('table_comparisons_diaNN/', name, "_comparison.csv"))
}
for (idx in 1:length(colnames(contrast_matrix))) {
contrast_name <- colnames(contrast_matrix)[idx]
export_results(fit2, idx, contrast_name)
}
cat("✓ Generated", length(contrast_vector), "pairwise comparisons\n")
✓ Generated 20 pairwise comparisons
Volcano Plots: Visualizing Differential Expression¶
Let's examine one key comparison and track our knockdown targets through the analysis.
# Focus on a biologically relevant comparison
comparison_name = 'S12-S12T'
comparison_file = f'table_comparisons_diaNN/{comparison_name}_comparison.csv'
print(f"Analyzing: {comparison_name}")
Analyzing: S12-S12T
# Track our knockdown targets
target_proteins = ['Tb927.3.870:mRNA-p1', 'Tb927.3.1910:mRNA-p1']
# Generate volcano and MA plots
result_table = pd.read_csv(comparison_file, index_col=0)
# Calculate -log10 p-values for visualization
result_table['log10pval'] = -np.log10(result_table['P.Value'])
result_table['log10adjpval'] = -np.log10(result_table['adj.P.Val'])
fig, axes = plt.subplots(figsize=(14, 5), ncols=2)
# Volcano plot
ax = axes[0]
result_table.plot(
x='logFC', y='log10adjpval',
kind='scatter', s=5, alpha=0.3,
ax=ax, c='gray', label='All proteins'
)
result_table.loc[target_proteins].plot(
x='logFC', y='log10adjpval',
kind='scatter', s=50, alpha=1,
ax=ax, c='red', label='Target proteins', edgecolors='black'
)
ax.axhline(y=-np.log10(0.05), color='blue', linestyle='--', alpha=0.5, label='FDR 0.05')
ax.axvline(x=0, color='black', linestyle='-', alpha=0.3)
ax.set_title('Volcano Plot: Identifying Significant Changes', fontweight='bold')
ax.set_xlabel('Log2 Fold Change')
ax.set_ylabel('-Log10 Adjusted P-value')
ax.legend()
# MA plot
ax = axes[1]
result_table.plot(
x='AveExpr', y='logFC',
kind='scatter', s=5, alpha=0.3,
ax=ax, c='gray', label='All proteins'
)
result_table.loc[target_proteins].plot(
x='AveExpr', y='logFC',
kind='scatter', s=50, alpha=1,
ax=ax, c='red', label='Target proteins', edgecolors='black'
)
ax.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax.set_title('MA Plot: Fold Change vs. Abundance', fontweight='bold')
ax.set_xlabel('Average Log2 Expression')
ax.set_ylabel('Log2 Fold Change')
ax.legend()
plt.suptitle(f'Differential Expression: {comparison_name}', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# P-value distribution (QC check)
fig, ax = plt.subplots(figsize=(8, 4))
result_table['P.Value'].plot(kind='hist', bins=30, ax=ax, color='steelblue', edgecolor='black')
ax.set_title('P-value Distribution (QC)', fontweight='bold')
ax.set_xlabel('Raw P-value')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Interactive Visualization with PlotHub¶
Static plots are informative, but interactive exploration takes analysis to the next level!
We'll prepare our data for PlotHub - enabling:
- Dynamic filtering and selection
- Hover-over protein annotations
- Linked views across multiple plots
- Easy sharing with collaborators
# Export normalized data for PlotHub
data_normalized_imputed.to_csv('table_comparisons_diaNN/in_data.csv')
# Load protein annotations from FASTA
prot_to_desc = parse_fasta_file('TriTrypDB-68_TbruceiTREU927_AnnotatedProteins.fasta')
!mkdir -p table_comparisons_PlotHub
# Define sample group mappings
sample_to_columns = {
'T1': ['2T1_1', '2T1_2', '2T1_3'],
'S12': ['SLBP12_1', 'SLBP12_2', 'SLBP12_3'],
'S12T': ['SLBP12TET_1', 'SLBP12TET_2', 'SLBP12TET_3'],
'S1': ['SLBP1_1', 'SLBP1_2', 'SLBP1_3'],
'S1T': ['SLBP1TET_1', 'SLBP1TET_2', 'SLBP1TET_3']
}
def prepare_for_plothub(data_df, comparison_df, data_cols):
"""
Format data for PlotHub compatibility:
- Add gene IDs and accessions
- Calculate significance metrics
- Append raw intensity values
- Include protein descriptions
"""
result = comparison_df.copy()
# Format identifiers
result['Gene_acc'] = range(result.shape[0])
result['Gene_id'] = result.index.str.replace(':mRNA-p1', '', regex=False)
# Prepare metrics
result['log_AveExpr'] = result['AveExpr']
result['FDR'] = -np.log10(result['adj.P.Val'])
# Select core columns
result = result[['Gene_acc', 'Gene_id', 'logFC', 'log_AveExpr', 'FDR']]
# Add descriptions
result['Desc'] = add_desc(data_df, prot_to_desc)
# Append intensity values for the relevant samples
intensity_data = data_df[data_cols].copy()
result = result.join(intensity_data)
result = result.reset_index(drop=True)
return result
# Process all comparisons
print("Preparing PlotHub-compatible files...\n")
comparison_files = [f for f in os.listdir('table_comparisons_diaNN') if f.endswith('_comparison.csv')]
for file_name in comparison_files:
# Load data
data_df = pd.read_csv('table_comparisons_diaNN/in_data.csv', index_col=0)
comparison_df = pd.read_csv(f'table_comparisons_diaNN/{file_name}', index_col=0)
# Extract comparison name and get relevant columns
comparison = file_name.replace('_comparison.csv', '')
group1, group2 = comparison.split('-')
data_cols = sample_to_columns[group1] + sample_to_columns[group2]
# Format and save
formatted = prepare_for_plothub(data_df, comparison_df, data_cols)
output_path = f'table_comparisons_PlotHub/for_web_limma_{comparison}.csv'
formatted.to_csv(output_path, index=False)
print(f"✓ {comparison}: {formatted.shape[0]} proteins, {len(data_cols)} samples")
print(f"\n✓ All {len(comparison_files)} comparisons ready for PlotHub!")
Preparing PlotHub-compatible files... ✓ T1-S1: 6790 proteins, 6 samples ✓ S1-S12T: 6790 proteins, 6 samples ✓ S12T-S12: 6790 proteins, 6 samples ✓ S12T-S1T: 6790 proteins, 6 samples ✓ S1T-S12T: 6790 proteins, 6 samples ✓ S12-S12T: 6790 proteins, 6 samples ✓ S1-S12: 6790 proteins, 6 samples ✓ T1-S12T: 6790 proteins, 6 samples ✓ T1-S1T: 6790 proteins, 6 samples ✓ S12-S1T: 6790 proteins, 6 samples ✓ S1-S1T: 6790 proteins, 6 samples ✓ S1-T1: 6790 proteins, 6 samples ✓ T1-S12: 6790 proteins, 6 samples ✓ S12-S1: 6790 proteins, 6 samples ✓ S12T-S1: 6790 proteins, 6 samples ✓ S1T-S1: 6790 proteins, 6 samples ✓ S1T-T1: 6790 proteins, 6 samples ✓ S12T-T1: 6790 proteins, 6 samples ✓ S12-T1: 6790 proteins, 6 samples ✓ S1T-S12: 6790 proteins, 6 samples ✓ All 20 comparisons ready for PlotHub!
Analysis Complete!¶
What We Accomplished:¶
✓ Data QC: Identified and explained sample outliers
✓ Validation: Confirmed knockdown efficiency and marker expression
✓ Normalization: Balanced intensity distributions across samples
✓ Imputation: Applied sophisticated missing data handling
✓ Statistics: Performed comprehensive differential expression analysis
✓ Visualization: Created publication-quality static and interactive plots
Next Steps:¶
- Upload HTML to your web server
- See this output in PlotHub, by clicking on "Upload Test File".
Analysis performed with custom utilities from the Delightful Data Analysis framework