Delightful Data Analysis¶

End-to-End Proteomics Analysis Pipeline¶

Welcome! This notebook showcases a complete proteomics workflow - from raw DIA-NN quantification to publication-ready insights. We'll explore quality control, dimensionality reduction, missing data imputation, and differential expression analysis using real experimental data.

In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%load_ext rpy2.ipython
In [2]:
# Standard data science toolkit
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
In [3]:
# Custom project utilities for streamlined QC and analysis
from ProjectUtility import dim_red_utility, mis_val_utility, correlation_utilities
from ProjectUtility.core import (
    clean_axes, 
    convert_palette_to_hex, 
    create_group_color_mapping,
    norm_loading,
    parse_fasta_file,
    add_desc
)

Data Preparation¶

Let's start by loading and cleaning our DIA-NN output. We'll streamline column names to make them more readable while preserving the experimental structure.

In [4]:
# Load raw DIA-NN quantification matrix
df = pd.read_csv('020_2025_DUN_DH-report.pg_matrix.tsv', sep='\t')

# Extract only protein groups and intensity columns
intensity_cols = [col for col in df.columns if col.endswith('.raw')]
df = df[['Protein.Group'] + intensity_cols].set_index('Protein.Group')

# Clean up column names for better readability
df.columns = (
    df.columns
    .str.replace('.raw', '', regex=False)
    .str.split('-GB-').str[1]
    .str.replace('-', '', regex=False)
    .str.replace('+', '', regex=False)
    .str.replace('Rep2', '', regex=False)
    .str.replace('Rep', '', regex=False)
)

# Apply standardized column names
standardized_cols = [
    '2T1_1', '2T1_2', '2T1_3', 
    'SLBP12_1', 'SLBP12_2', 'SLBP12_3', 
    'SLBP12TET_1', 'SLBP12TET_2', 'SLBP12TET_3', 
    'SLBP1_1', 'SLBP1_2', 'SLBP1_3', 
    'SLBP1TET_1', 'SLBP1TET_2', 'SLBP1TET_3'
]
df.columns = standardized_cols

# Save cleaned data for reproducibility
df.to_pickle('indata.pkl')
df.head()
Out[4]:
2T1_1 2T1_2 2T1_3 SLBP12_1 SLBP12_2 SLBP12_3 SLBP12TET_1 SLBP12TET_2 SLBP12TET_3 SLBP1_1 SLBP1_2 SLBP1_3 SLBP1TET_1 SLBP1TET_2 SLBP1TET_3
Protein.Group
Phleomycin 4.347880e+07 36651100.0 3.500330e+07 38470400.0 34726400.0 29039600.0 24671200.0 35611800.0 38989400.00 35651600.0 40151800.0 40348200.0 39719800.0 37193800.00 34347500.00
Puromycin 3.821860e+08 425111000.0 3.694850e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Tb04.24M18.150:mRNA-p1 9.676820e+03 20313.4 8.381180e+03 NaN NaN NaN 12192.6 13784.5 5528.62 10335.9 NaN NaN NaN 8260.32 6979.02
Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 5.654800e+06 5339560.0 6.262950e+06 4577010.0 5836170.0 6658720.0 6350950.0 6026440.0 4977160.00 5657600.0 5669860.0 5763490.0 6027480.0 5277680.00 5985870.00
Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 6.031810e+07 59744200.0 6.273160e+07 64689500.0 66848700.0 65379700.0 65218100.0 58591900.0 63377600.00 64882100.0 67925200.0 64827100.0 66387600.0 69991700.00 65262800.00

Quality Control¶

Before diving into analysis, let's thoroughly assess data quality. This step is crucial - catching issues early saves time and ensures reliable results!

In [5]:
# Load our prepared dataset
df = pd.read_pickle('indata.pkl')
df.head()
Out[5]:
2T1_1 2T1_2 2T1_3 SLBP12_1 SLBP12_2 SLBP12_3 SLBP12TET_1 SLBP12TET_2 SLBP12TET_3 SLBP1_1 SLBP1_2 SLBP1_3 SLBP1TET_1 SLBP1TET_2 SLBP1TET_3
Protein.Group
Phleomycin 4.347880e+07 36651100.0 3.500330e+07 38470400.0 34726400.0 29039600.0 24671200.0 35611800.0 38989400.00 35651600.0 40151800.0 40348200.0 39719800.0 37193800.00 34347500.00
Puromycin 3.821860e+08 425111000.0 3.694850e+08 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Tb04.24M18.150:mRNA-p1 9.676820e+03 20313.4 8.381180e+03 NaN NaN NaN 12192.6 13784.5 5528.62 10335.9 NaN NaN NaN 8260.32 6979.02
Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 5.654800e+06 5339560.0 6.262950e+06 4577010.0 5836170.0 6658720.0 6350950.0 6026440.0 4977160.00 5657600.0 5669860.0 5763490.0 6027480.0 5277680.00 5985870.00
Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 6.031810e+07 59744200.0 6.273160e+07 64689500.0 66848700.0 65379700.0 65218100.0 58591900.0 63377600.00 64882100.0 67925200.0 64827100.0 66387600.0 69991700.00 65262800.00

Missing Value Analysis¶

Missing values in proteomics can be informative! Let's visualize their patterns to understand if they're random or systematic (e.g., abundance-dependent).

In [6]:
# Generate comprehensive missing values dashboard
mv_analyzer = mis_val_utility.MissingValuesAnalyzer(df)
fig, axes, summary = mv_analyzer.plot_missing_dashboard()
No description has been provided for this image

Key Observations:¶

  • Good news! Missing values are sparse and concentrated in the low-intensity range
  • This pattern suggests typical left-censored data (proteins below detection limit)

Validation with Known Markers¶

Let's verify our data integrity using selection markers:

  • Puromycin resistance: Expected only in 2T1 samples (transgene marker)
  • Phleomycin resistance: Should be present across all samples (background marker)
In [7]:
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))

# Puromycin: should be 2T1-specific
np.log2(df.loc['Puromycin']).plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Puromycin Resistance Marker', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity')
clean_axes(axes[0])

# Phleomycin: should be ubiquitous
np.log2(df.loc['Phleomycin']).plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Phleomycin Resistance Marker', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity')
clean_axes(axes[1])

plt.tight_layout()
plt.show()
No description has been provided for this image

Perfect! Markers behave as expected - this validates our sample identities and data integrity.


In [8]:
# Filter out proteins with excessive missingness
# Rationale:除Puromycin外 proteins missing in >50% of samples are unreliable
print(f"Before filtering: {df.shape[0]} proteins")

missing_count = df.isna().sum(axis=1)
df = df[missing_count < 8]  # Keep proteins present in at least half the samples

print(f"After filtering: {df.shape[0]} proteins ({df.shape[0]/6835*100:.1f}% retained)")
Before filtering: 6835 proteins
After filtering: 6790 proteins (99.3% retained)

Color-Coding Experimental Groups¶

Consistent visual representation helps spot patterns across multiple plots. Let's assign colors to our experimental groups.

In [9]:
# Generate sample-to-color mapping (3 replicates per group)
sample_groups, color_dictionary = create_group_color_mapping(
    df.columns, group_size=3, return_color_to_group=True
)
sample_groups
Out[9]:
{'2T1_1': '#FF5733',
 '2T1_2': '#FF5733',
 '2T1_3': '#FF5733',
 'SLBP12_1': '#33FF57',
 'SLBP12_2': '#33FF57',
 'SLBP12_3': '#33FF57',
 'SLBP12TET_1': '#3357FF',
 'SLBP12TET_2': '#3357FF',
 'SLBP12TET_3': '#3357FF',
 'SLBP1_1': '#FF33A8',
 'SLBP1_2': '#FF33A8',
 'SLBP1_3': '#FF33A8',
 'SLBP1TET_1': '#33FFF5',
 'SLBP1TET_2': '#33FFF5',
 'SLBP1TET_3': '#33FFF5'}
In [10]:
# Customize group labels for clarity
color_dictionary = {
    '#FF5733': '2T1',
    '#33FF57': 'SLBP-12',
    '#3357FF': 'SLBP-12-TET',
    '#FF33A8': 'SLBP-1',
    '#33FFF5': 'SLBP-1-TET'
}

Dimensionality Reduction¶

Time to explore global patterns! We'll use PCA and MDS to visualize sample relationships in high-dimensional space.

In [11]:
from sklearn.preprocessing import StandardScaler

def preprocess_for_dim_reduction(df):
    """
    Prepare data for dimensionality reduction:
    - Log2 transform (stabilizes variance)
    - Remove remaining missing values
    - Z-score normalization (centers and scales features)
    """
    scaler = StandardScaler()
    
    # Log transform and clean protein IDs
    processed = np.log2(df).dropna().copy()
    processed.index = [idx.split(':')[0] for idx in processed.index]
    
    # Standardize features
    processed = pd.DataFrame(
        scaler.fit_transform(processed),
        index=processed.index,
        columns=processed.columns
    )
    return processed

# Generate comprehensive dimensionality reduction dashboard
fig, axes, results_dict = dim_red_utility.create_dim_reduction_dashboard(
    in_df=preprocess_for_dim_reduction(df),
    sample_palette=sample_groups,
    feature_palette={},
    top=500,  # Focus on 500 most variable proteins
    color_dictionary=color_dictionary,
    title="Multi-Method Dimensionality Reduction Analysis"
)

plt.tight_layout(rect=[0, 0.03, 1, 0.95])

# Display variance explained
print("\nVariance explained by first 3 principal components:")
print(results_dict['explained_variance'].head(3))
Explained variance ratio: [0.64777143 0.08738508 0.07778569 0.05570276 0.03574467]

Variance explained by first 3 principal components:
   component  explained_variance  cumulative_variance
0          1           64.777143            64.777143
1          2            8.738508            73.515651
2          3            7.778569            81.294220
No description has been provided for this image

Critical Finding: Outlier Detection!¶

Alert: Two samples show unexpected behavior across all reduction methods:

  • SLBP12-TET replicate A
  • SLBP1-TET replicate C

These outliers cluster separately from their biological replicates - we need to investigate further!


Hierarchical Clustering: A Different Perspective¶

Let's use unsupervised clustering to confirm our outlier observations.

In [12]:
# Generate clustered heatmap with sample dendrograms
sns.clustermap(
    preprocess_for_dim_reduction(df),
    z_score=0,  # Normalize by protein (rows)
    cmap='Blues',
    figsize=(8, 6),
    cbar_kws={'label': 'Z-score'}
)
plt.suptitle('Sample Clustering Analysis', y=0.99)
Out[12]:
Text(0.5, 0.99, 'Sample Clustering Analysis')
No description has been provided for this image

Root Cause Identified!¶

Clustering confirms the outliers. After investigating with the MS facility:

  • Problem: Poor protein recovery during trypsin digestion for these two samples
  • Action: We want to discard those two samples for differential expression analysis, howewer we will keep them for the sake of simplicity

Lesson: Always communicate with your core facility when you spot anomalies!


Intensity Distribution Assessment¶

DIA-NN includes built-in normalization, but let's verify intensity distributions are comparable across samples.

In [16]:
fig, ax = plt.subplots(figsize=(10, 5))

sns.boxplot(
    data=np.log2(df),
    showfliers=False,
    palette=sample_groups.values(),
    ax=ax
)

ax.set_title('Pre-Normalization Intensity Distribution', fontweight='bold', fontsize=14)
ax.set_ylabel('Log2 Intensity', fontsize=12)
ax.set_xlabel('Sample', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [17]:
# Apply median normalization for additional balance
data_normalized = norm_loading(np.log2(df))

fig, ax = plt.subplots(figsize=(10, 5))
sns.boxplot(
    data=data_normalized,
    showfliers=False,
    palette=sample_groups.values(),
    ax=ax
)

ax.set_title('Post-Normalization Intensity Distribution', fontweight='bold', fontsize=14)
ax.set_ylabel('Log2 Intensity (Normalized)', fontsize=12)
ax.set_xlabel('Sample', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
medians [23.48822472 23.49156809 23.49019743 23.47039845 23.46962243 23.27145354
 23.18479772 23.48068322 23.48804703 23.45988983 23.45657874 23.458376
 23.4732135  23.45718642 23.48640297]
target 23.441776005765004
norm_facs [0.99802247 0.99788043 0.99793865 0.99878049 0.99881351 1.00731894
 1.01108391 0.99834301 0.99803002 0.99922788 0.99936893 0.99929236
 0.99866071 0.99934304 0.99809988]
No description has been provided for this image

Validating Knockdown Targets¶

Our experimental design includes conditional knockdowns:

  • Protein 1 (Tb927.3.870): Depleted in SLBP-1-TET and SLBP-12-TET samples
  • Protein 2 (Tb927.3.1910): Depleted only in SLBP-12-TET samples

Let's verify the knockdown efficiency!

In [18]:
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))

# Protein 1 knockdown
data_normalized.loc['Tb927.3.870:mRNA-p1'].plot(
    kind='bar', ax=axes[0], color='darkred', alpha=0.7
)
axes[0].set_title('Protein 1 Knockdown Efficiency', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity (Normalized)')
axes[0].axhline(y=data_normalized.loc['Tb927.3.870:mRNA-p1'].median(), 
                color='gray', linestyle='--', alpha=0.5, label='Median')
clean_axes(axes[0])

# Protein 2 knockdown
data_normalized.loc['Tb927.3.1910:mRNA-p1'].plot(
    kind='bar', ax=axes[1], color='darkblue', alpha=0.7
)
axes[1].set_title('Protein 2 Knockdown Efficiency', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity (Normalized)')
axes[1].axhline(y=data_normalized.loc['Tb927.3.1910:mRNA-p1'].median(), 
                color='gray', linestyle='--', alpha=0.5, label='Median')
clean_axes(axes[1])

plt.tight_layout()
plt.show()
No description has been provided for this image

Excellent! Both knockdowns show clear depletion in the expected samples. This confirms:

  1. Successful tetracycline induction
  2. Target-specific depletion
  3. Biological system is responding as designed

Missing Data Imputation¶

The great debate: To impute or not to impute?

Here, we'll use MissForest (random forest-based imputation) to demonstrate the approach. This method:

  • Leverages correlations between proteins
  • Handles non-linear relationships
  • Provides more nuanced estimates than simple substitution

Note: Given our low missingness rate, results with/without imputation should be similar.

In [19]:
%%R -i data_normalized
suppressPackageStartupMessages({
    library(missForest)
    library(doParallel)
})

# MissForest imputation with group-wise processing
apply_missForest_to_subsets <- function(df, group_vector, cores = 3) {
    registerDoParallel(cores = cores)
    
    list_df_imputed <- list()
    unique_groups <- unique(group_vector)
    
    for(group in unique_groups) {
        df_subset <- df[, group_vector == group, drop = FALSE]
        
        # Impute missing values using random forest
        imputed_subset <- missForest(
            df_subset,
            ntree = 200,
            maxiter = 20,
            mtry = 3,
            nodesize = c(5, 5),
            parallelize = "forests",
            replace = TRUE
        )$ximp
        
        list_df_imputed[[length(list_df_imputed) + 1]] <- imputed_subset
    }
    
    df_imputed_full <- do.call(cbind, list_df_imputed)
    stopImplicitCluster()
    
    return(df_imputed_full)
}

# Define sample groups (3 replicates each)
subset_vector <- c(1,1,1, 2,2,2, 3,3,3, 4,4,4, 5,5,5)

# Perform imputation
data_normalized_imputed <- apply_missForest_to_subsets(data_normalized, subset_vector)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Loading required package: rngtools
In [20]:
# Transfer imputed data back to Python
%R -o data_normalized_imputed
data_normalized_imputed.head()
Out[20]:
2T1_1 2T1_2 2T1_3 SLBP12_1 SLBP12_2 SLBP12_3 SLBP12TET_1 SLBP12TET_2 SLBP12TET_3 SLBP1_1 SLBP1_2 SLBP1_3 SLBP1TET_1 SLBP1TET_2 SLBP1TET_3
Phleomycin 25.323631 25.074094 25.009328 25.166517 25.019809 24.972966 24.828505 25.044285 25.166902 25.068093 25.243021 25.248122 25.209547 25.132037 24.986135
Tb04.24M18.150:mRNA-p1 13.214134 14.279813 13.006072 16.301880 16.453885 16.657805 13.724168 13.727974 12.408211 13.325080 13.206031 13.275544 13.098157 13.003434 12.744546
Tb05.5K5.100:mRNA-p1;Tb927.5.4450:mRNA-p1 22.386686 22.300921 22.531869 22.098991 22.449922 22.832711 22.849021 22.485554 22.203065 22.414439 22.420724 22.442619 22.492958 22.316802 22.470352
Tb05.5K5.110:mRNA-p1;Tb927.5.4460:mRNA-p1 25.794976 25.777542 25.849295 25.915385 25.963554 26.152356 26.246494 25.761441 25.866413 25.931280 26.001025 25.931730 25.949610 26.043560 25.910431
Tb05.5K5.120:mRNA-p1;Tb927.5.4470:mRNA-p1 22.775044 22.771027 22.735765 22.778995 22.881175 23.037907 22.919583 22.680461 22.663162 22.822548 22.840285 22.780760 22.793283 22.743623 22.670075
In [21]:
# Verify imputation preserved biological signal
fig, axes = plt.subplots(ncols=2, figsize=(12, 4))

data_normalized_imputed.loc['Tb927.3.870:mRNA-p1'].plot(
    kind='bar', ax=axes[0], color='darkred', alpha=0.7
)
axes[0].set_title('Protein 1 (Post-Imputation)', fontweight='bold')
axes[0].set_ylabel('Log2 Intensity')
clean_axes(axes[0])

data_normalized_imputed.loc['Tb927.3.1910:mRNA-p1'].plot(
    kind='bar', ax=axes[1], color='darkblue', alpha=0.7
)
axes[1].set_title('Protein 2 (Post-Imputation)', fontweight='bold')
axes[1].set_ylabel('Log2 Intensity')
clean_axes(axes[1])

plt.tight_layout()
plt.show()
No description has been provided for this image

Validated! Imputation preserved our knockdown signals perfectly. Ready for statistical analysis!


Differential Expression Analysis¶

Time for the main event! We'll use limma to identify differentially expressed proteins.

Strategy: Generate all pairwise comparisons, then select the most biologically relevant contrasts. This exploratory approach helps uncover unexpected patterns.

In [22]:
# Create output directory
!mkdir -p table_comparisons_diaNN
In [23]:
%%R -i data_normalized_imputed
options(warn = -1)
library(limma)

# Define experimental groups
f <- factor(
    c('T1', 'T1', 'T1',
      'S12', 'S12', 'S12',
      'S12T', 'S12T', 'S12T',
      'S1', 'S1', 'S1',
      'S1T', 'S1T', 'S1T'),
    levels = c("T1", "S12", "S12T", "S1", "S1T")
)

# Create design matrix
design <- model.matrix(~0 + f)
colnames(design) <- c("T1", "S12", "S12T", "S1", "S1T")

# Fit linear model
fit2 <- lmFit(data_normalized_imputed, design)

# Generate all pairwise contrasts
contrast_names <- list()
for (i in 1:length(levels(f))) {
    for (j in 1:length(levels(f))) {
        if (i != j) {
            contrast_name <- paste(levels(f)[i], "-", levels(f)[j], sep = "")
            contrast_names[[length(contrast_names) + 1]] <- contrast_name
        }
    }
}

contrast_vector <- unlist(contrast_names)
contrast_matrix <- makeContrasts(contrasts = contrast_vector, levels = design)

# Apply contrasts and empirical Bayes moderation
fit2 <- contrasts.fit(fit2, contrast_matrix)
fit2 <- eBayes(fit2)

# Export all comparison results
export_results <- function(fit, contrast_index, name) {
    results <- topTable(fit, coef = contrast_index, sort.by = "none", number = Inf)
    write.csv(results, paste0('table_comparisons_diaNN/', name, "_comparison.csv"))
}

for (idx in 1:length(colnames(contrast_matrix))) {
    contrast_name <- colnames(contrast_matrix)[idx]
    export_results(fit2, idx, contrast_name)
}

cat("✓ Generated", length(contrast_vector), "pairwise comparisons\n")
✓ Generated 20 pairwise comparisons

Volcano Plots: Visualizing Differential Expression¶

Let's examine one key comparison and track our knockdown targets through the analysis.

In [24]:
# Focus on a biologically relevant comparison
comparison_name = 'S12-S12T'
comparison_file = f'table_comparisons_diaNN/{comparison_name}_comparison.csv'
print(f"Analyzing: {comparison_name}")
Analyzing: S12-S12T
In [25]:
# Track our knockdown targets
target_proteins = ['Tb927.3.870:mRNA-p1', 'Tb927.3.1910:mRNA-p1']
In [26]:
# Generate volcano and MA plots
result_table = pd.read_csv(comparison_file, index_col=0)

# Calculate -log10 p-values for visualization
result_table['log10pval'] = -np.log10(result_table['P.Value'])
result_table['log10adjpval'] = -np.log10(result_table['adj.P.Val'])

fig, axes = plt.subplots(figsize=(14, 5), ncols=2)

# Volcano plot
ax = axes[0]
result_table.plot(
    x='logFC', y='log10adjpval',
    kind='scatter', s=5, alpha=0.3,
    ax=ax, c='gray', label='All proteins'
)
result_table.loc[target_proteins].plot(
    x='logFC', y='log10adjpval',
    kind='scatter', s=50, alpha=1,
    ax=ax, c='red', label='Target proteins', edgecolors='black'
)
ax.axhline(y=-np.log10(0.05), color='blue', linestyle='--', alpha=0.5, label='FDR 0.05')
ax.axvline(x=0, color='black', linestyle='-', alpha=0.3)
ax.set_title('Volcano Plot: Identifying Significant Changes', fontweight='bold')
ax.set_xlabel('Log2 Fold Change')
ax.set_ylabel('-Log10 Adjusted P-value')
ax.legend()

# MA plot
ax = axes[1]
result_table.plot(
    x='AveExpr', y='logFC',
    kind='scatter', s=5, alpha=0.3,
    ax=ax, c='gray', label='All proteins'
)
result_table.loc[target_proteins].plot(
    x='AveExpr', y='logFC',
    kind='scatter', s=50, alpha=1,
    ax=ax, c='red', label='Target proteins', edgecolors='black'
)
ax.axhline(y=0, color='black', linestyle='-', alpha=0.3)
ax.set_title('MA Plot: Fold Change vs. Abundance', fontweight='bold')
ax.set_xlabel('Average Log2 Expression')
ax.set_ylabel('Log2 Fold Change')
ax.legend()

plt.suptitle(f'Differential Expression: {comparison_name}', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# P-value distribution (QC check)
fig, ax = plt.subplots(figsize=(8, 4))
result_table['P.Value'].plot(kind='hist', bins=30, ax=ax, color='steelblue', edgecolor='black')
ax.set_title('P-value Distribution (QC)', fontweight='bold')
ax.set_xlabel('Raw P-value')
ax.set_ylabel('Frequency')
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Interactive Visualization with PlotHub¶

Static plots are informative, but interactive exploration takes analysis to the next level!

We'll prepare our data for PlotHub - enabling:

  • Dynamic filtering and selection
  • Hover-over protein annotations
  • Linked views across multiple plots
  • Easy sharing with collaborators
In [27]:
# Export normalized data for PlotHub
data_normalized_imputed.to_csv('table_comparisons_diaNN/in_data.csv')
In [28]:
# Load protein annotations from FASTA
prot_to_desc = parse_fasta_file('TriTrypDB-68_TbruceiTREU927_AnnotatedProteins.fasta')
!mkdir -p table_comparisons_PlotHub
In [29]:
# Define sample group mappings
sample_to_columns = {
    'T1': ['2T1_1', '2T1_2', '2T1_3'],
    'S12': ['SLBP12_1', 'SLBP12_2', 'SLBP12_3'],
    'S12T': ['SLBP12TET_1', 'SLBP12TET_2', 'SLBP12TET_3'],
    'S1': ['SLBP1_1', 'SLBP1_2', 'SLBP1_3'],
    'S1T': ['SLBP1TET_1', 'SLBP1TET_2', 'SLBP1TET_3']
}

def prepare_for_plothub(data_df, comparison_df, data_cols):
    """
    Format data for PlotHub compatibility:
    - Add gene IDs and accessions
    - Calculate significance metrics
    - Append raw intensity values
    - Include protein descriptions
    """
    result = comparison_df.copy()
    
    # Format identifiers
    result['Gene_acc'] = range(result.shape[0])
    result['Gene_id'] = result.index.str.replace(':mRNA-p1', '', regex=False)
    
    # Prepare metrics
    result['log_AveExpr'] = result['AveExpr']
    result['FDR'] = -np.log10(result['adj.P.Val'])
    
    # Select core columns
    result = result[['Gene_acc', 'Gene_id', 'logFC', 'log_AveExpr', 'FDR']]
    
    # Add descriptions
    result['Desc'] = add_desc(data_df, prot_to_desc)
    
    # Append intensity values for the relevant samples
    intensity_data = data_df[data_cols].copy()
    result = result.join(intensity_data)
    result = result.reset_index(drop=True)
    
    return result

# Process all comparisons
print("Preparing PlotHub-compatible files...\n")
comparison_files = [f for f in os.listdir('table_comparisons_diaNN') if f.endswith('_comparison.csv')]

for file_name in comparison_files:
    # Load data
    data_df = pd.read_csv('table_comparisons_diaNN/in_data.csv', index_col=0)
    comparison_df = pd.read_csv(f'table_comparisons_diaNN/{file_name}', index_col=0)
    
    # Extract comparison name and get relevant columns
    comparison = file_name.replace('_comparison.csv', '')
    group1, group2 = comparison.split('-')
    data_cols = sample_to_columns[group1] + sample_to_columns[group2]
    
    # Format and save
    formatted = prepare_for_plothub(data_df, comparison_df, data_cols)
    output_path = f'table_comparisons_PlotHub/for_web_limma_{comparison}.csv'
    formatted.to_csv(output_path, index=False)
    
    print(f"✓ {comparison}: {formatted.shape[0]} proteins, {len(data_cols)} samples")

print(f"\n✓ All {len(comparison_files)} comparisons ready for PlotHub!")
Preparing PlotHub-compatible files...

✓ T1-S1: 6790 proteins, 6 samples
✓ S1-S12T: 6790 proteins, 6 samples
✓ S12T-S12: 6790 proteins, 6 samples
✓ S12T-S1T: 6790 proteins, 6 samples
✓ S1T-S12T: 6790 proteins, 6 samples
✓ S12-S12T: 6790 proteins, 6 samples
✓ S1-S12: 6790 proteins, 6 samples
✓ T1-S12T: 6790 proteins, 6 samples
✓ T1-S1T: 6790 proteins, 6 samples
✓ S12-S1T: 6790 proteins, 6 samples
✓ S1-S1T: 6790 proteins, 6 samples
✓ S1-T1: 6790 proteins, 6 samples
✓ T1-S12: 6790 proteins, 6 samples
✓ S12-S1: 6790 proteins, 6 samples
✓ S12T-S1: 6790 proteins, 6 samples
✓ S1T-S1: 6790 proteins, 6 samples
✓ S1T-T1: 6790 proteins, 6 samples
✓ S12T-T1: 6790 proteins, 6 samples
✓ S12-T1: 6790 proteins, 6 samples
✓ S1T-S12: 6790 proteins, 6 samples

✓ All 20 comparisons ready for PlotHub!

Analysis Complete!¶

What We Accomplished:¶

✓ Data QC: Identified and explained sample outliers
✓ Validation: Confirmed knockdown efficiency and marker expression
✓ Normalization: Balanced intensity distributions across samples
✓ Imputation: Applied sophisticated missing data handling
✓ Statistics: Performed comprehensive differential expression analysis
✓ Visualization: Created publication-quality static and interactive plots

Next Steps:¶

  • Upload HTML to your web server
  • See this output in PlotHub, by clicking on "Upload Test File".

Analysis performed with custom utilities from the Delightful Data Analysis framework

In [ ]: