4. Statistical analysis; Computational methods
Normalization¶
- Could be data-wise normalization: like gene expression, gene counts
- Could caused by experimental: like in qRT-PCR, when we test a targeted gene, in parallel, the mRNA level of a housekeeping gene, like ACTB or GAPDH will be also measured--> for normalization
- The mRNA level of Gene X will be normalized to the level of housekeeping gene in each sample
Statistical analysis¶
- t test
- Compares means between two groups. Independent t-tests compare different subjects (e.g., drug vs. placebo), while paired t-tests analyze the same subjects before and after intervention (e.g., pre/post-treatment blood pressure).
- Chi-square tests
- Analyzes relationships between categorical variables. Commonly used to compare proportions across groups, such as response rates between treatment arms or disease prevalence across populations.
- ANOVA (analysis of variance)
- Extends t-tests to compare means across multiple groups. One-way ANOVA compares one factor (e.g., multiple drug doses), while factorial ANOVA examines interactions between factors (e.g., drug and gender effects).
- Linear regression
- Models relationships between continuous variables. In biomedicine, used to predict outcomes based on predictors (e.g., how BMI relates to cholesterol) or adjust for confounding factors.
- Logistic regression
- Predicts binary outcomes (e.g., disease/no disease) based on predictor variables. Essential for developing risk prediction models and analyzing case-control studies.
- Survival analysis
- Analyzes time-to-event data with censored observations
- Kaplan-meier curves: visualize survival probabilities over time
- Log-rank test: compare survival between groups
- Cox proportional hazards: Assess multiple variables' impact on survival time
- Non-Parametric Tests
- Used when data doesn't follow normal distribution:
- Mann-Whitney U test: Non-parametric alternative to t-test
- Kruskal-Wallis test: Non-parametric alternative to ANOVA
- Wilcoxon signed-rank test: Non-parametric paired comparison
- Used when data doesn't follow normal distribution:
GWAS Genome-wide association studies¶
-
GWAS Application: From Experimental Design to Results
-
Experimental Design:
- A case-control study investigating genetic associations with Type 2 Diabetes (T2D) in diverse populations, recruiting 5,000 diagnosed T2D patients and 5,000 matched healthy controls with detailed phenotypic data including clinical measurements, family history, and environmental exposure information; participants provide informed consent for genotyping and data sharing under appropriate ethical approvals.
-
Upstream Process:
- Blood samples collected from all participants undergo DNA extraction following standardized protocols to ensure high-quality genomic material; extracted DNA is quantified, quality-checked, and genotyped using a high-density SNP array platform covering 800,000+ genetic variants across the genome; rigorous quality control measures include monitoring call rates, Hardy-Weinberg equilibrium testing, and technical replicates to ensure data integrity before proceeding to analysis.
-
Downstream Analysis:
- Initial quality filtering removes samples with low genotyping rates (<98%) and SNPs with poor call rates, minor allele frequencies <1%, or significant deviation from Hardy-Weinberg equilibrium; population stratification is addressed using principal component analysis and ancestry-informative markers; association testing employs logistic regression models adjusting for age, sex, BMI, and ancestry components, with genome-wide significance threshold set at p<5×10^-8; identified signals undergo replication testing in independent cohorts and fine-mapping to pinpoint causal variants.
-
Expected Results:
- The study identifies several genomic loci significantly associated with T2D risk, including both previously known and novel genetic associations; pathway enrichment analysis reveals biological processes related to insulin signaling, glucose metabolism, and pancreatic beta-cell function; polygenic risk scores developed from significant variants demonstrate predictive value for disease risk beyond traditional clinical factors; functional annotation of associated variants provides insights into potential regulatory mechanisms, guiding follow-up experimental studies to validate biological effects and explore therapeutic implications for personalized medicine approaches.
eQTL, expression quantitative trait locus¶
eQTL (Expression Quantitative Trait Locus) Analysis
Experimental Design: A study investigating genetic variants affecting gene expression in liver tissue, recruiting 300 individuals undergoing liver biopsies during bariatric surgery or routine clinical procedures; participants provide informed consent for genetic testing and tissue collection; detailed clinical data including liver function tests, metabolic parameters, and medication history are recorded to account for potential confounding factors in downstream analyses.
Upstream Process: Liver tissue samples undergo simultaneous DNA and RNA extraction; DNA samples are genotyped using a genome-wide SNP array capturing 1 million variants; RNA is processed for RNA-sequencing with appropriate quality controls including RNA integrity number assessment; RNA-seq libraries are prepared and sequenced to a depth of 30 million paired-end reads per sample; tissue processing, nucleic acid extraction, and sequencing follow standardized protocols to minimize batch effects and technical variation.
Downstream Analysis: Genotype data undergo quality control filtering for call rates, minor allele frequency, and Hardy-Weinberg equilibrium; RNA-seq data are processed through alignment to the reference genome, quantification of gene expression levels, and normalization to account for sequencing depth and composition biases; eQTL mapping employs linear regression models testing associations between genotypes and expression levels of genes within 1Mb of each variant (cis-eQTLs), adjusting for age, sex, principal components of genetic ancestry, and technical covariates; statistical significance is determined using permutation-based methods with FDR control.
Expected Results: The analysis identifies thousands of significant cis-eQTLs where genetic variants correlate with expression levels of nearby genes; integration with GWAS data reveals that disease-associated variants often function as eQTLs, providing mechanistic insights into how genetic risk factors influence disease through gene regulation; tissue-specific eQTL effects help explain why certain variants affect disease risk in specific organs; colocalization analysis distinguishes shared causal variants from coincidental overlaps between GWAS and eQTL signals; findings contribute to functional annotation of the genome and help prioritize genes for therapeutic targeting based on their genetic regulation profiles.