FAQ

  1. What are the Different Types of Datasets with Expression Profiles on GEO?

    • GEO hosts diverse datasets, including mRNA expression, microRNA (miRNA) expression, circRNA expression, long non-coding RNA (lncRNA) expression, small RNA expression (piRNAs, snoRNAs, tRNAs), ChIP-seq, ATAC-seq, epigenomic data (DNA methylation, histone modifications), metabolomics data, proteomics data, and more.

  2. Can I perform Expression Analyses without Giant Computers?

    • For smaller-scale analyses, various bioinformatics tools and R packages (such as DESeq2, edgeR, and limma) can be utilized on personal computers or laptops with moderate specifications. These tools are efficient for analyzing moderate-sized datasets without requiring extensive computational resources.

  3. What are the common types of Expression Analyses?

    • Differential expression analysis, pathway analysis, co-expression network analysis, cluster analysis, functional enrichment analysis, and integrative analyses combining multiple omics datasets are common types of expression analyses.

  4. What are the Data Collection Methods for Differential Expression Analysis (DEA)?

    • DEA is typically conducted on transcriptomic (RNA-seq, microarray), epigenomic (ChIP-seq, ATAC-seq), and other omics data types. It involves comparing gene or feature expression levels between different conditions or groups to identify significant differences.

  5. What are the Data Types used for Conducting Differential Expression Analysis?

    • Transcriptomic data such as mRNA, miRNA, circRNA, lncRNA, small RNA, ChIP-seq, and other omics data types can be used for DEA. The choice depends on the research question and the biological context of the study.

  6. How do you decide expression profiles for the DEA Project?

    • Selecting the type of expression profiles for DEA depends on the research objective, the specific biological process or pathways of interest, the availability of datasets, and the type of samples or conditions being studied.

  7. Can I perform DEA using GEO datasets without downloading them?

    • You can perform preliminary analyses on GEO datasets using the GEOquery package in R without downloading the entire dataset. GEOquery allows direct access to specific subsets of data for initial analysis.

  8. How to Interpret DEA Results?

    • Interpretation involves identifying differentially expressed genes or features, examining fold changes and statistical significance, analyzing pathway enrichment, functional annotation, and validating findings through biological context and existing literature.

  9. What is the outline for writing a Research Paper on my DEA Project:

    • A research paper typically includes sections like Introduction, Methods (including data acquisition and analysis methods), Results (DEA results and interpretations), Discussion (interpretation, implications, and limitations), and Conclusion (summary and future directions).

Each of these aspects plays a crucial role in conducting and communicating the findings of a Differential Expression Analysis project. Depending on the specific details of your study, you'll need additional information and clear guidance for a comprehensive understanding and execution of each step.

  1. Why should every Biomed undergrad perform at least a few differential analyses on an expression data type of their choice for a disease of their interest using GEO datasets?

Performing differential analyses on expression data from GEO datasets offers several benefits for biomedical undergraduates interested in research:

  • Hands-on Experience: Conducting differential expression analyses provides practical experience working with real-world biological data, enhancing practical bioinformatics and data analysis skills.

  • Understanding Disease Mechanisms: By focusing on a disease of interest, undergraduates can delve deeper into the molecular underpinnings of the condition, gaining insights into gene regulation, pathway alterations, and potential biomarkers.

  • Integration of Theoretical Knowledge: Applying differential expression analysis allows students to integrate classroom learning with practical application, reinforcing their understanding of molecular biology, genomics, and bioinformatics concepts.

  • Skill Development: Engaging in such analyses hones critical thinking, problem-solving, and analytical skills essential for research and biomedical careers.

  • Exposure to Research Methods: It introduces undergraduates to standard research methodologies used in biomedicine, providing a glimpse into the research process and the utilization of publicly available datasets.

  • Contribution to Scientific Knowledge: Even in small-scale projects, undergraduates can contribute insights or novel findings to the scientific community, fostering a sense of accomplishment and encouraging further exploration.

  • Career Readiness: Acquiring proficiency in bioinformatics tools and data analysis techniques boosts employability and prepares students for potential roles in research, healthcare, or biotech industries.

  • Networking and Mentorship: Engaging in research projects can lead to mentorship opportunities and networking with faculty, researchers, and peers, potentially opening doors for future collaborations or recommendations.

  • Problem Identification and Solving: Through analyzing disease-related expression data, students develop the ability to identify research questions and seek solutions to address specific biomedical problems.

  • Appreciation of Data-Driven Research: Understanding how to navigate and utilize publicly available datasets fosters an appreciation for data-driven research and encourages lifelong learning in a rapidly evolving field.

In summary, performing differential expression analyses on expression data from GEO datasets equips biomedical undergraduates with invaluable skills, enhances their understanding of disease mechanisms, and prepares them for future careers in research and healthcare.

  1. What are the R packages specialized for analyzing various type of expression data?

    There are several R packages available for performing differential expression analysis on different types of expression data. Here are some of the commonly used packages for each type of data:

    • miRNA expression data: edgeR, DESeq2, limma

    • circRNA expression data: DCC, CIRCexplorer2

    • lncRNA expression data: edgeR, DESeq2, limma

    • snRNA expression data: edgeR, DESeq2, limma

    • shRNA expression data: edgeR, DESeq2, limma

    • Small RNA expression data (piRNAs, snoRNAs, tRNAs): edgeR, DESeq2, limma

    • ChIP-seq data: DiffBind, ChIPseeker, csaw

    • ATAC-seq data: DiffBind, csaw, peakSeq

    • Ribosome profiling data: riboSeqR, Riborex

    • Epigenomics data (DNA methylation, histone modifications, chromatin structure): DSS, ChAMP, edgeR

    • Metabolomics data: MetaboAnalystR, CAMERA, MetaboDiff

    • Proteomics data: limma, MSnbase, the limma/voom workflow

    • Exome sequencing data: edgeR, DESeq2

    Each package has its own advantages and considerations, so it is recommended to read the documentation and tutorials associated with each package to understand their features and usage.

  2. What are the main statistical methods used in Differential Expression Analysis (DEA)?

    Differential Expression Analysis commonly employs statistical methods like the negative binomial distribution for RNA-seq data and linear models for microarray data. These methods assess gene expression changes between conditions, considering sample variance, normalization, and false discovery rate (FDR) control.

  3. How can one handle batch effects in Differential Expression Analysis?

    Batch effects, arising from technical variations, can confound DEA. Methods like ComBat, limma's removeBatchEffect, or surrogate variable analysis (SVA) can mitigate batch effects by adjusting data for systematic variations without affecting biological differences.

  4. What role does normalization play in Differential Expression Analysis?

    Normalization methods aim to remove technical variations from raw expression data, ensuring comparability between samples. Techniques such as Trimmed Mean of M-values (TMM), Reads Per Kilobase per Million mapped reads (RPKM), or Transcripts Per Million (TPM) are used for RNA-seq data normalization.

  5. How can one integrate multi-omics data for a comprehensive analysis?

    Integrative analyses merge data from different omics layers (e.g., transcriptomics, epigenomics, metabolomics) to uncover complex biological relationships. Tools like mixOmics, MOFA (Multi-Omics Factor Analysis), or Confero enable multi-omics data integration, revealing inter-omic correlations.

  6. What methods are available for pathway enrichment analysis after Differential Expression Analysis?

    Pathway enrichment tools, such as Enrichr, DAVID, or gProfiler, identify overrepresented biological pathways among differentially expressed genes. They help interpret DEA results in the context of biological functions and pathways.

  1. Can one perform Differential Expression Analysis on single-cell RNA-seq (scRNA-seq) data?

    Yes, specialized packages like Seurat, scran, or Monocle handle Differential Expression Analysis on scRNA-seq data. These tools account for the unique challenges posed by analyzing gene expression in individual cells, including high noise levels and sparsity.

  2. How does False Discovery Rate (FDR) control impact Differential Expression Analysis outcomes?

    FDR control (e.g., Benjamini-Hochberg procedure) adjusts p-values to minimize false positives among significant findings. Setting an FDR threshold helps control the rate of false discoveries while identifying differentially expressed genes.

  3. What considerations are crucial for choosing the appropriate DEA tool for a specific dataset?

    Factors such as data distribution, sample size, data preprocessing steps, computational efficiency, and assumptions of statistical methods should guide the choice of the appropriate DEA tool for accurate and reliable analysis.

  4. How do biological replicates influence the reliability of Differential Expression Analysis results?

    Biological replicates increase statistical power and confidence in DEA by capturing inherent biological variability. Adequate replicates help distinguish true biological changes from experimental noise.

  5. Are there specific pre-processing steps essential before conducting Differential Expression Analysis?

Pre-processing steps such as quality control, data normalization, filtering out low-quality  features  or samples, and batch effect removal are critical to ensure robust and reliable results in  Differential Expression Analysis.

  1. How does the design matrix specification in Limma differ for different expression data types like miRNA, lncRNA, and mRNA?

    The design matrix in Limma captures experimental factors and covariates. For different expression data types, you might adjust the design matrix to account for specific experimental designs, such as pairing or blocking for paired experiments or incorporating batch effects.

  2. What considerations should be made when specifying contrasts for miRNA expression data in Limma?

    When analyzing miRNA data using Limma, specifying contrasts should focus on biological conditions of interest, for example, diseased vs. healthy states or treated vs. control groups. Ensuring meaningful contrasts enhances the identification of differentially expressed miRNAs.

  1. How does one account for potential confounding factors in differential expression analysis using Limma for lncRNA expression data?

    In Limma, controlling for confounding factors such as batch effects or sample variability can be achieved by including them as covariates in the linear model design. Adjusting for these factors helps in isolating true biological differences.

  1. What steps are essential in the design formula for mRNA expression data to control for multiple factors or interactions in Limma?

    For mRNA expression data, the design formula in Limma should consider multiple factors or interactions like treatment, timepoints, and covariates. Constructing appropriate design matrices including these factors enables comprehensive analysis of differential expression.

  2. How can one accommodate paired experimental designs in Differential Expression Analysis using Limma for ChIP-seq data?

    For paired designs in ChIP-seq data, specifying paired samples within the design matrix allows Limma to model dependencies between paired samples, enhancing the accuracy of differential binding analysis.

  3. What strategies should be employed when dealing with high-dimensional data, such as proteomics, in Limma's design matrix?

    For high-dimensional datasets like proteomics, careful design matrix construction is crucial to handle multiple comparisons. Dimensionality reduction techniques or proper adjustment for multiple testing can be incorporated within the design matrix to manage the complexity of the analysis.

  4. How does Limma address the issue of heteroscedasticity commonly observed in certain types of expression data like circRNA or snRNA?

    Limma utilizes empirical Bayes moderation to address heteroscedasticity issues in expression data. This approach shrinks gene-wise variances towards a common value, providing more stable and reliable differential expression estimates.

  5. Can Limma's design matrix account for complex experimental designs involving multiple treatment groups and time-series data in metabolomics analysis?

    Yes, Limma can handle complex designs by incorporating factors such as multiple treatment groups and time-series data into the design matrix. This allows for flexible modeling of various experimental conditions in metabolomics studies.

  6. How does Limma adjust for outliers or influential data points that might impact differential expression analysis in ATAC-seq data?

    Robust statistical methods within Limma can handle outliers or influential data points by down-weighting their influence on the differential analysis. This robust estimation approach helps mitigate the impact of extreme values in ATAC-seq data.

  7. In what ways can Limma's design matrix accommodate non-standard experimental designs, such as dose-response experiments in epigenomic data analysis?

    Limma's design matrix can be adapted to handle non-standard designs by encoding dose levels or continuous variables to capture dose-response relationships. Including these factors in the design enables the analysis of differential epigenomic changes across dose gradients.

  8. How do edgeR and DESeq2 algorithms differ in their approach to RNA-seq data analysis?

    Both edgeR and DESeq2 are popular algorithms used for differential gene expression analysis in RNA-seq data. While they have similar goals, they differ in their approach to data analysis.

    • edgeR: edgeR is a statistical method that uses a negative binomial distribution model to estimate the variance of gene expression. It assumes that the distribution of counts follows a negative binomial distribution, which allows for overdispersion in the data. edgeR uses generalized linear models (GLMs) to estimate the mean and dispersion parameters for each gene, and then performs hypothesis testing to identify differentially expressed genes.

    • DESeq2: DESeq2 also uses a negative binomial distribution model, but it incorporates additional normalization steps to account for library size differences and other sources of variation. DESeq2 uses a method called "shrinkage estimation" to borrow information across genes and improve the estimation of dispersion. It then performs hypothesis testing using a Wald test or a likelihood ratio test to identify differentially expressed genes.

    In summary, both edgeR and DESeq2 use negative binomial distribution models for differential gene expression analysis. However, DESeq2 incorporates additional normalization steps and shrinkage estimation to improve the accuracy of its estimates. The choice between edgeR and DESeq2 often depends on the specific dataset and research question at hand.

  9. How does the t-test algorithm work for gene expression analysis?

    The t-test algorithm is commonly used for differential gene expression analysis. It helps to identify genes that are differentially expressed between two groups of samples, such as disease versus control groups.

    The t-test algorithm works by comparing the means of gene expression values between the two groups and assessing whether the difference is statistically significant. It calculates a t-value, which measures the difference between the means relative to the variability within the groups.

    Here's a step-by-step process of how the t-test algorithm works for gene expression analysis:

    1. Data preprocessing: The gene expression data is preprocessed to remove any noise or outliers and to normalize the values.

    2. Grouping: The samples are divided into two groups, typically based on their experimental conditions (e.g., disease vs. control).

    3. Calculating means: The mean gene expression value is calculated for each group.

    4. Calculating variances: The variance or standard deviation of gene expression values is calculated for each group.

    5. Calculating t-value: The t-value is calculated using the formula t = (mean1 - mean2) / sqrt((variance1/n1) + (variance2/n2)), where mean1 and mean2 are the means of the two groups, variance1 and variance2 are their respective variances, and n1 and n2 are the sample sizes.

    6. Calculating degrees of freedom: The degrees of freedom are calculated based on the sample sizes and used to determine the p-value associated with the t-value.

    7. Determining significance: The p-value is compared to a predefined significance level (e.g., 0.05) to determine whether the difference in gene expression is statistically significant. If the p-value is below the significance level, it suggests that the gene is differentially expressed.

    It's important to note that the t-test assumes that the gene expression values follow a normal distribution and that the variances are equal between the two groups. If these assumptions are violated, alternative statistical tests may be used instead.

  10. Are there any special considerations for analyzing time course data using limma?

    There are a few special considerations. Time course experiments typically involve measurements taken at multiple time points, and the aim is often to identify genes that change over time. In limma, you can use the makeTimeContrasts function to define contrasts between time points of interest.

    Additionally, it is important to account for the correlation between measurements taken at different time points. The duplicateCorrelation function in limma can be used to estimate and adjust for this correlation.

    Overall, limma provides flexibility for analyzing time course data by allowing the incorporation of time-related factors and accounting for correlation between time points.

  11. Can you explain linear models and empirical Bayes methods?

    Linear models and empirical Bayes methods are statistical techniques used in the analysis of gene expression microarray data. The linear model is a mathematical model that describes the relationship between a dependent variable and one or more independent variables. In the context of gene expression microarray data, the linear model is used to identify genes that are differentially expressed between two or more groups of samples. The empirical Bayes method is a statistical technique used to estimate the variance of gene expression measurements. It is used to provide stable results even when the number of arrays is small1.

    In the context of the limma package, the linear model is used to identify differentially expressed genes in microarray experiments with arbitrary numbers of treatments and RNA samples. The empirical Bayes method is used to estimate the variance of gene expression measurements and provide stable results even when the number of arrays is small1 2 3.

  12. other resources about DEA on Limma R

    [Differential gene expression analysis | Functional genomics II] (https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/data-analysis/differential-gene-expression-analysis/#:~:text=There%20are%20different%20methods%20for,when%20choosing%20an%20analysis%20method .)

    [Differential Expression Analysis - of Microarray Data]

    (https://www3.nd.edu/~steve/Rcourse/Lecture11v1.pdf )

    [A comparison of methods for differential expression analysis ...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91 )

    [Methods for evaluating gene expression from Affymetrix ...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-284 )

    [Best practices on the differential expression analysis of multi ...](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02337-8 )

    [Differential expression analysis | Functional genomics II]

    (https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/microarrays/analysis-of-microarray-data/differential-expression-analysis/ )

    [Differential gene expression analysis based on linear ...]

    (https://www.nature.com/articles/s41598-023-43686-7 )

    [Pre-processing and differential expression analysis of Agilent ...]

    (https://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-64), [Differential Expression Analysis: Understanding th - pluto.bio](https://pluto.bio/blog/differential-expression-analysis-techniques-and-benefits)

    [Differential Expression - omicsoft doc](https://omicsoftdocs.github.io/ArraySuiteDoc/tutorials/Microarray/Differential_Expression/)

    [Differential gene expression (DGE) analysis](https://hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/sample_level_QC.html), [Testing for differentially expressed genes with microarray ...](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC154240/)

    [Linear Models for Differential Expression in Microarray Studies](https://online.stat.psu.edu/stat555/node/12/)

    [Differential expression analysis - Clariom S array -workflow](https://support.bioconductor.org/p/9143200/)

    [Dream: powerful differential expression analysis for ...](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8055218/)

    [limma powers differential expression analyses for RNA ...](https://academic.oup.com/nar/article/43/7/e47/2414268)

    [RNA-Seq differential expression analysis: An extended ...](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152)

    [Commonly used statistical methods for detecting ...](https://ologyjournals.com/beij/beij_00001.php)

    [Robustness of differential gene expression analysis ...](https://www.sciencedirect.com/science/article/pii/S200103702100221X)

    [An end to end workflow for differential gene expression ...](https://bioconductor.org/packages/release/workflows/vignettes/maEndToEnd/inst/doc/MA-Workflow.html)

    [limma](https://kasperdanielhansen.github.io/genbioconductor/html/limma.html#:~:text=Overview,models%20called%20%E2%80%9Clinear%20models%E2%80%9D.)

    [limma](https://bioconductor.org/packages/release/bioc/html/limma.html)

    [limma: Linear Models for Microarray and RNA-Seq Data ...](https://bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf)

    [Differential expression analysis using a model-based gene ...](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04438-4)

    [Differential Expression with Limma-Voom](https://ucdavis-bioinformatics-training.github.io/2018-June-RNA-Seq-Workshop/thursday/DE.html)

    [R: Introduction to the LIMMA Package](http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/01Introduction.html)

    [Statistical approaches for differential expression analysis in ...](https://academic.oup.com/bioinformatics/article/37/Supplement_1/i34/6319701)

    [How to install limma in R studio : r/RStudio](https://www.reddit.com/r/RStudio/comments/qwhpxx/how_to_install_limma_in_r_studio/)

    [Differential expression analysis using a model-based gene ...](https://pubmed.ncbi.nlm.nih.gov/34670485/)

    [limma](https://bioconductor.riken.jp/packages/3.9/bioc/html/limma.html )