Simulation plan

Objectives
Pseudo algorithm (Data simulation)
Simulation setting
Evaluation steps
Open questions
Appendix

Objectives

Investigate the impact of window size
Investigate the impact of percentage of deferentially expressed genes (DEGs).
Investigate the impact of dispersion parameter and the function form of dispersion parameter
Investigate the performance of different parameter estimation method for the genes within a window

Pseudo algorithm (Data simulation)

Draw expression values of 20,000 genes from an exponential distribution (\(rate = 1/250\)). Question
Specify the percentage of DEGs
For the DEGs, draw a log 2 fold change from a Normal \(N(\mu = 0, \sigma = 0.7)\)
Go up and down from the base value, draw from \(exp(\lambda = 1/250)\), to a log fold change, draw from \(N(\mu = 0, \sigma = 0.7)\), to get the expression counts for the two conditions
Specify a value or a function of dispersion parameter (the sahpe parameter of the gamma mixing distribution).
- Constant value: 1/0.015 (used in DESeq) Question
- function \(1/r = 0.01 + 9/(\mu + 100)\) (used in DESeq) Question
- other functions Question
With the expression levels (\(\mu\)) and the dispersion values (\(r\)), two transcriptomes can be simulated
Repeat step 1-6 for 20,000 times.

Simulation setting

Percentage of DEG: 5%, 10%, 15%, 20%, and possibly 25% and 30%
Maybe Different dispersion parameter value and dispersion functions?
Window size: {50,1000, by the step of 50} Question

Evaluation steps

Compare the estiamted dispersion value, \(\hat{r}\), or dispersion function with the true value or function.
Evaluate type I error rate and power at different simulation settings.

Open questions

Which is better in the simulation: randomness or specific values of the simulation parameters? And in general?

Appendix

Questions

Supp. plots

exponential distribution(\(\lambda = 1/250\))

Back

Gene fold changes

Back

dispersion parameter function

Back

Supp. proof

If we want a precise estimation of, for example, type I error rate, a sufficent number of simulation repitions is needed. Assume each simulation repition is a Bernoulli trial with rate, \(p\), and there are \(n\) trails correponding to \(n\) simulation repitions. In the end, we count the number of false positives, and this count follows Binomial distribution, \(Binom(n,p)\).

\begin{align*} \hat{p} &= \frac{\sum X_i}{n}\\ CI_{\hat{p}} &= \hat{p} \pm Z_{\alpha/2}SE(\hat{p})\\ CI_{\hat{p}} &= \hat{p} \pm 1.96 \sqrt{\frac{p(1-p)}{n}} \end{align*} If I want a confidence interval with width smaller than 1%, then: \begin{align*} 1.96 \sqrt{ \frac{p(1-p)}{n} } < 0.005 \\ n > 7,300 \text{, if } p = 0.05 \\ n > 24,587 \text{, if } p = 0.8 \end{align*}

Back

Simulation plan

Qike Li,

February 18, 2016

Objectives

Pseudo algorithm (Data simulation)

Simulation setting

Evaluation steps

Open questions

Appendix

Questions

Q1: Is this a reasonable distribution to use?

Q2 Is it acceptable to use these values without justification?

Q3: Other reasonable functons?

Q4: should the size go even lower or higher? Coarser granularity?

Supp. plots

exponential distribution(\(\lambda = 1/250\))

Gene fold changes

dispersion parameter function

Supp. proof