Sample Probability Problem

Problem

Suppose we are given a set \(\Omega\) consisting of \(N\) different elements, and a map

\[f:\quad\Omega\rightarrow\{0,1\}\]

on that set, which maps every item to either \(0\) or \(1\).

The fraction of items that are mapped to \(1\) is given by

\[p=\frac{1}{N}\sum_{x\in\Omega}f(x)\]

Given a subset \(S\subseteq\Omega\) of size \(n\) define the sample proportion

\[\bar{p}=\frac{1}{n}\sum_{x\in S}f(x)\]

For all the subsets of size \(n\) find the expectation value and the variation of the random variable \(\bar{p}\).

Solution

Let us first compute

\[\begin{split}\begin{aligned} \operatorname{E}[f(x)] & =p\\ \operatorname{E}[f(x)^{2}] & =\frac{1}{N}\sum_{x\in\Omega}f(x)^{2}\\ & =\frac{1}{N}\sum_{x\in\Omega}f(x)\\ & =p\\\end{aligned}\end{split}\]
\[\begin{split}\begin{aligned} \operatorname{Var}[f(x)] & =\operatorname{E}[f(x)^{2}]-\operatorname{E}[f(x)]^{2}\\ & =p-p^{2}\\ & =p(1-p)\\\end{aligned}\end{split}\]

Let us define the set of all subsets of size \(n\)

\[\Sigma_{n}=\{S|S\in2^{\Omega}\text{and}|S|=n\}\]

where \(2^{\Omega}\) defines the powerset of \(\Omega\), and the size of the set is

\[|\Sigma_{n}|={N \choose n}\]

For example, if \(\Omega=\{1,2,3\}\) then

\[\begin{split}\begin{aligned} \Sigma_{0} & =\{\{\}\} & |\Sigma_{0}|=1\\ \Sigma_{1} & =\{\{1\},\{2\},\{3\}\} & |\Sigma_{1}|=3\\ \Sigma_{2} & =\{\{1,2\},\{1,3\},\{2,3\}\}\quad & |\Sigma_{2}|=3\\ \Sigma_{3} & =\{\{1,2,3\}\} & |\Sigma_{0}|=1\\\end{aligned}\end{split}\]

Now the expectation value and variance that we are required to compute are given in terms of:

\[\begin{split}\begin{aligned} \operatorname{E}[\bar{p}] & =\frac{1}{|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\frac{1}{n}\sum_{x\in S}f(x)\\ & =\frac{1}{n|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\sum_{x\in S}f(x)\\ \operatorname{E}[\bar{p}^{2}] & =\frac{1}{|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\left(\frac{1}{n}\sum_{x\in S}f(x)\right)^{2}\\ & =\frac{1}{n^{2}|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\left(\sum_{x\in S}f(x)^{2}+2\sum_{x<y\,\in S}f(x)f(y)\right)\\\end{aligned}\end{split}\]

To compute the sums that appear in the expressions above we will use arguments of symmetry to show that the terms on the left and right hand side will be the same up to an integer constant

\[\sum_{S\in\Sigma_{n}}\sum_{x\in S}f(x)=A\sum_{x\in\Omega}f(x)\]

where \(A\) is a constant.

Counting the total number of terms on the left and the right it follows that

\[|\Sigma_{n}|n=A\,N\quad\therefore\quad A=\frac{n}{N}|\Sigma_{n}|\]

Similarly,

\[\sum_{S\in\Sigma_{n}}\sum_{x<y\,\in S}f(x)f(y)=B\sum_{x<y\,\in\Omega}f(x)f(y)\]

and counting the the terms on the left and the right

\[\Sigma_{n}\frac{n(n-1)}{2}=B\,\frac{N(N-1)}{2}\quad\therefore\quad B=|\Sigma_{n}|\frac{n(n-1)}{N(N-1)}\]

Since \(f(x)\) is either \(0\) or \(1\) it can be shown that

\[\begin{split}\begin{aligned} \sum_{x\in\Omega}f(x) & =pN\\ \sum_{x\in\Omega}f(x)^{2} & =pN\\ \sum_{x<y\,\in\Omega}f(x)f(y) & =\frac{pN(pN-1)}{2}\\\end{aligned}\end{split}\]

Finally, we obtain simplified expressions for the expectations

\[\begin{split}\begin{aligned} \operatorname{E}[\bar{p}] & =\frac{1}{n|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\sum_{x\in S}f(x)\\ & =\frac{1}{n|\Sigma_{n}|}A\sum_{x\in\Omega}f(x)\\ & =\frac{1}{n|\Sigma_{n}|}\frac{n}{N}|\Sigma_{n}|\sum_{x\in\Omega}f(x)\\ & =p\\ \operatorname{E}[\bar{p}^{2}] & =\frac{1}{n^{2}|\Sigma_{n}|}\sum_{S\in\Sigma_{n}}\left(\sum_{x\in S}f(x)^{2}+2\sum_{x<y\,\in S}f(x)f(y)\right)\\ & =\frac{1}{n^{2}|\Sigma_{n}|}\left(A\sum_{x\in\Omega}f(x)^{2}+2B\sum_{x<y\,\in\Omega}f(x)f(y)\right)\\ & =\frac{1}{n^{2}|\Sigma_{n}|}\left(|\Sigma_{n}|\frac{n}{N}\sum_{x\in\Omega}f(x)^{2}+2|\Sigma_{n}|\frac{n(n-1)}{N(N-1)}\sum_{x<y\,\in\Omega}f(x)f(y)\right)\\ & =\frac{p}{n}+\frac{1}{n}\frac{n-1}{N-1}p(pN-1)\end{aligned}\end{split}\]

Thus, we obtain the following final result

\[\begin{split}\begin{aligned} \operatorname{E}[\bar{p}] & =p\\ \operatorname{Var}[\bar{p}] & =\frac{p}{n}+\frac{1}{n}\frac{n-1}{N-1}p(pN-1)-p^{2}\end{aligned}\end{split}\]

It can be seen that this agrees with the Central Limit Theorem in the limit \(N\to\infty\)

\[\lim_{N\to\infty}\operatorname{Var}[\bar{p}]=\frac{\operatorname{Var}[f(x)]}{n}=\frac{p-p^{2}}{n}\]