Malthus and Ricardo, Wages and Rent


Ferdinand Lassalle’s Iron Law of Wages, following from Malthus, and David Ricardo’s Law of Rent are some of the very first relatively quantitative attempts at statements or observations of economics and can IMHO be considered a sort of ancestor of modern economics. 

In the Iron Law of Wages, as population increases, the labor supply increases and thus the wage price decreases – which does mean that we assume labor demand is unaffected by population and thus labor demand is effectively exogenous.  Wages continue to decrease until they hit subsistence levels for laborers.  A further decrease in wages is unsustainable as laborers will literally be unable to sustain themselves, which may cause a decrease in population.  A decrease in population, i.e. a decrease in the labor supply, pushes wages back up to the long-term level, which is the minimum subsistence level.  If the wage price is above subsistence level, population will rise (the assumption is that any wage above the subsistence level contributes to population growth) until the wage decreases to the subsistence level.

Malthus’s Iron Law of Population is the observation that given enough food, population grows exponentially or geometrically while agricultural output – which is limited by 1. the amount of new land that can be put to agricultural use and 2. the amount of additional intensification that one can do to increase the output of existing agricultural lands, which Malthus understandably assumes to have diminishing returns – grows linearly or arithmetically.  For the former limit on agricultural output, his evidence is the population growth in the early United States where new land was plentiful (despite the existence of Natives on those lands) while his evidence for the latter limit on the diminishing returns of agricultural intensification is an appeal to common sense of the times (which may be understandable – we can suppose that it would be hard for someone in the early 1800s to think that agricultural output could grow to accommodate an exponentially growing population or that in the future, longer years of education would lead to declining fertility rates). Since linear growth has no hope of staying above exponential growth in the long run, Malthus’s conclusion is that once population hits the level where the masses can only afford a subsistence level of living, that will be the long run equilibrium for wages and quality of life. There may be ameliorating factors such as an increase in agricultural technology, delay in bearing children, and contraception, or direct decreases to population such as war and disease as such, but Malthus’s opinion was that none of that can overturn the Iron Law of Population. In any case, once population hits the level where people are living at subsistence levels, whether it be war, disease, or famine that keeps population from going above this long run equilibrium doesn’t change the fact that the factors that keep population from going above this equilibrium are painful to humanity.





The Terms of Trade of Brazil


An article in the New York Times by Paul Krugman talked about a current economic downturn in Brazil. What happened:

First, the global environment deteriorated sharply, with plunging prices for the commodity exports still crucial to the Brazilian economy. Second, domestic private spending also plunged, maybe because of an excessive buildup of debt. Third, policy, instead of fighting the slump, exacerbated it, with fiscal austerity and monetary tightening even as the economy was headed down.

What didn’t happen:

Maybe the first thing to say about Brazil’s crisis is what it wasn’t. Over the past few decades those who follow international macroeconomics have grown more or less accustomed to “sudden stop” crises in which investors abruptly turn on a country they’ve loved not wisely but too well. That was the story of the Mexican crisis of 1994-5, the Asian crises of 1997-9, and, in important ways, the crisis of southern Europe after 2009. It’s also what we seem to be seeing in Turkey and Argentina now.

We know how this story goes: the afflicted country sees its currency depreciate (or, in the case of the euro countries, its interest rates soar). Ordinarily currency depreciation boosts an economy, by making its products more competitive on world markets. But sudden-stop countries have large debts in foreign currency, so the currency depreciation savages balance sheets, causing a severe drop in domestic demand. And policymakers have few good options: raising interest rates to prop up the currency would just hit demand from another direction.

But while you might have assumed that Brazil was a similar case — its 9 percent decline in real G.D.P. per capita is comparable to that of sudden-stop crises of the past — it turns out that it isn’t. Brazil does not, it turns out, have a lot of debt in foreign currency, and currency effects on balance sheets don’t seem to be an important part of the story. What happened instead?

Slowly going over the three points that Krugman made in the beginning:

1. Commodity prices went down and Brazil exports a lot of commodities.

Brazil’s exports in 2016:


At a glance, we have among commodities: vegetable products, mineral products (5% crude petroleum, 10% iron and copper ore), foodstuffs, animal products, metals, and precious metals. Though picking out these may be over or underestimating the true percentage of commodity exports among all of Brazil’s exports, let’s use these for our approximation. The total percentage of these products is about 60%, where around 36% are agricultural commodities, around 27% are metal commodities (metals + iron and copper ore), around 5% is crude petroleum, and around 2% are precious metals. These categorizations that I did are improvisational and not following any definitions – they are simplifications.

Looking at the S&P GSCI Agricultural & LiveStock Index Spot (SPGSAL):


we definitely do see a downtrend in the last several years in agricultural commodities.

Looking at the S&P GSCI Industrial Metals Index Spot (GYX):


there was a decline from 2011 but a rise from 2016.

Looking at the S&P GSCI Precious Metals Index Spot (SPGSPM):


it’s been flat since around 2013.

Looking at S&P GSCI Crude Oil Index Spot (G39):


it has been low after a decline in 2014 with volatility in 2017-2018.

But instead of eyeballing this phenomenon with a bunch of different charts, there’s a way that can mathematically eyeball this in one chart, called the terms of trade.

Investopedia’s definition of terms of trade:

What are ‘Terms of Trade – TOT’?

Terms of trade represent the ratio between a country’s export prices and its import prices. The ratio is calculated by dividing the price of the exports by the price of the imports and multiplying the result by 100. When a country’s TOT is less than 100%, more capital is leaving the country than is entering the country. When the TOT is greater than 100%, the country is accumulating more capital from exports than it is spending on imports.

But how exactly do you calculate the “price of exports and imports” of a country like, say Brazil, that has USD 190B exports a year and surely thousands if not more different products, and what to do about the changing quantities of each of those products every year? How do we understand the terms of trade in a way that doesn’t vaguely seem like the current account balance? (which is the total value of exports minus imports, or net value of exports: \( EX – IM = \sum_{i}^{}{p_i \cdot q_i} – \sum_{i}^{}{p’_i \cdot q’_i} \) where \( p_i\), \( q_i \) is the price and quantity of export product \(i\) and \( p’_i\), \( q’_i \) is the price and quantity of import product \(i\).

The answer is by deciding on a base year to compare the year in question. For example, for the prices of products in the year in question, we sum the values of exports for each product in that year, i.e. \( \sum_{i} {p_{i,n} \cdot q_{i,n}} \) where \(i\) is the index for each different product and \(n\) is the year in question. For the prices of products in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) in the year in question \(n\). In other words, we fix the quantity of each product \(q_i\) to the quantity of each product in the year in question \(q_{i,n}\) so that we are strictly comparing prices between year \(n\) and \(0\) and not letting changes in quantity \(q\) get in the way. This is the Paasche index.

Another way we can do this is: for the prices of products in the year in question \(n\), we sum the prices of each product in that year \( p_{i,n} \) multiplied by the quantity of each product from the base year \( q_{i,0} \), and for the prices in the base year \(0\), we take the price of each product \(i\) in that base year multiplied by the quantity of that product \(i\) also in the base year \(0\). So this time, instead of fixing the quantity of each product in the year in question \(n\), we fix the quantity of each product to the base year \(0\). This is the Laspeyre index.

Paasche index:

$$ P_{\textrm{Paasche}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,n}}}{\sum_{i}{p_{i,0} \cdot q_{i,n}}} $$

Laspeyre index:

$$ P_{\textrm{Laspeyre}} = \frac{\sum_{i}{p_{i,n} \cdot q_{i,0}}}{\sum_{i}{p_{i,0} \cdot q_{i,0}}} $$


Thus, by using such a price index calculation we “cancel out” the effect of changing export or import quantities so that we are only looking at the change of price of exports of imports between two time periods. With a base year \(0\), we can calculate the price index for exports in year \(n\), the price index for imports in year \(n\), and then divide the former by the latter to achieve the terms of trade for year \(n\).


A terms of trade chart quantitatively summarizes all the above eyeballing we did with the visualization of Brazil’s exports and the charts of commodities indices as well as the eyeballing we didn’t do with Brazil’s imports. And we see what we expect in the above graph, which is a drop in Brazil’s terms of trade in the last several years.

2. Brazil’s consumer spending declined due to rising household debt (the red graph):


3. Brazil implemented fiscal austerity to try to deal with “long-term solvency problems” and raised interest rates to try to deal with inflation, which was caused by depreciation in the currency. The currency depreciated due to lower commodity prices, which of course is also reflected in the terms of trade graph above.

Depreciating currency (blue) and inflation (change in or first derivative of red):


Interest rates raised to combat inflation:


We can see that interest rates rise in late 2015 as a response to rising inflation. Inflation drops as a response in the next couple of years, but this rise in interest rates contributed to the slow down in Brazil’s economy.


So we have a drop in the terms of trade (due to a drop in commodity prices), a drop in consumer spending (due to a rise in household debt in preceding years), and then fiscal austerity and monetary contraction as government policy responses, causing a recession in Brazil.


Centered regular text.


Centering equation by centering the latex text:

\displaystyle\sum_{n=1}^\infty \frac{1}{n^2} = \frac{\pi^2}{6}. \\ 4

Above does not work with tables.  Multiple lines with \\ look awkward.  Only first line is centered.


Aligning equations using \begin{array}.  To center begin latex with <p align=”center”>\ and \ end ~ with \quad </p>.  Will need to check by going between Visual and Text modes.  Note that there probably will be strange behavior by the location of </p> in the Text mode when going between the two modes and will probably have to correct it multiple times.  Note that the lines need to be tight in Text mode – no extra line of space between the latex code, and this needs to be edited and checked in Text mode, not Visual mode.:


\begin{array}{rcl} f: R^3 & \to & R \\  (x,y,z) & \to & x + y + z \\  f(x,y,z) & = & x + y + z  \end{array}


Centered table:

\begin{tabular}{ |c|c|c| }  \hline  One & Two & Three \\  \hline  1/2 & 49 & 6 \\  1/3 & 32 & 8 \\  \hline  \end{tabular}



rcl: three columns, the first column right-justified, the middle one centered, and the third column left-justified

Test Coin2


Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \), \(p \neq q \), and the values are given and known. If you are free to flip one of the coins any number of times, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

alternative text

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

Setting Up Our Hypothesis Test

Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p < q\). We know that coin flips give us a binomial distribution, and we know the standard deviation of a binomial random variable \(X_p\) (let \(X_p\) or \(X_{p,n}\) be a binomial random variable for the number of flips that are heads, where the probability of a head on a flip is \(p\) and we do \(n\) number of flips), which is:

$$ \textrm{Standard Deviation of }{X_p} = \sqrt{ Var\left( {X_p} \right) } = \sqrt{ np(1-p) } $$


Digression: we can also split our \(n\) Bernoulli trial coin flips that make up our binomial random variable \(X_p\) into \(m\) number of binomial random variables \(X_{p,k}\) each with \(k\) trials, such that \(k \times m = n\). Then the standard error of the mean proportion of heads from \(m\) binomial random variables (each with \(k\) trials) is:

$$ \textrm{Standard error of the mean} = \sqrt{ Var\left( \overline{X_{p,k}} \right) } = \sqrt{ Var \left( {1 \over m} \sum_{i=1}^{m} {X_{p,k}} \right) } $$
$$= \sqrt{ Var(\sum_{i=1}^{m} X_{p,k}) \over m^2 } = \sqrt{ m \cdot Var(X_{p,k}) \over m^2 }= \sqrt{ {m \cdot kp(1-p) \over m^2 } } = \sqrt{ { kp(1-p) \over m} } $$

This standard error above is for the random variable \(X_{p,k}\), each of which has \(k\) Bernoulli trials. In other words, the standard deviation of \( {1 \over m} \sum_{i=1}^{m} X_{p,k} \) is \( \sqrt{ kp(1-p) \over m }\). But if you simply change \(k\) to \(km = n\) and reduce \(m\) to \(1\), you get the same result as if you took all \(km = n\) trials as the number of trials for one binomial random variable, our original \(X_p\): where we now say that the standard deviation of \( {1 \over 1} \sum_{i=1}^{1} X_{p,n} = X_{p,n} = X_p \) is \( \sqrt{ np(1-p) \over 1 } = \sqrt{ np(1-p) } \).

By going from \(m\) repetitions of \(X_{p,k}\) to \(1\) repetition of \(X_{p,n}\), both the mean and the standard deviation is multiplied by \(m\). The mean of \(X_{p,k}\) is \(kp\) and the mean of \(X_{p,n}\) is \(mkp = np\); the standard deviation of \(X_{p,k}\) is \( \sqrt{ kp(1-p) } \) and the standard deviation of \(X_{p,n}\) is \( \sqrt{ mkp(1-p) } =\sqrt{ np(1-p) } \). The standard error of the mean of \(m\) repetitions of \(X_{p,k}\) is \( \sqrt{ { kp(1-p) \over m} } \) while the mean of \(m\) repetitions of \(X_{p,k}\) is of course just \( {1 \over m} \sum_{i=1}^{m} \mathbb{E} \left[ X_{p,k} \right] = {1 \over m} m (kp) = kp \). So when going from \(1\) repetition of \(X_{p,k}\) to \(m\) repetitions of \(X_{p,k}\), the mean goes from \(kp\) to \(mkp = np\) and the standard error of the mean of \(X_{p,k}\) goes from \( \sqrt{ { kp(1-p) \over m} } \) to the standard deviation of \( X_{p,n} \) by multiplying the standard error of the mean of \( X_{p,k} \) by \(m\): \( m \cdot \sqrt{ { kp(1-p) \over m} } = \sqrt{ { m^2 \cdot kp(1-p) \over m} } = \sqrt{ { mkp(1-p)} } = \sqrt{ { np(1-p)} } \).


Knowing the standard deviation of our random variable \(X_p\), a 0.05 significance level for a result that “rejects” the null would mean some cutoff value \(c\) where \(c > pn\). If \(x_p\) (the sample number of heads from \(n\) coin tosses) is “too far away” from \(pn\), i.e. we have \(x_p > c\), then we reject the null hypothesis that we have been flipping the \(p\) coin.

But note that if we choose a \(c\) that far exceeds \(qn\) as well, we are in a weird situation. If \(x_p > c\), then \(x_p \) is “too large” for \(pn\) but also is quite larger than \(qn\) (i.e. \( x_p > qn > pn \) ). This puts us in an awkward situation because while \(x_p \) is much larger than \(pn\), making us want to reject the hypothesis that we have were flipping the \(p\) coin, it is also quite larger than \(qn\), so perhaps we obtained a result that was pretty extreme “no matter which coin we had.” If we assume the null hypothesis that we have the \(p\) coin, our result \(x_p \) is very unlikely, but it is also quite unlikely even if we had the \(q\) coin, our alternative hypothesis. But still, it is more unlikely that it is the \(p\) coin than it is the \(q\) coin, so perhaps it’s not that awkward. But what if \(x_p\) does not exceed \(c\)? Then we can’t reject the null hypothesis that we have the \(p\) coin. But our sample result of \(x_p\) might in fact be closer to \(qn\) than \(pn\) – \(x_p\) might even be right on the dot of \(qn\) – and yet we aren’t allowing ourselves to use that to form a better conclusion, which is a truly awkward situation.

If \(c\) is, instead, somewhere in between \(pn\) and \(qn\), and \(x_p > c\), we may reject the null hypothesis that our coin is the \(p\) coin, while \(x_p\) is in a region close to \(q\), i.e. a region that is a more likely result if we actually had been flipping the \(q\) coin, bringing us closer to the conclusion that this is the \(q\). However, if we reverse the experiment – if we use the same critical value \(c\) and say that if \(x_p < c\) then we reject our null hypothesis that \(q\) is our coin, then the power and significance of the test for each coin is different, which is also awkward.

Above, the pink region is the probability that \(X_p\) ends in the critical region, where \(x_p > c\), assuming the null hypothesis that we have the \(p\) coin. This is also the Type I Error rate (a.k.a. false positive) – the probability that we end up falsely rejecting the null hypothesis, assuming that the null hypothesis is true.

Above, the green region is the power \(1-\beta\), the probability that we get a result in the critical region \(x_p > c\) assuming that the alternative hypothesis is true, that we have the \(q\) coin. The blue-gray region is \(\beta\), or the Type II Error rate (a.k.a. false negative) – the probability that we fail to reject the null hypothesis (that we have the \(p\) coin) when what’s actually true is the alternative hypothesis (that we have the \(q\) coin).

Now let us “reverse” the experiment with the same critical value – we want to test our null hypothesis that we have the \(q\) coin:

We have \(x_p < c\). We fail to reject the null hypothesis that we have the \(p\) coin, and on the flip side we would reject the null hypothesis that we have the \(q\) coin. but we have failed a tougher test (the first one, with a small \(\alpha_p\)) and succeeded in rejecting an easier test (the second one, with a larger \(\alpha_q\)). In hypothesis testing, we would like to be conservative, so it is awkward to have failed a tougher test but "be ok with it" since we succeeded with an easier test. Common sense also, obviously, says that something is strange when \(x_p\) is closer to \(q\) than \(p\) and yet we make the conclusion that since \(x_p\) is on the "\(p\)-side of \(c\)," we have the \(p\) coin.   In reality, we wouldn't take one result and apply two hypothesis tests on that one result. But we would like the one test procedure to make sense with whichever null hypothesis we start with, \(p\) coin or \(q\) coin (since it is arbitrary which null hypothesis we choose in the beginning, for we have no knowledge of which coin we have before we start the experiment).

What we can do, then, is to select \(c\) so that the probability, under the hypothesis that we have the \(p\) coin, that \(X_p > c\) is equal to the probability that, under the hypothesis that we have the \(q\) coin, that \(X_q < c\). In our set up, we have two binomial distributions, which are discrete, as opposed to the normal distributions above. In addition, binomial distributions, unless the mean is at \(n/2\), are generally not symmetric, as can be seen in the very first figure, copied below as well, where the blue distribution is symmetric but the green one is not.

We can pretend that the blue distribution is the binomial distribution for the \(p\) coin and the green distribution for the \(q\) coin. The pmf of a binomial random variable, say \(X_p\) (that generates Heads or Tails with probability of Heads \(p\)) is:

$$ {n \choose h} p^h (1-p)^{n-h} $$

where \(n\) is the total number of flips and \(h\) is the number of Heads among those flips. We let \(c\) be the critical number of Heads that would cause us to reject the null hypothesis that the coin we have is the \(p\) coin in favor of the alternative hypothesis that we have the \(q\) coin. The area of the critical region, i.e. the probability that we get \(c_H\) heads or more assuming the hypothesis that we have the \(p\) coin, is:

$$ Pr(X_p > c) = \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] $$

And the reverse, the probability that we get \(c_H\) heads or less assuming the hypothesis that we have the \(q\) coin, is:

$$ Pr(X_q < c) = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ So we want to set these two equal to each other and solve for \(c\): $$ \sum_{i=c}^{n} \left[ {n \choose i} p^i (1-p)^{n-i} \right] = \sum_{i=0}^{c} \left[ {n \choose i} q^i (1-q)^{n-i} \right] $$ But since the binomial distribution is discrete, there may not be a \(c\) that actually works. For large \(n\), a normal distribution can approximate the binomial distribution. In that case, we can draw the figure below, which is two normal distributions, each centered on \(pn\) and \(qn\) (the means of the true binomial distributions), and since normal distributions are symmetric, the point at which the distributions cross will be our critical value. The critical regions for \(X_p\) (to the right of \(c\)) and for \(X_q\) (to the left of \(c\)) will have the same area.

If we pretend that these normal distributions are binomial distributions, i.e. if we pretend that our binomial distributions are symmetric (i.e. we pretend that \(n\) is going to be large enough that both our binomial distributions of \(X_p\) and \(X_q\) are symmetric enough), then to find \(c\) we can find the value on the horizontal axis at which, i.e. the number of Heads at which, the two binomial probability distributions are equal to each other.

$$ {n \choose c} p^c (1-p)^{n-c} = {n \choose c} q^c (1-q)^{n-c} $$
$$ p^c (1-p)^{n-c} = q^c (1-q)^{n-c} $$
$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n-c} = 1 $$
$$ \left({p \over q}\right)^c \left({1-p \over 1-q}\right)^{n} \left({1-q \over 1-p}\right)^c = 1 $$
$$ \left({p(1-q) \over q(1-p)}\right)^c = \left({1-q \over 1-p}\right)^{n} $$
$$ c \cdot log \left({p(1-q) \over q(1-p)}\right) = n \cdot log \left({1-q \over 1-p}\right) $$

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

The mean of a binomial distribution \(X_p\) has mean \(pn\) with standard deviation \(\sqrt{np(1-p)}\). With a normal distribution \(X_{\textrm{norm}}\) with mean \(\mu_{\textrm{norm}}\) and standard deviation \(\sigma_{\textrm{norm}}\), the value \( c_{\alpha} = X_{\textrm{norm}} = \mu_{\textrm{norm}} = 1.645\sigma_{\textrm{norm}}\) is the value where the area from that value \(c_{\alpha}\) to infinity is \(0.05 = \alpha\). Thus, \( c_{\alpha} \) is the critical value for a normal random variable where the probability that \(X_{\textrm{norm}} > c_{\alpha} = 0.05)\). So for a binomial random variable \(X_p\), we would have \(c_{\textrm{binomial, }\alpha} = pn + 1.645\sqrt{np(1-p)}\).

Thus, we have that this critical value for a binomial random variable \(X_p\):

$$ c = n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) $$

must also be

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

for the area to the right of \(c\) to be \(\leq 0.05\). To actually find the critical value \(c_{\textrm{binomial, }\alpha}\), we can just use

$$ c_{\textrm{binomial, }\alpha} \geq pn + 1.645\sqrt{np(1-p)} $$

Since we are given the values of \(p\) and \(q\), we would plug in those values to find the required \(n\) needed to reach this condition for the critical value. So we have

$$ n \cdot log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) = pn + 1.645\sqrt{np(1-p)} $$

$$ \sqrt{n} = 1.645\sqrt{p(1-p)} / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p \right] $$

$$ n = 1.645^2p(1-p) / \left[ log \left({1-q \over 1-p}\right) / log \left({p(1-q) \over q(1-p)}\right) – p\right]^2 $$

For example, if \(p = 0.3\) and \(q = 0.7\), we have \(n = 14.2066 \), or rather, \(n \geq 14.2066 \).

Wolfram Alpha calculation of above, enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.3, q = 0.7

Note that if we switch the values so that \(p = 0.7\) and \(q = 0.3\), or switch the \(p\)’s and \(q\)’s in the above equation for \(n\), we obtain the same \(n_{\textrm{min}}\). This makes sense since our value for \(n_{\textrm{min}}\) depends on \(c\) and \(c\) is the value on the horizontal axis at which the two normal distributions from above (approximations of binomial distributions) with means at \(pn\) and \(qn\) cross each other. Thus, we set up the distributions so that that whole problem is symmetric.

So if we generate a sample such that the number of samples is \(n \geq 14.2066\), we can use our resulting \(x_p\) and make a hypothesis test regarding if we have the \(p\) or \(q\) coin with \(\alpha = 0.05\) significance level.

If \(p\) and \(q\) are closer, say \(p = 0.4\) and \(q = 0.5\), then we have \(n \geq 263.345\). This makes intuitive sense, where the closer the probabilities are of the two coins, the more times we have to flip our coin to be more sure that we have one of the coins rather than the other. To be more precise, the smaller the effect size is, the larger sample size we need in order to get the certainty about a result. An example of the effect size is Cohen’s d where:

$$\textrm{Cohen’s d } = {\mu_2 – \mu_1 \over \textrm{StDev / Pooled StDev}} $$

Wolfram Alpha calculation of above for \(n\) with \(p = 0.4\) and \(q = 0.5\), or enter the following into Wolfram Alpha:

1.645^2 * p * (1-p) / (ln((1-q)/(1-p))/ln(p*(1-q)/(q*(1-p))) – p )^2; p = 0.4, q = 0.5

From here, where the question is asked originally, is an answer that finds the exact values for the two \(n_{\textrm{min}}\) using R with the actual binomial distributions (not using normal distributions as approximations):

Due to the discrete-ness of the distributions, the \(n_{\textrm{min}}\)’s found are slightly different: \(n_{\textrm{min}} = 17\) for the first case and \(n_{\textrm{min}} = 268\) for the second case. I.e., the difference comes from using the normal distribution as an approximation for the binomial distribution.

Test Coin

Suppose there are two coins and the percentage that each coin flips a Head is \(p\) and \(q\), respectively. \(p, q \in [0,1] \) and the values are given and known. If you are free to flip one of the coins, how many times \(n\) do you have to flip the coin to decide with some significance level \( \left( \textrm{say } \alpha = 0.05 \right) \) that it’s the \(p\) coin or the \(q\) coin that you’ve been flipping?

The distribution of heads after \(n\) flips for a coin will be a binomial distribution with means at \(pn\) and \(qn\).

alternative text

Two binomial distributions, n = 20. The means are pn = 10 and qn = 14.

The Usual Hypothesis Test

In the usual hypothesis test, for example with data \(x_i, i=1, 2, 3, …, n\) from a random variable \(X\), to find the if the mean \( \mu \) is \(\leq\) some constant \(\mu_0\):

H_0 & : \mu \leq \mu_0 ( \textrm{ and } X \sim N(\mu_0, \textrm{ some } \sigma^2 ) )
H_1 & : \mu > \mu_0

If the sample mean of the data points \( \overline{x} \) is “too large compared to” \( \mu_0 \), then we reject the null hypothesis \( H_0 \).

If we have the probability distribution of the random variable (even if we don’t know the true value of the mean \( \mu \)), we may be able to know something about the probability distribution of a statistic obtained from manipulating the sample data, e.g. the sample mean.  This, the probability distribution of a statistic (obtained from manipulating sample data), is called the sampling distribution.  And a property of the sampling distribution, the standard deviation of a statistic, is the standard error.  For example, the standard error of the mean is:

Sample Data: \(x\) \(\qquad\) Sample Mean: \( \overline{x} \)

Variance: \( Var(x) \) \(\qquad\) Standard Deviation: \( StDev(x) = \sigma(x) = \sqrt{Var(x)} \)

Variance of the Sample Mean: \( Var( \overline{x} ) = Var \left( \frac{1}{n}  \sum_{i=0}^{n}{ x_i } \right) = \frac{1}{n^2} \sum_{i=0}^{n} { Var(x_i) } = \frac{1}{n^2} n Var(x) = \frac{1}{n} Var(x) = {\sigma^2 \over n} \)

Standard Deviation of the Sample Mean, Standard Error of the Mean: \(  \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \)

Thus, if the random variable is \(i.i.d.\) (independent and identically distributed), then with the sample mean \( \overline{x} \) we obtain from the data, we can assume this \( \overline{x} \) has a standard deviation of \(  \frac{\sigma}{\sqrt{n}} \).  This standard deviation, being smaller than the standard deviation of the original \(X\), i.e. \(\sigma\), means that \(\overline{X}\) is narrower around the mean than \(X\). This means \(\overline{X}\) gives us a better ability to hone in on what the data says about \( \mu \) than \(X\)’s ability to hone, i.e. a narrower, more precise, “range of certainty,” from the sample data, with the same significance level.

Thus, given our sample \(x_i, i = 1, \dots, n \), we can calculate the statistic \( \overline{x} = \frac{1}{n} \sum_{i=1}^{n} {x_i} \), our sample mean.  From the data (or given information), we would like to calculate the standard error of the mean, the standard deviation of this sample mean as a random variable (where the sample mean is a statistic, i.e. can be treated as a random variable): \(  \frac{1}{\sqrt{n}} StDev(x) = {\sigma \over \sqrt{n}} \). This standard error of the mean gives us a “range of certainty” around the \(\overline{x}\) with which to make an inference.

A. If we know/are given the true standard deviation \( \sigma \)

If we are given the true standard deviation \( \sigma \) of the random variable \( X \), then we can calculate the standard error of the sample mean: \(  \frac{\sigma}{\sqrt{n}} \).  So under the null hypothesis \( H_0: \mu \leq \mu_0 \), we want to check if the null hypothesis can hold against a test using the sample data.

A.a Digression about \(H_0: \mu \leq \mu_0\) and \(H_0: \mu = \mu_0\)

If the \(\mu\) we infer from the sample data is “too extreme,” in this case “too large” compared to \(\mu_0\), i.e. the test statistic is > some critical value that depends on \(\mu_0\), i.e. \(c(\mu_0)\), we reject the null hypothesis. If we check a \(\mu_1\) that is \(\mu_1 < \mu_0\) (since our null hypothesis is \( H_0: \mu < \mu_0 \)), our critical value \(c(\mu_1)\) will be less extreme than \(c(\mu_0)\) (in other words \( c(\mu_1) < c(\mu_0) \)), and thus it would be "easier to reject" the null hypothesis if using \( c(\mu_1) \). Rejecting a hypothesis test ought to be conservative since rejecting a null hypothesis is reaching a conclusion, so we would like the test to be "the hardest to reject" that we can (a conclusion, i.e. a rejection here, should be as conservative as possible). The "hardest to reject" part of the range of \(H_0: \mu \leq \mu_0\) would be \( \mu = \mu_0 \) where the critical value \( c(\mu_0) \) would be the largest possible critical value. Testing a \(\mu_1 < \mu_0\) would mean that we may obtain a test statistic that rejects is too extreme/large) for \(\mu_1\) (i.e. \( t > c(\mu_1) \) ) but not too extreme/large for \(\mu_0\) (i.e. \( t \not> c(\mu_0) \) ). But if we test using \(\mu_0\), if the test statistic is extreme/large enough that we reject the null hypothesis of \(\mu = \mu_0\), that would also reject all other null hypotheses using \(\mu_1\) where \(\mu_1 < \mu_0\).

So under the null hypothesis \( H_0: \mu \leq \mu_0 \) or the “effective” null hypothesis \( H_0: \mu = \mu_0 \), we have that \( X \sim N(\mu_0, \sigma^2) \) with \( \sigma \) known, and we have that \( \overline{X} \sim N(\mu_0, \sigma^2/n) \).  This means that

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \sim N(0, 1) \)

Then we can use a standard normal table to find where on the standard normal is the \( \alpha = 0.05 \) cutoff – for a one-tailed test, the cutoff is at \( Z_{\alpha} = 1.645 \) where \( Z \sim N(0, 1) \).  So if

\( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } > 1.645 = Z_{\alpha} \),

then this result is “too large compared to \( \mu_0 \)” so we reject the null hypothesis \( H_0: \mu \leq \mu_0 \).  If \( \frac{\overline{X} – \mu_0} { ^{\sigma}/_{\sqrt{n}} } \leq 1.645 = Z_{\alpha} \), then we fail to reject the null hypothesis \( H_0: \mu \leq \mu_0 \).

B. If we don’t know the standard deviation \( \sigma \)

If we don’t know the value of the standard deviation \( \sigma \) of our random variable \( X \sim N( \mu, \sigma^2 ) \) (which would be somewhat expected if we already don’t know the value of the mean \(\mu\) of \( X \)), then we need to estimate \( \sigma \) from our data \( x_i, i = 1, 2, \dots, n \).  We can estimate \( \sigma \) by taking the sample standard deviation of \( x_i, i = 1, \dots, n \) by doing \( s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \), or rather the sample variance \( s^2 = { \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) and then taking the square root of that.

However, note that while the estimator for the sample variance is unbiased:

\mathbb{E}\left[s^2\right] & = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ (x_i -\mu + \mu – \overline{x})^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[ \left( (x_i -\mu) – (\overline{x} – \mu) \right)^2 \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[ \sum_{i=0}^{n} { \left[  (x_i – \mu)^2 – 2 (x_i – \mu) (\overline{x} – \mu) + (\overline{x} – \mu)^2  \right] } \right] \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu) \sum_{i=0}^{n} { (x_i – \mu) } + \sum_{i=0}^{n} { (\overline{x} – \mu)^2 }   \right]  \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 (\overline{x} – \mu)   (n \overline{x} – n \mu) + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – 2 n (\overline{x} – \mu)^2 + n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1} \mathbb{E} \left[   \sum_{i=0}^{n} { (x_i – \mu)^2 } – n (\overline{x} – \mu)^2    \right]   \\
& = \frac{1}{n-1}    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ (x_i – \mu)^2 \right] } – n \mathbb{E} \left[ (\overline{x} – \mu)^2 \right]  \right)  \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \mathbb{E} \left[ x_i^2 – 2 \mu x_i + \mu^2 \right] } – n \mathbb{E} \left[ \overline{x}^2 – 2 \mu \overline{x} + \mu^2 \right]  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu \mathbb{E} [x_i] + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu \mathbb{E} [\overline{x}] + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – 2 \mu^2 + \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] – 2 \mu^2 + \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \mu^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \mu^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \mathbb{E} \left[ x_i^2 \right] – \left( \mathbb{E} [x_i] \right)^2 \right) } – n \left( \mathbb{E} \left[ \overline{x}^2 \right] –  \left( \mathbb{E} [\overline{x}] \right)^2 \right)  \right) \\
& = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left(  Var(x_i) \right) } – n Var(\overline{X})  \right) = \frac{1}{n-1} \left(    \sum_{i=0}^{n} { \left( \sigma^2 \right) } – n \frac{\sigma^2}{n} \right) \\
&  = \frac{1}{n-1} \left(    n \sigma^2 – \sigma^2 \right)  = \sigma^2 \\

that does not allow us to say that the square root of the above estimator gives us an unbiased estimator for the standard deviation \( \sigma \). In other words:

\( \mathbb{E}\left[s^2\right] = \mathbb{E}\left[ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \right] = \sigma^2 \)


\( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \neq \sigma \)

because the expectation function and the square root function do not commute:

\( \sigma = \sqrt{\sigma^2} = \sqrt{ \mathbb{E}[s^2] } \neq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s] \)

B.a The sample standard deviation \( s = \sqrt{s^2} = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a biased estimator of \( \sigma \)

In fact, we can infer the bias of \( \mathbb{E} [s] \) to some extent. The square root function \( f(x) = \sqrt{x} \) is a concave function. A concave function \( f \) is:

$$ \forall x_1, x_2 \in X, \forall t \in [0, 1]: \quad f(tx_1 + (1 – t) x_2 ) \geq tf(x_1) + (1 – t) f(x_2) $$

The left-hand side of the inequality is the blue portion of the curve \( \{ f( \textrm{mixture of } x_1 \textrm{ and } x_2 ) \} \) and the right-hand side of the inequality is the red line segment \( \{ \textrm{a mixture of } f(x_1) \textrm{ and } f(x_2) \} \). We can see visually what it means for a function to be concave, where between to arbitrary \(x\)-values \(x_1\) and \(x_2\), the blue portion is always \(\geq\) the red portion between two \(x\)-values, .

Jensen’s Inequality says that if \( g(x) \) is a convex function, then:

$$ g( \mathbb{E}[X] ) \leq \mathbb{E}\left[ g(X) \right] $$

and if \( f(x) \) is a concave function, then:

$$ f( \mathbb{E}[X] ) \geq \mathbb{E}\left[ f(X) \right] $$

The figure above showing the concave function \(f(x) = \sqrt{x}\) gives an intuitive illustration of Jensen’s Inequality as well (since Jensen’s Inequality can be said to be a generalization of the “mixture” of \(x_1\) and \(x_2\) property of convex and concave functions to the expectation operator). The left-hand side \( f(\mathbb{E}[X]) \) is like \( f( \textrm{a mixture of } X \textrm{ values} ) \) and the right-hand side \( \mathbb{E}\left[ f(X) \right] \) is like \( {\textrm{a mixture of } f(X) \textrm{ values} } \) where the “mixture” in both cases is the “long-term mixture” of \( X \) values that is determined by the probability distribution of \( X \).

Since \( f(z) = \sqrt{z} \) is a concave function, going back to our estimation of the standard deviation of \( X \) using \(\sqrt{s^2}\), we have
f( \mathbb{E}[Z] ) & \geq \mathbb{E}\left[ f(Z) \right] \longrightarrow \\
\sqrt{\mathbb{E}[Z]} & \geq \mathbb{E}\left[ \sqrt{Z} \right] \longrightarrow \\
\sqrt{ \mathbb{E}[s^2] } & \geq \mathbb{E}\left[ \sqrt{s^2} \right] \longrightarrow \\
\sqrt{ Var(X) } & \geq \mathbb{E}\left[s\right] \\
\textrm{StDev} (X) = \sigma(X) & \geq \mathbb{E}\left[s\right] \\

Thus, \( \mathbb{E} [s] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \leq \sigma \). So \( \mathbb{E} [s] \) is biased and underestimates the true \(\sigma\).

However, the exact bias \( \textrm{Bias}(s) = \mathbb{E} [s] – \sigma \) is not as clean to show.

\( \frac{(n-1)s^2}{\sigma^2} = \frac{1}{\sigma^2} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } \sim \) a \( \chi^2 \) distribution with \( n-1 \) degrees of freedom. In addition, \( \sqrt{ \frac{(n-1)s^2}{\sigma^2} } = \frac{\sqrt{n-1}s}{\sigma} = \frac{1}{\sigma} \sqrt{ \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \sim \) a \( \chi \) distribution with \( n-1 \) degrees of freedom. A \( \chi \) distribution with \(k\) degrees of freedom has mean \( \mathbb{E} \left[ \frac{\sqrt{n-1}s}{\sigma} \right] = \mu_{\chi} = \sqrt{2} \frac{\Gamma ( ^{(k+1)} / _2 ) } { \Gamma ( ^k / _2 )} \) where \( \Gamma(z) \) is the Gamma function.

If \(n\) is a positive integer, then \( \Gamma(n) = (n – 1)! \). If \(z\) is a complex number that is not a non-positive integer, then \( \Gamma(z) = \int_{0}^{\infty}{x^{z-1} e^{-x} dx} \). For non-positive integers, \( \Gamma(z) \) goes to \(\infty\) or \(-\infty\).

From the mean of a \( \chi \) distribution above, we have:

\( \mathbb{E}[s] = {1 \over \sqrt{n – 1} } \cdot \mu_{\chi} \cdot \sigma \)

and replacing \(k\) with \(n-1\) degrees of freedom for the value of \(\mu_{\chi}\), we have:

\( \mathbb{E}[s] = \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } \cdot \sigma \)

Wikipedia tells us that:

\( \sqrt{ {2 \over n – 1} } \cdot { \Gamma(^n/_2) \over \Gamma(^{n-1}/_2) } = c_4(n) = 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \)

So we have:

\( \textrm{Bias} (s) = \mathbb{E}[s] – \sigma = c_4(n) \cdot \sigma – \sigma = ( c_4(n) – 1) \cdot \sigma \)

\( = \left( \left( 1 – {1 \over 4n} – {7 \over 32n^2} – {19 \over 128n^3} – O(n^{-4}) \right) – 1 \right) \cdot \sigma = – \left( {1 \over 4n} + {7 \over 32n^2} + {19 \over 128n^3} + O(n^{-4}) \right) \cdot \sigma \)

Thus, as \(n\) becomes large, the magnitude of the bias becomes small.

From Wikipedia, these are the values of \(n\), \(c_4(n)\), and the numerical value of \( c_4(n) \):

n & c_4(n) & \textrm{Numerical value of } c_4(n) \\
2 & \sqrt{2 \over \pi} & 0.798… \\
3 & {\sqrt{\pi} \over 2} & 0.886… \\
5 & {3 \over 4}\sqrt{\pi \over 2} & 0.940… \\
10 & {108 \over 125}\sqrt{2 \over \pi} & 0.973… \\
100 & – & 0.997… \\

Thus, for the most part, we don’t have to worry too much about this bias, especially with large \(n\). So we have
\( \mathbb{E}[\hat{\sigma}] \approx \mathbb{E}[s] = \mathbb{E}[\sqrt{s^2}] = \mathbb{E} \left[ \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \right] \)

More rigorously, our estimator \( \hat{\sigma} = s = \sqrt{ \frac{1}{n-1} \sum_{i=0}^{n} { \left[ (x_i – \overline{x})^2 \right] } } \) is a consistent estimator of \( \sigma \) (even though it is a biased estimator of \( \sigma \)).

An estimator is consistent if \( \forall \epsilon > 0 \):

$$ \lim\limits_{n \to \infty} \textrm{Pr } (|\hat{\theta} – \theta| > \epsilon ) = 0 $$

In other words, as \( n \to \infty \), the probability that our estimator \( \hat{\theta} \) “misses” the true value of the parameter \(\theta\) by greater than some arbitrary positive amount (no matter how small) goes to \(0\).

For the sample standard deviation \(s\) as our estimator of the true standard deviation \(\sigma\) (i.e. let \(\hat{\sigma} = s\)),

\( \lim_{n \to \infty} (|\hat{\sigma} – \sigma|) = \lim_{n \to \infty} ( | c_4(n) \sigma – \sigma |) = (| \sigma – \sigma |) = 0 \)


\( \lim_{n \to \infty} \textrm{Pr } (|\hat{\sigma} – \sigma| > \epsilon) = \textrm{Pr } ( 0 > \epsilon ) = 0 \)

Since \(s\) is a consistent estimator of \(\sigma\), we are fine to use \(s\) to estimate \(\sigma\) as long as we have large \(n\).

So back to the matter at hand: we want to know the sampling distribution of \(\overline{X} \) to see “what we can say” about \(\overline{X}\), specifically, the standard deviation of \(\overline{X}\), i.e. the standard error of the mean of \(X\). Not knowing the true standard deviation \(\sigma\) of \(X\), we use a consistent estimator of \(\sigma\) to estimate it: \(s = \sqrt{{1 \over n-1} \sum_{i=1}^n {(x_i – \overline{x})^2}}\).

So instead of the case where we know the value of \(\sigma\)
\(\overline{X} \sim N(\mu, \sigma^2/n)\)
we have, instead something like:
\(\overline{X} \quad “\sim” \quad N(\mu, s^2/n)\)

When we know the value of \(\sigma\), we have
\({ \overline{X} – \mu \over \sigma/\sqrt{n} } \sim N(0,1) \)
When we don’t know the value of \(\sigma\) and use the estimate \(s\) instead of having something like
\({ \overline{X} – \mu \over s/\sqrt{n} } \quad “\sim” \quad N(0,1) \)
we actually have the exact distribution:
\({ \overline{X} – \mu \over s/\sqrt{n} } \sim T_{n-1} \)
the student’s t-distribution with \(n-1\) degrees of freedom.

Thus, finally, when we don’t know the true standard deviation \(\sigma\), under the null hypothesis \( H_0: \mu \leq \mu_0 \), we can use the expression above to create a test statistic
\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } ~ T_{n-1} \)
and check it against the student’s t-distribution with \(n-1\) degrees of freedom \(T_{n-1}\) with some critical value with some significance level, say \(\alpha = 0.05\).

So if the test statistic exceeds our critical value \(\alpha 0.05\):

\( t = { \overline{x} – \mu_0 \over s/\sqrt{n} } > T_{n-1, \alpha} \)

then we reject our null hypothesis \( H_0: \mu \leq \mu_0 \) at \(\alpha = 0.05\) significance level. If not, then we fail to reject our null hypothesis.






we know the standard deviation of a data point


If under the null hypothesis \( H_0 \) we have a probability distribution, the sample data gives us a sample standard deviation, i.e. the standard error.


Back to our case with 2 coins.  Let’s say we want to test if our coin is the \(p\) coin and let’s say we arbitrarily decide to call the smaller probability \(p\), i.e. \(p \leq q\).  We know that coin flips give us a binomial distribution, and we know the standard error of the mean proportion of heads from \(n\) flips.  So a 0.05 significance level would mean some cutoff value \(c\) where \(c > p\).  But note that if \(c\) ends up really big relative to \(q\), e.g. it gets close to \(q\) or even exceeds \(q\), we are in a weird situation.


we can decide on some cutoff value \(c\) between \(p\) and \(q\).  If we change around \(c\), what happens is that the significance level and the power of the test, whether testing \(p\) or \(q\), changes.



The Usual Colors of the Sky

Why is the sky blue? Why are sunsets red? The answers and explanations to these questions can be found fairly easily on the internet. But there are many subtle “Wait a minute…”-type questions in between the cracks that seem to necessitate subtle answers and more “difficult” searching around the internet to find those answers.

So, why is the sky blue?

No, but first, what color is sunlight?

Sunlight spectrum1, from

Visible light generally is from 390 nm (violet) to 720 nm (red). Visible sunlight is a mix of these colors at the intensities we see in the figure above.

, with maximum eye sensitivity at 555 nm (green).



Barcodes and Modular Arithmetic


Here is an example of a UPC-A barcode, taken from wikipedia:

UPC-A barcode exampled

A UPC-A barcode has 12 digits.  The first digit is something that tells how the numbers are generally used – for example, a particular industry might use a certain number for certain kinds of items.  The last twelfth digit is a check digit that can try to tell whether or not the numbers have an error.  This check digit is constructed in a certain way at first.  Later on, the check digit may be able to tell us if the numbers have an error or not.


The check digit is constructed as follows:

We have 11 digits:


So let \(L\) be the last twelfth difit.  We sum the digits in the odd positions and multiply by 3, and sum that with the sum of the digits in the even positions:


We take this modulo 10, or the remainder of this when divided by 10.  If this is 0, that is our twelfth digit; if not, subtract this from 10 and that is our twelfth digit.

$$\text{Let}\ S = (3\cdot(A+C+E+G+I+K)+(B+D+F+H+J))$$

0, & \text{if}\ S \pmod{10} \equiv 0 \\
10 – (S \pmod{10}), & \text{otherwise}

So the logic is that if all 12 digits are correct, they satisfy the check digit equation:

$$3\cdot(A+C+E+G+I+K)+(B+D+F+H+J+L) \equiv 0 \pmod{10}$$

If there is an error in the 12th digit, of course the check digit equation won’t be satisfied.  If there is an error in any one single digit among the first 11 digits, then the check digit equation will also not be satisfied.  Thus, the check digit equation will detect any single digit error.

To see that a single digit error among the first 11 digits will cause the check digit equation to not be satisfied, first note that if any of the digits in the even position are off, that will manifest in \(S\) as well as \(S \pmod{10} \) and we will have \(S \pmod{10} \not\equiv 0\).  But what about the digits in the odd positions, whose sum is multiplied by 3, and why multiplied by 3?

Take a digit in one of the even positions.  As long as the digit is off from the correct value, that will manifest itself in \(S\) and \(S \pmod{10} \).  Now take a digit in one of the odd positions and call id \(O\).  The question then is, if the digit is off from the correct value by say \(d\), how will that manifest itself in \(S\) as well as \(S \pmod{10} \)?  The correct \(O\) gives a term \(3 \cdot O\) in \(S\) while an incorrect digit of say \(O + d\) gives a term \(3 \cdot O + 3 \cdot d\).

Portfolio Insurance and Black Monday, October 19, 1987

On the thirtieth anniversary of Black Monday, the stock market crash of October 19th and 20th in 1987, there have been mentions of “portfolio insurance” having possibly exacerbated the crash.


Portfolio insurance, in principle, is exactly what you might expect it to be: if you own a stock, Stock A, you insure it with a put option on Stock A.  Your position becomes equivalent to a call option on Stock A until the put option expires, with the price of this position being the premium of the put option when you bought it.

If you are managing a portfolio on behalf of clients, though, and you just need to insure the portfolio up to a certain date, after which, say, you hand over the portfolio, then to buy American put options to insure the portfolio would be unnecessary.  European put options would suffice.  So let’s suppose that we are only interested in European options.

In the article that I cite at the bottom (Abken, 1987), it seems that at the time, buying put options as insurance had a few issues.  This is assuming that the portfolio we want to insure is a stock index: the S&P 500 index.  The issues were:

  • Exchange-traded index options only had matures up to nine months
  • Exchange-traded index options had a limited number of strike prices
  • It’s implied that only American options were available (which we would expect have a premium over European options).

Thus, instead of using put options to insure the portfolio, the portfolio and put options are replicated by holding some of the money in the portfolio and some of it in bonds, Treasury bills, that we assume to provide us with the risk-free rate.

Without worrying about the math, the Black-Scholes equation gives us a way to represent our stock index S and put options P as:

$$S + P = S \cdot N_1 + K \cdot DF \cdot N_2$$






Abken, Peter A.  “An Introduction to Portfolio Insurance.”  Economic Review, November/December 1987: 2-25.

Link to articleArchived.


Value-added Tax and Sales Tax

(This is mostly a summary of and heavily borrowed from, archived).
(The first three figures are taken from Wikipedia).


Comparing No Tax, Sales Tax, and VAT

Imagine three companies in a value chain that produces and then sells a widget to a consumer. The raw materials producer sells raw materials to the manufacturer for $1.00, earning a gross margin (revenue – Cost Of Goods Sold, COGS) of $1.00. The manufacturer sells its product, the widget, to the retailer for $1.20, earning a gross margin of $0.20. The retailer sells the widget to a non-business consumer (for the customer to use and consume) for $1.50, earning a gross margin of $0.30.

No tax example

Imagine we add a sales tax of 10%.  Sales tax applies only to the transaction with the final end-user, the consumer, i.e. the final transaction. So the consumer pays the retailer $1.50 + 10% sales tax = $1.65 to the retailer. The retailer remits the sales tax, $0.15, to the government.

Sales tax example

  • Only retailers remit collected sales tax to the government. Different regions, products, and types of consumers may have different sales taxes, so retailers are burdened with the maintenance of functions that process this for every different kind of sales tax.
  • There are cases (e.g. in the U.S., remote sales, i.e. cross-state or internet sales) where the retailer isn’t required to charge sales tax on its sales to consumers.  Instead, the consumer is responsible for remitting a use tax to the government on his or her remote purchases.
  • Only end-users pay the sales tax. Thus, someone who is an end-user has an incentive to masquerade as a business and purchase products for usage.
  • The government thus requires businesses (namely non-retailers, in this example) with the burden to prove, via certifications, that it is a business (and thus does not need to pay sales tax on the products it buys) and that it sells to other businesses (and thus does not need to charge and remit sales tax on products it sells).

Now let’s take away the 10% sales tax and add a 10% VAT. The raw materials producer charges $1.00 + a $0.10 VAT, which is 10% of the $1.00 value they added to the product) to the manufacturer. It remits the $0.10 VAT to the government. The manufacturer sells its product to the retailer for $1.20 + 10% or $0.12: $0.10 of which is the VAT that the raw materials producer charged the manufacturer and is now getting “paid back” by this transaction and the remaining $0.02 of which is 10% of the value added to the product by the manufacturer, which is $1.20 – $1.00 = $0.20, and remits the $0.02 to the government. The retailer sells its product to the customer for $1.50 + 10% of $1.50 or $0.15 ($0.12 of which is VAT that it paid to the previous 2 companies in the value chain and is now being “paid back” by this transaction and the remaining $0.03 of which is 10% of the value that the retailer added to the product, $0.30) and remits $0.03 to the government.

VAT example

  • From the consumer’s viewpoint, nothing has changed. End-users still pay the same $1.65 for the widget.
  • The government earns the same $0.15 as it earned with sales tax.  But instead of receiving all of it from the final transaction between the retailer and the consumer, it earns it in bits of [10% * each value added by each company in the value chain], which are $0.03, $0.02, and $0.10.
  • From the perspective of each business, they’re charged VAT by companies that they purchase from and they charge VAT to companies/consumers that purchase from them.  When they’re charged VAT on purchases ($0.12 for the retailer in the example), they are effectively charged the VAT of all companies that are further down the value chain from them ($0.10 for the raw materials producer and $0.02 for the manufacturer).  When they charge VAT on their sales, they effectively charge VAT for the value they added ($0.03) plus the VAT of all companies that are further down them in the value chain ($0.12).  The difference, the VAT for the value they added, is remitted to the government ($0.15 – $0.12 = $0.03).  So when a company purchases and is charged VAT, they are effectively “in the red” for that amount of VAT ($0.12) until they can sell their product up the value chain and charge that amount of “downstream” VAT + the VAT they own on the value they added ($0.12 + $0.03 = $0.15).  Then with that sale, they get “refunded” the portion of VAT that they paid before ($0.12) and remit the remainder ($0.03) to the government.
  • All buyers, whether they’re a consumer or a business, pay the VAT. So there is no incentive for anyone to masquerade as anyone else (e.g. a consumer to masquerade as a business).
  • All businesses process VAT (charged VAT on products they buy, charge VAT on products they sell, and pay VAT to the government) so all businesses are burdened with the maintenance of functions that process this.
  • Because businesses are charged VAT when they buy products and remain “in the red” that amount of VAT until they can sell those products, they are incentivized to make sure they charge VAT on the products they sell in order to make up that VAT they were already charged (e.g. the manufacturer paid the raw materials producer $0.10 of VAT so it’s incentivized to charge VAT on the products it sells to the retailer to make sure to make up for that $0.10). Because everyone is incentivized to charge VAT on their buyers, “everyone collects the tax for the government.”
  • This symmetry where everyone in the VAT system charges and is charged VAT doesn’t exist when it comes to cross-border trade, which is discussed below.


Sales Tax versus VAT

  • Most countries (166 out of 193 countries) in the world use VAT. The US uses sales tax and is the only one to do so in the OECD.
  • Asymmetry creates perverse incentives: In sales tax, only retailers charge sales tax and remit sales tax to the government. Consumers want to masquerade as businesses, retailers are not especially incentivized to make sure that their buyers are charged sales tax, and if there are ways to get around sales tax (remote sales, which include cross-state sales and online sales; wholesaling to consumers ), retailers and consumers might want to do that. In the US, retailers don’t need to charge sales tax to consumers buying in a state in which the retailer doesn’t have a physical presence.  In this case, there is a use tax charged on the consumer to make up for this, but compliance of use tax is low.  (Source, archived.)  Estimates of sales tax lost due to remote sales in 2012 varied up to a high of USD $23 billion where total retail sales that year (excluding food sales because many states don’t charge sales tax) was around USD $350 billion a month, i.e. around USD $4.2 trillion that year.  (Archived, archived, archived.)
  • In VAT, all businesses process VAT the same way, so there are no such perverse incentives.
    • The big exception to this is cross-border trade, i.e. imports and exports.  Governments have a choice of whether to charge VAT on goods it exports and goods it imports and whether to charge differently for every country it trades with.  This creates a huge potential for asymmetries in the VAT system.
  • Imports and Exports:
    • Sales Tax countries charge sales tax on imported goods if and when they reach the end-user.  If the imported good is exported again, then it hasn’t reached an end-user, and thus is never sales taxed.
    • Both sales tax and VAT are consumption taxes – the purpose is to tax consumption.  This is why the sales tax doesn’t tax a good that is imported and then exported without being consumed in the country.  VAT accomplishes this as well, but also would ideally keep the cross-border trade situation simple and sensible when dealing with other VAT countries or sales tax countries.
    • In order to have as symmetric and fair a system of cross-border trade, VAT countries generally:
      • Do not charge VAT goods that are exported.  When a good is exported by an exporter, the government refunds the exporter the entire VAT that it paid on its cost of goods sold purchases;
      • Charge VAT on goods that were imported on that good’s first subsequent sale that occurs after importation for the full sale value (not just the value added by the importer, which is [sale price – cost of goods sold], but the full [sale price]). I.e. after the importer imports the good, when that importers sells that good, that sales transaction is VAT-taxed for the full sale price.  This is assuming that the imported good is being sold to another domestic company and not being immediately exported.
      • Note that if an imported good is exported, the government does not receive any VAT.  This is the same as in sales tax and it accomplishes what a consumption tax is supposed to do (which in the case of a good that is imported and then exported without being consumed in the country is to not tax the good).  Each company in the value chain plays its usual part in the VAT system, but the last one, the exporter, is refunded by the government all the VAT it paid on its purchases of cost of goods sold.

Cross-border VAT

  • The reason VAT countries don’t tax their goods upon export is because sales tax countries don’t tax their goods upon export, so this keeps that part of the trade symmetrical.  This also prevents any case of a good being double-taxed during a cross-border trade.
  • The reason VAT countries tax their import goods (subsequent to the import transaction) is because if that good is going to be consumed in the country, not VAT-taxing it would mean the good would be untaxed during and after its cross-border transaction and thus have an advantage over similar competing domestic goods at this and every following point on the value chain since domestic goods have been VAT-taxed up to this point and will be VAT-taxed on all following points on the value chain.
  • The reason why an imported good’s subsequent sales transaction is VAT-taxed its full sales price instead of just the value added by the importer is because:
    • If the good is only taxed by its value-added amount instead, this still is not enough to offset the disadvantage of domestic goods (which have been VAT-taxed for all value that has been added to the product up to that point, not just the value added by the last company to sell it).
    • The government of the country in which the good is consumed ought, in principle, to capture the entire consumption tax on the good.  Thus, when the good went across the border and the exporting country’s government refunds the VAT to its exporter, the good is effectively “untaxed” at this point (the exporting country’s government has refunded all previous VAT on it and the importing country’s government has yet to tax any of the value that has so far been added by producers of the exporting country).  By VAT-taxing it by its full sales price after importation, the government of the importing country captures the VAT that the exporting country’s government refunded to its exporter or “resets” the VAT to where it ought to be at this point in the value chain for itself.
  • Missing trader fraud/Carousel fraud: A type of fraud that exists when a good is imported into and then exported out of a VAT country without the good being consumed in the country.  Since cross-border trade is a point of asymmetry in the VAT system, it makes sense that this is where fraud occurs.
    • Company A imports a good legitimately, paying the exporter EUR 100 for the good.  This transaction is VAT-free.
    • Company A sells the good to Company B for EUR 110.  This transaction is VAT-taxed its full sales price (since it is the transaction subsequent to the good being imported and also is not being immediately exported), and Company A owes the government this VAT.  Note that the good in this transaction is “new” to the country and thus the government has not received any VAT from this good further down the value chain (that VAT has been collected by the exporter’s government and refunded back to the exporter).  If the VAT is 20%, that 20% is charged on the full EUR 110 sale price of the transaction (not the value added by Company A, which is EUR 10).  The total price of the transaction with VAT to EUR 132 and the government expects to be remitted a VAT of EUR 22 from Company A for this sales transaction.
    • Company B exports the good.  This transaction is VAT-free.  Furthermore, Company B has paid Company A a VAT of EUR 22, so it is entitled to a refund of EUR 22 from the government as the good is being exported and not consumed in the country.
    • Company A disappears or goes bankrupt without paying the VAT (of EUR 22) on the sale of goods by Company A to Company B.  This is key to the fraud because Company A is supposed to pay VAT for the full sales price of its sale of the good (20% * EUR 110 = EUR 22), not just VAT of the value added (20% * EUR 10 = EUR 2).  In the diagram above that depicts cross-border trade, the retailer would owe the government $0.15 of taxes, not $0.03 as in the diagrams that are further above that don’t depict cross-border trade.  By Company A disappearing, the government is losing the VAT of all value that has been added to the good by companies from this point and all the way down the value chain.
    • With no fraud occurring, the government is supposed to earn 0 tax: charge VAT starting from the importer selling the good to domestic companies but then refund all that VAT to the exporter at the end who exports the good since the good is not to be consumed in the country.  But instead, in this case where Company A disappears, the government has lost EUR 22 by refunding Company B, the exporter, for the VAT that it paid on its purchase of the good.
    • In reality, if the importer (Company A) and the exporter (Company B) are working together, the good may never even physically leave the port, and is imported, sold, and exported only on paper.  If there are many companies in between Company A and Company B (e.g. it could instead by A -> X -> … -> Z -> B), it could be difficult for the government to prove any wrongdoing by Company B as the link between A and B will be weak (Company B may even be innocent in some cases where the only fraud is Company A disappearing without paying its VAT) and thus the government will be obligated to refund the VAT to Company B that it paid on its purchases.
    • According to sources found in Wikipedia, it’s estimated that the UK annually lost around GBP 2 billion in 2002-2003 (archived) and between GBP 2 billion and GBP 8 billion annually for the years (archive) between 2004 and 2006 due to this kind of fraud.  Total UK retail sales (archived) for these years was around GBP 250 million to GBP 300 million.  For the EU, 2008 estimates were EUR 170 billion lost (archived) due to this type of fraud.  Total EU-27 retail turnover in 2010 was around EUR 2.3 trillion (archived).
    • In the above link to a BBC article from 2006, it says that in order to combat the losses from this fraud, the government is implementing a new system where:

      Under the new rules the last company to sell on goods like mobile phones – such as a retailer – will be responsible for paying the VAT.

      So it sounds like the portion of VAT that the government is “missing” from the imported good will be paid by the retailer instead of the importer – in other words, it’s a bit like the sales tax system.  In this system that’s described, if a good is imported and then exported, the amount that the government would refund the exporter will be smaller than in the previous system, making the fraud much less damaging.  And if the retailer disappears without paying the taxes it owes, that’s the same as a retailer disappearing in a sales tax system without paying taxes, or a raw materials producer in a VAT system disappearing without paying taxes.  (These cases are a much simpler sort of tax evasion and unlike the missing trader/carousel fraud where the importer disappearing and not paying the “missing” chunk of VAT that the government is owed is in combination with the exporter that is “refunded” that chunk of VAT from the government, even though the government never received that chunk of VAT from the missing trader.)

Instead of taxing the importer the “missing” VAT, tax the retailer the “missing” VAT

In Country A, the government refunds the manufacturer/exporter the VAT that it paid on its purchases.  In Country B, the importer is charged VAT only on its value added, and that VAT is remitted to the government.  The retailer is charged VAT on its value added (10% * $0.20 = $0.02) and on the “missing” VAT from the value added to the product prior to importation, which is $1.20 (the price that the importer paid the manufacturer/exporter), so that comes to 10% * $1.20 = $0.12.

If the good is exported, refund the exporter as usual

If the good is imported into Country B and then exported without reaching a consumer, the amount of VAT that the importer is charged is only on its value added (10% * $0.10 = $0.01).  Thus, the amount of VAT that the government refunds the exporter is only that amount ($0.01).  If the importer disappears, the government only loses $0.01, which is 10% of the value added by the importer, not 10% of the full sales price of the product at this point.  Furthermore, if the importer and exporter work together and the importer sells the good to the exporter at a much higher price (raising the value added and the potential VAT that the government is supposed to refund the exporter), the exporter still needs to legitimately export the product in order to qualify for the refund.  It’s theoretically possible for the exporter to operate at a loss to make this possible, but this might raise an additional red flag that the government would become suspicious of, raising the risk of doing the fraud.


Economic Impact of Sales Tax versus VAT

Back to the diagrams for no tax, sales tax, and VAT:

No tax

Sales tax


Although only the consumer actually pays the tax in the end, as prices are raised at other transaction points, there is some friction that will discourage those transactions by some amount.

One can also think of this as an overhead cost for the businesses involved.  In the no tax and sales tax situations, the manufacturer buys goods at $1.00 and sells them at $1.20.  In the VAT situation, the manufacturer buys goods at $1.10 and sells them at $1.32, which minus the VAT remitted to the government becomes $1.30.  In both cases, the manufacturer earns a profit of $0.20, but there is an overhead of $0.10 in the VAT case.  An extreme analogy is: if you are a company that makes a profit of $1 on each good you sell, would you rather buy goods for $2 and sell them for $3 to earn your $1 profit or buy goods for $1,002 and sell them for $1,003 to earn your $1 profit?  Surely the former is easier and has less friction.

Back to the economic interpretation: if the demand and supply curve of a transaction point in a no tax situation is this:


then by adding a tax to the transaction, the price is increased.  For convenience, we add a second supply curve that is “supply + tax.”  While the end result is the same if we left the supply curve alone and added a “demand – tax” curve instead, the supply + tax curve more conveniently takes the hypothetical price of a product sold (the supply curve) and then adds the hypothetical price + tax of a product sold (the supply + tax curve).  What one can also do instead of drawing a new curve is take the vertical distance of the final tax per product and “fit it in between” the two curves from the left side of the diagram, and the end result will be the same.


(The diagram says “Consumer Surplus” but since this may represent a business-to-business transaction, it’d be clearer to just say “Purchaser Surplus.”)  So a tax will cause less quantity to be transacted, a higher post-tax price, some government tax revenue, lower purchaser and producer surpluses, and some deadweight loss.  Unless the government tax revenue is spent in a way that can overcome that deadweight loss (e.g. spending on things that have positive externalities), we have an inefficient outcome.

So in the VAT system, businesses are contending with higher prices (which is like more overhead) and lost quantity transacted compared to the sales tax system.  This is another cost that the VAT system pays (in addition to the missing trader/carousel fraud) in order to have a “symmetric” system where almost everyone in the value chain pays and collects VAT.