Jekyll2018-01-22T03:37:18+00:00http://kojinoshiba.com/Kojin Oshiba BlogKojin Oshiba's personal website / blogKojin Oshibakojinoshiba@college.harvard.eduTheoretically rigorous introduction to EM, Mixture of Gaussians and K-means2017-12-25T00:00:00+00:002017-12-25T00:00:00+00:00http://kojinoshiba.com/machine%20learning/em-gmm-kmeans<p>This post ties together EM (Expectation Maximization), GMM (Gaussian Mixture Models), K means and variational inference. If you have taken an introductory machine learning course and have learned these algorithms, but the connections between them are not yet clear, this is the post for you. If you have not heard about two of the above terms, I suggest you read the wikipedia page for each before reading this post.</p>
<h2 id="tldr">TL;DR:</h2>
<ul>
<li>EM is a variational inference algorithm to optimize the lower bound of the log likelihood.</li>
<li>GMM is a specific example of EM where the base distribution is MVN (multivariate normal).</li>
<li>K means is GMM where variance and cluster assignment probabilities are fixed and cluster assignments are hard.</li>
</ul>
<p>This post is rather theoretic compared to the other ones but I assure you that you’ll have a very deep understanding of the above algorithms by the end of this post.</p>
<h2 id="theory-of-variational-inference">Theory of Variational Inference</h2>
<p>Let me first introduce the EM algorithm in a rather theoretical way. I found this theoretical introduction more “intuitive” than other attempts. I hope you will, too. In short, the goal of EM is to increase the likelihood, and it does so by giving a lower bound of the likelihood. Let <script type="math/tex">X=\{X_1,X_2,...,X_N\}</script> denote the data we have at hand. Then, the likelihood of a model parameterized by <script type="math/tex">\theta</script> is</p>
<script type="math/tex; mode=display">l(\theta)=p(X\lvert \theta)</script>
<p>Now let me introduce variables for cluster membership <script type="math/tex">Z=\{Z_1,Z_2,...,Z_N\}</script>. Here, each <script type="math/tex">Z_i</script> is a <script type="math/tex">K</script> dimensional vector where each vector denotes which cluster each data point belongs to. Then, using the law of total probability,</p>
<script type="math/tex; mode=display">l(\theta)=p(X\lvert \theta)=\sum_Z p(X,Z\lvert \theta)</script>
<p>The learning objective, as usual, is to maximize the log likelihood:</p>
<script type="math/tex; mode=display">argmax_{\theta} \ log \: l(\theta)=argmax_{\theta} \ log \: \sum_Z p(X,Z\lvert \theta)</script>
<p>This would have been easier if the log likelihood was of form <script type="math/tex">\sum_Z log(p(X,Z\lvert \theta))</script> as the sum is outside the log and hence we can take the derivative easily. Life is not so easy here since the sum is inside the log. <strong>How can we take the log outside?</strong> This question motivates us to introduce Jensen’s Inequality:</p>
<blockquote>
<p>For any concave function <script type="math/tex">\phi</script>, <script type="math/tex">\phi(EX) \geq E\phi(X)</script>
<cite><a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s Inequality</a></cite></p>
</blockquote>
<p>Using Jensen’s Inequality, we can take the sum outside:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& log \ \sum_Z p(X,Z\lvert \theta) \\
& =log \ \sum_Z \frac{q(Z)}{q(Z)} p(X,Z\lvert \theta) \\
& =log \ \sum_Z q(Z) \frac{p(X,Z\lvert \theta)}{q(Z)} \\
& =log \ E_{Z\sim q} \frac{p(X,Z\lvert \theta)}{q(Z)} \\
& \geq E_{Z\sim q} \ log \ \frac{p(X,Z\lvert \theta)}{q(Z)} \\
& = \sum_Z q(Z) log \ \frac{p(X,Z\lvert \theta)}{q(Z)} \\
& \equiv L(q,\theta)
\end{align} %]]></script>
<p>Jensen’s Inequality is used from line 4 to 5 to provide <script type="math/tex">L(q,\theta)</script> as the lower bound for <script type="math/tex">log \ \sum_Z p(X,Z\lvert \theta)</script>. You should be able to see that <script type="math/tex">L(q,\theta)</script> only has the sum outside the log.</p>
<p>I abruptly introduced <script type="math/tex">q(Z)</script> without really explaining it. What is <script type="math/tex">q(Z)</script> here? We can think of <script type="math/tex">q(Z)</script> as a probability distribution we model that we wish to be as close to the true data distribution <script type="math/tex">p(Z\lvert X,\theta)</script> as possible. To see why this is the case, consider the difference between the true model log likelihood and our lower bound approximation <script type="math/tex">L(q,\theta)</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& log \: l(\theta)-L(q,\theta) \\
&= log \: l(\theta)-\sum_Z q(Z) log \ \frac{p(X,Z\lvert \theta)}{q(Z)} \\
&= \sum_Z q(Z) log \: l(\theta)-\sum_Z q(Z) log \ \frac{p(X,Z\lvert \theta)}{q(Z)} \\
&= \sum_Z q(Z) log \: l(\theta)-\sum_Z q(Z) log \ \frac{p(Z\lvert X,\theta)p(X\lvert \theta)}{q(Z)} \\
&= \sum_Z q(Z) (log \: p(X\lvert \theta)- log \: p(Z\lvert X,\theta) - log \: p(X\lvert \theta) + log \: q(Z)) \\
&= \sum_Z q(Z) log \: \frac{q(Z)}{p(Z\lvert X,\theta)} \\
&= KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))
\end{align} %]]></script>
<p>Note that from line 2 to 3, we are using the fact that <script type="math/tex">\sum_Z q(Z)=1</script>. To summarize, we have so far:</p>
<script type="math/tex; mode=display">log \: l(\theta)=L(q,\theta)+KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))</script>
<p>This equation explains why we want <script type="math/tex">q(Z)</script> to be as close to <script type="math/tex">p(Z\lvert X,\theta)</script> as possible. It’s because we want to have <script type="math/tex">KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))</script> smaller so as to make <script type="math/tex">L(q,\theta)</script> a tighter lower bound for <script type="math/tex">log \: l(\theta)</script>. To sum up, we have:</p>
<ul>
<li>Original goal: <script type="math/tex">argmax_{\theta} \ log \: l(\theta)</script></li>
<li>New goal: <script type="math/tex">argmax_{q,\theta} \ L(q,\theta)</script></li>
</ul>
<h2 id="from-variational-inference-to-em">From Variational Inference to EM</h2>
<p>Now that we introduced the new optimization problem we care about, let’s actually solve it! This will yield the formula for the EM algorithm in the most general way possible. Since <script type="math/tex">L(q,\theta)</script> has two parameters <script type="math/tex">q</script> and <script type="math/tex">\theta</script>, let’s optimize the function w.r.t. one parameter at a time.</p>
<h3 id="e-step-argmax_q--lqtheta">E Step: <script type="math/tex">argmax_{q} \ L(q,\theta)</script></h3>
<p>Let’s maximize <script type="math/tex">L(q,\theta)</script> w.r.t. <script type="math/tex">q</script>. Since we know that <script type="math/tex">log \: l(\theta)=L(q,\theta)+KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))</script> and also that <script type="math/tex">log \: l(\theta)</script> is fixed and doesn’t depend on <script type="math/tex">q</script>, this is equivalent to <script type="math/tex">argmin_{q} \ KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))</script>. Since we know that the minimum value of Kullback–Leibler divergence is <script type="math/tex">0</script>, we want to set <script type="math/tex">q(Z)</script> such that:</p>
<script type="math/tex; mode=display">KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))=0 \Leftrightarrow q(Z)=p(Z\lvert X,\theta)</script>
<p>Thus, we have the update formula for the E step:</p>
<script type="math/tex; mode=display">q(Z)=p(Z\lvert X,\theta)</script>
<p>Because we are minimizing the KL divergence but the likelihood itself doesn’t change, <strong>E step is equivalent to making the lower bound tighter</strong>.</p>
<h3 id="m-step-argmax_theta--lqtheta">M Step: <script type="math/tex">argmax_{\theta} \ L(q,\theta)</script></h3>
<p>It is not so hard to maximize <script type="math/tex">L(q,\theta)</script> w.r.t. <script type="math/tex">\theta</script>. But the intuition behind M step is trickier. This is because <script type="math/tex">log \: l(\theta)</script> depends on <script type="math/tex">\theta</script> and thus we first need to understand why <script type="math/tex">\theta</script> that maximizes <script type="math/tex">L(q,\theta)</script> also increases <script type="math/tex">log \: l(\theta)</script>. To see this, recall</p>
<script type="math/tex; mode=display">log \: l(\theta)=L(q,\theta)+KL(q(Z)\lvert \lvert p(Z\lvert X,\theta))</script>
<p>In E step, we minimized the second term on RHS (KL divergence) to be its smallest possible value <script type="math/tex">0</script>. Hence, this term can only increase as we change <script type="math/tex">\theta</script> in M step. The first term on RHS will obviously increase because that is what we are aiming for in M step. Hence, as a whole, we know that <script type="math/tex">log \: l(\theta)</script> must also increase.</p>
<p>Now, since we are updating <script type="math/tex">\theta</script>, let’s rewrite <script type="math/tex">q(Z)</script> after the E step as <script type="math/tex">q(Z\lvert X,\theta^{old})</script> so as to distinguish with the new <script type="math/tex">q(Z)</script> derived from the M step. Using this new terminology:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& L(q,\theta) \\
&= \sum_Z q(Z\lvert X,\theta^{old}) log \ \frac{p(X,Z\lvert \theta)}{q(Z\lvert X,\theta^{old})} \\
&= \sum_Z q(Z\lvert X,\theta^{old}) log \ p(X,Z\lvert \theta) - \sum_Z q(Z\lvert X,\theta^{old}) log \ q(Z\lvert X,\theta^{old}) \\
&= Q(\theta,\theta^{old}) + const
\end{align} %]]></script>
<p>where <script type="math/tex">Q(\theta,\theta^{old})\equiv \sum_Z q(Z\lvert X,\theta^{old}) log \ p(X,Z\lvert \theta)</script>.
Hence,</p>
<script type="math/tex; mode=display">argmax_{\theta} \ L(q,\theta)=argmax_{\theta} \ Q(\theta,\theta^{old})</script>
<p><script type="math/tex">Q(\theta,\theta^{old})</script> depends on how we parametrize the model. One case is GMM, and is exemplified below. We now have the update rule of the M step. In contrast to the E step, <strong>M step is equivalent to raising the log likelihood higher</strong>.</p>
<h2 id="gmm-as-em">GMM as EM</h2>
<p>Now, we will show that GMM is a specific example of EM. To see this, let</p>
<ul>
<li><script type="math/tex">X=\{X_1,X_2,...,X_N\}</script> be the data.</li>
<li><script type="math/tex">Z=\{Z_1,Z_2,...,Z_N\}</script> be the cluster membership.</li>
<li><script type="math/tex">\theta=\{\mu_1,...,\mu_K,\Sigma_1,...,\Sigma_K,\pi_1,...,\pi_K\}</script> be the parameters. <script type="math/tex">\mu,\Sigma</script> are parameters of the Gaussian, and <script type="math/tex">\pi</script> are prior mixture probability.</li>
</ul>
<p>The whole model is a mixture of MVN, as follows:
<script type="math/tex">p(X,Z\lvert \theta)=\prod_{n=1}^N \prod_{k=1}^K \pi_k^{Z_{nk}} N(X_n\lvert \mu_k, \Sigma_k)^{Z_{nk}}</script></p>
<h3 id="e-step-qzpzlvert-xtheta">E Step: <script type="math/tex">q(Z)=p(Z\lvert X,\theta)</script></h3>
<p><script type="math/tex">% <![CDATA[
\begin{align}
& q(Z) \\
&=p(Z\lvert X,\theta) \\
&=\frac{p(X,Z\lvert \theta)}{p(X\lvert \theta)} \\
&=\frac{\prod_{n=1}^N \prod_{k=1}^K \pi_k^{Z_{nk}} N(X_n\lvert \mu_k, \Sigma_k)^{Z_{nk}}}{\sum_{Z_n} \prod_{n=1}^N \prod_{k=1}^K \pi_k^{Z_{nk}} N(X_n\lvert \mu_k, \Sigma_k)^{Z_{nk}}}
\end{align} %]]></script></p>
<h3 id="m-step-argmax_theta--qthetathetaold">M Step: <script type="math/tex">argmax_{\theta} \ Q(\theta,\theta^{old})</script></h3>
<p><script type="math/tex">% <![CDATA[
\begin{align}
& Q(\theta,\theta^{old}) \\
&= \sum_Z q(Z\lvert X,\theta^{old}) log \ p(X,Z\lvert \theta) \\
&= E_{Z\sim q(\cdot \lvert X,\theta^{old})} log \ p(X,Z\lvert \theta) \\
&= E_{Z\sim q(\cdot \lvert X,\theta^{old})} \sum_{n=1}^N \sum_{k=1}^K Z_{nk} (log \ \pi_k + log \ N(X_n\lvert \mu_k, \Sigma_k)) \\
\end{align} %]]></script>
Let <script type="math/tex">E_{Z\sim q(\cdot \lvert X,\theta^{old})}Z_{nk}=\gamma(Z_{nk})</script>. Then,
<script type="math/tex">% <![CDATA[
\begin{align}
&= \sum_{n=1}^N \sum_{k=1}^K \gamma(Z_{nk}) (log \ \pi_k + log \ N(X_n\lvert \mu_k, \Sigma_k))
\end{align} %]]></script></p>
<p>Since we have the constraint that <script type="math/tex">\sum_k \pi_k =1</script> (the cluster membership probability sums to 1), the ultimate optimization problem becomes:</p>
<script type="math/tex; mode=display">argmax_{\theta} \ Q'(\theta)=Q(\theta,\theta^{old})+\lambda(\sum_k \pi_k - 1)</script>
<p>Since <script type="math/tex">\theta=\{\mu_1,...,\mu_K,\Sigma_1,...,\Sigma_K,\pi_1,...,\pi_K\}</script>, we can solve this for <script type="math/tex">\pi_k,\mu_k,\Sigma_k</script> one at a time:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{\partial Q'}{\partial \pi_k} = 0 \Leftrightarrow \pi_k = \frac{N_k}{N} \\
& \frac{\partial Q'}{\partial \mu_k} = 0 \Leftrightarrow \mu_k = \frac{1}{N_k} \sum_{n=1}^N \gamma(Z_{nk})X_n \\
& \frac{\partial Q'}{\partial \Sigma_k} = 0 \Leftrightarrow \Sigma_k = \frac{1}{N_k} \sum_{n=1}^N \gamma(Z_{nk})(X_n-\mu_k)(X_n-\mu_k)^T \\
\end{align} %]]></script>
<h2 id="k-means-as-gmm">K means as GMM</h2>
<p>It is straightforward to see that K means is a specific instance of GMM. In GMM, assume the following:</p>
<ul>
<li>Constant variance: <script type="math/tex">\Sigma_k = \sigma^2 I_D</script></li>
<li>Constant membership probability: <script type="math/tex">\pi_k=\frac{1}{k}</script></li>
<li>Hard membership assignment: <script type="math/tex">X(m,n)=
\begin{cases}
1, if \quad k=argmax_k \ N(X_n\lvert \mu_k,\Sigma_k)\\
0, otherwise
\end{cases}</script></li>
</ul>
<p>With these three additional assumptions, we have the K means! To see this, let’s take a look at the E step and the M step.</p>
<h3 id="e-step-qzpzlvert-xtheta-1">E Step: <script type="math/tex">q(Z)=p(Z\lvert X,\theta)</script></h3>
<p>Since the membership assignment is hard, for each point, we want to find a class that maximizes its likelihood:
<script type="math/tex">% <![CDATA[
\begin{align}
&argmax_k \ N(X_n\lvert \mu_k,\Sigma_k) \\
&= argmax_k \ exp(-\frac{(X_n-\mu_k)^2}{2\sigma^2}) \\
&= argmin_k \ (X_n-\mu_k)^2
\end{align} %]]></script></p>
<p>Hence we see that E step is equivalent to assigning points to the nearest centroid.</p>
<h3 id="m-step-argmax_theta--qthetathetaold-1">M Step: <script type="math/tex">argmax_{\theta} \ Q(\theta,\theta^{old})</script></h3>
<p>We only need to update <script type="math/tex">\mu_k</script> according to the update rule derived above:</p>
<script type="math/tex; mode=display">\mu_k = \frac{1}{N_k} \sum_{n=1}^N \gamma(Z_{nk})X_n</script>
<p>This is equivalent to taking the mean of data points in each cluster.</p>
<h2 id="summary">Summary</h2>
<p>I hope you were able to see the theoretical derivation of EM and that GMM and K means are both specific instances of EM. To read more about this material, I suggest you refer to the following:</p>
<ul>
<li>Information Theory, Inference, and Learning Algorithms (MacKay) Chapter 20, 22, 33.7</li>
<li>Machine Learning: A Probabilistic Perspective (Murphy) Chapter 11.4</li>
<li>Pattern Recognition and Machine Learning (Bishop) Chapter 9</li>
</ul>Kojin Oshibakojinoshiba@college.harvard.eduThis post ties together EM (Expectation Maximization), GMM (Gaussian Mixture Models), K means and variational inference. If you have taken an introductory machine learning course and have learned these algorithms, but the connections between them are not yet clear, this is the post for you. If you have not heard about two of the above terms, I suggest you read the wikipedia page for each before reading this post.人生の見方が変わる統計用語６選2017-12-19T00:00:00+00:002017-12-19T00:00:00+00:00http://kojinoshiba.com/japanese/six-statistics-words<p>僕にとって統計を勉強する醍醐味は何と言っても日々の意思決定やキャリアに統計的考え方が生きる、というところでしょう。この醍醐味を少しでも多くの人に味わってもらうべく、統計のコンセプトと、それが人生にどう役立つかの事例をご紹介します。一部機械学習やコンピューターサイエンス、数学の考え方も混ぜてみました。</p>
<h2 id="1-平均と偏差-mean-and-variance">1. 平均と偏差 (mean and variance)</h2>
<p>「A社とB社、どっちに就職しようかな。。A社は平均は低いかもだけど偏差が高いからちゃんと部署を選べばめっちゃ成長できるかも！」</p>
<p>平均の意味は皆さんご存知かと思います。偏差は偏差値に代表されるように、データのばらつき具合を言います。
例えば、学校を選ぶとき、周りの学生が優秀かどうかは大事な指標になるかと思いますが、平均的な能力値だけではなく、能力値の偏差（ばらつき）も見るべきです。なぜなら、学生の中で自分が関わることになるのは一握りであることを考えると、偏差が大きければ大きいほど質の高い学生コミュニティーを見つけられる可能性が高くなるからです。
よくアメリカの大学生の理数能力が日本の大学生のそれと比較されることがありますが、僕は前者は平均が低く偏差が大きいとみています。現に、今関わっている友達は理数能力がえぐいほど高いので、上位10%くらいのコミュニティーに属せているのかな、と思います。あくまで感覚値ですが。</p>
<p>能力値のような上下がはっきりしているもの以外でも、偏差が大きい環境の方が自分の居場所を見つけやすくなるのではないでしょうか。
<img src="http://kojinoshiba.com/assets/images/2017-12-19-six-statistics-words/variance.png" alt="image-center" class="align-center" /></p>
<h2 id="2-ジップの法則-zipfs-law">2. ジップの法則 (Zipf’s Law)</h2>
<p>「勤めあげたら年収１０００万までは行けるかもしれないけど、指数関数的な収入の増加は望めないな。。どうすれば非線形に収入を増やせるか考えてみよう！」</p>
<p>上位８人の富豪が下位５０％と同等の資産を持っているという<a href="https://www.cnn.co.jp/business/35095041.html">ニュース</a>をみたことがあるでしょうか。もちろんいつの時代もこのように格差がここまでひどい訳ではないでしょうが、このようなごく限られた人々への富の集中はいつの時代も起こってきました。それは収入の分布が上位にいけば行くほど指数関数的に増えて行くからです。このような指数関数的関係をジップの法則(Zipf’s Law)と言います。これは富の格差に限らずありとあらゆるところに出現します。例えば本に出てくる単語の頻出度合いはジップの法則に従います。つまり、最頻出する単語、例えば、てにをは、私、僕、などの言葉は、他の単語よりも圧倒的に多く登場するのです。他にもTwitterのフォロワー数、FacebookのLike数や、企業の社員数など、ありとあらゆるものがジップの法則に従います。</p>
<p>僕らは学校教育のせいか物事が線形に増加すると思いがちです。だから、例えば富が一部の上位層に集中していることを知った時に、必要以上に憤りを覚えてしまうかもしれません。これを基にして考えれば、自分のキャリアに関して、線形的な自己成長が自分を他人より優位に立たせることは少ないかも、と考えることができます。年収１０００万から１億は額の差は多けれど人口比の差はそれほど大きくない。ということはこのままちまちま勤め上げる、というようなことをしても他人と比較した場合には同じポジションにとどまり続ける、ということになります。もちろん、それを気にするかどうかというのは別問題ですが、言いたいのはジップの法則を線形的な増加と同じくらい自然な分だと感覚的に理解することができれば、いろんなものが見えてくるのではないかということです。</p>
<p><img src="http://kojinoshiba.com/assets/images/2017-12-19-six-statistics-words/zipfs.png" alt="image-center" class="align-center" /></p>
<h2 id="3-過学習-overfitting">3. 過学習 (overfitting)</h2>
<p>「テスト勉強しまくったのに点取れない。。あ、過学習してるかも！」</p>
<p>勉強をしていて、練習問題はスラスラ解けるけど、本番で少し違うパターンの問題が出たらうまく解けな買った、という場合には過学習を疑う必要があります。過学習とは与えられたデータに対してモデルを当てはめ過ぎてしまうことによって、新たに与えられたデータに対してそのモデルが有用じゃなくなってしまうことを言います。</p>
<p>大学受験は無敵だったのに会社に入ったらうまく仕事をこなすことができない、日本でうまくいった企業が海外でうまくいかない、テニス部でペアを替えたらすごく弱くなってしまった、ある女の子をメロメロにさせた方法が他の子には全然通用しない。このような場合には自分が培ってきたスキルがこれまでの経験に過学習している場合があります。</p>
<p>こういった場合、統計ではモデルを正則化（データに過敏に反応しないようにすること）したり、学習するデータを増やしたりします。人生でも同様に、何か経験から学ぶときは汎用的になるように気をつけ、学ぶ際になるべく多種多様な経験をする方向に自分を向かわせてみるべきでしょう。</p>
<p><img src="http://kojinoshiba.com/assets/images/2017-12-19-six-statistics-words/overfitting.png" alt="image-center" class="align-center" /></p>
<h2 id="4-平均への回帰-regression-to-the-mean">4. 平均への回帰 (regression to the mean)</h2>
<p>「今回のサザンのアルバムイマイチだったけど、平均へ回帰してるのかな。もうしばらくファンでいよう」</p>
<p>大好きだったアーティストの新しいアルバムが期待外れだった、という経験はないでしょうか。一流大学に通っていた友達Aくんのその後の人生が思っていたより輝かしいものでなかった、という経験はないでしょうか。これは好きなアーティストやAくんの能力が下がってきた、というだけではないことがあります。一般的に観測するデータにはノイズが含まれます。このノイズはランダムであり、ある時には結果をよくするようなノイズが含まれ、ある時には結果を悪くするノイズが含まれます。例えば、最初にアーティストに出会ったアルバム（2ndアルバムとしましょう）は、彼らの能力を実際よりも高く見せようとするプラスのノイズが足されていたのかもしれません。その場合、仮に彼らの能力がその後伸びていたとしても、3rdアルバムは彼らの能力にマイナスのノイズが足されていた、ということがありえます。このような現象を統計では平均への回帰と言います。</p>
<p>よくできた親の子供には期待をかけがちだったり、株の売買がうまくいったときは今後もうまくいく、などと思いがちですが、これは平均への回帰に逆行する考え方です。平均への回帰を念頭に置くことでよりスマートな期待値コントロールができるでしょう。</p>
<p><img src="http://kojinoshiba.com/assets/images/2017-12-19-six-statistics-words/regression_mean.png" alt="image-center" class="align-center" /></p>
<h2 id="5-探索と活用-exploration-and-exploitation">5. 探索と活用 (exploration and exploitation)</h2>
<p>「大学生はまだ人生の中では探索のフェーズだ。活用しようと焦らずに色々試してみよう」</p>
<p>これはコンピューターサイエンスの中の一分野、強化学習の用語です。強化学習とは自動運転や将棋ソフトなどの背景にある技術で、人工知能に色々なパターンを試させることでどんどん人工知能を成長させて（強化させて）いきます。一般的には、最初の方は探索(Exploration)といって、例えば将棋ならいろんな手を試してみて勝敗の確率がどの程度かをとにかく学びまくります。だんだん人工知能が成長してくると、活用(Exploitation)といって、これまで学んだ手の中で勝ちやすい手を中心に勝負していきます。</p>
<p>自分の脳を人工知能と考えて、経験から学んでいくプロセスを強化学習と考えれば、私たちの意思決定にも探索と活用のような考え方が大事になってくることがわかるでしょう。どのタイミングで一般教養を磨くところから専門性を磨くところにスイッチするのか、具体的に言えば、例えば大学は専門学校に行くべきか普通科の大学に行くべきか、これを探索と活用のフレームワークに落とし込むと綺麗に整理することができます。より強化学習を学べば、探索と活用のトレードオフについて、より熟慮した上で自分の人生の意思決定ができるのではないでしょうか。</p>
<h2 id="6-全体最適と極所最適-local-optima-and-global-optima">6. 全体最適と極所最適 (local optima and global optima)</h2>
<p>「最近営業の成績が伸び悩んでるな。局所最適にハマってるのかもしれないから、全く新しいやり方も試しみようかな」</p>
<p>これは絵を見たら一発でわかると思います。以下のLocal Optimaが局所最適です。小さい丘の上に登っても、大きい丘にそのまま登ることはできません。一旦低いところに降りて、また大きい丘に向かって登って行く必要があります。これは、現状の改善を続けて行くだけでは必ずしも物事の最適解には行き着かない、ということを示しています。時には斬新な方法を試みて見る必要があります。この考え方は数理最適化の分野に登場します。</p>
<p>例えば仕事の効率が前より上がっていかなくなった、カップルの仲があまり深まらなくなった、ABテストでウェブサービスを改善しても前ほどKPIの上昇が見られない、こういった場合には全体最適を目指して勇気を持って今立っている小さい丘を下ってみましょう。</p>
<p><img src="http://kojinoshiba.com/assets/images/2017-12-19-six-statistics-words/optima.png" alt="image-center" class="align-center" /></p>
<p>他、因果関係と相関関係、セレクションバイアス、再現率と適合率、無定義語など、物事を考える際に度々持ち出している用語はたくさんありますが、今回はこの辺で。みなさんの人生が統計によって少しでもいいものになりますように！</p>Kojin Oshibakojinoshiba@college.harvard.edu僕にとって統計を勉強する醍醐味は何と言っても日々の意思決定やキャリアに統計的考え方が生きる、というところでしょう。この醍醐味を少しでも多くの人に味わってもらうべく、統計のコンセプトと、それが人生にどう役立つかの事例をご紹介します。一部機械学習やコンピューターサイエンス、数学の考え方も混ぜてみました。Multivariate Normal Cheatsheet2017-12-14T00:00:00+00:002017-12-14T00:00:00+00:00http://kojinoshiba.com/machine%20learning/mvn-cheatsheet<p>Multivariate normal (MVN) is used everywhere in machine learning, from simple regressions, linear discriminant analysis, Kalman filters to gaussian processes. But very few textbooks summarize the important characteristics in a concise manner. So here they are.</p>
<h2 id="definition">Definition</h2>
<p><script type="math/tex">\mathbf{Y} \sim N_k(\mathbf{\mu},\mathbf{V})</script> if <script type="math/tex">Y=A\mathbf{Z}+\mathbf{\mu}</script>.</p>
<ul>
<li><script type="math/tex">\mathbf{Y}</script>: <script type="math/tex">k</script> dimensional vector</li>
<li><script type="math/tex">\mathbf{\mu}</script>: <script type="math/tex">k</script> dimensional mean vector</li>
<li><script type="math/tex">\mathbf{V}</script>: <script type="math/tex">m \times m</script> dimensional covariance matrix. It is positive semi-definite: <script type="math/tex">\mathbf{x'Vx}>0</script> for all <script type="math/tex">\mathbf{x}\neq \mathbf{0}</script>.</li>
<li><script type="math/tex">A</script>: <script type="math/tex">k \times m</script> dimensional matrix. <script type="math/tex">\mathbf{V} = AA'</script></li>
<li><script type="math/tex">\mathbf{Z}=(Z_1,...,Z_m)</script> where <script type="math/tex">Z_i \sim_{iid} N(0,1)</script></li>
</ul>
<h2 id="equivalent-definition">Equivalent Definition</h2>
<p><script type="math/tex">\mathbf{Y} \sim N_k(\mathbf{\mu},\mathbf{V})</script> if all linear combinations of <script type="math/tex">\mathbf{t'Y}</script> are univariate normal, i.e.</p>
<p><script type="math/tex">\mathbf{t'Y} \sim N_k(\mathbf{t'\mu},\mathbf{t'Vt})</script> for any <script type="math/tex">\mathbf{t}</script>.</p>
<h2 id="pdf-and-mgf">PDF and MGF</h2>
<ul>
<li>PDF: <script type="math/tex">f(\mathbf{Y}) = \frac{1}{(2\pi)^\frac{k}{2}\lvert \mathbf{V}\big \rvert^\frac{1}{2}}exp(-\frac{1}{2}((\mathbf{V-\mu})'V^{-1}(\mathbf{V-\mu}))</script></li>
<li>MGF: <script type="math/tex">M_{\mathbf{Y}}(\mathbf{t}) = E(e^{\mathbf{t'Y}}) = exp(\mathbf{t'\mu}+\frac{1}{2}\mathbf{t'Vt})</script></li>
</ul>
<h2 id="linear-transformations-of-mvn">Linear Transformations of MVN</h2>
<p>If <script type="math/tex">\mathbf{Y} \sim N_k(\mathbf{\mu},\mathbf{V})</script>,</p>
<ul>
<li>
<script type="math/tex; mode=display">\mathbf{X} = B\mathbf{Y} + \mathbf{b} \sim N(B\mathbf{\mu} + \mathbf{b},B\mathbf{V}B')</script>
</li>
<li>
<script type="math/tex; mode=display">\mathbf{a'Y} \sim N(\mathbf{a'\mu},\mathbf{a'Va})</script>
</li>
</ul>
<h2 id="within-mvn">Within MVN…</h2>
<p>Let <script type="math/tex">Y=\begin{pmatrix}
\mathbf{Y_1} \\
\mathbf{Y_2}
\end{pmatrix} \sim N_{k_1+k_2}(\mathbf{\mu},\mathbf{V})</script>, where <br />
<script type="math/tex">\mathbf{\mu}=\begin{pmatrix}
\mu_1 \\
\mu_2
\end{pmatrix}</script> and <script type="math/tex">\mathbf{V}=\begin{pmatrix}
\mathbf{V_{11}},\mathbf{V_{12}} \\
\mathbf{V_{21}},\mathbf{V_{22}}
\end{pmatrix}</script></p>
<p>Then,</p>
<ul>
<li>Uncorrelation implies independence: <script type="math/tex">\mathbf{V_{12}}=0 \Leftrightarrow \mathbf{Y_1} \perp \mathbf{Y_2}</script></li>
<li>Marginal: <script type="math/tex">\mathbf{Y_i} \sim N(\mu_1,\mathbf{V_{11}})</script></li>
<li>Conditional: <script type="math/tex">\mathbf{Y_2}\lvert \mathbf{Y_1} \sim N(\mu_{2\cdot 1},\mathbf{V_{22\cdot 1}})</script>, where <script type="math/tex">\mu_{2\cdot 1}=\mu_2+\mathbf{V_{21}}\mathbf{V_{11}}^{-1}(\mathbf{Y_1}-\mu_1)</script>, <script type="math/tex">\mathbf{V_{22\cdot 1}}=\mathbf{V_{22}}+\mathbf{V_{21}}\mathbf{V_{11}}^{-1}\mathbf{V_{12}}</script>.</li>
</ul>
<h2 id="joint-distribution">Joint Distribution</h2>
<p><script type="math/tex">Y\lvert \theta \sim N_k(\theta,A_1),\theta\sim N_k(\mu,A_2)</script> then <script type="math/tex">(Y,\theta)\sim N_{2k}(\begin{pmatrix}
\mu \\
\mu
\end{pmatrix},\begin{pmatrix}
A_1+A_2,A_2 \\
A_2,A_2
\end{pmatrix})</script></p>Kojin Oshibakojinoshiba@college.harvard.eduMultivariate normal (MVN) is used everywhere in machine learning, from simple regressions, linear discriminant analysis, Kalman filters to gaussian processes. But very few textbooks summarize the important characteristics in a concise manner. So here they are.What regression coefficients really mean2017-12-11T00:00:00+00:002017-12-11T00:00:00+00:00http://kojinoshiba.com/causal%20inference/what-regression-coefficients-really-mean<p>There is nothing in statistics that is as easy as regressions to use but as hard to make a correct interpretation. Especially in causal inference, even the coefficients of a linear regression is quite hard to interpret. If you ask a random statistics student on campus and ask them to interpret a fitted linear regression model, you might find them having trouble giving a crisp explanation of what it captures. This is because, often times, the coefficients are not a causal effect. It also consists of what is known as a selection bias, which I’ll explain in this post.</p>
<h3 id="the-goal-of-regression--learn-cef">The goal of regression = learn CEF.</h3>
<p>CEF stands for conditional expectation function. CEF explains how much <script type="math/tex">Y</script> changes as <script type="math/tex">X</script> change. More formally, CEF is defined as</p>
<script type="math/tex; mode=display">\mu(x) = E[Y\lvert X=x]</script>
<p>In short, the goal of any regression is to estimate <script type="math/tex">\hat{\mu}(x)</script>.</p>
<h3 id="regressions-can-be-parametric-or-non-parametric">regressions can be parametric or non parametric.</h3>
<p>There are two ways to estimate CEF one way is parametric. We approximate CEF with a function with some parameters, and learn the best parameters that best fits CEF.</p>
<script type="math/tex; mode=display">\hat{\mu}(x) = g_{\theta}(x)</script>
<p>Here, <script type="math/tex">\theta</script> is the parameter we would like to estimate. For example, in linear regressions, <script type="math/tex">\theta</script> will be the coefficients for the features.</p>
<p>On the other hand, there’s a non parametric approach to the problem, where we directly estimate <script type="math/tex">\mu(x)</script>. This includes methods like random forest and other decision tree algorithms.</p>
<h3 id="regression-does-not-imply-causality">Regression does not imply causality.</h3>
<p>As we shall see, CEF is not causal. To see this, let’s think about CEF in the case of an experiment with binary treatment. In this case, CEF must be linear, since the model is saturated. Namely, treatment <script type="math/tex">D</script> can only take on two possible values, and hence a simple linear model can capture everything about the changes in <script type="math/tex">D</script>. Hence, we have:</p>
<script type="math/tex; mode=display">\mu(d) = E[Y\lvert D=d] = \beta_0 + \beta D</script>
<p>This encode the information of how much <script type="math/tex">Y</script> changes as <script type="math/tex">D</script> changes.</p>
<p>Since</p>
<p><script type="math/tex">E[Y\lvert D=0] = \beta_0</script>
and
<script type="math/tex">E[Y\lvert D=1] = \beta_0 + \beta</script></p>
<p>we can write these two in a more concise notation:</p>
<script type="math/tex; mode=display">E[Y\lvert D] = E[Y\lvert D=0] + D (E[Y\lvert D=1] - E[Y\lvert D=0])</script>
<p>This is our CEF. The reason I used <script type="math/tex">\beta=E[Y\lvert D=1] - E[Y\lvert D=0]</script> in my previous post about the experiment is because it the ATE in the randomized experiment case is exactly this CEF. In general, however, <script type="math/tex">\beta</script> will not be a causal effect. Let’s see what <script type="math/tex">\beta</script> consists of:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta &= E[Y\lvert D=1] - E[Y\lvert D=0] \\
&= E[Y(1)\lvert D=1]-E[Y(0)\lvert D=0] (\because \text{SUTVA})\\
&= E[Y(1)\lvert D=1]-E[Y(0)\lvert D=1]+E[Y(0)\lvert D=1]-E[Y(0)\lvert D=0] \\
&= ATT + E[Y(0)\lvert D=1]-E[Y(0)\lvert D=0] \\
&= ATE + (ATT-ATE) + (E[Y(0)\lvert D=1]-E[Y(0)\lvert D=0])
\end{align} %]]></script>
<p>Before explaining what this decomposition means, note that if the unconfoundedness assumption holds, <script type="math/tex">\beta</script> will reduce to ATE, as shown in the previous post.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta &= E[Y\lvert D=1] - E[Y\lvert D=0] \\
&= E[Y(1)\lvert D=1]-E[Y(0)\lvert D=0] (\because \text{SUTVA})\\
&= E[Y(1)]-E[Y(0)] (\because \text{unconfoundedness}) \\
&= ATE
\end{align} %]]></script>
<p>This is why running a linear regression on a binary treatment experiment is enough to capture the causal effect.</p>
<p>Now let’s get back to the more general case. We have</p>
<script type="math/tex; mode=display">\beta = ATE + (ATT-ATE) + (E[Y(0)\lvert D=1]-E[Y(0)\lvert D=0])</script>
<p>What does this mean? It clearly means the coefficient of the regression is not just ATE. It consists of two more terms:</p>
<ul>
<li><script type="math/tex">E[Y(0)\lvert D=1]-E[Y(0)\lvert D=0]</script>: Type I Selection Bias. This arises when <script type="math/tex">Y(0) \not{\perp} D</script>, meaning that the default outcome in the control case is different for those actually treated vs those who are not.</li>
<li><script type="math/tex">ATT-ATE</script>: Type II Selection Bias. This arises when <script type="math/tex">Y(1)-Y(0)</script>, meaning that the benefit of the treatment is different for those actually treated vs those who are not.</li>
</ul>
<h3 id="example-regression-for-measuring-the-effect-going-to-college">Example: Regression for measuring the effect going to college</h3>
<p>Selection biases might be hard to understand. So let me illustrate with an example. Again, think of an experiment for measuring the effect of going to college on future income. This time, we cannot randomly assign students to go to college as is usually the case. So this is an example of what is known as an observational study. What do the selection biases say?</p>
<p>If type I bias is positive, it means that even before entering college, those students who are to enter college have a higher potential to earn more. This intuitively makes sense, as those who enroll in colleges can be from a better socio economic status, are more motivated to advance their careers, etc.</p>
<p>If type II bias is positive, it means that those who go to college can gain more from college than those who don’t. In other words, even if those who didn’t college went college, they couldn’t have increased the income as much as those who actually went.</p>
<p>Hence, these two biases can hinders <script type="math/tex">\beta</script> from capturing the true “going to college” effect averaged over both those who did and didn’t go to college.</p>
<h3 id="summary-how-to-interpret-regression-coefficients">Summary: how to interpret regression coefficients</h3>
<p>For the case when you run a linear regression on a binary treatment experiment, don’t jump to the conclusion that the regression coefficients capture the true causal effect. If you witness that the change in going to college will raise the future income by $1000 a year, look for the two selection biases. It might be the case that those who went to college could earn $200 more than those who didn’t and also increase the future earnings by $100 more than those who didn’t. In this case, the the true “going to college” effect is $700.</p>Kojin Oshibakojinoshiba@college.harvard.eduThere is nothing in statistics that is as easy as regressions to use but as hard to make a correct interpretation. Especially in causal inference, even the coefficients of a linear regression is quite hard to interpret. If you ask a random statistics student on campus and ask them to interpret a fitted linear regression model, you might find them having trouble giving a crisp explanation of what it captures. This is because, often times, the coefficients are not a causal effect. It also consists of what is known as a selection bias, which I’ll explain in this post.Formal definition of an experiment2017-12-09T00:00:00+00:002017-12-09T00:00:00+00:00http://kojinoshiba.com/causal%20inference/what-are-experiments<p>Science experiments, social experiments, thought experiments, … We use the word “experiment” somewhat often in real life. But have you ever wondered what experiments really are? Probably not… But don’t close the page yet! It is actually quite interesting to learn about how statisticians formally define experiments. This post is about that. If you read this post, you will be able to tell your friends having a thought experiment that what they are doing is, statistically speaking, a not valid experiment. What a great knowledge to have!</p>
<h3 id="jumping-right-in-the-definition-of-an-experiment">Jumping right in, the definition of an experiment.</h3>
<p>Recall the previous post about causal inference. In any scientific studies, we care about the following three types of variables:</p>
<ul>
<li><script type="math/tex">X_i</script>: Pretreatment covariates.</li>
<li><script type="math/tex">D_i</script>: Treatment.</li>
<li><script type="math/tex">Y_i</script>: Observed outcome.</li>
</ul>
<p>Given these variables, let <script type="math/tex">p_i</script> be the probability of unit i receiving a treatment given its covariates and potential outcomes. Formally,</p>
<script type="math/tex; mode=display">p_i = P(D_i=1\|X,Y(0),Y(1))</script>
<p>Then an experiment can be defined quite simply; it is a study where <script type="math/tex">p_i</script> is controlled by and known to the researcher. In other words, if the experimenter can decide whether to assign treatment or control to each unit, it is an experiment. For example, AB testing is an experiment since we are controlling how to allocate users to different groups.</p>
<h3 id="randomized-experiments">Randomized Experiments</h3>
<p>The most important type of experiment is the randomized experiment. The intuitive way to understand it is, it is any experiment where units are randomly assigned to treatment or control, according to <script type="math/tex">p_i</script>. More rigorously, a randomized experiment should satisfy the following three conditions:</p>
<ul>
<li><script type="math/tex">% <![CDATA[
0 < p_i < 1 %]]></script>, non deterministic. Each unit has some chance (even very small) of being assigned to treatment/control.</li>
<li><script type="math/tex">p_i \perp Y_{-i}, X_{-i} \forall i</script>, individualistic. The treatment assignment probability doesn’t depend covariates and potential outcomes of other units.</li>
<li><script type="math/tex">p_i \perp Y_{i} \forall i</script>, unconfounded. The treatment assignment probability is independent of the potential outcomes.</li>
</ul>
<p>The first two assumptions should make an intuitive sense. The last one needs a bit of thought and in fact, it is the most crucial for characterizing a randomized experiment. Why are experiments not randomized if the treatment assignment is confounded? To see this, think of an experiment for measuring the effect of going to college on future income. Suppose you can randomly assign high school grads to go / not go to college (this is unethical in practice). If the experiment is counfounded, it means that the probability of a student assigned to go to college depends on their future income (supposing we are the God and we know their future). In the likely scenario, those who will earn more will have a higher probability (<script type="math/tex">p_i</script>) of being assigned to go to school. Consider the difference between the treat and the control in this scenario. Will it have a causal interpretation, in the sense that it captures the causal effect of going to college on the income? The answer is no. This is because we can’t know if the difference in the future income is due to students who went to college already being smart, or the college education making them more capable of earning. Hence, unconfoundedness is crucial for deriving causal effects from randomized experiments.</p>
<p>Let me rephrase this argument in a more rigorous way.</p>
<p>Recall from the previous post that one of the metrics that we care about is the Average Treatment Effect:</p>
<script type="math/tex; mode=display">E[\tau_i] = E[Y(1)]-E[Y(0)] = \frac{1}{N} \sum_{i=1}^N [Y_i(1)-Y_i(0)]</script>
<p>We cannot directly obtain <script type="math/tex">E[Y(1)]-E[Y(0)]</script> since we can only observe one of the potential outcomes for each unit. On the other hand, what is available is the following:</p>
<script type="math/tex; mode=display">\beta = E[Y|D=1]-E[Y|D=0]</script>
<p>I used the character <script type="math/tex">\beta</script> for a reason I’ll explain in the next post about causal inference. This is the average outcome for the treated minus the average outcome for the control. My claim here is that only under the unconfoundedness assumption, <script type="math/tex">\beta</script> has a causal interpretation. It’s actually a simple math:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\beta &= E[Y|D=1]-E[Y|D=0] \\
&= E[Y(1)|D=1]-E[Y(0)|D=0] (\because \text{SUTVA})\\
&= E[Y(1)]-E[Y(0)] (\because \text{unconfoundedness}) \\
&= ATE
\end{align} %]]></script>
<p>I’m using the second assumption of SUTVA (<script type="math/tex">Y_i=Y_i(d)</script>) from the first to the second line. For the second to the third line, I’m using the fact that if <script type="math/tex">p_i \perp Y_{i}</script> (unconfounded) then <script type="math/tex">E[Y(1)\|D=1] = E[Y(1)]</script> and <script type="math/tex">E[Y(0)\|D=0] = E[Y(0)]</script>.</p>
<p>Hence, we have that <em>the difference in <script type="math/tex">Y</script> between treatment and control is ATE only if the unconfoundedness assumption holds</em>. This is a very important result in causal inference, and this is why randomized experiment is such a convenient type of experiment.</p>
<h3 id="examples-of-randomized-experiments">Examples of Randomized Experiments</h3>
<p>Last but not least, let me introduce four types of randomized experiments. Some are not as widely used as the others, but they all have the above property of <script type="math/tex">\beta=ATE</script> and hence you should consider these when running your AB tests:</p>
<ul>
<li>Bernoulli Trials: Each unit is assigned to treatment with the same probability <script type="math/tex">p</script>.</li>
<li>Completely Randomized Experiments: Randomly sample <script type="math/tex">n_t</script> units and assign them to treatment. AB tests are typically completely randomized experiments.</li>
<li>Stratified Randomized Experiments: Subgroup units based on the covariates <script type="math/tex">X</script>. Within the subgroup, run a completely randomized experiments.</li>
<li>Paired Randomized Experiments: Special case of stratified randomized experiments with two samples in each group.</li>
</ul>
<p>That’s it! Now you should be confident with what experiments are, and why AB tests are so powerful in finding causal effects. On the other side of the spectrum, there exists observational studies, where <script type="math/tex">p_i</script> is unknown. The current studies in causal inference is very much focused on observational studies, as people have done enough research on experiments. I’ll write about observational studies in the coming posts.</p>Kojin Oshibakojinoshiba@college.harvard.eduScience experiments, social experiments, thought experiments, … We use the word “experiment” somewhat often in real life. But have you ever wondered what experiments really are? Probably not… But don’t close the page yet! It is actually quite interesting to learn about how statisticians formally define experiments. This post is about that. If you read this post, you will be able to tell your friends having a thought experiment that what they are doing is, statistically speaking, a not valid experiment. What a great knowledge to have!The Theory behind AB Testing: Introduction to Causal Inference2017-12-07T00:00:00+00:002017-12-07T00:00:00+00:00http://kojinoshiba.com/causal%20inference/theory-behind-ab-testing<p>If you’ve been working in the tech industry, or have thought about doing so, you’ve probably heard of AB testing. Some of you have even conducted one. The idea is as follows: say you own a website with a red login button. You’re thinking that you should change it to blue, since it looks to ugly. To determine which color you should use, you ask your users. You randomly sample 100 users and show 50 users the red button and 50 users the blue button. You measure the ratio of people who login to your website for each group, and see if there’s a big difference.</p>
<p>This is an example of a <em>completely randomized experiment</em> (CRE). In CRE, you first sample <em>units</em> (users), randomly assign a fraction of people to <em>treatment</em> (the red button) vs <em>control</em> (the blue button) and measure the difference in the <em>outcome</em> (the login rate).</p>
<p>Here, we can intuitively see that the difference in the login rate is caused by the different colors of the login button. In other words, the difference in the login rate has a causal interpretation.</p>
<p>On the other hand, imagine a situation where you started a campaign on your website, say a 10\% off sale for all ice creams sold on your website. To see the effect of your campaign, you compare the total number of ice creams sold before and after the campaign. You saw a 20\% increase in the sales volume. “The campaign was a success!”, you conclude. But can you really call the campaign a success?</p>
<p>Not really. For a simple example, imagine that the campaign was running for two months in June and July. In May, you had $1 million in monthly revenue, and after the campaign, in August, now you have $1.2 million. Now consider this question: when are users more likely to buy ice creams, in May or in August? If you live in a certain part of the northern hemisphere like me (and so were the users), they would be more likely to buy ice creams in August. So, it is hard to tell if that $0.2 million increase in sales was coming from the campaign or from the fact that it was summer after the campaign was over. In such a scenario, the difference in the revenue does NOT have a causal interpretation: we don’t know if it was due to the campaign.</p>
<h3 id="causal-inference-is-the-theory-behind-ab-testing">Causal inference is the theory behind AB testing.</h3>
<p>When can we say that the increase in the login rate was <em>caused</em> by the change in the login button color? When can we say that the increase in revenue was <em>caused</em> by the campaign conducted? In general, when can we learn that something <em>caused</em> something else?</p>
<p>This is what causal inference is all about. Causal inference is a field for understanding the causal relationships between different events.</p>
<p>Let me formalize the notations used in causal inference. Let <script type="math/tex">i</script> be units. Units are users in the above example.</p>
<ul>
<li><script type="math/tex">X_i</script>: Pretreatment covariates. These can be age, gender, registration rate, past purchase history of the users. It is often a <script type="math/tex">d</script> dimensional vector where <script type="math/tex">d</script> is the number of features about each user.</li>
<li><script type="math/tex">D_i</script>: Treatment. For the above example, <script type="math/tex">D_i=1</script> if the button that a user <script type="math/tex">i</script> shown is red and <script type="math/tex">D_i=0</script> if blue. <script type="math/tex">D_i</script> is often binary, but can also take multiple, or even continuous values.</li>
<li><script type="math/tex">Y_i</script>: Observed outcome. In the above examples, this can be an indicator of whether user <script type="math/tex">i</script> logged in or the total amount of ice cream user <script type="math/tex">i</script> bought.</li>
</ul>
<h3 id="potential-outcomes-define-what-are-fundamentally-unknown">Potential Outcomes define what are fundamentally unknown.</h3>
<p>In real world, each unit can only have either one of the two treatments. If the user sees a blue button, they cannot see the red one, and vice versa. When <script type="math/tex">Y_i</script> can take on two values, one under treatment <script type="math/tex">Y_i(1)</script> and one under control <script type="math/tex">Y_i(0)</script>, we can observe either one of the two. Hence, <script type="math/tex">Y_i(1)</script> and <script type="math/tex">Y_i(0)</script> are called the potential outcomes. The fact that we can only observe one of the two (or many, in the case with more than two treatments) is so fundamental that it is called the “Fundamental Problem of Causal Inference”.</p>
<h3 id="before-analysis-we-need-sutva">Before analysis, we need SUTVA.</h3>
<p>Before moving on to understanding the causal effects in depth, I will introduce two assumptions that are often employed to make the modeling simple. SUTVA stands for Stable Unit Treatment Value Assumption, and it consists of two assumptions about the data:</p>
<ul>
<li><script type="math/tex">Y_i(d_1,...,d_N) = Y_i(d_i)</script>. No Interference. “My outcome is only dependent on my treatment assignment and not others.” E.g. the amount of ice cream I bought will depend on whether I was targeted by the campaign but not on whether my friends were targeted by the campaign.</li>
<li><script type="math/tex">Y_i=Y_i(d)</script> if <script type="math/tex">D_i=d</script>. No Hidden Variations in Treatment. E.g. If I’m administered a medicine, I can only take or not take it; I cannot take half the portion of others, take an older version of it that is less effective, etc.</li>
</ul>
<p>SUTVA assumptions makes sense in some cases, but not in others. For example, we should be skeptical about the no interference case if the treatment on friends, neighbors, families, etc of the units can affect their potential outcomes. The ice cream example can potentially violate no interference (and thus, have a <em>spill over effect</em>).</p>
<p>The second assumption can break down if the treatments were not correctly standardized and prepared, as in the case of medicine doses.</p>
<p>However, to understand most important concepts in causal inference that can be useful for business and social sciences, it is ok to SUTVA (for now).</p>
<h3 id="finally-we-can-talk-about-causal-effects">Finally, we can talk about causal effects!</h3>
<p>Under the SUTVA assumption, we can finally define what we ultimately care about, the causal effects. First, what is the increase in my ice cream purchases if I’m exposed to the campaign? That is defined as <em>Individual Causal Effect</em> (ITE):</p>
<script type="math/tex; mode=display">\tau_i = Y_i(1)-Y_i(0)</script>
<p>This should be easy to understand. The difference in the outcome when I’m assigned treatment vs control is my causal effect of the treatment. Next, let’s think about this in the population level. Say there are N users on the website. The <em>Average Treatment Effect</em> (ATE) is ITE averaged over the population:</p>
<script type="math/tex; mode=display">E[\tau_i] = E[Y(1)]-E[Y(0)] = \frac{1}{N} \sum_{i=1}^N [Y_i(1)-Y_i(0)]</script>
<p>These two are probably the two most important metrics in causal inference. In addition, we can think about the treatment effect only for those who were treated as Average Treatment effect over the Treated (ATT) <script type="math/tex">E[\tau_i\|D_i=1]</script> and ATE for users with a specific set of covariates <script type="math/tex">ATE(x) = E[\tau_i\|X_i=x]</script>.</p>
<h3 id="population-vs-sample">Population vs Sample</h3>
<p>The treatment effects I just defined above are over the population (e.g. all users on your website). But usually, it is hard to run experiments over the entire population. We need to sample a fraction of the population, <script type="math/tex">S</script> (a set of units of size <script type="math/tex">n_S</script>). In such cases, we define the <em>Sample Average Treatment Effect</em> (SATE):</p>
<script type="math/tex; mode=display">\tau_S = \frac{1}{n}\sum_{i\in S}Y_i(1)-Y_i(0)</script>
<p>In contrast to SATE, ATE defined above is also referred to as the Population Average Treatment Effect (PATE). In later posts, we’ll discuss if SATE is a valid estimator of PATE.</p>
<h3 id="are-these-notations-even-helpful">Are these notations even helpful?</h3>
<p>You might wonder, especially if you know AB tests well, if these mathematical formations of the problem is of any help. If so, consider a simple case where there’s a spill over effect. If users’ friends being in a campaign affects users’ outcome for their ice cream purchases, the above treatment effects will not be able to be estimated from a simple comparison in the outcome between the two sample groups! In such a case, we need to assess the existence of spill over effects and incorporate that to our causal effect estimation. Otherwise, we can make a wrong conclusion of campaign being successful or unsuccessful, or the degrees of its success. Hence, being able to talk about experiments using the potential outcome model described above is crucial for being certain that the effect you measured has a causal interpretation. If not convinced yet, I hope to write some more posts in the future about the cool techniques developed in causal inference useful everywhere in business and academics.</p>Kojin Oshibakojinoshiba@college.harvard.eduIf you’ve been working in the tech industry, or have thought about doing so, you’ve probably heard of AB testing. Some of you have even conducted one. The idea is as follows: say you own a website with a red login button. You’re thinking that you should change it to blue, since it looks to ugly. To determine which color you should use, you ask your users. You randomly sample 100 users and show 50 users the red button and 50 users the blue button. You measure the ratio of people who login to your website for each group, and see if there’s a big difference.Is Bell Curve really a Great Intellectual Fraud?2017-12-03T00:00:00+00:002017-12-03T00:00:00+00:00http://kojinoshiba.com/probability/bell-curve-is-important<h3 id="bell-curve--great-intellectual-fraud">Bell Curve = Great Intellectual Fraud?</h3>
<p>I recently read a New York Times best-seller titled “Black Swan” by Nicholas Taleb. The book discusses how hard it is to predict rare events like Black Monday and 911. He claims that for predicting these events, statistical modeling is of no use. I am generally for the idea that current statistical modeling cannot handle outliers like the aforementioned. However, there was one part of the book which I had trouble agreeing with. Here is the summary of that passage taken from wikipedia:</p>
<blockquote>
<p>Almost everything in social life is produced by rare but consequential shocks and jumps; all the while almost everything studied about social life focuses on the ‘normal,’ particularly with ‘bell curve’ methods of inference that tell you close to nothing. Why? Because the bell curve ignores large deviations, cannot handle them, yet makes us confident that we have tamed uncertainty. Its nickname in this book is GIF, Great Intellectual Fraud. <a href="https://en.wikipedia.org/wiki/Black_swan_theory#Background">Black Swan Theory (wikipedia)</a></p>
</blockquote>
<p>As a statistics student, I was offended to hear that Taleb claims that a ‘bell curve’ is the Great Intellectual Fraud, and that this book is convincing many business people to think that way. Just because the current statistical models failed to capture events like the financial crisis, this doesn’t mean the ‘bell curve’ is a fraud. In fact, ‘bell curve’ enhanced the human capabilities to understand the world so much. It is arguably the most concept you can learn in probability, with so many real world applications aside from the prediction of rare events.</p>
<p>Hence, I decided to write about the importance of ‘bell curve’. I hope to convince my readers why it is definitely not an intellectual fraud.</p>
<h3 id="bell-curves-as-distributions">Bell curves as distributions.</h3>
<p>To prepare you for the next sections, what Taleb refers to as the ‘bell curves’ is called ‘normal distributions’ or ‘Gaussian distributions’ by statisticians. I’ll occasionally use these terms to refer to the same concept. For the reasoning behind these namings, please refer <a href="https://www.quora.com/Why-is-the-Normal-distribution-called-Normal">this link</a>.</p>
<h3 id="entropy-as-a-measure-of-information-encoded">Entropy as a measure of information encoded</h3>
<p>Say you have a dataset in front of you, and you want to understand the data. Any attempts of trying to understand a dataset can be thought of as a “distribution”. There is always some underlying data generation process, and we hope to find a “distribution” that best models that process. Initially, you have no clue where to start, but you want to model the data somehow using a distribution, just to get started. In such a situation, we hope to find a distribution that makes the least assumption about the data.</p>
<p>Among what is a good de facto as a starter distribution to model the data? How can we find a distribution with the least amount of assumption about the data? It seems hard to find the single best one, but in theory, we can find such a distribution. In fact, that distribution is the bell curve.</p>
<p>To formalize the argument, let’s define what it means for distributions to capture the information of a dataset i.e. entropy. Entropy is defined on distributions (and hence on random variables) and encodes how much uncertainty there is to the distribution. It is defined as follows:</p>
<script type="math/tex; mode=display">H(X) \equiv -\sum_{k=1}^K p(X=k) log_2p(X=k)</script>
<p>Note that for simplicity, I’m assuming that the distribution takes only finitely many (K) values. The higher the entropy is, the less assumption distributions make about the underlying data generation process. To see why this is the case, consider a coin flip, where we don’t know what the probability of landing heads is.</p>
<h3 id="entropy-for-a-bernoulli-random-variable">Entropy for a Bernoulli random variable</h3>
<p>First, let’s naively assume that it is a fair coin. Let <script type="math/tex">X</script> be the indicator that the coin landed heads under this assumption. As noted in the previous post,</p>
<script type="math/tex; mode=display">X\sim Bern(\frac{1}{2})</script>
<p>Hence,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H(X) &= -(p(X=0)log_2p(X=0)+p(X=1)log_2p(X=1))\\
&= -2\cdot \frac{1}{2}\cdot log_2 \frac{1}{2} \\
&= 1 \\
\end{align} %]]></script>
<p>Similarly, consider the other extreme assumption that the coin always lands heads. Let <script type="math/tex">Y</script> be the indicator that the coin lands heads. Now,</p>
<script type="math/tex; mode=display">Y\sim Bern(1)</script>
<p>Hence,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H(X) &= -(p(X=0)log_2p(X=0)+p(X=1)log_2p(X=1))\\
&= -1\cdot log_2 1 \\
&= 0 \\
\end{align} %]]></script>
<p>Consider the implications of these two distributions, when we model a coin flip as a fair coin flip, the entropy is high, meaning that we are make little assumptions about the coin flip. On the other hand, in the case of a always-heads coin flip, the entropy is low, meanign that we are making a lot of assumptions about the coin flip. This should make an intuitive sense; no one, when seeing the coin for the first time would make an assumption that coin would always land heads! For different values of <script type="math/tex">p</script>, the probability of the coin landing heads, the entropy is as follows:</p>
<figure>
<img src="/assets/images/2017-12-03-bell-curve-is-important/max_entropy.png" />
<figcaption>Entropy for diffent values of p.</figcaption>
</figure>
<p>Hence, when given a coin, and we want to make the least amount of assumptions about <script type="math/tex">p</script>, <script type="math/tex">X\sim Bern(\frac{1}{2})</script> is the initial distribution we should assume as it provides the highest entropy.</p>
<h3 id="bell-curves-maximize-entropy">‘Bell curves’ maximize entropy.</h3>
<p>In the example of a coin flip, the outcome was binary. What is the outcome was continuous and takes the values from <script type="math/tex">-\infty</script> to <script type="math/tex">\infty</script>? What is the distribution that maximizes the entropy?</p>
<p>The continuous version of the max entropy is:</p>
<script type="math/tex; mode=display">H(X) \equiv -\int_{x} p(x) log_2p(x) dx</script>
<p>where <script type="math/tex">p(x)</script> is the <a href="https://en.wikipedia.org/wiki/Probability_density_function">PDF</a> of <script type="math/tex">X</script>.</p>
<p>We would like to maximize this with respect to the constraint that the PDF integrates to 1:</p>
<script type="math/tex; mode=display">\int_{-\infty}^{\infty} p(x) dx = 1</script>
<p>This is a simple optimization problem which can be solved using a <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multiplier</a>. The objective is to maximize:</p>
<script type="math/tex; mode=display">-\int_{-\infty}^{\infty} p(x) log_2p(x) dx + \lambda_0(\int_{-\infty}^{\infty} p(x) dx - 1)</script>
<p>We actually need two more assumptions here: the mean and the variance of the distribution is known. In other words, we are going to consider a situation where we know nothing about the data BUT the mean and the variance. Let them be <script type="math/tex">\mu</script> and <script type="math/tex">\sigma^2</script>. From the definition of mean and variance,</p>
<script type="math/tex; mode=display">\int_{-\infty}^{\infty} xp(x) dx = \mu</script>
<script type="math/tex; mode=display">\int_{-\infty}^{\infty} x^2p(x) dx = \sigma^2</script>
<p>Adding this as constraints to the optimization problem, the final objective to maximize becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& -\int_{-\infty}^{\infty} p(x) log_2p(x) dx \\
& +\lambda_0(\int_{-\infty}^{\infty} p(x) dx - 1) \\
& +\lambda_1(\int_{-\infty}^{\infty} xp(x) dx - \mu) \\
& +\lambda_2(\int_{-\infty}^{\infty} x^2p(x) dx - \sigma^2)
\end{align} %]]></script>
<p>Solving this yields,</p>
<script type="math/tex; mode=display">p(x) = \frac{1}{\sqrt{2\pi \sigma^2}} e^{\frac{x^2}{2\sigma^2}}</script>
<p>which is exactly the PDF of a normal distribution!</p>
<p>Hence, we have that given the mean and the variance of the data, normal distribution is the distribution that makes the least amount of information about the data.</p>
<h3 id="normal-is-not-gif">Normal is not GIF!</h3>
<p>What does this derivation of the normal distribution tells us about Taleb’s claim about the ‘bell curve’? It says that his claim doesn’t make sense! First of all, as we have seen, anywhere in statistics or information theory claim that normal distribution is what we should use all the time to model the data. It only says that when there is no information (but the mean and the variance) about the data given, normal distribution makes as little assumptions as possible.</p>
<p>Obviously, the fact that normal distribution makes the least assumption doesn’t mean that we shouldn’t use the distribution. It is always a good starting point when confronted a dataset, since it avoids any bias we may have about the distribution of the data.</p>
<p>In the following posts, I would also like to introduce other important concepts involving the normal distribution, such as the central limit theorem, Kalman Filters and Gaussian Processes.</p>
<!-- ### Central Limit Theorem:
### Gaussian Processes: All distributions are mixtures of 'bell curves'.
### Kalman Filter: How 'bell curves' can track a space shuttle. -->Kojin Oshibakojinoshiba@college.harvard.eduBell Curve = Great Intellectual Fraud? I recently read a New York Times best-seller titled “Black Swan” by Nicholas Taleb. The book discusses how hard it is to predict rare events like Black Monday and 911. He claims that for predicting these events, statistical modeling is of no use. I am generally for the idea that current statistical modeling cannot handle outliers like the aforementioned. However, there was one part of the book which I had trouble agreeing with. Here is the summary of that passage taken from wikipedia:Must-know probability distributions from a single coin toss2017-11-22T00:00:00+00:002017-11-22T00:00:00+00:00http://kojinoshiba.com/probability/must-know-distributions<p>There are countless numbers of probability distributions. Some of them are so widely used and beautiful that they deserve a name. Surprisingly, all of those distributions can be derived starting from a single coin toss. Here’s the demonstration.</p>
<h3 id="consider-a-coin-toss">Consider a coin toss…</h3>
<p>Let’s start again with a coin toss. This time, think of a coin that lands heads with probability <script type="math/tex">p</script> and probability <script type="math/tex">1-p</script>. This is called a Bernoulli distribution, and we write this as <script type="math/tex">Bern(p)</script> Surprisingly, almost all important distributions we encounter in statistics and machine learning can be derived by combining this single coin toss somehow. Let’s start with a simple example.</p>
<h3 id="binomial-distribution">Binomial Distribution</h3>
<p>This is equivalent to tossing the same coin <script type="math/tex">n</script> times.
Namely, let <script type="math/tex">X_i \sim_{iid} Bern(p)</script>, meaning coin tosses with the same probability of landing heads, that are independent of each other. Then, <script type="math/tex">Y</script> the total number of heads in these <script type="math/tex">n</script> coin tosses are:</p>
<script type="math/tex; mode=display">Y = \sum_{i=1}^n X_i \sim Bin(n,p)</script>
<p>This was a pretty straightforward example. Now, let’s move on to something a bit more complicated.</p>
<h3 id="uniform-distribution">Uniform Distribution</h3>
<p>If you know what a uniform distribution is, it might seem counterintuitive at first that it can be generated using coin tosses. But we can! Again, let <script type="math/tex">X_i \sim_{iid} Bern(p)</script>. Then,</p>
<script type="math/tex; mode=display">U = \sum_{i=1}^{\infty} \frac{X_i}{2^i} \sim Unif(0,1)</script>
<p>Conceptually, we can generate uniform distributions by infinite tosses of a fair coin. To see why, let’s think about the CDF of <script type="math/tex">Unif(0,1)</script>. Using the above expression of the uniform distribution, we hope to derive <script type="math/tex">P(U \leq u) = u</script>. To do so, think of <script type="math/tex">U</script> as a binary expansion <script type="math/tex">U=0.X_1X_2...</script>. Similarly, <script type="math/tex">u=0.u_1u_2...</script>. To see when <script type="math/tex">U \leq u</script>, we only need to check the first decimal point in which <script type="math/tex">X_i</script> and <script type="math/tex">u_i</script> defer (e.g. comparison of <script type="math/tex">0.001010</script> and <script type="math/tex">0.000101</script> can be made by comparing the third decimal point). Conditioning on <script type="math/tex">j</script>, the decimal point in which the two numbers differ, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
P(U\leq u) =P(U<u)= \sum_{j=1}^{\infty} P(U\leq u|J=j)P(J=j) %]]></script>
<p>The first equation follows from the fact that <script type="math/tex">P(U=u)=0</script>. The second equation is using <a href="https://en.wikipedia.org/wiki/Law_of_total_probability">the law of total probability</a>.</p>
<h3 id="exponential-distribution">Exponential Distribution</h3>
<p>Exponential distribution can be defined using a uniform distribution:</p>
<script type="math/tex; mode=display">X=-logU\sim Expo(1)</script>
<p>where <script type="math/tex">U\sim Unif(0,1)</script>. Because the uniform distribution is generated using coin tosses, we can also think of exponential distribution as generated from them.</p>
<p>If you haven’t seen exponential distributions before, it is a distribution that comes up a lot in <a href="https://en.wikipedia.org/wiki/Poisson_point_process">Poisson Processes</a> (which is an important probability concept I’ll describe in another post). The most important thing to remember about the exponential distribution is the <em>memoryless property</em>:</p>
<script type="math/tex; mode=display">P(X>a+b)=P(X>a)P(X>b)</script>
<p>To have an intuitive understanding, think of <script type="math/tex">X</script> as a waiting time of a bus at a bus stop. <script type="math/tex">X</script> following an exponential distribution means that the time you have waited so far tells you nothing about the time you will wait until the bus comes (hence the term “memoryless”). This property is crucial in modeling natural phenomena such as radioactive particle decays in physics. Surprisingly, a <em>continuous</em> random variable is memoryless if and only if it has an exponential distribution (<a href="https://en.wikipedia.org/wiki/Memorylessness">proof</a>).</p>
<h3 id="geometric-distribution">Geometric Distribution</h3>
<p>In contrast to Exponential, a <em>discrete</em> random variable is memoryless if and only if it has a Geometric distribution.</p>
<script type="math/tex; mode=display">G=\lfloor X \rfloor</script>
<p>where <script type="math/tex">X</script> is exponential.</p>
<h3 id="gamma-distribution">Gamma Distribution</h3>
<p>Gamma distribution is just a sum (or <em>convolution</em>) of i.i.d. Exponentials. Let <script type="math/tex">X_i \sim_{iid} Expo</script>. Then,</p>
<script type="math/tex; mode=display">G_r = \sum_{i=1}^r X_i \sim Gamma(r)</script>
<p>Note that this only defines the Gamma distribution when <script type="math/tex">r</script> is a positive integer. More comprehensive definition in <script type="math/tex">r > 0</script> case will be left to other sources.</p>
<p>Gamma distribution is important because it is the conjugate prior for Poisson distribution (coming soon). Hence, it is important in Bayesian statistics. <a href="https://en.wikipedia.org/wiki/Inverse-gamma_distribution">The inverse of Gamma distribution</a> is also widely used in Bayesian statistics to model the prior variance of a Normal distribution (again, coming soon). I know this is a lot of information; for now, think of Gamma as some distribution derived from Exponential that is important in Bayesian statistics.</p>
<h3 id="chi-squared-distribution">Chi Squared Distribution</h3>
<p>Chi Squared distribution is just a special case of Gamma distribution. Namely, Chi Squared distribution with <script type="math/tex">n</script> degrees of freedom is</p>
<script type="math/tex; mode=display">W^2 \sim \chi^2_n \sim 2Gamma(\frac{n}{2})</script>
<p>For now, don’t worry too much about “<script type="math/tex">n</script> degrees of freedom” part (it just means that there are <script type="math/tex">n</script> parameters we can vary). Chi Squared distribution is often used in <a href="https://en.wikipedia.org/wiki/Chi-squared_test">hypothesis testing</a>. As noted in the next section, it has a close relationship with the more famous distribution, Normal distribution.</p>
<h3 id="normal-distribution">Normal Distribution</h3>
<p>Normal distribution is defined using Chi distribution (which is just the square root of Chi Squared distribution) and a random sign. A random sign is a random variable which takes the value <script type="math/tex">1</script> with probability <script type="math/tex">1/2</script> and <script type="math/tex">-1</script> with probability <script type="math/tex">1/2</script>. Let the former be <script type="math/tex">W</script> and the latter be <script type="math/tex">S</script>. Then,</p>
<script type="math/tex; mode=display">Z=SW\sim N(0,1)</script>
<p><script type="math/tex">Z</script> is a standard normal, because it has a mean <script type="math/tex">\mu=0</script> and a variance <script type="math/tex">\sigma=1</script>. Any normal distribution can be obtained from <script type="math/tex">Z</script>:</p>
<script type="math/tex; mode=display">X=\mu+\sigma Z \sim N(\mu,\sigma)</script>
<p>Another relationship between normal and Chi distributions is that Chi Squared distribution is sum of i.i.d. standard normals. Since the sum of n iid Gamma(1/2) is Gamma(n/2).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
Z_1^2+...+Z_n^2 &=W_1^2+...+W_n^2 \quad (\because Z_i \sim SW_i \Rightarrow Z_i^2\sim W_i^2) \\
&\sim Gamma(n/2) \\
&\sim \chi^2_n
\end{align} %]]></script>
<h3 id="student-t-cauchy-log-normal">Student-t, Cauchy, Log Normal</h3>
<p>These are all distributions that can be derived using Normals and/or Chi-Squared. I will not go into depth with each one, but you will for sure come across each of this as you study more statistics and machine learning.</p>
<ul>
<li>Student-t: <script type="math/tex">T=\frac{Z}{\sqrt{V_n/n}}</script> where Z is a standard normal and <script type="math/tex">V_n \sim \chi^2_n</script>. Widely used in <a href="https://en.wikipedia.org/wiki/Student%27s_t-test">hypothesis testing</a>.</li>
<li>Cauchy: <script type="math/tex">Cauchy \sim \frac{Z_1}{Z_2}</script> where <script type="math/tex">Z_1,Z_2</script> are i.i.d standard normals. Famous because mean and variance are undefined.</li>
<li>Log Normal: <script type="math/tex">Y=e^X</script> where X is Normal. It’s just an exponential of a normal random variable, but widely used in modeling.</li>
</ul>
<h3 id="beta-distribution">Beta Distribution</h3>
<p>Beta distribution is widely used in Bayesian statistics because it is the conjugate prior of Benroulli and Binomial distributions.</p>
<script type="math/tex; mode=display">B = \frac{G_a}{G_a+G_b}</script>
<p>where <script type="math/tex">G_a \sim Gamma(a)</script>, <script type="math/tex">G_b \sim Gamma(b)</script> and they are independent. As you can immediately see, Beta distribution has a close connection with the Gamma distribution. Fun fact is that <script type="math/tex">\frac{X_1}{X_1+X_2}</script> and <script type="math/tex">X_1+X_2</script> are independent if and only if <script type="math/tex">X_1,X_2</script> are independent Gammas. In such case, <script type="math/tex">\frac{X_1}{X_1+X_2}</script> is Beta and <script type="math/tex">X_1+X_2</script> is also a Gamma. If we think of <script type="math/tex">X_1</script> as waiting time in a bus stop and <script type="math/tex">X_2</script> as waiting time at a station, this tells us that the proportion of time one waited at a bus stop tells nothing about the total time waited.</p>
<h3 id="poisson-distribution">Poisson Distribution</h3>
<p>Last but not least, Possion distribution is defined using Exponential and a Poisson Process. This is tricky, so let me explain it in depth. Think of buses arriving at a bus stop. Let <script type="math/tex">% <![CDATA[
0<T_1<T_2<... %]]></script> be their arrival times. Let <script type="math/tex">T_n \sim Gamma(\lambda,n)</script> or equivalently, <script type="math/tex">T_1,T_2-T_1,... \sim_iid Expo(\lambda)</script> (think about why). Then <script type="math/tex">N_t=max\{n:T_n \leq t\}</script>, the number of bus arrivals until time t follows a Poisson process:</p>
<script type="math/tex; mode=display">N_t=max\{n:T_n \leq t\} \sim Pois(\lambda t)</script>
<h3 id="thats-it">That’s it!</h3>
<p>It all started from a coin toss… We’ve come this far. Perhaps in the end, you’ve forgotten about that coin toss in the beginning. But if you carefully track how the distributions are related to each other, you’ll start to see how a coin toss magically opens up the door to the world of probability distributions.</p>Kojin Oshibakojinoshiba@college.harvard.eduThere are countless numbers of probability distributions. Some of them are so widely used and beautiful that they deserve a name. Surprisingly, all of those distributions can be derived starting from a single coin toss. Here’s the demonstration.What is probability?2017-11-12T00:00:00+00:002017-11-12T00:00:00+00:00http://kojinoshiba.com/probability/what-is-probability<p>We come across probability not just in statistics classrooms but also in real life. But, have you thought about what probability really means? I would like to introduce you to a formal definition of probability.</p>
<h2 id="consider-a-coin-toss">Consider a coin toss…</h2>
<p>Let’s start with a simple example. Imagine tossing a fair coin where the outcome is either heads or tails. What is the probability of tossing heads?
You’re right, <script type="math/tex">0.5</script>. But why not <script type="math/tex">-0.7</script>, <script type="math/tex">20</script>, or <script type="math/tex">\frac{1024}{7}</script>? Well, the (somewhat boring) answer is because it is defined as such. To give you a more in depth answer, this post will introduce the formal way to define what a probability is.</p>
<h2 id="the-world-of-omega-f-p">The world of <script type="math/tex">\Omega</script>, <script type="math/tex">F</script>, <script type="math/tex">P</script>.</h2>
<p>Probability, or more formally, a <strong>probability space</strong> is defined using three letters: <script type="math/tex">\Omega</script>, <script type="math/tex">F</script>, <script type="math/tex">P</script>. What are they? Let’s think a fair coin toss as an example.</p>
<h3 id="omega-is-a-sample-space"><script type="math/tex">\Omega</script> is a sample space.</h3>
<p>A sample space is a set of all the possible outcomes in a certain process. In a coin toss, a coin can only land heads (H) or tails (T), so there are the two only outcomes. So we have:</p>
<script type="math/tex; mode=display">\Omega = \{ H,T \}</script>
<p>Each element in <script type="math/tex">\Omega</script> is an <strong>outcome</strong> and is often referred to as <script type="math/tex">\omega</script> (‘small’ omega).</p>
<h3 id="f-is-a-set-of-events"><script type="math/tex">F</script> is a set of ‘events’.</h3>
<p>First, let’s define an event. An <strong>event</strong> is a set that contains multiple outcomes. For example:</p>
<ul>
<li><script type="math/tex">\{ H \}</script> is “an event that a coin lands heads”</li>
<li><script type="math/tex">\{ T \}</script> is “an event that a coin lands heads”</li>
<li><script type="math/tex">\{ H,T \}</script> is “an event that a coin lands heads or tails”</li>
<li><script type="math/tex">\{ \}</script> is “an event that a coin lands neither heads nor tails”</li>
</ul>
<p>In fact, these are the only events that are defined in a probability space of a coin toss!
A <strong>set of events</strong> is literally a set that contains all the possible events in a certain process. Referring to the four events above, we have:</p>
<script type="math/tex; mode=display">F = \{\{ \},\{ H \},\{ T \}\{ H,T \}\}</script>
<p>Note that <script type="math/tex">F</script> contains <script type="math/tex">\Omega=\{ H,T \}\}</script>. This is always the case in any probability space. To more formally define <script type="math/tex">F</script>, we need to introduce a more difficult concept called <script type="math/tex">\sigma</script>-algebra, but I’ll leave that to future posts for now.</p>
<h3 id="p-is-a-probability-measure"><script type="math/tex">P</script> is a probability measure.</h3>
<p>A <strong>probability measure</strong> is a function that takes in an event as input, and spits out a probability of that event between 0 and 1. Formally, <script type="math/tex">P: F \rightarrow [0,1]</script>. In our example,</p>
<ul>
<li><span class="tex2jax_ignore"><script type="math/tex">P(\{ H \})=0.5</script></span></li>
<li><span class="tex2jax_ignore"><script type="math/tex">P(\{ T \})=0.5</script></span></li>
<li><span class="tex2jax_ignore"><script type="math/tex">P(\{ H,T \})=1</script></span></li>
<li><span class="tex2jax_ignore"><script type="math/tex">P(\{ \})=0</script></span></li>
</ul>
<p>For <script type="math/tex">P</script> to be a probability measure, we need two more conditions (axioms).</p>
<ul>
<li><span class="tex2jax_ignore"><script type="math/tex">% <![CDATA[
\begin{align}&P(\Omega)=1\end{align} %]]></script></span></li>
<li><span class="tex2jax_ignore"><script type="math/tex">P(\bigcup_{j=1}^{\infty}A_j) = \sum_{j=1}^{\infty}P(A_j)</script></span> where <script type="math/tex">A_j</script> are disjoint.</li>
</ul>
<p>The first equation seems intuitive. Recall that <script type="math/tex">\Omega</script> contains all the outcomes that can possibly happen. This equation is saying that the probability of either one of the all possible outcomes happening is 1. <br />
<br />
The second equation looks mysterious, so let me break it down. First, <script type="math/tex">A_j</script> being disjoint means that when one event is happening, another event cannot happen. For example <script type="math/tex">{H}</script> and <script type="math/tex">{T}</script> are disjoint, whereas <script type="math/tex">{H}</script> and <script type="math/tex">{H,T}</script> are not (because they overlap). Formally, two events <script type="math/tex">A,B</script> (or sets) are disjoint when <script type="math/tex">A \cap B={}=\phi</script>. <script type="math/tex">\phi</script> is just a commonly used notation that refers to an empty set.<br />
<br />
The equation is essentially saying that the probability of either one of the multiple disjoint events happening <script type="math/tex">P(\bigcup_{j=1}^{\infty}A_j)</script> is the same as the sum of the probability of each event happening (<script type="math/tex">\sum_{j=1}^{\infty}P(A_j)</script>). <br />
For example, in our case, since <script type="math/tex">{H}</script> and <script type="math/tex">{T}</script> are disjoint and so it must be the case that</p>
<script type="math/tex; mode=display">P({H}\cup{T})=P({H})+P({T})</script>
<h2 id="thats-it">That’s it!</h2>
<p>This is in fact, all we need to define what a probability is. As a side note, physicists did try defining a negative probability. I’m not going into any details about it, but you can <a href="https://en.wikipedia.org/wiki/Negative_probability">read more about it</a> if interested.</p>Kojin Oshibakojinoshiba@college.harvard.eduWe come across probability not just in statistics classrooms but also in real life. But, have you thought about what probability really means? I would like to introduce you to a formal definition of probability.テックなハーバード学部生の就活事情2017-10-27T00:00:00+00:002017-10-27T00:00:00+00:00http://kojinoshiba.com/japanese/tech-harvard-undergrad<p>今僕は大学３年生なのですが、１年休学していたので仲良い同期は４年生、就活の時期です。彼らの就活を見ていて気づいた点をまとめておきます。アメリカのテックな学部生の就活事情はあまり知る機会がないと思いますので、参考になればいいなと思います。</p>
<h2 id="就職先">就職先</h2>
<p>（成績的な意味での）レベルで分けてみると以下のような傾向が見えます。</p>
<h3 id="超トップ層">超トップ層</h3>
<p>(1) 大学院。全米トップの大学院：Stanford、MIT、Carnegie Melon</p>
<p>(2) 起業。ごくわずか。Theil FellowshipやYC Fellowshipに選ばれた人、Founders FundやSequoiaから出資を受けている人などがたまにいる。</p>
<h3 id="トップ層">トップ層</h3>
<p>(1) 大きめのユニコーン。代表企業：Uber, Airbnb, Palantir</p>
<p>(2) 大企業でのレアなポジション。例：Google APM, Apple ML Engineer</p>
<p>(3) イケてるクオンツ。代表企業：Jane Street, Two Sigma, De Shaw</p>
<p>(4) イケてるスタートアップ、ミッドサイズ。代表企業：まちまちです</p>
<h3 id="中上位層">中上位層</h3>
<p>(1) 大企業でソフトウェアエンジニア。代表企業：Facebook > Google > Microsoft （人気順に）</p>
<p>(2) まあまあなクオンツ。</p>
<h3 id="中下位下位層">中下位〜下位層</h3>
<p>現時点ではまだ就職先が決まっていないのでわかりません。</p>
<h2 id="就活時期">就活時期</h2>
<h3 id="１２年夏のインターン">１、２年夏のインターン</h3>
<p>通常企業は３年生にならないとインターンとして採用してくれません。ですが、GoogleやFacebookなどの大企業は１、２年生にもインターンの機会を提供しています。専攻が決まっている感度の高い学生はこういったインターンに応募します。ここで大企業での仕事がつまらないと悟り起業、ミッドサイズへの就職を決めた人をたくさん知っています。</p>
<h3 id="３年夏のインターン">３年夏のインターン</h3>
<p>自分の卒業後のキャリアに近い会社でインターンします。しっかりと夏の間仕事をするとリターンオファー（就職しても働いていいよ！という権利）がもらえるので、インターン先に就職するケースが結構あります。</p>
<h3 id="４年秋冬">４年秋〜冬</h3>
<p>３年夏のインターンの前後から就活が始まっていきます。大企業は１０〜１１月、ミッドサイズは１２月〜１月には選考を終えます。スタートアップは通年採用です。大学院の出願はこの頃に準備を始め、冬に出願します。</p>
<h2 id="就活の方法">就活の方法</h2>
<h3 id="大学のインタビュープログラム">大学のインタビュープログラム</h3>
<p>キャンパスに大企業が面接をしにやってきます。テックだとGoogle, Facebook, Palantirなど、その他金融ではGS, JP Morgan、コンサルではMcKinsey, BCGなどがきます。全体の学生の４割くらいが参加して半分くらいが（つまり全体の２割が）ここから内定していきます。</p>
<h3 id="リファラル">リファラル</h3>
<p>仲良い友達同士で３年夏にインターンしていた会社にリファラルしあいます。リファラルがあれば必ずレジュメ審査は通過することができます。最近の卒業生やイベントで知り合った社員にリファラルしてもらうケースも多くあるようです。みんなコネを活用するのがうまいです。</p>
<h3 id="オンライン">オンライン</h3>
<p>ネットから応募するケースもたくさんあります。上記の二つよりも他の志願者と同じ土俵からスタートするという意味で大変ですが、実力や熱意のある人やちゃんと受かっていきます。</p>
<h2 id="給料">給料</h2>
<p>年収。肌感です。Glassdoorなどを見れば会社ごとの数字があると思うので詳しくはそちらを参考にしてください。</p>
<p>大学院：４００〜５００万円</p>
<p>ミッドサイズ、ユニコーン、大企業：１０００〜１５００万</p>
<p>クオンツ：１２００〜１８００万</p>Kojin Oshibakojinoshiba@college.harvard.edu今僕は大学３年生なのですが、１年休学していたので仲良い同期は４年生、就活の時期です。彼らの就活を見ていて気づいた点をまとめておきます。アメリカのテックな学部生の就活事情はあまり知る機会がないと思いますので、参考になればいいなと思います。