Kojin Oshiba

AI in Society: 2. Algorithmic Fairness

2018-06-14T00:00:00+00:00

Data is to machine learning as fuel is to cars. None of the AI superpowers can be gained without abundant data. But it is not only the quantity of the data that matters. Quality matters, and it matters a lot especially if the task can impact human rights. If a model is trained on a biased data, AI can be accurate, but unfair.

A popular criminal risk assessment AI is racially biased.

AI is already used in the criminal justice system. COMPAS ¹ is a crime risk assessment algorithm that is most widely used in the US. Developed by a private company Equivant, COMPAS has assessed the “risk scores” of more than 1 million criminals. The algorithm uses more than 100 features of each criminal, and predicts how likely they will commit crimes again and accordingly, the types of supervision needed ².

The very unfortunate truth is that COMPAS is racially biased. It has been shown that the software is twice more likely to falsely predict black defendants to commit crimes again than white defendants ³. This is a fact. Now the question is why it’s happening and how this can be prevented.

Equivant only makes the COMPAS prediction results public and does not release the actual model used. So it is hard to see why. But there are two reasons we can speculate. One is that the data fed into the system was biased. The data is of course generated by past decisions made by the human judges. They are by no means unbiased. People discriminate all the time, both consciously and subconsciously. If COMPAS is modeled to “mimick” human judges, they will learn to discriminate just as us humans do. The second reason that stems from this is that the model didn’t take this fact into account very well. Learning a fair classifier from an unbiased data is a very hard thing to do (as we shall see later in the post). It is plausible that Equivant is aware of this unfairness, but had not been able to cope with it.

Algorithmic Fairness definitions

COMPAS motivates us to rigorously study how we can enforce algorithmic fairness. Just like we define accuracy and losses mathematically, we also need to define fairness quantitatively. Otherwise, a model would not know what to optimize for. There are many fairness definitions proposed, and there’s not yet a standard definition everyone uses. It is even debatable whether there should be one. In this section, I will survey some of the proposed fairness criteria.

Fairness Through Unawareness

If we don’t want to discriminate based on race, can’t we just remove that from the set of features? This is the idea behind Fairness Through Unawareness (FTU). Essentially, FTU claims that a model is fair if it’s not trained using sensitive attributes. FTU is a good starting point, but it is quite easy to see that this definition is too naïve. For example, what if there was a feature about the zip code of where the criminals lived? Typically, zip code is highly associated with the race of the residents. Hence, even if we remove race from the set of features, zip code can “signal” the race of the defendants.

Group Fairness

Another notion of fairness which is widely known, but is recently believed to be “not enough” is group fairness. For example, one notion of group fairness is statistical parity, which can be formulated as $P(x \in S \vert \text{outcome}=o) = P(x \in S)$. $S$ can be a set of all people who are black and the outcome can be that people are hired for a job. Then, this equation states that the probability of black people in the group of those who got hired, is the same as the probability of black people in the general population. More generally, statistical parity holds when the demographics of the selected group (e.g. people who got hired) is the same as the demographics of the population. Similar definitions can be defined based on equal false positive rates, false negative rates, false discovery rates, etc.

There are two subtle drawbacks to this approach. One is that even if statistical parity is satisfied, social welfare might not be maximized. For example, if you are a university, trying to hire most talented students from group $S$ and $S^C$. If people in $S$ tend to value tech jobs as more prestige, and those in $S^C$ tend to value finance as more prestige, it can be the case that statistical parity is satisfied, but the university chose the wrong set of talents from e.g. group $S^C$ by hiring a bunch of tech people within it. Another drawback is that even if statistical parity is satisfied for $S$, this doesn’t mean it is satisfied for a subset of $S$. ⁴

A recent result that is perhaps striking is that no classifier can ensure multiple reasonable fairness criteria at the same time. These are false positive rates (FPR), false negative rates (FNR) and positive predictive value (PPV). This is because $FPR=\frac{p}{1-p}\frac{1-PPV}{PPV}(1-FNR)$ always holds, and thus there are always tradeoffs among these three criteria. In the case of COMPAS, it has been studied that PPV is well satisfied, but FPR and FNR are not. It is not possible for a model to be fair in all any respect. ⁵

Individual Fairness

A more fine grained definition of fairness is individual fairness, which essentially states that “people who are similar should be treated similarly”. Individual fairness can be enforced as a constraint to linear program. Once similarities are defined using distance metrics, individual fairness can be defined with a Lipschitz constraint ⁴. Under some conditions, individual fairness can be shown to imply group fairness, and thus it is a more general approach.

Counterfactual Fairness

The final line of work on fairness is pretty distinct from the other ones. All of the above definition relied on association rather than causation. Counterfactual fairness borrows tools from causal inference to reason fairness. The most classic counterfactual fairness definition ⁶ is: $P(\hat{Y}_{A \leftarrow a}(U)=y \vert X=x,A=a)=P(\hat{Y}_{A \leftarrow a'}(U)=y \vert X=x,A=a)$

This $\hat{Y}_{A \leftarrow a}$ means an “intervention” to change the value of $A$ to $a$. Intutively, this is saying that a model is fair if the predictions are the same under the situation that the sensitive attribute is changed but everything else is held constant.

Let’s look at an example of how counterfactual fairness captures a notion not captured by the above definitins. Think about a car insurance company pricing insurance based on accident rate prediction of a person. Following conditions have been observed:

People who drive more aggressively tend to have red cars more often.
Black people tend to prefer red cars more often than people with other races.
But, race does not affect aggressiveness of the drivers. What happens in this scenario? We can use the red car feature to predict accident rate (and this will certainly be effective), but this can potentially discriminate against the blacks! Counterfactual fairness definition takes into account these concerns. There are recent improvements made to counterfactual fairness, so take a look at them if intereted ⁷.

Conclusion

There are many definitions of fairness out there, and we don’t yet have an agreement on which one to use. Regardless, I hope these definitions and results convinced you that algorithmic fairness is not only philosophically subtle, but also technically subtle. Unlike other fields in AI, fields like fairness and interpretability are very hard to have cohesive arguments, and impress everyone by beating some benchmarks. This can make some researchers shy away, but I think it’s at least worth knowing about.

COMPAS ↩
The accuracy, fairness, and limits of predicting recidivism ↩
Machine Bias. There’s software used across the country to predict future criminals. And it’s biased against blacks. ↩
Fairness Through Awareness (Dwork et al. 2012) ↩ ↩²
Inherent Trade-Offs in the Fair Determination of Risk Scores (Kleinberg et al. 2016) ↩
Counterfactual Fairness (Ksuner et al. 2017) ↩
Fair Inference On Outcomes (Nabi et al. 2018) ↩

Tackling the Cold Start Problem in Recommender Systems

2018-06-05T00:00:00+00:00

As part of my machine learning internship at Wish, I’m tackling a common problem in recommender systems called the “cold start problem”. Cold start happens when new users or new items arrive in e-commerce platforms. Classic recommender systems like collaborative filtering assumes that each user or item has some ratings so that we can infer ratings of similar users/items even if those ratings are unavailable. However, for new users/items, this becomes hard because we have no browse, click or purchase data for them. As a result, we cannot “fill in the blank” using typical matrix factorization techniques.

Fortunately, researchers have proposed various ways to tackle this cold start problem. In this post, I would like to introduce loosely categorized set of papers I found interesting. Note that this is not a holistic survey and I have not implemented many of them, so I’m not sure if they really work in practice.

Approaches (TL;DR)

Representative based: use subset of items and users that represents the population
Content based: use side information such as text, social networks, etc.
Bandit: consider the exploration vs exploitation tradeoffs in new items.
Deep learning: recent methods that tries to solve some of the issues tackled above but using a black box.

Representative based

If we do not have enough information about users and items, we can rely more on those who “represent” the set of items and users. That’s the philosophy behind representative based methods.

Representatives can be users whose linear combinations of preferences accurately approximate other users’. For example, a famous representative based method, Representative Based Matrix Factorization (RBMF)¹ is an extension of MF methods with an additional constraint that $m$ items should be represented by a linear combination of $k$ items, as can be seen from the objective function below:

Here, we have the reconstruction error similar to standard MF methods, with this additional constraint. When a new user joins the platform, we can ask the new users to rate these $k$ items, and use that to infer the ratings of other $m-k$ items. This way, with a small additional cost on users of rating some items, we can improve the recommendations for new users.

There have been improvements on RBMF proposed, where we can interview only a subset of users instead of all the new users to decrease the burden on the new users².

Advantages

More interpretability, because new users can be expressed in terms of few representative items.
If you’re using MF methods already, this can be a simple extension to handle cold start.

Disadvantages

Need to change UI and front end logic to ask the users to rate the representative items.

Content Based

There are many side information that has been underutilized in recommendation systems. For example, citing one of the content based papers³ directly:

In recent years, the boundaries between e-commerce and social networking have become increasingly blurred. Many e-commerce websites support the mechanism of social login where users can sign on the websites using their social network identities such as their Facebook or Twitter accounts. Users can also post their newly purchased products on microblogs with links to the e-commerce product web pages.

As other examples, for recommending research papers on journal archives, we not only have the ratings but also the actual text of the paper available. We can incorporate these information as additional features for users to alleviate information scarcity for new users. For textual information, we can for example use LDA to obtain topic vectors for user posts. Even for social networks, there have been proposed a variety of graph embedding methods, which can represent graph nodes in vector spaces⁴⁵.

How can we handle these contents in CF methods? Classic CF methods are essentially matrix completion with reconstruction error as its objective. Thus, it is hard to utilize these side information. Researchers have proposed hybrid methods that combines matrix reconstruction objective and content based objectives. These approaches are not cold start specific, so I’ll leave it to the readers to explore some literatures in depth ⁶⁷.

Advantages

Can incorporate a wide range of content information.
Hybrid methods are extensions of MF.

Disadvantages

Objective functions get cluttered. This is solved with deep methods, with a different downside.
Lots of feature engineering? Depends on whether you have ready to plug-in user/item features in your db.

Bandit

Cold start can be reframed and solved as a bandit problem. What is a bandit problem? A classic example is the following: you are in a casino and you see $k$ slot machines in front of you. Each slot machine has its own distribution of reward, but you do not have access to that information. Your goal is to maximize your return from playing these slot machines.

Why is this relevant? You can draw an analogy here. Think of an e-commerce problem where you have $k$ new items arriving on your platform everyday. These $k$ items can be thought of as slot machines with different returns (e.g. revenue, profit). But since they are new items, you do not have access to how much users will buy them. Then, recommending a subset of these items is like choosing a subset of slot machines to draw.

In bandit problems, we very much care about the exploration vs exploitation problem. If we found a new item that sells very well, we would like to show it to users more (exploitation). But at the same time, you would also want to show others which have not been shown as much, because they can be even more popular then the items you have shown already (exploration).

Solutions to the Bandit Problem

The bandit problem is well studied and there are many ways to solve this problem:

$\epsilon$ greedy⁸: This is the simplest approach. Show an item that’s discovered to be most popular (e.g. highest average sales / clicks / views among users so far) with probability $\epsilon$. Show a random item with $1-\epsilon$. $\epsilon$ controls whether to focus more on exploitation vs exploration.
Upper Confidence Bound (UCB)⁸: Select an item that maximizes $\hat{\mu_j} + \sqrt{\frac{2\ln t}{t_j}}$. $\mu_j$ is the average sales / clicks / views of item $j$ so far. $t$ is the number of times you’ve shown the new items. $t_j$ is the number of times item $j$ was shown. Clearly, $\mu_j$ is the criteria for exploitation (e.g. favor items with more average sales) and $\sqrt{\frac{2\ln t}{t_j}}$ for exploration (e.g. favor items with less exposure so far). This weird $\log$ and $\sqrt$ comes from the confidence interval of $E[\mu_{i,j}]$, which you can read more about in this paper⁸.
Thompson Sampling (TS)⁹: Think of this as a Bayesian version of the approaches above. You define some prior (e.g. gaussian) with parameters, obtain some observations to calculate the likelihood, and then update the prior with the posterior. We can then choose an action that maximizes the reward based on the posterior distribution of the parameters.

Contextual Bandit

Contextual bandit problem is a generalization of the bandit problem that resembles real world cases more closely. The only difference now is that, in addition to the action (e.g. which items to show) and reward (e.g. sales), you now have contexts (e.g. item discription, pictures, the number of times it was clicked). By utilizing these contexts, we can have a better estimate of the reword. For example, in UCB above, we can add to the reward function a linear combination of the context features. In TS, we can have the likelihood function depend not only on the rewards and actions but also on the contexts.

Advantages

Easy to implement: The simple bandit algorithms are quite easy to implement, especially compared to sophisticated machine learning methods. So bandit methods are definitely a good baseline to compare to.
Nice theoretical guarantees: As ML practitioners, you might care little about theoretical guarantees, but thanks to its simplicity, a lot of theoretical analysis has been done on the average and the worst case reward. So you can be rest assured that the algorithms won’t be a disaster.

Disadvantages

Too simple?: Maybe, maybe not. Bandit might just be good enough.

Deep Learning

I’m treating deep learning as one big category for tackling the cold start problem, but note that the ways deep learning is used in these papers are very diverse. There are many other recent methods that use deep learning. You might be able to find one that fits your context here¹⁰.

A simple trick used in Deep Learning based Youtube Recommendation

Deep Neural Networks for YouTube Recommendations¹¹ is not specifically about cold start. Nonetheless, they recognize the cold start problem and proposes a simple hack to deal with it. Namely, among the many features on video watches, search tokens, geographic and biographic information, they included Days Since Upload. Training a deep network with this additional feature, they observed that their deep net learns that “fresh” videos are more important.

DropoutNet¹²:

The basic idea is simple yet powerful. Their approach is that, in training deep learning based recommendation systems (e.g. collaborative filtering with many layers), we can make it robust against the cold items by randomly dropping ratings of items and users. The key here is that, as opposed to standard dropout in neural net training, they drop the features, not the nodes. By doing so, they can make the neural net depend less on certain ratings, and be more generalizable to items/users with less ratings. The strength of their approach is that it can be used with any neural net based recommender systems, and also that the approach works for cold start in both users and items.

Session-based RNN¹³:

This approach tries to utilize each session of users by feeding it into an RNN. Specifically, they trained a variant of Gated Recurrent Unit (GRU), where the input is the current state of the session and the output is the item of the next event in the session. This networks is useful e.g. in smaller e-commerce sites where only a few user sessions are available.

Advantages

If you can find a model for a specific domain that fits with your purpose nicely, you might boost the recommendation performance significantly.

Disadvantages

Takes time to implement and tune the model.
Deploying deep models can take more time, depending on your current stack.
No guarantee on how well it will do.

Conclusion

Assuming you have some recommender system in place in your product, you want to find the solution to cold start that satisfies the following:

Suitable for your domain.
Can built on top of an existing system, or can be implemented and improved quickly.

For example, you can start from bandit and then move on to deep learning methods. In an industry environment, a model that might look too simple is always better. It is more maintainable, interpretable, and can be improved quickly.

I hope you found some good hints on how to get started on your cold start solution!

ハーバードの授業で習った英語の文章が上手くなる７つのコツ

2018-06-03T00:00:00+00:00

ハーバードの学部生は１年生のときに必ずライティングの授業を取らなければなりません。理系や文系、卒業後の進路関係なく全学生が対象です。

僕にとってこのライティングの授業は、これまで受けてきた授業の中でもっとも多くの学びがありました。特に、授業で読んだたくさんの文献の中で今でも印象に残っているものがあります。それは、ノーベル文学賞作家 V.S. Naipaulがひよっこ作家に宛てたアドバイス¹です。今でも数ヶ月に１回は読み直すほど大好きなアドバイスなので、ここに書き留めておきます。

Do not write long sentences. A sentence should not have more than ten or twelve words.
Each sentence should make a clear statement. It should add to the statement that went before. A good paragraph is a series of clear, linked statements.
Do not use big words. If your computer tells you that your average word is more than five letters long, there is something wrong. The use of small words compels you to think about what you are writing. Even difficult ideas can be broken down into small words.
Never use words whose meaning you are not sure of. If you break this rule you should look for other work.
The beginner should avoid using adjectives, except those of colour, size and number. Use as few adverbs as possible.
Avoid the abstract. Always go for the concrete.
Every day, for six months at least, practice writing in this way. Small words; short, clear, concrete sentences. It may be awkward, but it’s training you in the use of language. It may even be getting rid of the bad language habits you picked up at the university. You may go beyond these rules after you have thoroughly understood and mastered them.

個人的には、「シンプルな文章を書くことが大切だよ」ということをとてもシンプルに書いているところがツボです。

特に1-6を忠実に守ることで僕のライティング力は飛躍的に向上しました。今でも英語の文章が上手いとは言えませんが、大学に入りたてでネイティブに恐れおののいていた当時、このアドバイスを守ることでクラスでAをとることができたことは「英語である程度読みやすい文章は書ける！」という自信になっています。

自分の意見を文章を通して述べることは何をするにせよ必須のスキルです。それが英語でできれば世界中の人に自分の考えを読んでもらうことができます。皆さんも是非Naipaul先生のアドバイスを参考にしてみてください！

VS Naipaul’s Advice To Writers ↩

AI in Society: 1. Interpretability

2018-05-30T00:00:00+00:00

Would you believe in doctors that tell you this?

Another example. Consider a politician telling you “let’s implement this policy because my AI says it’s going to improve our country”. Would you vote for him/her?

If you are fine with these scenarios, go watch Netflix. If this is a bit disturbing, continue scrolling down.

In this post, I’ll introduce recent research for improving the interpretability of machine learning models. Interpretability is extremely hard to quantify and to optimize for. But this also means that the field is diverse, full of many interesting approaches to deal with black box models. Let’s take a look at some of them. Because this is a post on interpretability, I have lots of images and plots for you :) ¹

Early days: variable selection with L1 regularization

Research in interpretability of machine learning algorithms is not new. It has been studied for more than a decade. One of the early attempts to interpret machine learning was the invention of L1 regularization and L1 regularization paths. The basic idea is that L1 regularization induces sparsity in the set of variables used in a model, ending up in a more parsimonious model. The claim is that if the model has fewer variables, we can have better interpretability.

This is an image of an L1 regularization path ². On the x-axis, we have the strength of L1 regularization parameter $\lambda$. Higher $\lambda$ means more sparsity. On the y-axis, we have the value of coefficients for each feature. The lines represent how the feature coefficients change as we decrease $\lambda$. As we can see, the higher $\lambda$ is (left) the fewer the number of variables. We can also see when different variables kick in to the model as we lower $\lambda$.

Then comes sparsity of samples

An orthogonal line of work that is a bit more recent is the idea of selecting a subset of data (“a prototype set”) which best covers the whole set of data classified by a certain label. These prototype sets can formally optimally found via a set cover integer program, which reflects how an idea prototype set should behave. For example,

Every training example should have a prototype of its class as its neighborhood.
No point should have a prototype of a different class as its neighborhood.
The fewer the number of prototypes the better.

A greedy solution to this set-cover problem selects one image at a time as a prototype so as to maximize the increase in the number of other images correctly covered. The above image shows the first 88 prototypes for the MNIST data. It’s a bit hard to see, but the number on top of each MNIST image correspond to the number of images newly covered correctly by adding that image as a prototype. For example, for “1”, you need only two prototypes to explain the images, but for “7”, you need many more different prototypes.

For the actual formulation of the integer program, and details of the algorithm for solving it, you can refer to the original on prototype selection ³.

Decision Sets

If we can have a much more interpretable model with slightly less harm in accuracy, we might be better off using the simpler model instead. This is the idea behind decision sets ⁴. The basic idea is to learn a model that is almost as good as deep learning or other sophisticated models with a set of rules that are much simpler.

To find such rules, we need to have an objective function. The objective function should have two terms for accuracy and interpretability. Defining interpretability is hard, but the authors of decision sets paper consider things like the length of the rules (less is better), the set of data points covered by each rule (more is better), the overlap across different rules (less is better). Then, we can solve this using some optimization algorithm.

Scoring Systems

This one is perhaps the most widely used method for interpretable machine learning. It is so popular that it might not seem like machine learning at all. But I think scoring systems ⁵ are really cool because of its extreme simplicity.

Have you seen a medical diagnosis test like this online? This is called a scoring system. I would assume that many of these you’ve encountered are suboptimal, but there is a rigorous optimization based approach behind constructing these scoring systems, and they are widely used in medical diagnosis, jurisdictional decision makings and more.

Given a dataset $(\mathbf{x}_1,y_1),...,(\mathbf{x}_n,y_n)$, we can make a prediction based on a linear combination of the features with the coefficients (scores) being small integers:

\[\hat{y}_i=+1 \text{ if } \lambda_1 x_{i,1}+\lambda_2 x_{i,2}+...+\lambda_d x_{i,d} > \lambda_0\]

where $\lambda_0$ is some threshold for the total score, and $\lambda_i \in \{-10,...,10\}$ or some other range of small integers. This is called a Supersparse Linear Integer Model (SLIM), and Ustun et al. have proposed ways to optimize this integer program quickly. The resulting $\lambda_i$’s then gets used in the scoring systems you often see.

Being Causal

I put this in the end because I personally think this is the most important approach to interpretability of all. The idea behind these causal approaches is that we can best interpret black-box models if we know exactly what would’ve happened (1) if one feature changed, everything else held constant or (2) if one data point changed, everything else held constant. One widely used approach for (1) is the partial dependency plot (PDP). The definition of partial dependence is pretty simple. If you have a set of variables $X$, and a subset $X_S$ (with its complement $X_C$), a partial dependence of a machine learning model $g(x)$ is defined as:

\[g_S(x_S)=E_{X_C}[g(x_S,X_c)]\]

The above formula basically marginalizes out the effect of other variables. It has been shown that PDP has a close connection to backdoor criterion, a notion in causal inference which all researchers in the domain are familiar with ⁶.

For (2), a recent paper ⁷ uses what is known as influence functions to answer “what would happen if we did not have this training point, or if the values of this training point were changed slightly?” I won’t describe the formal definition here, but the basic idea of influence function is that it approximates the parameter change of a model when an input was perturbed by a small amount. I intuitively understand this as looking at the change in model weights via something like a gradient descent with one input which we are interested in analyzing. With influence functions, we can for example know which train images are harming the classification of test examples the most:

Conclusion

I want to make three points before I end. There are two approaches to interpretability. One is to use a complex model (like deep nets), but have good tools to enhance it’s interpretability (e.g. influence functions). Another is to train inherently interpretable models to begin with (e.g. scoring systems, decision sets). Which approach is more suitable depends on different occasions.

Second, the fact that ML savvy people think that a model or an approach is interpretable does not imply that the general public, or the users of these models would think the same way. HCI type researches to verify whether claimed-to-be interpretable models are actually interpretable or useful for them are scarce ⁸. I hope that the research on interpretability should go in this direction.

Finally, you might have found these works less “sexier” than e.g. recent techniques in GAN being able to learn to generate celebrity images. But that’s kind of the whole point of research in interpretability. The goal is to demystify what people call “black box” or “machine intelligence” or even “magic”, and to represent what’s happening underneath the hood as vividly as possible. I believe that the “uncooler” we can think about machine learning, the more successful we are at interpreting what they are really doing.

Many of the papers and visualizations were introduced in COMPSCI 236R: Topics at the Interface between Computer Science and Economics, a course I took at Harvard in Spring 2018. Thank you Professor Chen! ↩
L1 Regularization Path Algorithm for Generalized Linear Models ↩
Prototype selection for interpretable classification ↩
Interpretable Decision Sets: A Joint Framework for. Description and Prediction ↩
Supersparse Linear Integer Models for Optimized Medical Scoring Systems ↩
Causal Interpretations of Black-Box Models ↩
Understanding Black-box Predictions via Influence Functions ↩
One of the very few is Manipulating and Measuring Model Interpretability ↩

AI in Society: 0. Limitations of Machine Learning Today

2018-05-29T00:00:00+00:00

Machine learning has had immense success in wide range of fields, from autonomous driving, natural language translation, to diagnosing serious diseases.

However, as ML gets integrated to many social systems, predictive accuracy (or whatever other metrics we optimize for) is no longer our only concern. Terms like safety, interpretability and fairness are starting to be used by researchers and ML practitioners.

Here are a couple of examples. In automated hiring decision making, ML systems trained on biased past hiring data can favor people with a certain race or gender. In medical diagnosis, ML systems might be able to predict whether a treatment cures diseases with high probability but might not be able to explain to a patient why a certain treatment might work. In education, ML might be able to associate certain attributes with high future income, but might not be able to explain what changes can cause students to increase their future income. These are all limitations of ML systems that cannot be fully tackled by the breakthrough AI technologies you see on media everyday.

This will be a series of blog post about such limitations machine learning faces today. I will (nonexhaustively) introduce some of the many important issues that arises when ML systems are put into practice. For each topic, I listed some of the research directions proposed in recent academic papers. Feel free to use this as a reference to further exploration of each field.

The fields that I’m going to cover are the following:

Interpretability: Deep learning or AI in general is often referred to as a “black box”. It is very hard to understand how sophisticated models actually reason about the data. This is troublesome, for example in medical or legal domains where how models draw conclusions is as equally important as what the conclusions are.
Algorithmic Fairness: ML models can be biased when the data they are trained on is biased. How can we quantify (un)fairness? How can we enforce models to be fair?
Robustness: Commonly used ML models like neural networks are shown to classify inputs horribly when the input is perturbed even by a little. This can raise serious concerns in ML based security systems like facial recognition. Can we save the humanity from ML savvy hackers?
Strategic Considerations: People are not static points. If you use AI to automatically admit students to college or accept who to offer credit cards to, you will almost always have people who will try to “game the system”. Strategic considerations applies principles from game theory to handle such issues.
Causality: ML models can understand the association between two things very well. But they have a long way to go for understanding the cause and effect. In domains like policy making, we are more interested in the consequence of certain policies. Can AI also reason causality?
Advising: Have you met people who say the right things, but the way they say them is very annoying? Your AI assistant can be like that. When advising humans, AI should not only have high IQ, but it should also have high EQ. This is also an open area of research under HCI (Human Computer Interaction).

Enjoy!

Robust PCA

2018-05-28T00:00:00+00:00

PCA is great because you can reduce a data matrix to a lower dimension without losing much. Although it is widely used, PCA doesn’t work well when there are noises in the input data. This is because the objective function $\min \vert \vert D-A \vert \vert$ doesn’t really incorporate the fact that the input might be noisy. As the name suggests, Robust PCA is a variant of PCA that is more robust against noises. It was efficiently solved by Candès et al. in 2011¹.

*If you want a quick review on PCA, read my previous post and then come back.

Goal

The goal of Robust PCA is $D=A+E$. This goal can be best illustrated with the image below:

The intuition is that noisy images should be able to be decomposed into $A$ (underlying less noisy image) and $E$ (noise). Just like PCA, we would like $A$ to be of lower rank. Furthermore, since $E$ is a noise, we want it to be sparse.

Objective Function: PCP

How can we formulate this intuition in terms of an objective function? First thought would be to try this:

\[\begin{align} & \min_{A,E} \text{rank}(A) + \lambda \vert \vert E\vert \vert_0 & \\ & s.t. D=A+E \end{align}\]

The first terms ensures $A$ is low rank and the second term ensures $E$ is sparse. Note that $\vert \vert E\vert \vert_0$ is the $l_0$ norm, the number of non-zero elements in $E$. If you’ve learned optimization before, you will know that this is not a good objective function. This is because the objective is neither continuous nor convex.

The key insight Candès provided in his paper is that we can proxy this with what he named Principal Component Pursuit (PCP):

\[\begin{align} & \min_{A,E} \vert \vert A \vert \vert_* + \lambda \vert \vert E\vert \vert_1 & \\ & s.t. D=A+E \end{align}\]

Lower rank is now enforced by the nuclear norm, which is essentially the sum of singular values of $A$. Sparsity is enforced by the $l_1$ norm, which I hope you’re familiar with.

Candès showed that this proxy optimization function has the following property:

If $A$ is sufficiently low rank but not sparse AND $E$ is sufficiently sparse but not low rank, then $D$ can be recovered exactly with $\lambda=\frac{1}{\sqrt{\max(m,n)}}$. (Theorem 1.1)

This is amazing, because decomposing $D$ into $A$ and $E$ in an unsupervised fashion seems very very hard. The theorem states that this is possible. Section 2 and 3 of his paper proves this theorem. Section 4 empirically shows that PCP can reconstruct. I will not cover these sections in this post, but I’ll leave a picture of the experimental result from the original paper here:

How to Solve PCP

PCP is convex, so we can solve it for example using interior point methods. However, the author claims that this is not fast enough (with $O(n^6)$ run time). In section 5, he introduces augmented Lagrange multiplier (ALM), which I found is interesting.

ALM

Before discussing how we can solve PCP, let’s briefly learn what ALM does. Say we have the following optimization problem:

\[\begin{align} & \min_{\mathbf{x}} f(\mathbf{x}) & \\ & s.t. c_j(\mathbf{x})=0, j\in [m] \end{align}\]

There are two ways to solve this:

1. Langrange multiplier

We can reformualte the above problem using a Langrange multipler $\lambda_j$ and solve instead:

\[\min_{\mathbf{x}} f(\mathbf{x}) + \sum_{j=1}^m c_j(\mathbf{x})\]

2. Penalty method

A less known approach is called the penalty method. The basic idea is that we can solve the above objective function by solving repeatedly

\[\min_{\mathbf{x}} f(\mathbf{x}) + \mu \sum_{j=1}^m c_j^2(\mathbf{x})\]

Here, we are using a quadratic loss function $c_j^2(\mathbf{x})$, but it can be linear, cubed, anything. The algorithm is as follows:

Solve this optimization problem.
Increment $\mu=\rho \mu$. ($\rho=10$ is commonly used).
Solve the next optimization problem starting from $\mathbf{x}^*$ from the previous step.
Repeat until convergence.

ALM = Langrange + Penalty Method

ALM basically combines these two by iteratively solving the following, updating $\mu$ each iteration until convergence:

\[\min_{\mathbf{x}} f(\mathbf{x}) + \sum_{j=1}^m c_j(\mathbf{x}) + \frac{\mu}{2} \sum_{j=1}^m c_j^2(\mathbf{x})\]

PCP as ALM

Now, we can formulate PCP in ALM form using matrix notations:

\[\min_{A,E} \vert \vert A \vert \vert_* + \lambda \vert \vert E\vert \vert_1 + \langle Y,D-A-E \rangle + \frac{\mu}{2} \vert \vert D-A-E \vert \vert^2_F\]

where F is the Frobenius norm (square root of the sum of squares of matrix elements) and $Y$ is the lagrange multipliers.

In practice, we need to alternately optimize $A,E$. Closed form update formula for these are quite straightforward to derive.

Conclusion

Robust PCA is regarded as one of the master pieces in machine learning papers. I skipped the proof of theorem 1.1 entirely, but I hope you at least got the gist of what it is optimizing for, and how it is done.

Robust PCA ↩

SVD -> PCA

2018-05-27T00:00:00+00:00

I wrote a blog about Robust PCA. As a prerequisite for the readers, I will explain what SVD and PCA are. As we shall see, PCA is essentially SVD, and learning these two will be a nice segue way to robust PCA.

SVD

Formula

Any matrix $A \in \mathbb{R}^{m \times n}$ can be written as

\[A=U\Sigma V^\intercal\]

where $U \in \mathbb{R}^{m \times m}$ is an orthogonal matrix,
$V \in \mathbb{R}^{n \times n}$ is also an orthogonal matrix,
and $\Sigma \in \mathbb{R}^{m \times n}$ is a diagonal matrix.

Diagonal values of $\Sigma$ are called singular values. SVD shows that we can decompose any rectangular matrices into three matrices with nice properties (i.e. orthogonal and diagonal).

Low Rank SVD

In machine learning, we use low rank (or truncated) SVD a lot, because it can compress the information $A$ has to smaller matrices. Formally,

\[A\approx U_t\Sigma_t V_t^\intercal\]

where $U_t \in \mathbb{R}^{m \times t}$ is an orthogonal matrix,
$V_t \in \mathbb{R}^{t \times t}$ is also an orthogonal matrix,
and $\Sigma_t \in \mathbb{R}^{t \times n}$ is a diagonal matrix.

$\Sigma_t$ contains $t$ largest singular values. The rows and columns in $U_t,V_t$ can be thought of as “the most information rich” vectors of $U,V$. Note that the exact equality no longer holds. Here is an image that illustrates the above formula:

Finding $U$ and $V$

Recall that the definition of an orthogonal matrix is $X^\intercal X=X X^\intercal=I$. Using this fact, you can find $U,V$ by using eigen decomposition twice.

\[\begin{align} A^\intercal A&=(V\Sigma^\intercal U^\intercal) (U\Sigma V^\intercal) &\\ &=V(\Sigma^\intercal \Sigma)V^\intercal \end{align}\] \[\begin{align} A A^\intercal&=U(\Sigma^\intercal \Sigma)U^\intercal \end{align}\]

Essentially, applying eigen decomposition to $A^\intercal A, A A^\intercal$ gives us $V,U$ as stacks of eigen vectors. Eigen values in the diagonal entries of $\Sigma^\intercal \Sigma$ correspond to the square of the singular values.

Applications

There are numerous applications of (truncated) SVD:

Collaborative filtering is an algorithm for recommender systems. $A$ is a user $\times$ item matrix, which is decomposed into $U_t$, a user matrix and $V_t$ an item matrix. SVD finds a $t$ dimensional embedding vector for each user and item.
Image Compression: $A$ is the original image. $A$ can be decomposed into smaller matrix space by choosing a small $t$. Below, you can see the reconstructed images for different values of $t$. The lower the $t$ is, the more compressed the image is. Image quality does go down, but it does preserve many important aspects of the original image.

Semantic Indexing: What is known as LSI (latent semantic indexing) in NLP is essentially SVD. In LSI, $A$ is a term frequency matrix of dimension document $\times$ term. $U_t,V_t$ are low dimensional embeddings of document, term, respectively.

PCA

I would say that PCA is one of the applications of SVD.

The objective of PCA is to “compress” a data matrix to a lower rank:

\[\min ||D-A|| \ s.t. \ rank(A) \leq t\]

Solving this optimization function let’s us find $A$ with rank at most $k$ that best approximates $D$. Eckart and Young proved in 1936 ¹ (!) that the optimal $A$ is $X^t$, $X$ with only the top $t$ principal components. What are principal components? We already calculated them above! It’s $U_t\Sigma_t$. Corresponding principal “principal vectors” are $V_t$. PCA is that simple. This is true for all norms that are invariant under unitary transformations. To learn more about matrix norms, see this. We will be using some of them in Robust PCA.

Conclusion

I hope you see how PCA is easily derived from SVD. Now that you know PCA, you’re ready to learn Robust PCA!

The approximation of one matrix by another of lower rank ↩

米国大学出願エッセイ虎の巻

2018-02-06T00:00:00+00:00

Dropboxのフォルダを整理していたら2013年に書いた後輩向けのメモ書きが出てきました。特に加筆修正もせずベタ張りで共有します。海外大受験生の参考になればと思います。

Essay in general（特にcommonappとsupplement long essay）

とりあえず自分の書けそうなネタをリストアップしてみよう！ブレスト的に書いていきましょう。運動会、文化祭、国際大会、家族、名前、日本人、日本文化、本、日常、通学電車、趣味、音楽、何でもいいです。とにかくあげていきましょう。
実際に書いてみよう！最終的にはエッセイ以外の部分（推薦書、こもんあぷ）とあわせて自分が全面的に売り出したい核を２、３個見つけていくことになるけど、それをいきなりやるのは大変。だからまずはとりあえず一つ二つエッセイを書いてみることからはじめましょう。そのときに注意することは、
- そのエッセイを通して自分のどういうところをAOに売り込みたいのか。必ずメッセージは一つのエッセイにつき一つです。一行で自分の何をAOに伝えたいのかを明確にすること。
- そのメッセージを伝えるためのエッセイになっているか。
- メッセージが決まったらそのエッセイはメッセージを猛烈にアピールするための文章になるはずです。全ての段落、全ての文章、全ての単語がそのメッセージを引き出すためにどういった役割を担っているのかを明確に。それが分からない部分は不要な部分です。カットしましょう。
基本的にエッセイは３通りに分類されます（kensho分類法）。
- 着眼点（鋭い、ユニークな目の付け所を示すことで自分を伝える）
- 意見（社会や組織に対する意見感想を書くことで自分を伝える）
- 行動（ユニークにtake actionした経験を書くことで自分を伝える）自分のエッセイはどれに分類されますか？そこを強くアピールしていくこと。例えば着眼点エッセイなら他の人に同じことを書かれたらつらいわけですね。最初に書いたエッセイをいきなり出せることは少ないです。何度も書き直したり、最終的に最適なトピックを選べるように何個も書いていきましょう。

こんな質問をするといいでしょう。

How could you condense the message for the admission officer into one sentence?
What is the theme?
Does every part of your essay relate to the prompt?
Is your essay creative or unique, telling a story that only you could tell?
Can the admission officer get a sense of your passion through your essay?
Is your essay balanced with the proper weight given to each idea?
Is your essay believable?
Is every word/sentence necessary to get your point across?
Are your ideas comprehensive with logical connections?
Are you using the fewest words possible to get your point across?

エッセイを書くときの留意点

Do you really need all those cliches, idiomatic expressions, adjectives, and adverbs instead of descriptive nouns and verbs? Is there an easier word that accurately expresses exactly what you want to say? 形容詞、形容動詞は文章が下手に見える要因らしいです。シソーラスを駆使して名詞、動詞で最適なものをばしばし探していきましょう。
Is your writing style consistent and in your own voice? 色んな人に見てもらった結果、これだけ他のエッセイと英語が違う！？とか、私っぽさが全く出ていない！？ということがないように。
Does each section flow smoothly into the next? この段落はこういう意味を持ちます、としっかり言えるように。
Are you avoiding using pronouns too much? itとかthemとかマジで何か分かんない。本人以外は。
Do you repeat language carefully? 基本的に同じ単語は近くで使わない。前の箇所をリマインドするために意図的に使う場合あり。
Does your essay have metaphors or analogies? If so, are they strong and vivid or do they confuse the readers? いみふなメタファーにならないように。
Are you sure the admission officer has enough knowledge to fill in the gaps of your story, or do you need to provide a bit more background information? 日本文化とか、実は日本でしか使わない言い回しだったとか。あとは日本的な謙虚さは理解されないので、日本人だったらしつこいくらいがちょうどいいかもしれません。控えめに言ってもあまりいいことないですよ、ちゃんと理由がない限りは。
Is the pace of your essay fast enough? 一つのエッセイを１分とかで読むのがAOなので、すらすら流れるようなエッセイを書きましょう。
Do your beginnings and endings capture the interest of the reader? キャッチーに始まって「おっ」って思わせて、後味よく終わらせられれば勝ちです。まじで。

続いてSupplement系の対策

Is your answer to supplement essay prompts personalized to the individual schools? その学校っぽさを出していきましょう。

WHYどこどこ？系エッセイ Why Yale? = Why Taro to Yale?　です。Yaleが太郎君にとって大事な理由ではなく、太郎君がYaleにとって大事な理由、何をもっていけるかを書くこと。書き方は「実際に先輩にこんな話を聞いて、ぴったしやーん！」的なものから「理由は二つあります。一つは、一つは」みたいな。
short takes系エッセイ「好きな本」とか「休日は何をしますか？」みたいなやつは、ガチでそれに応えるより、そういう応えでAOに何を見せたいのかを明確にすること。

例）Q. What is something about which you have changed your mind in the last three years?

A. During the blackout after the East Japan Earthquake, I realized that candles aren’t just obsolete light sources; they foster true family intimacy. (146/150 characters)

見せられるものは、「日本人しか経験できない東北地震を経験しました」「家族の感じ」「ちょっと繊細な感じ」みたいな。これはapplication全体で日本人らしさがあまり出ていなかったので、全体を補強する意味を込めて最後にshort takesで全体のバランスをとった感じです。

解答単体では普通でも、全体の印象、他のessayとの関係を考えればその人だけのsignificanceがでてくるものがあります。ここはみんなあまりじっくり考えてこないので、差を付けやすいです。

他の人に読んでもらうときの注意点

ある程度エッセイのことを知っている先輩に読んでもらう場合は、上にいくつか列挙した英語の質問のうち自分が自信を持ってyesといえないモノに対してどうやって対処すればいいかを聞けばでしょう。あまり知ってない人には単にyesかnoで答えてもらうだけでいいと思います。３、４人以上のアドバイスを素直に全部聞こうとするとどれが正しいか分からなくてしっちゃかめっちゃかになります。信用できる人を数人選んで集中的に見てもらえるのが望ましい。経験則的に質問は２種類に大別されます。

あなたならこのpromptにどう応えますか（質問を応えるときの考え方の道筋を相手に聞くもの、short takes系に有効か）
このpromptにこう応えたんだけど、どう？（こっちのほうが一般的。自分の文章の細かいところを聞く）

その他

シソーラスを使いまくれ。 http://thesaurus.com/

海外大出願時に知っておけばよかったこと

2018-02-05T00:00:00+00:00

現高３の僕らがもっと早くに知っておけばよかったこと…

＊今年受験生の僕ら（高３―４大柴・渡辺）が、後輩に同じ苦労はしてほしくないという思いで書いています。オフィシャルなものではなく、受験生の一意見として参考程度にとらえてください。

併願

基本的には高３の春休みまでに三科をある程度完成させておく必要があると思います。併願するかの決断時期も高２の終わりあたりがいいのではないでしょうか。可能ならば高３春のセンター同日、東大同日で合格に近いところまで持って行く。高３の夏以降は完全に海外に時間を取られると思っておいた方がいいです。下記は僕のものをもとにしたタイムスケジュールです。

高１　日本の勉強（学校と、塾行っている人はそれも）を中心に。適度に英語に触れて。
高２年内　TOEFL100以上に。時間かかる人は高１からコツコツやる。日本もまじめに。
高２１月　SAT Subject　２つ。
高３春　　日本、ある程度完成
- ５月　SAT Reasoning２回目
- ６月　SAT Subject
- ８月　エッセイ、日本
- 10月 SAT Reasoning２回目
- 11月 Early Actionしめきり
- 12月 SAT Reasoning３回目
- １月　Regularしめきり併願はリスキーです。どんなに高３一学期時点で成績が良くても２学期の出願に追われていて東大に落ちてしまった先輩が何人もいらっしゃいます。特に純ジャパで併願成功例はあまり知りません。あとは、例えば東大との併願でなくても、慶応とかのAO入試と併願するという手もあります。米大を受験する人にとって日本のAOは楽らしいです（受けた先輩が言っていました）。

GPA（学校の成績のこと）

端的に。全教科8.0以上とればGPA５点満点です。開成は他校に比べて相当GPAをとるのが楽だと言われています。定期テストはしっかり対策すればできるものなので、毎回ぬかりなくやりましょう。また、中３からの成績を要求する大学があります（Yaleとか）。高１からだと思って油断していると痛い目に遭います。

その他、文理選択する際には、理系を選択すると有利です。理系にしろって言っている訳ではないですが、後述のSAT Subjectや、併願する際の労力を考えると、明らかに理系が楽です。また、審査官は「科目選択で常に一番上のコースを選択しているか」ということを見てきます。理系の数学3Cがこれに該当するため、選んでいないとこれを言うことができません。

SAT（TOEFL100超えてからSATに移行するのが望ましい）

SAT Reasoning⇒以下の３セクションが受験必須。

MATH：中学生レベル。満点とれます。perpendicularとか、そういう数学の用語だけ押さえればOK
WRITING：文法とエッセイという２セクションがあります。文法は日本の英語の勉強をしっかりしていれば、あとは問題を解きまくればなんとかなります。エッセイは厄介です。２５分でA4見開き２ページは厳しいです。普段から英語を書く癖を付けるしかないと思います。英語独特のエッセイ構成についても、書きまくってマスターしましょう。日本の受験レベルの文法力があれば、あとは語彙とイディオムを詰めて書きまくるだけです。
READING：日本人が最も苦手とするセクションです。本や雑誌を読みまくりましょう。ネイティブと比べると圧倒的に活字に触れる機会が少ない訳です。だからスコアが出ないのです。とにかく速く正確にできるようになりましょう。問題自体は日本語で書かれていたら全問正解できるようなものだと思います。ただ本当に時間が足らない…。全部「まだ大丈夫」とか言っていると高２、高３になってきついです。これを読んで本気で海外に行きたいと思うなら、今日からとにかく毎日英語に触れましょう。（なんか塾の宣伝文句みたいだけど…）

SAT Subject

物化生、世界史、MATH2、Literatureなどから選択。大体の大学で２つは必須。日本人では圧倒的に物理とMATH2を選択する人が多いです。２つとも満点は狙えるでしょうし、少なくとも750/800くらいはとれると思います。

TOEFL（日本生まれ日本育ち、略して純ジャパの壁）

恐らく学校の勉強をそこそこしてきた人が、特に対策をせずに受けると４０点ちょいでしょう、僕も一個上の先輩もそうでした。そして、そこから８０点代にあげるまでには７００時間の勉強が必要と言われています（８０→１００は３００時間）。

壁1：圧倒的語彙力の差

まず学校の英語とTOEFLでは要求される語彙力のレベルが段違いです。しかし、こつこつ努力すれば確実に身に付く能力であるため、純ジャパである人は最も重視すべき力だと思います。

Step1 大学受験用→学校で配られたのを一冊、つべこべ言わずに丸暗記。ここではひたすら無心で何百個単位でざーっと覚えることを何周も繰り返すこと。一日何単語系は禁止。 TOEFL用→まだまだ基本語、これも一冊３０００語ほどのやつを。ここらへんからゴロや絵などの何かしらイメージ付きで覚えた方がいいと思います。何か感覚や感情、イメージが想起されたときに、つられて単語も思い出されるようにするのです！・・・実はこのレベルが一番大変かもしれない。
Step2 SAT用→ここから発展編。なるべく語幹やどの単語から派生してきたのかを考えるなり辞書をひくなりしましょう。ここまでの知識が一気につながってくるはずです。それが見つからない場合、僕はゴロを考え、絵が思いついたときは絵を添えておきました。このレベルになると、以前のように脳筋プレイする必要はなくなってきます、まずは一周を目標にして、そのあと復習という形にしましょう。

壁2：リスニング力

TOEFLにおいて、実はReading以外のすべてのセクションに紛れ込んでいる曲者がこやつ。ひたすらに慣れですが、ある程度の段階はあると思います。

Step1 自分が発音できない音は聞き取ることができません。まずは、英語耳の青いやつを買いやりましょう。発音いらないという意見も聞きますが、高みを目指すのあれば必須であることは言うまでもありません。（アクセントはもっと重要）
Step2 音が聞こえるようにする。まず、１０～１７単語でそのうち知らない単語二三個程度の例文と、その音源を用意する（僕はTOEFL TEST 必須英単語5600）。そしてそれを、普通の速度→文字見ながら音を完全に真似する・速め→見ずに同じことをやる・超速め→別のこと（例えば暗算）をしながら同じことをする　の段階に沿ってこなします。疲れるので一日一時間程度でしょうか。三か月頑張ってみてください。
Step3 後はTEDやiTunes U、CNNなどで多聴してください。

以下、他に使える小細工を書いていきます。

Readingでは英語の文構造：一段落のだいたい最後にキーセンテンスがあることや、結論が最初や最後で述べられることを意識して読む順番を考えて。それと、問題解きながら読み進めましょう。
Speaking・Writingではテンプレートを活用しましょう。
試験当日は少し遅れていくこと。みながSpeakingをしているなか、君はListeningすることになります。聞き取りにくいかもしれませんが、みなが口にしているのは何でしょう？そう答えです。
ダミー問題の存在についてネットで調べてみてください。判断は各自に任せます。
おすすめは、必ず覚えられるOfficial Guide/TOEFLテスト英単語３４００/大戦略シリーズの単語帳/TOEFLゼミナールReading, Speaking/Mastering Skills for the TOEFL Reading, Listening

課外活動

ここが日本人の弱いであろうところです。かといって、ひたすら大会に参加しまくればいいってものでもないと思います。自分の好きなことを、とことん追求することが大事だと思います。何かでっかいことやっちゃってください。とは言っても必要なので・・・まず評価はlocalからinternationalになるにつれて上がります。また、何かアカデミックな賞があるといとされています。これは○○オリンピックや模擬国連、WSCで確保している人が多いように感じます。こういう大会に参加するときは”何が求められているのか”を見極めることが大事です、所詮審判がいるゲームなのですから。情報はGoogleやFacebookを利用し、勝利は腐っても開成生ということでもぎ取ってください。もっと大事なことは、開成で課外活動にとりくむであたっての心構えです。というのも、半端なく煽られます。結果がでるまでつらいでしょうが、ひたすら自分を信じて頑張ってください。見てくれている人は必ずいますし、何より取り組んでいて純粋に楽しいはずです。

最後に

とにかく海外受験はつらいです。そんな中でも耐えられるように、”なぜ自分がアメリカに行きたいのか”考えておくことをお勧めします。

高３時にアメリカ東海岸大学巡りした時の日記

2018-02-04T00:00:00+00:00

Dropboxのフォルダを整理していたら2013年に書いた後輩向けのメモ書きが出てきました。男子高校生らしいことがたくさん書いてあるけど特に加筆修正もせずベタ張りで共有します。海外大受験生の参考になればと思います。

１日目　Swarthmore

Philadelphia air port着　田井中さんと少し話して分かれる。時間があったのでタクシーではなくseptaで30thへ。実はswarthmoreまで便が少ないことに気づく。駅はwifiがつながって感動。１９時半頃に駅に到着。三年生の豊島出身の女の子も一緒に、町の中華料理店で夕飯。３つ頼んで多くてテイクアウト。寮に向かう。部屋にはフィリピン系アメリカ人（computer science先攻、遺伝子をプログラミングするみたいなやつ、説明聞いたけどよくわからん）上の階にはその二人が仲良くしてる女の子二人、中国系アメリカ人（日本語少しできる、小さいロリ系）と普通のアメリカ人（コスプレ好き）。部屋に残ればと言われたが２対１は気まずいので… 中国からの女の子ともう一人アメリカの女の子の部屋は装飾がきれいだった。話していると次から次へと来客が。一人はチェロがばりばりでmusic theoryを勉強しているらしい。コカインを吸っているホンコン出身の人もいた。寮は男女共同で、部屋が混在しているものと男女で区別されているものがあるらしい。植田さんのとこはハーレムでシャワー、トイレも男女共用である。部屋は汚い部屋切れない部屋があり、植田さんのルームメイトがきれい好きでそれに影響されて彼もきれいらしい。最初の１年はアンケートをもとに選ばれ、次からは２〜４人で組むらしい。上級生の方が人気のあるところをとる（広いところから）ある部屋に入ったら男の子と女の子がいた。engineerのクラスに誘ってもらえたのだ。他の寮にも行った。どこの寮が一番エネルギーを削減できるか合戦をしているらしい。AOの上に寮があるところもあった。地下には洗濯できるところ、共同で使える団らん場みたいのが一階に（ビリヤードとか。各階にグループワークに使えるところが会った。結構みんなグループワークに使っているらしい。植田さんは、スペイン語、心理学、経済学、言語学、火木につらいのをもってきている。1時頃まで宿題をやっていた。朝も宿題をやっていた。ルームメイトもそこそこ大変そうだった。他のリベラルアーツとの違いは、それはそこの人と会って話した感想だった。佐久間さんと一緒だ。僕も一緒。湯船がないのはつらいがしょうがない。たまにphiladelphiaに映画を見に行く。夏は日本に帰って、沖縄でバイトしたり、swarthmoreの友達が日本の家に泊まりにくるらしい。推薦は相当重要だとか。パーティーは木土、相当乱交でエロいとか。たまにベッドがきしむ音が上から聞こえるらしい。酒は基本的に成年の先輩が買ってくる。

２日目　Swarthmore

彼女は中国系。近くの女子大に通う。僕と同じような理由でアメリカに来たようだ。日本語シャベル会みたいのであって、その後積極的にアタックしまくったらしい。植田さんはボランティアがとにかく強かったらしい。朝は最初、engineeringのクラス。難しそうだと思ったらただの力のモーメントだった。実は有機化学とかも結構簡単らしい。数学はそこそこ難しいとか。クラス分けで最初から難しいところに行けると言う話も。テストになにがでるかけっこう気にしているようだった。寝ている生徒は村内ない。結構発言する人はいた。もう一人specがいた。東大生が一人留学できているらしい。次はmodern artのクラス。ピカソとかキュビズムの話からfuturismの話まで。lecture-basedで二つの絵の比較で進んでいった。クラスは男女半々くらい。帰り際に教授と話した。宿題はmuseum report。これ以外は基本discussion-basedのクラスだと。そのあとは植田さんの彼女と合流して大学のショップやパーティーの会場を視察。敷地は奥の森まで続くらしく、生物のリサーチなどで使われる。ローマみたいなところは重要な儀式に使われるとか。 swarthmoreにも学生運動が会って、それのバリケード封鎖のためにくねくねしている寮もあった。 botanical garden付近に行ったら１９８０年代からボランティアしているおばあちゃんに遭遇した。丁寧にhorticulture labを案内してくれた。地元との距離感とか、すごい感じる。ツアーもやってた。テニスコートは屋外に６面屋内に３面、使う人がいないとか。植田さんは友達が多い。会う人は皆日本語をしゃべる。日本語しゃべる会で知り合った人とか多いらしい。サイエンス系の建物の扉は重い。発電しているらしい。でっかい貯水タンクも会った。図書館は三つ。普通の、サイエンス系の、音楽芸術系の。カフェも三つ。普通の、サイエンスカフェ、心理学とかの建物のカフェ。春は桜がきれい。通りに並ぶ。桜フェスティバルで日本の大学から太鼓たたきにきたりするらしい。昼寝することも多いらしい。午後まですっかり寝たりとかね。静かな感じ、ほかほかした感じ、鳥のいる感じが本当にきれいだった。正面で写真とった後ダンキンドーナッツへ。やっぱりでかい。少し恋愛の話とかをした。今度日本に来ると言うから、楽しみだ。駅でお別れ、女の子たちはエロいと言う話をして終わった。ありがとうございました植田さん、まじで楽しかった。なんかすっきりした。

２日目　U Penn

芋と再会。道に迷ってupennの事務所に電話をかけてもらった。philadelphia自体はでかいけど田舎町っぽかった。国連の人と食事、飯はアメリカンサイズのはんバーがー。彼女はインターからシカゴ大政治系行って色々やって今は国連で紛争系の分析みたいな仕事をやっている。何か自分の好きなことをやってから国連とかに行くのが一番いいとか。電車に本当に乗れたか戸惑いながらprincetonへ向かった。

２日目　Princeton

NJは５分ですぐに着いた。早めに着いたのでtheaterで暖をとった。栗脇さん二荷物を預け、迷いながら寮へ向かう。ハリポタのような世界で、夕暮れも相まって感動した。寮では二人の日本人が待ち構えていた。一人は横浜のカナディアンインター出身の大３、もう一人はrei matsuura、途中で韓国系アメリカ人のfreshmanと合流し、一緒に食堂へ向かう。日本のアニメイベントを夜やるようだった。食堂の飯は普通だ。８時にしまるのは早い。アメリカ人の飯は早いらしい。なんでそんなに英語で切るのと少し韓国人の子にほめられたが、そこまで会話をできなかった。そのあと夜食用のビルディングへ。１１時くらいまでやっているらしい。一日６$くらいまでなら上限なく使えるらしいので、マフィンベーグルヨーグルト水をおごってもらった。すしもあった。そこそこうまいらしい。そのあと２階へ。イースターペインティングをした。俺はペンケンをあきらめおっさん、芋はペンケン、reiさんは途中で帰って、栗脇さんはよくわからん青、で、韓国の子がすごかった。彼女は音楽に秀でているらしく、アメリカ一の音楽専門学校にも受かっていて、なやんだあげく選択肢の幅がひろいprincetonを選んだらしい。でもやっぱり音楽が充実していなく、yaleへの編入も考えたとか。いろんな背景でprincetonへはいった人がいるものだとおもう。かわいかった。実はshyだったけど、栗脇さん率いるボランティアを通じて音楽ひいて意気投合したとか。そのあと歩いて寮へ。molecular biology専攻二人とclassics専攻一人、くりわきさんはpolitical science専攻。生物の人は他の部屋の女の子と一緒に宿題をやっていた。一回に②、３時間かかると言う。丸写しはだめだが協力者に名前を書いてグループワークすることは奨励されている。

３日目　Princeton

朝ご飯は前日に買ったマフィンを食べ、eating clubに行った。そのあとは授業のオンパレード。一個目は法の授業。栗脇さんが以前とっていたけど忙しくてやめたと言うやつだ。陪審員制度の是非についてだった。advantage disadvantageを挙手で生徒に答えさせ、それにコメントを付け加えて進めていくようなものだ。発言自体はそれほどのものではない。だがその力と言うのは日本人が持ち得ないものであると思った。それにしても内容の半分ほどしか分からないのはやはり問題かもしれない。二個目は栗脇さんがとっている中国語の授業。栗脇さんの他にアメリカすんでいた日本人とコロンビア人、あとは全員中国系アメリカ人だ。中国のファーストフードに関して記事を読んで新しい語法（予習）を使いながら発言すると言うものだった。ぶっちゃけ何行っているかさっぱり分からなかったが、これほどまでに発言する機会があるのかと言う感じ。この他にも週１でマンツーマンで話す時間があるそうだ。あとは週に３本作文を書いて添削してもらうとかね。そのあとは３個目の建築とヴィジュアルアーツの授業だ。ネタ。正直めっちゃ眠かった。なんかのexhibitionの話のようだったが、ぶっちゃけ何も覚えていない。そのあと栗脇さんと芋と合流して一緒に昼食をeating clubに向かう。栗脇さんの友達は春休み明けでひさしぶり〜みたいなテンションだった。午後は４個目の認知哲学の授業。後ろにおじいさんおばあさんがいたのが印象的だった。授業自体は時の存在について、扱い方についての講義で、わりかし面白かった。そのあと経済学入門の講義。３００人くらいのクラスなんだが、不意に人をさす先生で怖かった。ただ、名物教師らしい。授業内容が分かった唯一の授業だった。質問とかも結構していたし、授業後も質問に何人か並んでいた。授業自体はまぁまぁおもしろかった。そのあと中国語の研究所に行き、栗脇さんと再会してキャンパス巡り。ちょうど合格発表のようで、aoとかもちらみ。著名人のも頻繁にあるようだが、講演会が多い。でも栗脇さんはそれよりいい教授に長期的にレッスンを受けたいとか。外に出て、少しチョコアイス、校門で記念撮影して、ショップでクッズを買って、そのあとは宿に戻る。心理学は新しいビルディングをかねかけて造っているらしい。大学も競争の世界だ。荷物をパックする、栗脇さんも僕と同じようにあまりしゃべらない人だったらしい。少し元気づけてくれた。彼は人をよく見ている。高校時代は模擬国連とバレー部のセッターをやっていたそうだが、人をまとめるのは得意らしかった。駅で栗脇さんと別れ、急いでny行きのamtrakに乗り込む。nyでは芋がこげたチーズバーガーを食べてnew haven行きを待つ。やっぱり危なそうな人が多い。本当にファーストフードには飽き飽きしている。無事new haven行きに乗り込むと、僕は眠りこけてしまった。着いた。奥さんがお出迎え。めっちゃ美人。山梨の田舎育ち、小学生から料理を作っていたと言う。飯で選んだと萩原さんが言っていたが、その通り。本当においしい。でもかわいい。めっちゃ。萩原さんは山梨の後声をかけられて二年ほどコネチカット大学で安月給で働いているとか。引き止められて３年間いることになったらしい。５０分ほど車に乗って家へ。大学生が泊まるほどの大きさの家らしいが、僕には十分に見えた。お茶漬けを出してもらった。うまい、久しぶりの日本食は神。そのあと久しぶりの湯船に入り、くらくらになりながら就寝。

４日目　Yale

朝ご飯はアップルパイ。これまたうまい。そのあと車でvisitor centerまでおくってもらい、ばいばい。とりあえず芋と分かれて教室に向かうも扉が開かず再び合流してしばらくさまよう。そのあとなぜか再び分かれて僕はvisitor centerに戻る。ツアーを待っている人が多かったので僕もまつ。that’s why I chose yaleのビデオは秀逸。そのあとはツアー。ガイドはアメリカのネイビーに行きたいんだとか。鐘はレディーがガとかを演奏する。dormは鍵を開けて中に入る感じ。中庭がすごくきれいだった。１２のresidential college結構バトルとかもあるらしい。一年目の部屋はスィート。二年目以降は好きに選ぶことができるらしい。他に参加していた高校生が出来上がったエリートでなえた。最初の２週間は自由に授業に参加できる（その期間の欠席は自己責任）。寮の建築がすてき。全体的にすごくいい空気。ツアーの後芋と合流してスポーツセンターに向かう。めっちゃでかいビル。流石にここで運動はしたくないなと思った。そのあと図書館（古書保管室？）へ。写真とるとらないでもめた後commonsへ。コーラスを聴きながら田口さんを待つ。田口さんはめっちゃいい人だった。食堂はハリポタみたいなところ。飯もそこそこ豊富。結局人との距離も合うも会わないもなじむもなじまないも全てその人次第。教授との距離が遠いとか、友達が固まるとかは言い訳にすぎないと田口さんは言う。学部に重点が置かれていないなんてコトもない。また、英語の勉強がつらいくらいなら日本の大学行った方がいいよと。彼は生物の研究をしたかったけど、文系科目が捨てきれず海外を目指したらしい。彼はyaleの言葉では語り得ない空気にひかれたと言う。実にそれはよくわかる。less competitiveでいい人も多いとか。彼自身はエチオピアに生物の授業で行った。イェールは生物が強くないから逆に競争率が低くていいとか。メディアは海外進出を煽る、シャイだから行けないなんてコトはないし、むしろシャイな人ほど行っているのではないか。自分が行きたいと思ったときに、誰にも言わず勉強していけばいいのではないか。そのあと外へ出る。少し彼のdormを見せてもらったが地下のスペースがすごいいい感じ。交流場とか洗濯機とか、とにかくきれい。中身も外見も何もかも。図書館も静かでゴシックでいい感じ。田口さんは他の大学に関してはあまり調べていないらしい。yaleにピンと来るものを感じたんだとカ。英語に関して不自由しなかったと言っている。シャイでもよいと言うか、なんか普通に日本の大学に行く感覚で話していて、なるほどそうかと思った。そのあとは北の方に行き、大学院のところで分かれた。そのあと国際関係の授業。democracy in usだ。興味深い。国家の思想的影響は教育に及ぶ。そのあとは理系のほうを見て、そのあとl中庭でだらだら。そして熊倉さんにお会いした。熊倉さんは日本では高校時代に中国語をはじめ、大学では中国政治を研究しようと思っていたが、今では旧ソ連諸国の発展に関して研究していた。これからは旧ソ連書庫国が来るのではないかと言うことだった。yaleの学生と会った。９カ国語を操る天才だ。常々至る所でいろんな言語語られるのは本当に驚く。カナダで生まれ、ロシア、中国に住んでいたから、デフォで４カ国語。高校時代に日本語を勉強していたらしい。日本が勉強対象にされる気分はすごい。知床のアイヌとかね。そのあと夕飯、肉じゃがその他諸々。うまい。風呂に入って出たらいつの間にか寝落ち。一度起きたら何も欠けないで眠りこけていてびびった。

５日目　Columbia

早起きでチーズケーキを食べ駅へ。写真を撮ってお別れ。ny二あっという間に着く。luggageを預ける場所を探す。トイレが物騒。汚い臭いで、最初の印象は最悪。荷物を預けてプラプラした後チャーリーと食事。まぁまぁ話を聞けたかな。そのあとタクシーでコロンビアへ。たまたま日本ふぇあ、何たる偶然。話しかけた日本人がu pennで会った清水国連職員の妹！１３時半に再会して案内してもらうことに、その他にエリザベス！の人とか、むっちゃ面白かった。そしてリバーサードパークを散歩。アイスを食いながらコロンビアに戻る。キャンパスを見て回る。狭い小さい、少し汚い。residential collegeみたいのもないし、なんだかなぁと思った。でも中庭は好き。少し昼寝して、そのあと地下鉄でpennへ戻る。帰り際に芋とけんか？俺は人の話を聞かないと彼は言う。俺はあいつの物言いが非常に失礼だと思った。そこから損得を気にする人生に着いて議論した。芋は利他的な人だ。でも実際彼を嘲る存在がいる。でもお互いハッピーならそれでいいのか、そうなのか？luggageをピックアップして土産やデイもが買ってる間に僕はサブウェイへ。ぎりぎりで合流して、amtrakへ。海が見える。隣で芋はねている。くまひらさんからメッセが来て、めっさ楽しみなんちて。駅に着くとタカさんと熊さんが待ち合わせ。タカさんは北海道から文１からのtransfer。北海道に彼女がいて週１でskypeしているらしい。めっちゃ面白い。ハイテンションで芋の話を聞きながらハンバーガー屋へ。俺は愛すバーガーを、３人はまともなバーガーを食べる。俺の教育に興味がある話へ。タカさんは教育政策がメイジャーだ。彼の理想の教育システムは、高校卒業全員にプレゼンと論文を課して、大学のセンターを廃止、各大学がユニークな入試、自己推薦文章みたいのを書かせると言うものだった。そのためにはクラスの人数とかも変えなきゃみたいな話もあった。熊さんはお金持ちっぽくて、ドイツのインターに視察に行ったことがあるらしく、歴史に関して例えばなぜリンカーンは奴隷解放をしたのかとかに疑問を持たせる教育がいいのではと言っていた。そこそこ白熱したところでタイムアウト。そのまま寮に行く。寮では下ねたのオンパレードだった。facebookで栗脇さんの画像を見て興奮したり、女の子の写真を可愛い可愛い言っていた。うん、ずっとタカさんがいる間は下ねたを話していた。彼が帰った後に風呂に入る。風呂から出て、芋がいなかったので熊さんと話す。なんでか教育の話になって、ルームメイトが寝るとなって地下のキッチンに移動した。いろんなサマーキャンプとか、活動とかある中で、それが乱立していることを危惧していたらしい熊さん。説明会に関してもその思いを聞くことができました。相当前から企画していたらしいのね。で、俺らのinspire the nextの話。開成生も俺らも同じ感情を持っているんではと言う話。なぜなら今のグローバル化を煽る風潮を嫌っていると言う点で共通しているから。一つの解決策として彼が提示したのはグローバル化なんて大嫌いだと言う感じでセミナーを組むことだが、これもまた開成生が毛嫌いするのは目に見えている。とりあえずやってみりゃいいじゃんと言う話になった。夜考えてちゃあかんから寝ようみたいな。そのあとオナニーして寝ました。

６日目　Brown

朝は一回早く起きたけど結局１０時半まで寝ていて、熊さんは早く起きてカフェテリアで勉強していた。合流してスタバへ行って朝飯を買って、外で食う。うまいうまい。のどかな春の訪れ。そのあとはぐるぐるキャンパスを回る。特に面白かったところはないかな。完全バリアフリーとか、黒人女性校長とかは革新的な証拠だね。冬はジム、熊さんもよく走るらしい。待ちにいくことは少ないんだとか。そのあと一階部屋に戻って寝て、12時半にカフェテリアで熊さんたちと合流。そのあとメキシカン料理屋でテイクアウト。うまいうまい。なんでアメリカ行きたいのみたいな話とか色々したかな。くまさんたちはへやぎめの話デッ盛り上がっていた。そのあとカフェでともチーにに会った。彼はちょうどヒューストンでめっちゃ人とあった帰り。めっちゃめっちゃ楽しかったらしい。でも帰ってきて諸ボーンみたいな。順番に興味の話をして、どっちも興味を持ってくれたみたいだった。どっちもうまく自分の興味とつなげていた。宇宙法、科学教育など。そのあとはpeterpan bus申し込んだり、いもと自己分析みたいのしたり色々。ともちーには説明会のポスター案の話をしていた。そのあとは食堂へ。やっぱり女性の話ばっかりなんだとか。科学からロシアの戦後史みたいのに興味を持っている人もいた。ゲイは１７％くらい？すごいリベラルらしい。そのあとは下でプール＝ビリヤード。はまった。帰って中国人のルームメイトと話す。なんか専門的すぎてtransferしたらしいよ。で、寝た。７日目　MIT ６時起き。熊さんと一緒にロードアイランドを歩いてバス乗り場へ。peterpan busがnew york行きなのでスルー、後にそれが正しかったと発覚した。急いで駅へ行ってチケットを買う。乗る。途中チケットがなくなったと思って焦ったが大丈夫だった。ラッシャアワーと言えど、日本ほどではなかった。Bostonについてパンを買ってそのままsubwayでharvardに。道に迷いまくって１時間半くらい無駄にした気がする。スタバでやっと道が分かり、寮に到着、汗かきまくり。誰かに入れてもらってスーツケースだけ置いて、藤崎さんに電話をかけてシャトルで駅へ。藤崎さんと合流する前に像の前でロンドン出身のイギリス人に声をかけられた。それで、そのあとブックストアを見せてもらって、歩いてPorter Squareまで。一旦家によって地図をとって、そっからハンバーガー屋に。ベジーバーガーがおいしかった。分かれて駅へ。Glove Cycleを発見してめっちゃびっくりした。そっからMITへ。少し迷ってぎりぎりでツアーへ。ツアーは大人数。ぶっちゃけつまらん。そのあとは探検。うーん、微妙。水泳４ラップ強制とかは面白かったかな。あとは量ごとに相当雰囲気が違うとか、Freshmanのうちから特長にあわせて選べるとか！Media Labの中に入れなかったのが残念。Bostonの光景はきれいでした。そのあと大雨の中subwayでなんとかHarvardへ。一回寮へ。そのあと寮の近くのスタバに、そっから歩いて駅の方のスタバでseongと合流。イタリアンに行った。おごり！でも食いきれなかった。ハーバードはつまらないと言っていた。Too independentなんだとか。彼はクラブとか仕事とかを４、５やっていた。半端ねえ行動力。彼と別れて歩いて寮へ。Fanaに迎えにきてもらい、宿へ。誰もいなかったからFanaに一緒にいてもらい、途中で寝落ち。Stephaneきても寝ぼけていた。

８日目 Harvard

朝は６時くらいに起きてゆっくり準備。８時半に朝飯を。寮は全部つながっているらしかった。そこそこうまい。そのあと日本語の宿題を少し手伝いながら９時４５分にSeongと授業へ。少しホールとかを見せてもらってから、授業へ。統計の授業はかろうじて着いていけた。言語の壁がなければ余裕だったと思う。でもなかなか面白かった。そのあと天文学の授業。ただの物理、くそつまらん。これは文系用のやつらしい。それにしても簡単。そのあとは岡さんと合流して周りを色々。まずはサイエンスセンターの屋上の見晴し台。ハーバードを一望。研修すれば望遠鏡が使い放題なんだとか。そのあとライブラリーとかちらっとみて、岡さんの寮の食堂へ。まぁまぁうまい。やっぱり名声で選んだと言っていた。後は町が好きだから、ニューヨークDCの次はボストンだろうと。そのあとは亮介さんのところに。台湾人の人も一緒にお茶。台湾の人は日本に今度来るらしい、ギャップイヤー、ハーバードはナンバーワン。亮介さんはやっぱり最高の教授がそろっていると言っていた。教育なら教育の何をやるのか、どのような立ち位置（教授？学会組織？）を決めるべきだと。教育学部はpsychologyとeconomicsとの組み合わせが主流みたいなね。国際関係論にしても、方法を研究するものと実際を研究するものとで違うし、学問自体を専攻している人が実際に現地で携われる訳ではないということを気をつけるべきだと。そのあとは熱中していることをやるべきだとひたすら言われた。中長期的な目標の設定が重要なのではないかと。まぁ俺は哲学的な部分が削ぎ落とされると思うのだが。今何を考えなければ行けないのか、さっぱり分からん。そのあと少し芋と議論して、そのあとプラプら。coopでお土産を買って、ぷらぷらしてスタバへ。

持ち物

着替え（パーカー、カッターシャツ、Tシャツ、半袖Tシャツ、上着、ズボン三本、ジャージ上下）下着（パンツ、シャツ）靴下バスタオル、フェイスタオル歯磨きセット洗顔石鹸ひげ剃り（笑）ワックス（笑）お金食事代　1日2千円×10日＝2万円小遣い　一万予備　　　一万電車賃　？（wifi代　一万）携帯ウォークマン充電器薬（風邪薬、お腹の薬、目薬、花粉のやつ）パスポート財布二三個連絡先メモ保険ガイドブック手持ちのバッグ筆記用具行動記録帳ノートテッシュ（ひと箱、ポケット）電子辞書腕時計単語帳スーパーの袋折りたたみ傘スリッパ追加で

Kojin Oshiba

AI in Society: 2. Algorithmic Fairness

A popular criminal risk assessment AI is racially biased.

Algorithmic Fairness definitions

Fairness Through Unawareness

Group Fairness

Individual Fairness

Counterfactual Fairness

Conclusion

Tackling the Cold Start Problem in Recommender Systems

Approaches (TL;DR)

Representative based

Advantages

Disadvantages

Content Based

Advantages

Disadvantages

Bandit

Solutions to the Bandit Problem

Contextual Bandit

Advantages

Disadvantages

Deep Learning

A simple trick used in Deep Learning based Youtube Recommendation

DropoutNet12:

Session-based RNN13:

Advantages

Disadvantages

Conclusion

ハーバードの授業で習った英語の文章が上手くなる７つのコツ

AI in Society: 1. Interpretability

Early days: variable selection with L1 regularization

Then comes sparsity of samples

Decision Sets

Scoring Systems

Being Causal

Conclusion

AI in Society: 0. Limitations of Machine Learning Today

Robust PCA

Goal

Objective Function: PCP

How to Solve PCP

ALM

1. Langrange multiplier

2. Penalty method

ALM = Langrange + Penalty Method

PCP as ALM

Conclusion

SVD -> PCA

SVD

Formula

Low Rank SVD

Finding \(U\) and \(V\)

Applications

PCA

Conclusion

米国大学出願エッセイ虎の巻

Essay in general（特にcommonappとsupplement long essay）

エッセイを書くときの留意点

続いてSupplement系の対策

他の人に読んでもらうときの注意点

その他

海外大出願時に知っておけばよかったこと

現高３の僕らがもっと早くに知っておけばよかったこと…

併願

GPA（学校の成績のこと）

SAT（TOEFL100超えてからSATに移行するのが望ましい）

SAT Reasoning⇒以下の３セクションが受験必須。

SAT Subject

おすすめSAT参考書

TOEFL（日本生まれ日本育ち、略して純ジャパの壁）

壁1：圧倒的語彙力の差

壁2：リスニング力

課外活動

最後に

高３時にアメリカ東海岸大学巡りした時の日記

１日目 Swarthmore

２日目 Swarthmore

２日目 U Penn

２日目 Princeton

DropoutNet¹²:

Session-based RNN¹³:

１日目　Swarthmore

２日目　Swarthmore

２日目　U Penn

２日目　Princeton

３日目　Princeton

４日目　Yale

５日目　Columbia

６日目　Brown