Machine learning vs. statistics

The most widely used Machine learning models are probably OLS and Logistic regression. In that sense, the “10 year challenge” joke (ML is just Statistics) has some truth to it.

But there are also important differences. Maybe the paper Blumenstock2015predicting would be a great example.

Because statistical models are quite useful and powerful, ML should use them. But ML is more than that. It includes many areas outside of these regression models. For instance, the Unsupervised machine learning methods or Deep learning would be a departure from the regression models.

Key differences

In social sciences, often the reason why you construct a statistical model is to test a hypothesis. Say, we want to know how a certain demographic factors affect the voting behavior. Then, we construct a model that explains the voting behavior with demographic factors. Now the focus of the analysis is that whether the democratic factors matter or not and how much.

By contrast, emotional earning your aim to teach the computers to do certain things often as prediction. So your goal is how well can you predict the voting behavior or whether to people know each other or not. So you care more about how well you can teach computers to predict whether rather than the importance of individual factors.

Variable/feature selection

Because of the distance difference, there are interesting difference in approaches. And social science models, you usually carefully select variables based on the theory in existing literature. By contrast, in machine learning, you often throw in many features you can find and and then you use other techniques to avoid over fitting. This procedure in machining is called feature selection.

Out of sample prediction

Because you’re including many features is super important to avoid overfitting. And because the whole point is teaching computers to do certain task is crucially important that the software can generalize to out of sample data. So emotional in your care intend to care more about the problem setting by splitting training set and ttest set at the beginning and never look at the test set during training. And for training procedure your self and split your training said into training training set an a validation set. Are you you make use of cross validation to simulate out of sample testing.


And in machine learning you tend to use regularization a lot more because it tends to use more features and it’s critical not to overfit.

Measure of success

In social science often the measure of success is what kind of effect you find and how strong the effect is. So you do so-called star-gazing.

In ML, the success is usually measured by accuracy, AUROC, F1, and other metrics (R2, RMSE, …). It’s again all about how well computers can do the job, usually prediction.

“Fuck normal behaviors”