How Do You Know If Linear Regression Is Right For Your Dataset?

Introduction

Linear regression is talked about a lot. It is probably the first algorithm in machine learning people learn about. This is for good reason – it is easy to understand. The algorithm takes your data points and draws a straight line minimizing the distance from the line to all of the data points. Now you can use that line to make predictions. But what scenarios should this actually be used? What types of data and predictions can be made well with this type of algorithm? Well, that’s what Cassie Kozyrkov talks about in a segment of her Making Friends with Machine Learning course(link to video here).

Summary Of Article

In the video, Cassie lists two conditions that need to be met for linear regression to be a good algorithm for the situation. The first one is pretty straightforward and just the fact that the data has to be numerical. Any data like currency, calories, height, speed, etc. The second criterion for this technique to be a good algorithm for the situation is a little less straightforward. It is that the “value of the feature is more meaningful than just a threshold”. Cassie explains it with a story. The story goes like this: Cassie went to Zurich and presented a decision tree to some of her friends about what types of movies she likes. And then on the plane ride back they start giving her 5-hour long movies. They, however, misunderstood the decision tree. It said that if the move is longer than 127 minutes(the threshold) then recommend it. It didn’t say, the longer the movie is the more Cassie will like it. And this is the type of data that you need to be using for linear regression to work. The more of a certain feature the better or worse it is. Not as long as a feature is above or below a certain threshold it is good(an example of an algorithm you could use in the threshold scenario would be logistic regression).

My Take

Overall, I really liked this video. It was informative. I also really liked how the video consolidated all of that information into two overarching points which are easy to reference back to. The video was also really easy to understand and the addition of the story that Cassie added in, was extremely helpful in understanding the second – harder – point. I am still wondering how to know if you need to use linear or polynomial regression, for example. Is it just based on what the data looks like – if it just looks more like a polynomial as compared to a straight line, or is there another way to decide that? I also really liked the visuals presented on the slides which helped enhance the quality of the presentation. For example, Cassie had an exercise in which she demonstrated the kinds of classification decisions that machines have to make which was a fun introduction to the core part of linear regression in the video.

Conclusion

All in all, this video was really good. It was really informative and concise and the addition of the story to help explain the harder point was really good. I also liked the visuals presented in the video which also aided in understanding and made it more fun and interactive. I highly recommend you watch this video when you get the opportunity(link to video found here).