I just spent 8 hours editing machine learning videos, and here’s what I learned.

5 min readOct 10, 2020

It has nothing to do with video editing skills. But Machine Learning.

To beginners like myself, I hope this piece helps tap into the uncharted territory — Machine Learning.

To the professionals working with ML day-in-day-out, hopefully seeing things as a beginner offers you some fresh perspective. Feel free to correct me on what I’ve shared inaccurately in the comments below. For myself, writing this helps consolidate my learnings which my future self can refer to.

So how did it all get started?

Machine learning has been a buzz word for the longest time. It has been sitting on my ‘to-find-out-more’ list for the longest time since I started working. I’ve never understood what exactly it meant by its Wikipedia definition, nor was it a priority for me to dig up more content/ videos to learn about it.

Coincidentally I work in the software industry, so it always hits me when people are talking about machine learning, and I just sat there not understanding what’s happening.

Until recently at work, I was tasked to co-organize a Machine Learning event which presented me an opportunity (or rather a challenge?) to look into this topic close up.

To make a preview video of the event, I started watching all the machine learning related videos in JMP so I can take bits and chunks of them, edit them and combine them all together into a cool trailer video (hopefully!).

Video editors would know that this requires a repeated amount of watching and re-watching the footage. There are a couple of things I heard repeatedly from the pros. These became my 5 key takeaways on ML:

1. Identifying and understanding the problem is key

“Failing to define the problem is one of the biggest problems I see for failure in machine learning efforts” said Richard D. De Veaux from Williams College, one of the presenters in JMP’s previous webinar. His sharing covers the mistakes and detours he endured by overlooking this part of the process.

It applies to more than just ML projects, but also any problems we’re facing at work. The holy question that we should repeatedly ask ourselves are — Why am I doing it, what is the problem I’m solving? Do I understand the problem enough to even start solving it?

Most often, it’s almost like a knee-jerk reaction that we want to jump right into data and crack some insights out of it. But wait, what are we really trying to achieve here?

2. Supervised vs unsupervised learning models

Generally, there are 2 categories of machine learning models — supervised vs unsupervised models.

Imagine Machine Learning, a child. Supervised models are similar to setting rules and instructions to teach the kid how to behave. The algorithm learns how to predict future output based on the rules set. An example is determining if a new email in our inbox is spam.

Unsupervised models are similar to setting your child free in the playground, and let the kid find his/her way back home afterwards. The algorithm tries to learn something or derive insights out of your data. An example of an unsupervised model application are algorithms used in Search Engines to classify and distribute relevant content.

Selecting which models to work with depends back on point #1.

3. Data preparation & cleaning is 90% of the work

Performing the housekeeping work is 90% of the ML work — not training the model, not collecting data, not interpreting results. Missing values, messy data, outliers etc. which affects data quality, require cleaning up and preparation before it’s ready for analysis. It is most often the undervalued process. Remember, Garbage In Garbage Out. There is an entire industry dedicated to collecting and cleaning data like this and this.

Richard was asked to analyze drop out rates for a depression clinical trial so doctors could intervene and prevent drop outs earlier on. He started off with an exploratory tree to analyze the data and found out that — younger patients and more depressed patients have higher tendency to drop out. And the doctors went “Well, duh.”.

He went back to the drawing board and restudied the data. Lo and behold, it turns out that the answer to this problem lies in the data set missing values. (Drum rolls please) He found out that patients who skipped the first and second round of doctor visits are those who are likely to drop out from the clinical study, which provided the doctors actionable intervention.

4. Newest & latest might not always be the best in solving your problems

The shiny object syndrome — whenever new models, tools, techniques emerge, it captures a lot of attention and interest. But newest is not always better, and always keep your toolbox open. Some other techniques might be more suited to solve the problems we’re facing like Design Of Experiments and Statistical Modeling. Again, back to point #1, it all depends on what we’re trying to solve.

5. Overfitting and underfitting

Overfitting and underfitting makes up most of the underperforming ML models.

Imagine the same child who was playing in the playground met some other kids who are bullies and picked up the act of bullying. He/she assumed that it’s a well accepted behavior, and repeatedly bullied others in the future. This is overfitting. It happens when algorithms take into account training data sets errors, noise and randomness, resulting in a less accurate model when applied in the real world.

Underfitting is when you throw the kid out to take a pre-university exam when he/she is not yet done with preschool. The model hasn’t been trained with sufficient data to be able to predict the future outcome accurately.

The gist is to strike a balance between overfitting and underfitting your models.

Oversimplified, I know. But to get a better idea of this (shameless plug time!), join the event mentioned above and hear from the pros! Register for free here

All in all, it’s pretty amazing to be living in this age seeing how technology advances, helping people to make better decisions, saving lives and bringing convenience. I’m thankful to have encountered the opportunity to learn more about them and hopefully so are you!

Thanks to all ML practitioners who have been doing great work and sharing great learnings to the world! And if you’ve made it that far, thanks for reading my humble sharing! :)