Your browser is unable to display this site correctly. Please try an up-to-date version of Chrome or Firefox instead.

< Back to all posts

Fifty Shades of Greyscale: Classifying YouTube videos with Machine Learning

Mark Heydenrych

By Mark Heydenrych

Jonah Team

View bio
November 14, 2018

The magician pushes a piece of brightly coloured fabric into his hand. He squeezes tight, mutters an incantation, and, POOF – a dove leaps from his hands! The audience understands, intellectually, that there is a mechanism somewhere that made this happen. But on a gut level, it seems to be magic.

But do you know what's worse? Being the magician, and still being surprised every time you manage to produce a dove. That's what machine learning looks like sometimes. A bunch of data goes in, and by a process that appears to be entirely magical, we get answers. That's what my experience with machine learning has been like. Even as someone who understands the technical mechanisms at play, the results often feel akin to witchcraft.

So let me move a few steps back and explain about machine learning, the problem we tried to solve, and how I went from baffled audience member to slightly less baffled magician.

Machine Learning: An Introduction

Machine learning is a broad field, and one filled with lots of jargon and confusion. You will frequently see definitions of machine learning that are incredibly unhelpful. Wikipedia is a particular culprit in this regard, defining machine learning as

a field of artificial intelligence that uses statistical techniques to give computer systems the ability to "learn".

This is a frustrating explanation, because it uses the word learn to define machine learning. It really doesn't get us anywhere.

So I propose a different definition: machine learning refers to methods that allow computers to identify patterns in data. The more data you have, and the higher the quality of the data, the better the patterns you will be able to recognize. Once you have these patterns, you can figure out how they apply to new inputs. There is, of course, more to this from a technical standpoint. But in principle, this is all it is: extract patterns from past data, and then apply patterns to future data.

There are many types of problem that machine learning is great for, but this article focuses on Classification. Classification, as the name suggests, is about finding out what class data belongs to. It's the problem of finding the correct label or tag for a data point. In some cases, every data point has one and only one label. This is called a single-label classification problem. Alternatively, some problems deal with data points that can have multiple labels. These are multi-label classification problems.

Another important concept in machine learning is that of features. A feature is anything that can be used to describe a data point. For example, when classifying shapes, the number of faces could be considered a feature. When classifying colours, the red, green, and blue values could be features. For every problem, there are different features, and the number of features and their quality can be crucial to the results of your machine learning.

This research project focused on the problem of classifying and labelling YouTube videos.

Why are we doing this?

Research into machine learning is interesting (and fun), but if it doesn't produce any business results then the time spent doing the research should be used elsewhere. Classifying YouTube videos, in particular, is not important for Jonah. So why is this valuable?

The broader problem – labelling videos based on their visual and auditory aspects – has many applications beyond YouTube. For example, a well-trained model might be able to identify distracted or otherwise impaired driving from video feeds of traffic.

More generally, this research helps us understand the process and the pitfalls of machine learning, knowledge which will have applications in other projects moving forward.

How Grey Was My Salad?

One of the biggest questions in any machine learning problem is what feature set to use. As mentioned above, the quality of your features contributes significantly to the quality of your results. Frequently, you get to choose the features. Sometimes, you don't have this luxury. In this case, we used a pre-existing dataset from a challenge called YouTube 8M – so named because it originally contained a total of 8 million videos for training. (This has since been reduced to approximately 6 million videos of higher quality.) Since it was an existing dataset, we had to use the features provided. Each video in the set has features representing both visual and audible aspects, and labels like 'Video Game,' 'Food,' or 'Car' attached. In total, there are over 3,800 possible labels, and each video has an average of three.

However, the provided features are much more basic than I expected. One video feature and one audio feature is given for each second of video, taken from the first frame of that second. The audio feature is a decimal number indicating the volume of sound on that frame. The video feature is a decimal number representing the average of the red, green, and blue values. Not one for each of red, green, and blue, but an average of these three. Essentially, this means that we get only a single greyscale value for each second of video. When I first discovered this, it made the problem seem unsolvable. What kind of grey is 'Dance?' How can we tell the grey of 'Car' from the grey of 'Food?' The question seems so nonsensical that I was daunted.

After recovering, I saw this as an opportunity to test an interesting idea. I mentioned earlier that the quality and richness of the feature set can be crucial to the quality of the results. But here we had the opportunity to test how accurate our results could get with incredibly limited features. If we could obtain accuracy significantly better than a random classifier, I would consider it a success. A random classifier has a 50% chance to return "present" and a 50% chance to give "absent" for any label. Such a classifier would have about 1% accuracy in applying the 'Game' label (788,288 / 6,000,000), and we could expect even lower accuracy for all other labels, since they are less common. Any accuracy better than this is a good result. In real terms, getting significantly above 50% accuracy means that the classifier is right more often than it's wrong.

As an example of how little visual distinction there can be, look at the following three images (each with its averages of red, green, and blue, and its average greyscale):

RGB(178, 161, 81)

Salad Colour Salad Grey

RGB(89, 85, 92)

Car Colour Car Grey

RGB(88, 90, 173)

Sonic Colour Sonic Grey

For a human, it is relatively straightforward to distinguish these images, and each one has a lot of colour diversity. However, when turned into features, they end up being pure greyscale, which loses so much information that we can no longer identify them. That's the kind of data we have to work with – grey, and nothing but.

Well... that's not entirely true. Rather than a single grey image, the learning algorithm will be dealing with a sequence of greyscale values. As such, the first ten seconds of this video would look like this:

Car video greyscale

In this image, each grey block corresponds to one frame of the video. A few things stand out:

  • Every scene transition is very clear, with a rapid change in brightness. The first two seconds are a title card, followed by a still shot, and then a change in angle.
  • Within a single scene, there is typically very little colour variation (although if you look closely, there is a tiny bit of variation in the last half of the image).

It is not only the individual greyscale values, but also these transitions that provide important information to the machine learning algorithms. While the job is still daunting, it seems a little less daunting now – there are patterns that can be extracted here.

Like the brightness values, the average volume tends to be consistent within scenes and change between them. For example, the sound of a car driving around will have a particular volume pattern. Machine learning classifiers are excellent at detecting such patterns.

The features and labels for each video are encoded in a format called TensorFlow, an open source project originally developed by Google. This gave me a brief headache at the beginning of the project, since TensorFlow, being binary encoded, can be very difficult to work with. Fortunately, tools exist to convert TensorFlow records to JSON, which made life much easier. The JSON data were turned into a vector of values for use with the learning algorithms.

Machine Learning in Practice

In general, there are five steps in machine learning:

  1. Label Data
  2. Prepare Data
  3. Train and Test Models
  4. Deploy Model
  5. Monitor Model


The classification of items in large datasets can be a laborious and time-consuming process. In this case, however, the 8M dataset was labelled by YouTube's automatic classifiers before being released for public use.

To see a problem in which labelling itself had to be carried out, and for more details on all of these steps, I recommend this AI Lab post.


Fortunately, using a prebuilt dataset meant we could avoid labelling the data, but we still had to prepare it for our specific use.

It is essential to keep in mind the question you are trying to answer. This will tell you how to prepare your data. For example, in our first attempt at solving this, we produced a single-label classifier, focusing on the Game label simply because it was the most common. In this case, our question becomes: Does this video have the Game label? Since numbers are significantly more efficient for machines to deal with than strings, we can use 1 when the label Game is present, and 0 when it is not. Changing the label to Game or Not-Game would be equivalent, in theory.

Things get a little more complicated when moving to multi-label classification. There are a variety of approaches to this, but I'm only going to focus on the approach that actually worked: binary relevance. In binary relevance, we train a separate single-label classifier for each label we're interested in. Essentially, for each label, we are asking the question Is this label present? In this problem, we focused specifically on the twelve most common labels:

  • Game (788,288 videos)
  • Video Game (539,945 videos)
  • Vehicle (415,890 videos)
  • Concert (378,135 videos)
  • Musician (286,532 videos)
  • Cartoon (236,948 videos)
  • Performance art (203,343 videos)
  • Car (200,813 videos)
  • Dance (181,578 videos)
  • Guitar (156,226 videos)
  • String instrument (144,667 videos)
  • Food (135,357 videos)

Each single-label classifier needed a prepared copy of the dataset, with the feature value set to 0 or 1 based on the presence of the appropriate label: Game vs Not-Game, Car vs Not-Car, Food vs Not-Food, etc. To try to predict the labels of a new video, we tested it against each of the classifiers in turn.

There is one final thing to consider: correlations between labels. This becomes the question If this has a particular label, does it also have a similar label? For example, a video with the label Game frequently also has the label Video Game; in the same way, String instrument and Guitar frequently occur together. It is possible to train classifiers for these correlations, either for every possible pair of labels, or only for particular labels that frequently occur together. We included these correlations in our training, and found that it significantly improved the accuracy of the multi-label classifiers.


Once the data was labeled and prepared, we had to train the models. YouTube 8M includes both training and testing datasets, allowing us to determine the accuracy of the trained models. Two models were trained and tested: Random forest and logistic regression.

Random Forest Algorithm

The random forest algorithm is an extension of the decision tree algorithm. The decision tree algorithm defines a decision tree for its classifier in which the features are the input, and the result of following the decision tree is the output – in this case, the presence or absence of a given label. A known problem with decision trees is that of over fitting. Over fitting means that the decision tree matches the training data too well. While this might seem like a good thing, it actually results in worse predictions of future data, as the classifier treats the irrelevant noise in its inputs as significant.

The random forest algorithm addresses this by training multiple decision trees on the same data. Since there's an element of randomness in the learning algorithms, the different decision trees grow in slightly different ways. The result of the fuzziness that is introduced through this process is that consensus predictions based on future data are more accurate.

Logistic Regression

Logistic regression develops a logistic function to fit the feature set. Logistic regression only works when you have a binary feature representation – most often, as in this case, for absence vs. presence. Future data is used as input, and the output of the function is the result. Typically, the actual result will be some real value between 0 and 1, which is rounded to indicate the presence or absence of the label.


Which is better?

Random forest and logistic regression are both commonly used for classification. Typically one or the other will give sufficiently good results, and both should be tested. In this case, the random forest algorithm gave good results for the single-label problem, while for multiple labels logistic regression was significantly better. This is because our single-label classifiers focused only on the most common labels, which the random forest algorithm handled well. But logistic regression outperformed the random forest models on the rarer labels tackled by the multi-label classifiers. The results are presented below:

Single Label

One of the most important ways of quantifying the performance of a classifier is called a confusion matrix. A confusion matrix shows the following:

  • True Positives: The number of positive predictions that were correct
  • False Positives: The number of positive predictions that were wrong
  • True Negatives: The number of negative predictions that were correct
  • False Negatives: The number of negative predictions that were wrong

Precision and recall give a measure of how good the classifier is at correctly predicting the positive and negative labels respectively. The F1 score is a combination of these two. (For more details on these, please refer to this article)

Positive Negative
Positive 16,003 9,917
Negative 6506 79,163

Precision: 0.6174

Recall: 0.711

F1 Score: 0.661

Multi Label

The confusion matrix is difficult, if not impossible, to calculate for multi-label problems. However, there are other characteristics we can calculate. The most important of these is the Hamming loss. Hamming loss can be understood as the number of labels in the prediction that were incorrect divided by the number of labels in the correct list. For example, imagine we predict the labels (Dance, Guitar, Musician) and the correct labels were (Guitar, Musician, String instrument). In this case the number of incorrect labels is 2 (Dance should be removed, String instrument added), and the number of labels in the correct list is 3. Therefore the Hamming loss in this case is 2/3, 0.667. As with all loss measures, this should be minimised.

Hamming Loss: 0.378

Precision: 0.818

Recall: 0.564

F1 Score: 0.668


Once acceptable results were obtained for the multi-label problem, the service was deployed using PredictionIO. This provides both REST and python interfaces to query the model. Part of the deployment involved the creation of a python script to convert a YouTube video to TensorFlow (using a tool provided by YouTube) and then convert that TensorFlow record to JSON for use in PredictionIO. Once this tool was created, a YouTube URL could be submitted, and the tags would be returned.


The final step was to make this service available throughout Jonah. A simple web service was set up (using Flask, a tool for creating python web services) and hosted on an internal machine. The user interface, including the resulting labels, is shown below:


What we learned

Success! In spite of the incredibly limited feature set, we obtained a reasonable level of accuracy. A sequence of differing grey values was sufficient for the model to distinguish among labels. This raises an interesting question: would a richer feature set provide better classification accuracy? If I had been defining such a dataset, I would, at least, have included the red, green, and blue values individually. This would have tripled the size of our dataset, but how much benefit would it give? It certainly wouldn't be able to triple the accuracy – it can't, since we are already above 2/3 accuracy. Eventually, we would run into the problem of diminishing returns. We could include RGB, HSV, alpha channels, histograms, and region decomposition while not significantly improving the results. Richer and richer feature sets would provide increasingly little improvement.

There is an interesting implication to this fact: when designing a dataset for ourselves (rather than using a prebuilt dataset) we should begin with the simplest, smallest feature set that is in some way representative of the data. Only when satisfactory results cannot be obtained with this feature set should more features be added, and then, as few features as possible. My instinct going into this was that a richer feature set is always better, but that instinct has been challenged. Choosing the correct algorithm had a much greater effect on the quality of prediction – the random forest algorithm performed quite poorly on the multi-label problem, while logistic regression was far more effective.

This last point makes an interesting final argument in favour of limited datasets. If a richer dataset had been provided from the beginning, the results of using the random forest algorithm may have been sufficient, and logistic regression may never have been tested. By limiting your feature set, it is likely that differences in the effectiveness of tested algorithms will show more clearly.

Future Work

We have identified several areas for future research. The most interesting of these is to try deep learning, a different paradigm than the traditional machine learning algorithms employed here. Deep learning is commonly used in image and video processing, and is often capable of extracting patterns that traditional machine learning can’t. Fortunately, this will not require any further processing of the data, since deep learning works best with vectors of values, which we already have.

Another interesting advance that needs to be researched is that of time-based tags. While some videos will have the same tags throughout, this is not always true. For example, in a concert it is possible that the Guitar tag would only occur during certain parts of the video, and equally that Dance would only match certain periods. The current implementation gives back one set of tags for the entire video. Being able to determine when tags occur can be very useful -- for example, determining when certain events occur on traffic camera footage can help with future planning.


As with a magician's show, the results may seem remarkable and the means arcane, but we can assure ourselves that here is no magic happening here. Our conclusion should be that machine learning is a predictable, repeatable process that can take in very limited data and give back meaningful predictions about future data.

This is important. When you come across a problem such as this in business, a problem where you need to predict the properties of something – events or users, videos or songs – your first thought should be "We can use machine learning for this." Machine learning isn't perfect, of course, but it shouldn't be seen as magic. It is a monumentally powerful tool that can solve many problems.

At Jonah, we have successfully applied similar machine learning techniques to multiple business problems, and have built a strong foundation in this field. If you'd like to know more, we'd love to hear from you.

About Jonah Group

Jonah Group is a digital consultancy the designs and builds high-performance software applications for the enterprise. Our industry is constantly changing, so we help our clients keep pace by making them aware of the possibilities of digital technology as it relates to their business.

  • 24,465
    sq. ft office in downtown Toronto
  • 126
    team members in our close-knit group
  • 17
    years in business, and counting