AI Prediction Comparisons –Tabular Matrix and Gradient Boosting
Artificial Intelligence has been on the rise these past few years due to several forces, including increased data retention by organizations, access to better hardware (particularly being able to compute on the GPU or on a grid of computers) and finally, better software algorithms. In fact, there are so many options now available that it becomes difficult to select the best one. Considering the options and the criteria, which is the better selection? This paper dives into two choices of algorithms for AI prediction on a relatively large set of data and describe the hardware and software used in each of the algorithms. We hope that by the end of this paper, the reader gets a sense of the strengths and weaknesses of the algorithms and can make a more informed decision when applying to their own situation.
Rossmann Dataset
Kaggle competitions are very popular in the AI community. Sponsors post their particular problem and teams compete to solve them and the winner usually receives a prize. The Rossmann dataset was one such competition held a few years ago focusing on sales predictions. To get an idea on the size of the training data, there were about 1115 stores going back 3 years with sales data rolled up daily. This gave us slightly more than a million data points. Further, each row after the preprocess stage had about 90 columns. Finally, we make note that Excel 365 64-bit was able to handle the data size, but was barely functional. It takes slightly more than 1GB of RAM to load the data into it.
The intention of the problem was to predict sales for the next month and a half for these 1115 contributing stores. Accuracy of the predictions were calculated through the Root Mean Square Percentage Error, which is just a fancy way of saying, how far away were the predictions from the actual sales.
Data Cleaning and Augmentation
This is a very critical step in the prediction process – the quality of the prediction is heavily dependent on the abundance and quality of data. The paper however is not focused on this step, and borrows the solution adopted by FastAI which uses Python's Pandas libraries to do many of the transformations. Here are some of the highlights:
- Join ancillary data, like weather and state names to the initial training and test datasets.
- Expand date-time fields into categorical/continuous equivalents like Year, Month, Day, Day of Year, Is Month End, Is Year Start, etc. This is very important for trends.
- Fill in Null values with an appropriate stub.
- Fix errors and outliers.
- Limit number of unique categories on certain variables.
- Create and store relationships across rows, like running averages, time since the last event, etc. For example, we may want to know how many days have passed since the last school holiday. Note – we have to sort the rows for relationships like this to make sense.
- Particular to this dataset, omit stores that had 0 sales on a particular day.
The resulting cleaned and augmented dataset is what we will use across all of our subsequent algorithms.
Technique 1 – FastAI Tabular Predictions
FastAI is a software package that acts as an abstraction over Pytorch. Essentially, one can solve AI problems using the GPU (NVidia CUDA) by deconstructing the problem into tensors (or matrices). The approach offers the speed advantage: to solve particularly large dataset problems through computation quite quickly with a tendency towards rather accurate results.
The Rossmann problem is already solved by FastAI and the solution can be found here. The most important part of the model is:

Essentially, it states that a tabular model is used with three layers over a fairly large grid. It uses the notion of embeddings – which roughly allow us to model categorical variables and prints out the Root Mean Square Percentage Error which was required by the competition.
Any prediction algorithm requires the data be divided into two sets – the training and the validation (note that validation is different from the test dataset). The idea is to train the model using the training dataset, and then validate it against the validation set. Since the validation set has the actual value, the model can adjust itself to converge to the predicted values to the actual ones. Once the model has been trained and validated, it is run against the test dataset to see if it is generic enough to apply to unseen data.
FastAI split the training data into training and validation sets by taking the size of the test dataset, taking the first n tuples from the training dataset to match that size and adding a few other tuples until the date changed. It makes sense to do this as we are trying to see things in the future, so picking the most recent dates will allow us to do this. We note that usually the split between the training and validation set is either 80/20 or 70/30; however this split is a little unusual because it uses only 5% of the data for validation.
In this paper, we adopt the solution from FastAI and run it over the Nvidia Titan X (Pascal) which has 3584 cores @ 1.5GHz and 12GB of GDDR5 memory. This is quite a powerful card for AI programming. The online notebook took 11:27 minutes to run. On the Titan X, it took 8:41 minutes. Also notice how even whilst the model is learning, it only consumes half a gig of memory and utilizes only 21% of the GPU:
Thus, you really don't need a very high-end graphics card to leverage GPU computing for AI algorithms. Mainstream GPU cards should more than do the job fairly quickly.
Our final result is as follows:
We have obtained a Root Mean Square Percentage Error of 10.31%. Number 1 on Kaggle was 10.02% - this is very good!
We do note one limitation – we currently do not have a way of knowing which columns in our spreadsheet were the main predictors of sales. The statistical approaches we see next will speak to these.
Technique 2 – R Gradient Boosting Using H2O Package
In this second technique, we used Gradient Boosting, which seems to be getting more traction in recent years, particular XGBoost. However, we used R's H2O package for its simplicity to achieve the same.
Since we already enhanced our data in Technique 1, we will not be doing it here again. Rather, simply integrate our data into R. Rather than writing our own solution from scratch, we borrowed from the initial solution from RPubs solution by Saleem. The following are the notables:
- We initialize the H2O cluster with unlimited threads and a maximum memory size of 8GB. This is very critical, as otherwise these models would take too much memory and end up not being useful.
- GBM takes quite a few parameters which can impact the results. The GBM model we used in our test is different from the one presented in the solution:
In particular, we sliced our training and validation datasets the same way we did in Technique 1. This ensures that both models are learning on similar datasets.
The GBM algorithm was run on a Core i5-6200U (6th generation laptop) processor with a maximum frequency of 2.4GHz. The system had 16GB of RAM, but only 8GB was required as stated above. We note that H2O can be run on a cluster of computers as well, which can significantly reduce the time it takes to build the model.
The following is the result of our model:
We note that the RMSPE is 11.42%, which is slightly more than 10.31% we obtained from Technique 1, but still fairly close. Using this technique, we also obtain the most important and least important variables. This is extremely useful for planning purposes. We also note that the model took slightly less than an hour to complete.
Prediction Differences
So far, we have only built the models, but really haven't tested it against unknown data. We expect that when we run the test data through the models, the accuracy of the predictions should more or less be the same. In this section, we will obtain the predictions and then compare them to see how similar they are.
We line up the predictions for both models and put them into excel:
The Root Mean Square Percentage Error is 10.93%. So our predictions are indeed different, but within the range of our expectations. A few other metrics to take a look at:
- 78.8% of the data had a difference of less than 10%.
- 0.7% of the data had a difference over 100%, with only 13 of the 41,088 entries over 200% and only 1 entry over 300%.
Thus all things considered, the predictions are pretty good. Again with some tweaking of the GBM algorithm, we could probably reduce this difference; but it is quite good as it stands currently.
Conclusion
In this paper, we ran 2 distinct prediction AI models on a fairly large sales dataset summarized as follows:
- FastAI Tabular Predictions: This ran over the GPU using a simple tabular model with the help of embeddings. The strength to this type of modeling is that the accuracy is extremely high and the model is extremely fast to build. Thus you can tune the model fairly easily. However, currently this kind of model does not indicate which independent variables most heavily influence the prediction. This is a major limitation, especially if you are trying to figure out how to increase sales.
- R Gradient Boosting using H2O package: This model used Gradient Boosting and ran over a laptop CPU. We saw that the accuracy went down just a little (but could probably be recovered through some more tuning) and that the model took longer to build. We noted that H2O statistical methods could be applied to a cluster of computers to help build the models faster. Using GBM, we saw that it ranks the influence of the independent variables; which is extremely useful if one is trying to drive up sales.
Thus, if the actual prediction is the only thing you care about, using FastAI Tabular or a similar library is probably the way to go – as it is much faster to build, computationally less expensive to maintain and has a higher level of accuracy. A typical use case might look like "What are the sales forecast for the next year so we can schedule our production supply accordingly?"
However, if you care more about what are the major drivers that influence the prediction, then using a statistical package like H2O GBM makes more sense. A use case for this might look like "We want to grow our business and open up a new store, can this negatively impact our existing store sales?"