The Trouble With Software Estimation
Originally published 2011-01-21 The Trouble With Software Estimation In an effort to become more consistent with the way our company delivers estimates to clients and to each other, we decided to t...
Your browser is unable to display this site correctly. Please try an up-to-date version of Chrome or Firefox instead.
"The cloud is no longer relegated to handling ancillary jobs, but is quickly become the base for mission critical -- or even all -- enterprise IT operations, the head of Amazon Web Services said. The cloud is the new normal."" - Andy Jassy, CEO - Amazon Web Services in 2014
Depending on an organization's size and maturity of its cloud adoption, a comparison of cloud ML services providers and the associated feasibility analysis may be preceded by a question of cloud vs. on-premesis development; however, since more-and-more commonly, organizations are developing POC for cloud development and machine learning, and specifically focusing on adopting the public cloud, it seems relevant to discuss how much machine learning costs on each cloud platform.
In case you're only here for the numbers, I have provided the cost comparison first. A brief note about the lessons I learnt follow for those who may want to adapt and replicate this for their own businesses.
The intention of this analysis is to simulate the same project on Amazon AWS, Microsoft Azure, and Google Cloud Platform for the solution. I imagined a scenario where the machine learning problem has been defined, the project duration and overheads estimated, and the compute and storage infrastructure have been decided. What follows is the financial exercise for estimating the cost of running the same model on all 3 of the major cloud providers in the market.
Project specs: We estimated the project with the following parameters:
The duration for the project is 1 year in total:
|AWS EC2||Azure Virtual Machines||Google Compute Engine|
GPUs: 1x NVIDIA V100 Tensor Core vCPU: 8 Mem: 61 GB GPU mem: 16 Storage EBS-only EBS bandwidth: 1.5 Gbps Networking performance: Up to 10 Gigabit
GPU: 1x NVIDIA Tesla V100 vCPU: 6 Mem: 112 GB GPU mem: 16 Storage: 736 GB
GPUs: 1x NVIDIA Tesla V100 (billed separately) vCPU: 8 Mem: 52GB GPU mem: 16 Storage: 375GB SSD
|Image||Deep Learning AMI (Conda on Ubuntu)||Data Science Virtual Machine for Linux (Ubuntu)||Deep Learning VM Image|
|Storage||EBS Throughput optimized HDD (st1) volumes $0.045 per GB-month of provisioned storage||Temp storage included with Compute; Azure Managed Disk||Zonal persistent disk $0.040/GB per month|
|Notes||Microsoft Azure price-matches AWS for similar computing capacity.||Calculation for hourly cost is $0.4736 (VM) + $2.48 (GPU) = $2.95. Google offers a 30% sustained use discount over 1 year, so effective price is 0.702.95= $2.07
Google automatically applies a 'sustained use' discount of 30% over the year. The prices do not reflect this.
This specific virtual machine was selected based on its capability to handle high-performance computing and machine learning workloads and its compatibility with NVIDIA’s Tesla V100 GPU.
With virtual machines provisioned, the project team is able to immediately start working on the machine learning model because the machine images provided by the above vendors come with the data and development tools they need.
Below are the costs of doing the project for 1 year based on the project specifications above.
|Model training costs||Monthly||Annual||Monthly||Annual||Monthly||Annual|
As has been the case for many years, Google undercuts the competition at every level, coming out to be the cheapest option for most organizations, whereas AWS and Microsoft remain constant with their pricing.
As we will see below, comparing machine learning development across platforms comes down to more than just the price. The business leader must answer a few questions and work within certain constraints.
The biggest problem I faced in this analysis was the disparity between terminology and offerings across the 3 platforms compared above. Service providers vary in the ways they bill their customers, offer discounts, operate, and handle processes within themselves and with other providers.
This article notes that the majority of organizations have a multi-cloud strategy. Despite that, more often than not the choice of the machine learning cloud platform is going to come down to the ecosystem that the organization is working in. While at Jonah we are platform-agnostic and constantly explore proof-of-concept options on all cloud services, most organizations will have some degree of lock-in with their vendors. Although it helps to occasionally take a peek across the ecosystem wall and look at the market, it is not something that is advisable nor reasonable every time a new decision is needed.
As Dave Bartoletti of Forrester tweeted, "Don't pick a cloud before you pick a strategy."
Jonah Group is a digital consultancy the designs and builds high-performance software applications for the enterprise. Our industry is constantly changing, so we help our clients keep pace by making them aware of the possibilities of digital technology as it relates to their business.