On-Premises vs Cloud: Insights from Our Recent AI Project

Johnny Chan
4 min readJun 14, 2024

--

Image from Pixabay by BrianPenny

As a cloud and AI service provider, we offer a range of services from managing cloud resources to building and deploying AI applications for our clients. Today, I’d like to share some valuable insights we gained from our recent project. The task was to update a machine learning (ML) models for a client in their on-premise servers.

Background

We’re a young startup with a growing base of loyal customers who rely on our services to regularly update and optimize their AI/ML models. Our recent project involved using the latest 2024 data to fine-tune a model for a long-established industry leader. This client primarily uses on-premises servers for all internal applications and wanted to leverage their existing infrastructure to build AI applications.

The Importance of Scoping

In any external project, proper planning and scoping are crucial. Two common pricing models in our industry are Time and Material (T&M) and fixed-price contracts. Fixed-price contracts provide certainty for the client, while T&M contracts offer flexibility for the service provider. For new projects with unknown variables, T&M is often the safer choice. For recurring projects, a fixed-price contract can be more appealing to clients. In our case, despite the project being recurring, we opted for T&M due to the uncertainties surrounding the client’s on-premises server.

On-Premises vs Cloud

AI projects, unlike traditional software development, demand robust infrastructure to ensure smooth training, deployment, and inference processes due to their high computational demands. A poor initial setup can lead to project delays, a lesson we learned while working with our client who, despite being an industry leader, had somewhat outdated server technology. This situation made us hesitant to offer a fixed-rate contract due to our limited control over their on-premises server. The complexity of scaling up on-premises servers, which often requires approval from multiple levels of management, further reinforced our decision as it can introduce unaccounted time and delays into the project. On the other hand, had we been utilizing cloud infrastructure, we could have offered a fixed-price contract with more certainty. The cloud offers scalability, flexibility, and control over the environment, which can significantly reduce the time and complexity involved in scaling resources, thereby providing more predictability in project timelines and costs.

Why T&M was the Right Choice

We initially estimated 80 service hours for the project. However, during the model training phase, we realized this was unachievable. The model training, using a customized XGBoost for a classification problem, was taking longer than expected due to a large search space for hyperparameter optimization. Despite having a general-purpose VM with 8-core CPU and 32GB RAM, the training was slow.

Scaling up the infrastructure involved several meetings and email exchanges, leading to additional service hours. Even after scaling up to a 12-core, 64GB RAM server, the speed was still insufficient. By the time we finished training all the models, we had exceeded our estimated hours.

During the deployment phase, we faced network-related speed performance issues on the API. Troubleshooting and reconfiguring the servers involved multiple meetings with the client’s IT team and a week-long wait for them to resolve the issues. In the end, the project took about 50% more hours to complete than initially estimated.

Lessons Learned

  1. Consider the Client’s Infrastructure: In AI projects, it’s crucial to consider not just the model development but also the training environment and deployment. Factors such as resource constraints, maintenance, and security can impact the project timeline.
  2. Educate Clients on On-Premises vs Cloud Servers: Many clients prefer on-premises operations due to privacy concerns. It’s important to educate them about the advantages of cloud operations. In our case, we demonstrated to the client that using GPU-grade, accelerated compute resources could reduce the project time by at least half.
  3. Set Small Milestones: Engage clients at every stage of the project by setting small milestones. Regular feedback can help address small issues before they become larger problems.

Conclusion

In today’s rapidly evolving AI landscape, the limitations of on-premise servers become increasingly apparent. The speed at which technology is advancing often outpaces the ability of on-premise servers to keep up, and the investment in server hardware can be a significant cost for a company. On the other hand, cloud servers offer a more flexible and cost-effective solution for AI applications. They are easier to maintain, offer superior scalability, and can be updated to leverage the latest technologies without the need for substantial hardware investments. Therefore, for businesses looking to stay at the forefront of AI technology, cloud servers represent a compelling and practical alternative to on-premise solutions.

--

--

Johnny Chan
Johnny Chan

Written by Johnny Chan

Co-founder of Hazl AI -- a platform for your one-stop AI and cloud services. Visit us at hazl.ca

No responses yet