How New AWS AI/ML and Big Data Offerings Help You Build a Better Business (and Kingdom) — Part 1
Never Be The Same
In the modern world, things change fast.
Twenty years ago, nobody could have ever imagined that in 2022 everything would be smart (even light bulbs), connected to the Internet (watches on your wrist that are more powerful than supercomputers of the past), and so cloudy (when most of the hard work is done not on your pocket-size device, but somewhere else, about which you don’t need to worry).
It never was a secret that information technologies (IT) required businesses and individuals to be Computer Science gurus if they wanted to use them efficiently. Indeed, those people could only have imagined how IT could become so convenient and streamlined these days — almost anyone can use them to build something great.
One of the most intriguing things of our day is Data Science, which combines the best of both worlds — human intelligence and computer performance. With Data Science, we can solve real-world problems more efficiently than using these tools alone, whether it’s skin cancer detection via a phone camera or personalized recommendations for films and series on Netflix. The human mind’s flexibility helps us avoid writing endless lines of code to account for all possible cases, as traditional programming demands. This is why we now talk about Artificial Intelligence (AI) and the machine that gives us outstanding performance to learn and solve problems faster than ever. This process refers to Machine Learning or, simply, ML.
In recent years, these technologies have evolved to adapt to the new normal brought on by COVID-19. As part of this shift, Amazon Web Services (AWS) introduced myriad new offerings already actively used by Neurons Lab to help our Customers build a better being).
These offerings include:
- AWS Graviton — tailored Arm silicons with built-in acceleration for Big Data.
- Amazon SageMaker Studio — first-in-the-cloud Data Science IDE.
- Amazon CodeGuru — AI/ML-powered code analyzers, helping you to write better apps more quickly.
In order for your business to grow, it needs solid ground underneath. This makes it vital for cloud service providers to continuously return to yesterday’s releases and improve them based on today’s needs. AWS surely knows how to do this right.
In this three-part blog series, we discuss what’s new in AWS AI/ML and Big Data and how it can make your cloud journey more manageable, sustainable, reliable, performant, cost-efficient, and secure. In this part, we focus on core services, without which we can’t imagine any Data Science endeavor, and how AWS made them even more powerful. In part two, we talk about managed services, helping companies even without AI/ML expertise and historical data start their journey and get value from day one. In part three, we turn to Big Data services enabling you to easily collect, quickly process, reliably store, and deeply analyze data so it can produce better results when used in AI/ML and so that you can achieve higher performance, accuracy, and revenue.
All Flavors of Amazon SageMaker
In the heart of AWS Data Science lies Amazon SageMaker — a family of comprehensive-yet-convenient services covering the entire AI/ML lifecycle (so-called Machine Learning Operations or ML Ops): model development, training, inference, monitoring, and continuous evolution.
In the first half of 2022, SageMaker received dozens of new features in accordance with your feedback.
Model Development — Amazon SageMaker Studio, Notebooks, and Data Wrangler
What’s the first thing that comes to your mind when somebody talks about Artificial Intelligence or Machine Learning?
It’s Python — a programming language that’s easy to use regardless of the complexity of the task you want to solve. Whether you want to explore historical data or design a cutting-edge neural network, Amazon SageMaker Studio and Notebooks are what you typically use to deal with Python in AWS.
To streamline the models’ development, AWS now bundles both IDEs with the most recent version of JupyerLab (existing instances can also be updated in a few clicks). For the first time in its history, this comes with code auto-completion, syntax highlighting, and static analysis, all so that your scientists can deliver models of even greater quality.
What’s more, SageMaker also supports R — the second most popular language in the Data Science world, with many tools for statistics and graphics available out of the box. As well, if you need a more personalized experience for R, you can now build your own RStudio image and bring it to SageMaker without predating the seamless interoperability with other cloud services as if it were the default, AWS-served environment.
No matter the path you choose, you need to prepare the data you have obtained in a special way before using it in AI/ML (in particular, clean up outlying, duplicate, and invalid records). This is clearly the deal for Amazon SageMaker Data Wrangler, offering a catalog of the most common operations which you can apply to your data without any extra code. For example, the latest release gave us data sampling and splitting, a golden standard to reduce the bias in the model’s knowledge and, hence, increase its accuracy.
If you need something not yet available in the default bundle, you can build it yourself with familiar tools like Pandas and PySpark. To help you here, SageMaker offers a rich library of code snippets that you can extend with your own findings. This new version also allows you to measure data quality and see how applied transformations affect it in real-time, without using any third-party tools.
Serving the goal of being as simple as possible with a focus on visual, code-free editing, Data Wrangler is suitable for both startup- and enterprise-grade workloads, from megabytes to terabytes (and even petabytes) of data, thanks to the new capability to provision instances of various sizes (up to 24xrlage). When the data reaches the end of this sort of processing pipeline, it can be easily ingested into other services in minutes (not days) due to new built-in integrations with, among others, Amazon SageMaker Feature Store, Training, and Autopilot.
Model Training with Amazon SageMaker Canvas, Data Wrangler, and Jumpstart
When discussing novelties in model training, we couldn’t leave out Amazon SageMaker Canvas. This is a point-and-click AI/ML editor, recently released as part of Studio, that allows anyone to start getting value from famous models (like Amazon Forecast’s CNN-QR) without a need to have any prior Data Science experience or to write even a single line of code. For instance, business analysts can apply it to quickly predict churn, optimize revenue, and plan inventory.
Now, it can also be used to train your own engineering team through interactive product tours. So don’t be afraid if you’re only starting the AI/ML journey and don’t have any datasets yet. Canvas ships these playgrounds with sample datasets spanning various business cases (like predicting house prices or diabetic patients re-admission) so you can indeed find one relevant to your domain.
The artifacts you get from Canvas can be easily integrated with other parts of architecture because it received first-class support in AWS CloudFormation (and, hence, AWS Cloud Development Kit, aka CDK). Finally, you can combine it with new granular metrics and graphs from Amazon SageMaker Experiments to have better control over model excellence throughout its versions.
In our story, it’s essential not to forget that the same Data Wrangler discussed in the previous section can now be connected to SageMaker Autopilot in a plug-and-play fashion. Autopilot, similar to Canvas, allows you to automatically get the most efficient model from the data you supply. At the same time, you can deeply control the underlying process (if needed) including the applied algorithm, their hyperparameters, and the metrics to be optimized.
If you need more and still don’t want to write custom code (but want to retain the ability to do so if required), Amazon SageMaker Jumpstart is for you. This is a curated collection of the most popular AI/ML models which can be trained, configured, and deployed in a few clicks. Their list grows exponentially — since just this past summer, it has been extended with state-of-the-art implementations like LightGBM, CatBoost, AutoGluon-Tabular, and TabTransformer. These all are purposely tuned to uncover the full potential of AWS services including distributed training that scales automatically to accommodate the data volume.
Suppose you need to control everything within the training pipeline at the lowest possible level. Still, it is not an option to sacrifice the security and stability by manually configuring the underlying machines for all these Dockers, Kubernetes, and Kubeflows. If this is your case, we recommend that you look at Bottlerocket. This is a container-first operating system (OS) built by AWS with cloud best practices from day one. To better support AI/ML today it allows you to use the GPUs in containers running in it. It is especially beneficial for models demanding intense computations like BERT (particularly used for predictive input) or MegaMolBART (devoted to drug discovery tasks).
Model Inference, Monitoring, & Continuous Evolution — Amazon SageMaker Serverless Inference and Jumpstart
Nevertheless, the most significant advancement is the new capability to generate predictions from models (in both batch and real-time mode) without provisioning any servers by utilizing Amazon SageMaker Serverless Inference.
This not only reduces the administrative burden and removes the need to plan your workloads ahead of time, but also tremendously simplifies cases where it’s hard to predict required resources at all, like in model testing. Here, as in other serverless offerings from AWS, you only pay for the capacity that is actually used.
Not least of all, it is not enough to deploy the model once and forget about it for eternity. As AI/ML models represent the real world and are not static in any sense, they should be continuously monitored and regularly updated to reflect the changes in data behind them. As model training usually takes a significant amount of time, businesses tend to do such updates rarely, leading to degraded model accuracy, user experience, and, ultimately, profitability.
However, you no longer need to limit yourself in such a way, as SageMaker Jumpstart now allows you to incrementally add new knowledge to the existing models without retraining them from scratch. After all, we don’t study again how to count from 1 to 10 when starting statistics in school, do we?
This is accompanied by a new automatic tuning feature which supersedes the long, tedious, and error-prone process of manually determining the most optimal hyperparameters. As a result, you get better AI/ML models by spending fewer hours of your data scientists’ time and, thereby, your dollars.
In The Next Episode
In part one of our blog series, we have shown how you can embrace the power of Amazon SageMaker in your Machine Learning Operations and deliver more value for a lower price. Whether you are a physiotherapy app for healthier posture (like iPlena) or a smart medical billing coding assistant (like SmartNoteMD), nothing in Data Science exists that SageMaker can’t help you with.
In the second part, we will extend this notion with AWS-managed AI/ML, enabling businesses of any size — from young startups to large enterprises — to get robust, production-ready models without cramming in all of these AUC-ROC, cross-entropy, and other complex terms of Data Science (they sound like something from Tolkien books, don’t they?).
Why wait for the next part to start your cloud journey?
Contact Neurons Lab and we will show you how AWS and AI/ML can benefit your business today. As an AWS Advanced Tier Services Partner with 3 AWS Service Validations and 20+ successful AWS Customer Launches in AI/ML and Big Data, we are here for your success.