How New AWS AI/ML and Big Data Offerings Help You Build a Better Business (and Kingdom) — Part 3
The Data, Or There And Back Again
We often think about AI/ML as a magic wand that you just wave and it will predict the future like a crystal ball. Unfortunately, even if an idea of cutting-edge offerings from Amazon Web Services (AWS) is to achieve the former (the leitmotif of this three-part blog series), the latter is still in the backlog.
No Data Science algorithms can exist in a vacuum without the data. In the past, it was always required to train the model before it could understand your subject field and make any viable predictions (like custom models on top of Amazon SageMaker, to which our first blog is devoted).
Today, you often don’t need to train the model — particularly when using managed AI/ML services like Amazon Textract, as discussed in the second blog. Even then, however, it’s still crucial to measure its efficiency based on user feedback so you can quickly adapt to (and outperform) a continuously changing real-world.
This is the topic of this final part — how new Big Data services and enhancements to existing ones, released by AWS in 2022, make it easier to collect, process, store, and analyze the data. All to enable you to use Big Data more efficiently for greater Customer satisfaction and, hence, profit.
Often, we only bother with what happens with data when it’s already in the cloud but not how it appears there. Perhaps lovely leprechauns carry data in their little bags over the rainbow and put it into your Amazon S3 buckets while you are sleeping? Nobody really knows the truth.
In reality, the data isn’t mana which you obtain and replenish automatically. If you are starting work on a new product from scratch (especially in a domain in which you haven’t done a deep dive before), you often don’t have any data collected ahead of time. However, this doesn’t mean that you should wait until Ragnarök for your disks to be magically filled with terabytes of data to start getting value.
Traditionally, this issue is resolved in two ways. The first is to scrape data from all over the Internet, resulting in poor quality as every source has its own level of accuracy, completeness, and trustworthiness. Another way is to buy data from companies who specialize in it (like Foursquare), leading to huge expenditures (e.g. a dataset with all US business locations may cost you $100,000 for 1-year of use), which indeed restricts small businesses and startups from delivering their products to the market in a timely manner.
AWS Data Exchange And Open Data
To break this state of affairs, Amazon integrated its Open Data initiative into AWS Data Exchange, enabling you to easily use a plethora of datasets in your AI/ML workloads without building any ingestion pipelines or studying rocket science. You can just find whatever you need (like COVID-19 statistics or US sociodemographics), press one button in the console, and that’s all -a copy of data is then in the storage repository you prefer (like Amazon S3 or Redshift). As these datasets are composed and curated by AWS and Community experts, in most cases, you can use them as-is and without any extra preparations. As well, like all the best things in life, open datasets come free to us.
If you have historical data but it’s stored outside the cloud in some traditionally hard-to-integrate systems (like Salesforce, SAP, or Google Analytics), we recommend that you think about Amazon AppFlow. This service enables you to establish data ingestion from these complicated sources in a few clicks without writing any custom code.
Starting in June 2022, Amazon also began offering built-in connectors for famous marketing and product analytics platforms such as Facebook Ads, Google Ads, and Mixpanel. Since it is simple, AppFlow fits the needs of both small businesses and large enterprises, as it can ingest data at any velocity and scale from megabytes to terabytes in seconds.
AWS DMS And DataSync
If data is already in the cloud, you may need to expose it to an OLAP system or to some other Data Science-specialized storage, which should be able to process many records in a single query while imposing more relaxed latency constraints than OLTP. If any of these cases applies to you, take a look at AWS DMS and DataSync.
The first works best if the data in your source is structured or semi-structured (stored in relational or object-oriented databases, respectively). The second is best if you need to handle unstructured data (like raw images or a bunch of XLSX without a common format).
AWS DMS now natively supports Babelfish for Amazon Aurora, which allows you to use PostgreSQL database as if it were Microsoft SQL Server. This way, you can achieve the performance and price benefits of the former while introducing little-to-no changes in the code you may have been building for years around the latter.
AWS DataSync, in its turn, can now copy data to and from Google Cloud Storage and Azure Files, and therefore benefit you if some of your services are running in clouds other than AWS, like GCP or Azure.
Amazon SageMaker Ground Truth
If you think none of the above approaches could help you obtain the desired data, you can instead ask Amazon SageMaker Ground Truth to synthesize said data. At release, this feature supports the generation of images with other media types (like text, audio, and video) expected to come later this year. You only need to specify the generator parameters (like the number of images, their dimensions, and objects depicted in them), and, in seconds, you’ll receive a unique dataset.
Ground Truth not only produces images that look like real ones but also automatically labels them (marks the objects and their positions, which the model should learn to recognize), so they can be used in model development, training, and monitoring as-is, without any extra preparations.
You can give your own images to SageMaker to be used as a baseline during synthesis and it will try to reproduce, in generated pictures, the same patterns as in the examples. SageMaker wins the game when you have some historical data that you really need to use (e.g., to make the model reflect your business’ uniqueness), but the volume of data is not enough to achieve acceptable accuracy.
To further improve the results, SageMaker can add scratches, dents, and other alterations, which help the inference algorithm handle defects better. Let’s imagine we have a frame with a soccer player who likely scored a goal. The frame contains broken pixels near the ball making it harder to detect its position, but a model trained in such a way will be able to analyze the frame accurately (and inspire roars of “hooray!”). After all, we rarely have ideal photos in the real world, right?
The Power of Serverless and Auto-Scaling
Data processing is all about going serverless. If you think that Big Data shouldn’t be about provisioning and patching all these big servers and spending a significant amount of time and money, you’re in the right place. This year, AWS did a great job adding serverless variants to many of its Big Data offerings.
The most notable example is Amazon Elastic Map Reduce (or EMR) — a comprehensive data processing service. EMR enables you to easily and quickly spin up tools that may be found in every modern data-driven architecture, such as Apache Hadoop, Hive, and Hue. This new serverless deployment option allows you to use it even more efficiently while you pay less and only for resources you actually use. Additional benefits of this version include fine-grained scaling, resilience to Availability Zones failures, and multi-tenant application support.
Another great release is Amazon MSK Serverless, which can help you build near-infinitely scalable message queues and brokers without worrying about manually managing all the ZooKeepers.
AWS Glue, a serverless data integration service on top of Apache Spark, finally received long-anticipated auto-scaling support (in both batch and streaming versions). With this feature, you no longer need to spend endless hours guessing the most optimal capacity for your ETL (extract, transform, and load) pipelines, as Glue automatically sets up and tears down the compute resources depending on processing jobs’ complexity and the volume of data flowing through them.
Serverlessness and auto-scaling meet each other in the new On-Demand mode for Amazon Kinesis Data Streams, which is the best pick if you need to serve gigabytes of streaming data per second without planning for capacity or sacrificing reliability. As well, if you already use provisioned capacity mode, you can switch to the new option in a few clicks without a single second of downtime.
More Flexibility, More Performance
While making Big Data services more serverless and flexible, AWS hasn’t forgotten about making them more powerful and performant. The same Amazon EMR we discussed above now has the Result Fragment Caching feature, which automatically preserves the query results derived from static data so they can be used in future requests without re-calculations, achieving up to a 15x performance boost. These cached fragments can be shared across multiple EMR clusters (including serverless ones) and different queries, achieving even greater time and money savings.
If data decides to change one day, EMR is smart enough to detect this change and rebuild the cache so you’ll never receive stale results. EMR is also smart enough to gracefully handle heterogeneous queries where some data is static (e.g., fulfilled orders over last year offloaded to the Amazon S3) and other data is dynamic (e.g., orders over last month that may be updated with a new address until delivered and which are stored in Amazon Redshift). The latter’s importance will only grow as the data lake house design pattern is used more frequently (for details, see the “Analyze” section).
You can extend the EMR cluster with the new Spark profiling plugin from AWS to accelerate it further and find other low-hanging optimization opportunities. This plugin performs dynamic code analysis powered by Amazon CodeGuru Profiler to give you real-time insight as to how much this or that code section costs (in $).
As well, if you use AWS Glue in your Big Data architecture, it now allows you to automatically decompress the input stream so you can ingest more data faster without writing custom code and instead saving priceless engineering hours.
Not All Data Who Wander Are Lost
Like some other rings, Big Data services also need one ring to rule them, and this is the AWS Step Functions — a first-in-the-class serverless orchestrator that allows you to seamlessly manage data processing workflows. Step Functions comes not alone but with comprehensive exception handling and the ability to apply basic transformations to carried data at no additional cost and in a code-free way.
In 2022 H1, Amazon extended Step Functions with improved observability features so you can better control the state of your data at any given point in time (this is especially beneficial for long-running tasks, which are allowed to last for up to one year). If something goes wrong, quickly find the root cause and resolve it (this is especially useful for parallelized workflows that are typically complex to debug).
Suppose you’re only starting your Data Science journey. In this case, you’re not obligated to master each of over 220 AWS services and 10,000 API actions before building your first data pipeline. Just open the new Serverless Workflows Collection and you’ll find myriad reusable constructs curated by AWS experts, repeatable patterns uncovered by Community members like you, and even full-scale apps from APN Partners for you as reference architecture examples.
How to collect and process the data is only part of the question. Like train passengers who want a breath of fresh air during a stopover, data that has gone through all these validations, transformations, and aggregations needs to be stored in some reliable and resilient place — in a place where it won’t be lost and from where it can be quickly passed into analytics or AI/ML whenever requested.
Amazon Aurora Serverless
Our first guest is Amazon Aurora Serverless, which allows you to run MySQL- and PostgreSQL-compatible databases without worrying about any resource planning. With independent storage and compute scaling, Aurora Serverless can automatically grow to hundreds of thousands of transactions in seconds and shrink back when it’s no longer needed, also in seconds. This power is available for you at one-tenth of other commercial-grade offering costs.
The recently released v2 closes all the gaps between serverless and managed versions of Aurora — the former now supports all the features of its older brother, such as Global Database, RDS Proxy, and read replicas. In turn, managed Aurora (also referred to as provisioned capacity mode) now allows you to push its characteristics beyond its limits by mixing both capacity types in a single cluster.
You can imagine the scenario of a flower shop, which has a steady, well-known load for most of the year with some extreme peaks during holidays like Valentine’s Day. You can greatly (up to 69%) reduce your costs for the predictable workloads by running provisioned Aurora and applying Reserved Instances.
However, to handle spikes without sacrificing the User Experience (and preventing them from buying these lovely white anemones), you can extend this cluster with a serverless capacity. Amazon Aurora Serverless scales in fine-grained increments to provide just the right amount of resources to fit your needs (which is another enhancement of v2).
Both serverless and managed Aurora now support Zero Downtime Patching (ZDP) for PostgreSQL. This way, you can update your database engine to new minor versions without interrupting ongoing requests and stopping the acceptance of new ones, making it an ideal case for enterprise-grade workloads with strict Service Level Agreement (SLA) requirements.
Aurora has also achieved native compatibility with the PostgreSQL LO module to better manage large objects (LO). Let’s imagine that you are developing a new video game console and want to allow your users to select different avatars. One day, you notice a trend that users often upload the same images of popular game characters. To use storage more efficiently, you decide to check if the supplied image is already in your Amazon S3 bucket and if it is, just reference that without re-uploading. In addition, you want to delete such a file when all references to it are released. Both of these optimizations can be applied automatically by the LO module, which is why this update is notable.
Amazon RDS is a collection of managed relational databases that empowers Aurora behind the scenes, meaning all the RDS enhancements inherently apply. At the same time, RDS offers you extra engines in addition to MySQL and PostgreSQL — MS SQL Server, MariaDB, and Oracle.
This spring, RDS also attained significant enhancements — notably, you can now provision read replicas in a cascaded fashion, up to three levels in depth with up to five replicas per instance (or up to 155 replicas in total per source instance). Not only does this help you achieve up to 30x more read capacity, but it’s an ideal instrument for multi-AZ and multi-region disaster response. RDS is based on the fact that each replica can have its independent automated backups, ensuring faster promotion into the source in case of failure.
To better monitor and manage such distributed clusters, Amazon RDS Performance Insights now allows you to specify any time interval for metrics analysis. Using this, you can easily find any bottlenecks and inefficiencies in the database usage patterns and get rid of them without any external tools.
With Amazon DocumentDB, you can run MongoDB-compatible object-oriented data stores in a fully-managed manner, similar to how RDS works for relational databases. AWS also didn’t sidestep DocumentDB without adding a portion of great upgrades.
DocumentDB now dynamically resizes its storage when you delete or insert data into the Mongo collections, so you only pay for space that is actually consumed — there is no charge for empty space.
To test how the new Mongo engine features, configuration changes, and code adjustments may affect your production workloads, you no longer need to recreate the database from backups (which can take extensive time, making you less agile). The new fast cloning capability solves this problem gracefully — you just specify the target cluster parameters (like how many instances you want and of which sizes), and in minutes (or even seconds), it can be fully operating and filled with the data you need.
As you can see from these two announcements, DocumentDB gets more and more features from its parent, RDS, with each release. The third feature is not an exception — DocumentDB now has its own version of Performance Insights, offering you the same fully-managed monitoring tool belt with which you are familiar from your RDS experience (and the previous section of our blog post).
Amazon ElastiCache and MemoryDB for Redis
If your app needs to serve dynamic data blazingly fast, even while serving millions of requests per second (like real-time dating apps such as Tinder), put it into Amazon ElastiCache or MemoryDB. Both offer you ultra-fast performance, equally fast auto-scaling, and fully-managed deployment options, with the former intended for transient data (like user sessions) and the latter for something more persistent (like user profiles). The only difference is latency (still overly small in both variants) — a microsecond in ElastiCache and sub-millisecond in MemoryDB. Such outstanding results are achieved as all the data you store in them is always available in operating memory (even if backed up by a copy in a local drive for durability), eliminating inherently slow disk access.
Both services are compatible with Redis — a leader among open-source in-memory data stores. ElastiCache also supports Memcached — another leader in caching with a focus on simplicity; however, this is a story for another blog. This means you can replace your custom and on-premise installations in a drop-in way and continue using the same tools and features you are familiar with.
Some of the most remarkable parts of Redis are the built-in data structures such as hashes, lists, and sets that allow you to perform traditionally compute-intensive operations (like sets algebra) right in the database layer, so more resources remain for your business logic. However, developers often overlook this as being used to store all objects simply as serialized strings.
To mitigate this, AWS introduced native JSON support in both ElastiCache and MemoryDB so Redis can automatically find the most optimal in-memory representation for supplied data and convert it to native structures. Your engineers can continue using the same code for caching as before, while benefiting from increased performance and reduced costs.
If you were already doing this but with the custom code, you can decrease its verbosity and improve reliability by migrating to this feature, as such cross-domain type conversions often require tons of boilerplate constructs and are error-prone. Moreover, you can now query the nested JSON fields without leaving Redis, which speeds up the queries as less data has to be sent over the network.
It is not uncommon that when data reaches this step… nothing happens, and everybody forgets about it for a while.
The key challenge of Data Science isn’t grabbing all the data in a single place, but doing something to extract value from it to make it meaningful. Just like how you need to transform the oil into gasoline for it to become gold, data should become actionable for you to obtain competitive advantages. This is why we put “collect,” “process,” “store,” and “analyze” under the same “data analytics” umbrella to reflect the critical importance of the “analyze” part.
When you know more, you can do more and do it quickly. This is the credo of Amazon Athena, built around the Community-beloved Trino engine (formerly known as PrestoSQL), which allows you to run queries in standard SQL against structured, semi-structured, and unstructured data wherever it may be stored.
As it is serverless, Athena doesn’t require you to set up any infrastructure to work — you just specify where the data is (e.g., lake in Amazon S3) and Athena does all the remaining hard work to provide the answers you demand. As well, you only pay for data that is actually scanned to fulfill your request at a democratized price of $5 per terabyte. This all makes Athena an ideal solution for ad-hoc queries, which data scientists should be able to run whenever needed without straining their time and your budget.
At the outset, Athena received connectors for many new data sources, such as:
- SAP HANA, whose integration complexity we touched on a little in the “Collect” section).
- Microsoft SQL Server, so you can use both OLAP and OLTP data in a single SQL query without any preliminary transformations.
- Google BigQuery, enabling you to build data lakes spanning many clouds.
Athena also now natively works with the Amazon Ion format, which is well-suited for sparsely populated hierarchical data like medical history records which were complex to process with traditional SQL before the release of Ion, which combined the ease of use of text encoding and the efficiency of binary encoding.
The most prominent challenge with data lakes is that they are by design append-only, meaning you can’t insert, update, or delete an arbitrary row in your data set. Evenven if you do this with storage-level hacks, there are no guarantees that somebody else won’t do the same in the middle of your transformation, rendering the data broken.
There are many solutions to this problem (like Apache Hudi), following the delta lake design pattern, and today Athena has support for one of them — Apache Iceberg. With this update, Athena can offer you fast mutating operations (CRUD is complete now) with ACID compliance (no more data races or race conditions).
No less importantly, this allows you to manage the data lineage via time-travel and schema evolution capabilities, so you can trace the whole history of the data from the day when it landed in the cloud until it’s deleted. This all spans beyond Athena borders and works with other Iceberg-compatible services like Amazon EMR. So, if you are altering data in any of them, you can be sure that all others will see your changes immediately and that these changes, if run concurrently, will never corrupt any single bit.
Another enhancement of this version is support for Apache Hive views, which was previously limited by differences between Athena’s SQL and HiveQL syntax. Views are used in almost every Business Intelligence (BI) solution, so this update is critical for those who prefer self-hosted Hive Metastores. As Trino runs the query, it consults the execution plan to identify referenced views and automatically unrolls them into matching SQL so you don’t need to write the same query fragment again and again.
For example, by joining Customer profiles with their orders, you can both build day-to-day sales reports and detect outlying or fraudulent transactions. Views are literally smart code snippets — they don’t have associated physical data and never increase your storage footprint. It is the best practice to use them whenever possible to make your queries simpler to read and maintain.
To granularly control this process and speed it up further, Athena now offers visual analysis and tuning tools that allow you to decompose any query into an operations graph that the engine runs behind the scenes, trace the data flow through it, and see precisely how much time each step takes.
Knowledge is power, but only if it’s continuous. When your workloads become structured and predictable enough, you may find it valuable to extend your data lake with a data warehouse.
Thus, you can promise (as an SLA) your business users that they will get all the answers exactly when they need them, not later, which is vital for mission-critical questions. E.g. how many goods remain in the non-data warehouse is critical for efficient inventory planning, and you usually don’t have the luxury of waiting a week to get this info, but just 15 minutes at the end of a work day.
In this niche, the winner is Amazon Redshift, with which you can analyze exabytes of data across your operational stores, warehouses, and lakes using standard SQL. Redshift doesn’t replace a data lake you already use but is complementary. On top of this collaboration, you can implement a data lake house design pattern that combines the robustness and determinism of the former with the cost-efficiency and flexibility of the latter in a single bottle of good old wine.
New Amazon Redshift Serverless further advances this concept while allowing you to benefit from the same features as in a standard, managed Redshift. At the same time, you only pay the minimal possible amount and only for the capacity consumed on a per-second basis. Redshift Serverless includes (but is not limited to):
- Automated table design — a feature that continuously optimizes the physical layout of data in the background (without any actions needed from you) to accelerate the queries’ performance.
- Data sharing that enables multiple clusters, serverless and not, to use the same data without building nontrivial sync mechanisms (ideal for access management and Quality of Service).
- Rich customization capabilities that enable you to write User-Defined Functions (UDF) in any language for which AWS Lambda has built-in or custom runtime and call them right from SQL.
If you operate managed Redshift, AWS also has a few considerable enhancements for you. First, you can apply classic resize to your clusters in minutes, not hours, as before. This is especially beneficial if changes you’re going to introduce don’t fit elastic resize (e.g., modifications in node types). Moreover, you can perform resize operations in parallel with restoring your cluster from a backup.
Previously, you had to do this sequentially, which is counter-efficient as each step’s results are never used but overwritten when the next one runs. For example, a cluster takes a whole day to be restored with DC2 node type, as in the backed-up one, and then the other two days it also cannot be accessed as it is being transformed (resized) to RA3. Similarly, you can encrypt an unencrypted Redshift or change its underlying AWS KMS key during both resize and restore.
Another feature we encourage you to consider is Automated Materialized Views or AutoMV (supported in both provisioned and serverless capacity modes). To better understand the idea, you could treat it as a combination of usual, virtual views (described in the Athena section) and Result Fragment Caching (from the EMR one). Materialized Views cache the results of the frequently used query pattern to avoid recalculating them on each request, as happens with virtual views.
Automated Materialized Views move this notion to the next level by applying AI/ML to automatically determine which query parts you use repeatedly and extract them into views. Similar to Result Fragment Caching, AutoMV can replace some or all steps in the query execution plan with generated views so you don’t need to manually reference them to get performance improvements. The key difference from the EMR implementation is that these views apply to both dynamic and static data, not just the latter.
Finally, yet no less importantly, the Amazon Redshift ODBC driver now follows an open-source delivery model so you can have deeper visibility into query execution flow and granularly adjust if needed. Aside from this, all the Redshift drivers (ODBC, JDBC, and Python) can now natively apply binary encoding to the data when transferring it to and from the cluster so you get a massive decrement in the network bandwidth and, thereby, increment the overall analytics performance.
Seeing is believing. No matter how cool your data lake house and SQL queries are, business users won’t trust their results until they can inspect them visually.
This is the purpose of Amazon QuickSight — to help non-technical users get deeper insight from analyzed data in the most convenient form. QuickSight enables you to build BI dashboards with rich analytics features — from traditional bar charts to AI/ML-powered anomaly detection. They can scale in seconds from a few to a few thousand consumers, and you can use them from any device with a browser, without any client software or server infrastructure. Nobody has ever doubted that AWS would continue its evolution and, these days, QuickSight gets updates consistently every two weeks.
In particular, QuickSight dashboards now support rolling date and time ranges — e.g. the “orders over last 15 days” chart implies different start and end dates, depending on when they’re opened. Previously, authors had to regularly and manually update such visuals to provide viewers with the most up-to-date data, but with this release this can be done automatically.
In a similar fashion, window and aggregate functions received Level Aware Calculations (LAC) support, which allows you to specify a window to partition data by and a level to group it by. This helps to perform advanced calculations (e.g., the sum of sales per quarter, per region) right in QuickSight that were previously required to be pre-computed outside the service — for example, in the data lake house.
Amazon QuickSight Q is a feature that allows users to ask questions about their data in natural language and get answers in the form of AI/ML-generated visuals and narratives. This summer, AWS added an API to submit questions to Q programmatically and this is ideal if you want to embed QuickSight logic (without its interface) into your custom UI/UX to better match your branding and styling. Another quality of life improvement is the accounts management API, which is intended to reduce manual steps in provisioning QuickSight, whether you use CI/CD, CLI, or IAC automation.
As data analytics evolves, it becomes clearer that modern BI reports should be dynamic and interactive. Business users should be able to adjust most dashboard parameters (e.g. filtering and sorting criteria) so they can view the data precisely from the angle they want without any roundtrips to its authors. This self-service experience was further improved when AWS introduced the Bookmarks capability, allowing viewers to make a snapshot of such parameters and quickly reapply it later, saving inestimable hours to be invested into something more valuable rather than pressing the same buttons again and again.
So that you do not drown in all this abundance of new features, QuickSight now offers a Community Hub that has everything to deepen your BI mastery — community forums, a learning center, and an announcements newsfeed which are all easily accessible right from the service home screen.
The Road Most Traveled
In our three-part blog series, we’ve shown you that Data Science isn’t about confusing matrices, big servers, and colorful diagrams. It’s about finding the right tool to get the maximum value for your company from your data.
Our story started in part one with novelties in Amazon SageMaker and how they can make it a breeze for you to use even the most complex ML models and algorithms.
In part two, we went further and focused on how Customers without years of Data Science expertise and billions of dollars to hire experts can get value from AWS-managed AI.
In this final part, we went deeper to reveal the foundation under it all — data analytics and how serverless, performant, and resilient it became in 2022. We can’t wait to see which great products you will build using these new services and features. With all this knowledge in mind, it’s your turn now to roll the d20 and see which brilliant treasures (not mimics!) it will take you to.
There’s no better time to build your future in AWS than today, and we at Neurons Lab work hard to make your cloud journey a cakewalk. As AWS Advanced Tier experts in AI/ML and Big Data, we always welcome you to contact us to learn how you can apply Data Science to your business to help it live long and prosper