The Story of Big Data on AWS


Amazon Web Services (AWS) is a subsidiary of Amazon that provides cloud computing services, accessible to both individuals and companies. While newer cloud providers like Microsoft Azure and Google grow at a faster rate, AWS still holds a commanding position at the top of the cloud provider market.

Big data and AWS

AWS offers a wealth of different cloud services, ranging from infrastructure-as-a-service (IaaS) to data warehouses. Some of the most popular AWS offerings include:

  • Amazon S3, which allows enterprises to store and retrieve their data from anywhere on the web at any time. The S3 service is mainly used for backing up critical data, but it can also be used to serve static web content or share files.
  •  Amazon EC2, which provides compute capacity in the cloud. Enterprises can rent AWS EC2 instances and run their applications on those instances.


When using a cloud storage provider for data backup or to run apps, disaster recovery is also imperative. Some enterprises think that their data and apps are ultimately safe from disaster in a system like AWS, but things can go wrong there too. AWS disaster recovery entails backing up those cloud workloads to another location.


We are now firmly in the age of the Big Data economy. Data-driven technologies and Big Data systems such as Hadoop facilitate the gathering, processing, and analysis of large stores of real-time data in a variety of formats from multiple sources. Cloud storage providers also have an important role to play in Big Data, and AWS provides several services that can integrate with Big Data, providing handling and storage of these datasets.


In this post, you’ll find out about the main AWS Big Data services, in addition to finding out some best practices for working with Big Data on AWS.


AWS & Big Data

AWS provides over 50 different services to help you build and deploy Big Data analytics applications. However, going through each of those services would take an age, so let’s get down to the nitty-gritty of what AWS offers for companies looking to benefit from Big Data.


Big Data Frameworks

Starting off with Big Data analysis, AWS offers its Amazon EMR service, which provides a managed cluster platform of EC2 cloud instances, making it easier to run jobs using Hadoop, Spark or other Big Data frameworks. You can easily scale workloads up or down depending on data volume. Yelp, the popular user review platform, uses EMR jobs to process over 30 terabytes of data every day.


You also have the option of using the Amazon Athena service, which is an interactive query service that works on data stored in Amazon’s S3 simple storage system. With Athena, you use standard SQL to query large-scale datasets stored in S3.


Real-Time Analysis

AWS provides some powerful services for loading large volumes of streaming data into AWS for real-time analytics, giving enterprises the chance to make quick-fire decisions that improve their operations, whether marketing or sales.


Amazon Kinesis Firehose lets you load streaming data into data stores and analytics tools such as S3 and Splunk. This enables near real-time analysis of Big Data. Kinesis Analytics, a related service, lets you easily analyze streaming data with standard SQL queries that run continuously.


Big Data Storage

Amazon S3 gives you object storage that can scale up with larger datasets, which is useful for storing all that data in a secure and reliable system. You could also use a petabyte-scale NoSQL database like Apache HBase via Amazon’s EMR service. Deploying an HBase cluster in AWS EMR and combining it with a Big Data framework such as a Hadoop ecosystem can give you fast data access and Big Data analytics.


Data Warehousing

Amazon Redshift is a petabyte-scale data warehouse that allows organizations to pull disparate data sources into a single repository for business intelligence and reporting. A sub-service within Redshift is Redshift Spectrum, which enables you to query and analyze big data cost-effectively, directly running SQL queries against large volumes of unstructured data stored in Amazon S3.


AWS and Big Data Challenges

The challenges with using AWS for Big Data workloads and storage are similar to the challenges with any cloud provider; security, disaster recovery, and availability being paramount concerns.


One security concern is how Big Data frameworks result in data being distributed across several AWS instances for faster processing: this increases the number of systems on which security issues can arise.


Creating cloud backups is also imperative. As mentioned, AWS allows enterprises to store large datasets within the S3 service and then use its querying services to get insights from that data. However, relying on AWS alone is not wise—AWS instances are also vulnerable to natural disasters or cloud provider errors that could result in data loss. This makes the creation of AWS backups a vital part of any disaster recovery plan for your Big Data workloads.


There’s also a need for an advanced security solution that can keep track of high-velocity Big Data, intelligently identifying and protecting sensitive information within all that data, such as personally identifiable information (PII). Machine learning algorithms could have a role to play here.


With AWS or any other cloud solution, it’s important to extend existing governance, teams, skills, and data management procedures to incorporate cloud infrastructure.


Wrap Up


AWS provides a suite of exciting services that can empower your business to get the most out of Big Data. This post has only touched on the available services, so you can visit the AWS Big Data page for a more comprehensive breakdown of how AWS integrates with Big Data frameworks and facilitates analysis of these datasets. Remember to always bear security, governance, compliance, and AWS disaster recovery in mind.