Databases & Analytics


Types of Databases

Relational vs Non-Relational

Key differences are how data are managed and how data are stored

RelationalNon-Relational
Organized by tables, rows, and columnsVaried data storage models
Rigid schema (Structured Query Language / SQL)Flexible schema (NoSQL) - data stored in key-value pairs, columns, documents, or graphs
Rules enforced within the databaseRules can be defined in application code (outside the database)
Typically scaled vertically - This is where, for instance, we would deploy a database on an instance and then change the instance type to one with more CPU and memory if we need to scale the databaseScales horizontally
Supports complex queries and joinsUnstructured, simple language that supports any kind of schema
Amazon RDS, Oracle, MySQL, IBM DB2, PostgreSQLAmazon DynamoDB, MongoDB, Redis, Neo4j

Relational Databases

Here we have an example of a table in a relational database. As you see, we have columns and we have rows and each row is an individual item in the database. Now, if we have multiple tables, we can connect them, that’s how we can join different tables together. If we want to query the database, that’s when we use the Structured Query Language (SQL). So here, we might select by the first name from employees and where we want the location to equal Sydney. So it’s going to pull out resources based on these definitions in the SQL query.

Types of Non-Relational DB (NoSQL)

So here, for example, we have DynamoDB. DynamoDB has items in a table and those items have attributes and the attributes can be different for every item. They don’t have to be the same and some can be missing. So it’s much more flexible in that regard. We’ve got MongoDB, which is a document type of database. So you’re storing documents as items in the database and then we have graph databases like Amazon Neptune. These are all non-relational. A graph database is something like, you can think of Facebook and the way you store relationships between different people on Facebook. You’re storing those relationships in a graph database.

Operational vs Analytical

Key differences are use cases and how the database is optimized

Operational / TransactionalAnalytical
Online Transaction Processing (OLTP) - Think of these as the production databases. So maybe you’re an eCommerce platform and your orders are going into this database. That would be your operational database.Online Analytics Processing (OLAP) - the source data comes from OLTP DBs - Now with Analytical databases, the data is coming from your operational databases, so you’re loading that data into your analytical database. It then becomes what’s known as a data warehouse and it’s often separated from your front-facing databases, those operational databases.
Production DBs that process transactions. E.g. adding customer records, checking stock availability (INSERT, UPDATE, DELETE)Data warehouse. Typically, separated from the customer-facing DBs. Data is extracted for decision making
Short transactions and simple queriesLong transactions and complex queries
Relational examples: Amazon RDS, Oracle, IBM DB2Relational examples: Amazon RedShift, Teradata, HPVertica
Non-relational examples: MongoDB, Cassandra, Neo4j, HBaseNon-relational examples: Amazon EMR, MapReduce

We have several Amazon RDS databases, maybe these are in different parts of the world and those are our operational databases and then we have RedShift, which is a data warehouse for analytics. So we’re able to copy and load that data into RedShift and then perform our analytics on that data and we’ve aggregated it from all of the different operational databases we have.

AWS Databases

Data StoreUse Case
Database on EC2* Need full control over instance and database
* Third-party database engine (not available in RDS)
Amazon RDS* Need traditional relational database
* e.g. Oracle, PostgreSQL, Microsoft SQL, MariaDB, MySQL
* Data is well-formed and structured
Amazon DynamoDB* NoSQL database * In-memory performance
* High I/O needs
* Dynamic scaling
Amazon Redshift* Data warehouse for large volumes of aggregated data
Amazon ElasticCache* Fast temporary storage for small amounts of data * In-memory database
Amazon EMR* Analytics workloads using the Hadoop framework

Amazon Relational Database Service (RDS)

RDS is a managed relational database. Now it runs on EC2 instances, and that means you need to choose your instance type to determine the performance characteristics for your database, the CPU, the memory, the storage, and so on.

RDS supports the following database engines:

Now, if the database you need is not on the list, you can’t use RDS. You have to install it on EC2.

Amazon RDS Scaling Up (vertically)

So scaling up, scaling vertically is where we change our instance type. So here we’ve moved from db.m4.large with two CPUs and 8GiB RAM up to the M4 which has more processing power and more RAM associated with it.

Disaster Recovery (DR) and Scaling Out (Horizontally)

Now we can also scale out for some specific performance characteristics and we can also implement disaster recovery so we can have some high availability fault tolerance and disaster recovery for our applications. So how does that work? We have something called an RDS master. So this is running in an availability zone and this is our primary database. It’s the one on which both reads and writes get performed. We can then have a standby instance. Now the master will synchronously replicate to the standby instance. So the standby is always in sync with the master. Now, this is known as multi-az and you essentially have a passive standby that is used for disaster recovery. So that means in a dire situation, when the master fails or becomes inaccessible, then the standby can take over it automatically. We also have a Read Replica. Now, Read Replicas use asynchronous replication, so there’s a little bit more of a delay but they replicate from the master as well and they’re used for scaling reads. That means scaling the queries on your database. So this is a way of scaling out, scaling horizontally, but only for queries. The Read Replica is used for queries, the writes still always go to the master database. So here we have our app server and we can see in green all writes are going to the master, but Reads can instead go to the Read Replica so that we can scale and offload a bit of the performance impact from the master to the replica.

Amazon RDS

Create Amazon RDS Database

Let’s head over to the Management Console > Services > Database > RDS. Now, on the main screen here, if we go to Dashboard, we’ve got Amazon Aurora. AWS wants us to choose Amazon Aurora, that’s their own database engine for RDS. Now we can click on Create database here. We don’t have to use Aurora.

So for now, rather than choosing the Engine as Aurora, I’m going to choose MySQL instead and you can see we have a choice of other database engines if we want to. So I’m running MySQL.

Let’s scroll down. Here we’ve got some Templates. So what we’re going to do is use the Free tier and the Free tier will then configure this environment to make sure we stay within the free tier. Now there are some limitations. For example, you won’t be able to enable multi-AZ, but that’s fine for this. Otherwise, you could use the presets for production r dev/test workloads.

We need to provide a database identifier. I’m just going to call min digitalcloud. The Master username is here, the admin’s fine, and I’m just going to put in a password.

Next, we can choose the Database instance class. Now, because we’ve already selected the free tier, most of them are going to be grayed out and if we deselect that and change to Production, then we’d be able to choose from some of these larger database instance types. So I’m happy with the db.t2.micro.

Scrolling down a bit, we have Storage and it’s using EBS (Elastic Block Store) because this is running on EC2. So we can choose from General purpose, Provisioned IOPS, or Magnetic. I’m going to leave it on General-Purpose (SSD) and 20 GiB is fine and it enables storage auto-scaling by default. So this means that the storage layer will automatically scale for us as we use more storage.

As you can see, Multi-AZ is grayed out that’s because we’re on the free tier.

Note here that you can enable Public access. So if you do that, it means that you could have application instances coming in from the outside world to a public address and connected to your database. You might want it to be completely internal, so just leave it on No. You can choose a VPC security group also if you want to. In this case, it chooses the default and that’s fine for this purpose.

Let’s just check the Additional configuration. I want the Initial database name to be digitalcloud and I’m going to leave the automatic backups on. As you can see, you can also define your retention period. By default, it’s 7 days. You can knock it down to zero, which means no backups, or put it up to 35. We’ll leave it as it is. And I’m happy with not having a preference for the backup window. That’s the time when AWS will take the backup. You can choose your own time frame to make sure it doesn’t interfere with your production systems if you want to.

Let’s just scroll down and Create database.

Our database is now in the creating status and it can take several minutes for an RDS database to deploy. Once it’s deployed we’ll see that we have an endpoint and that’s the endpoint we can then use for our applications. So let’s just give it a few minutes to deploy.

So my database is almost deployed. We can see it’s in backing up status now and if I just click on the database here.

We can now see that we have an Endpoint & port. This endpoint would be used in your application to point it to the RDS database and the port would be 3306. Now you can see a bit of information about the availability zone it’s in, the VPC, the subnet group, and so on. Now, make sure that if you are using this to point an application, you go and check your VPC security group because your application will need to be allowed to talk to RDS on port 3306.

At the moment, if we scroll down we can see that we don’t have a rule there. So actually, the default security group only has an inbound rule allowing access from itself. So, your application, if it was in this particular security group, that would work otherwise you would need to allow 3306 inbound from another security group or IP address.

We then have the Monitoring tab and here we can see a little bit of information about CPU, database connections, storage space, freeable memory, and then read and write IOPS.

We’ve got Logs & events and you can see some information is being populated in here.

Under Configuration, you can see a few more details on the configuration of this RDS instance.

And under Maintenance & backups, you can see the configuration and any actual backups that have been taken.

Back up at the top, you’ve got Actions. Here, you can do things like Take snapshot of your instance. Remember, it uses the Elastic Block Store. So you can always take snapshots and use those to restore your database, as well as the automatic backups that take place. You can Create read replica here and you can Promote a database instance that perhaps is a Read Replica to be a primary database. You can also, Delete, Reboot, and Stop your instance.

So all I’m going to do now is just go in and Delete my database.

And we don’t to create a final snapshot because we’ll have to pay to remain that. We don’t want to retain automated backups. So we just acknowledge that everything is going to get deleted. All we need to do is type delete me and then that’s it.

Amazon Aurora

Amazon Aurora is actually part of the RDS family. We could see it in the RDS Console, but it’s not called Amazon RDS. It’s just called Amazon Aurora and this is a proprietary database engine created by AWS.

So here you can see that we have six different copies of our data, two in each availability zone. We have our Primary database with our Reads and Writes, and then we have our Replicas, which are in a different availability zone and each of them can perform reads but all writes go across this logical volume across multiple availability zones.

Aurora Fault Tolerance

Amazon Aurora Key Features

Aurora FeatureBenefit
High performance and scalabilityOffers high performance, self-healing storage that scales up to 64TB, point-in-time recovery, and continuous backup to S3
DB compatibilityCompatible with existing MySQL and PostgreSQL open source databases
Aurora ReplicasIn-region read scaling and failover target - up to 15 (can use Auto Scaling)
MySQL Read ReplicasCross-region cluster with reading scaling and failover target - up to 5 (each can have up to 15 Aurora Replicas)
Global DatabaseCross-region cluster with read scaling (fast replication / low latency reads). Can remove secondary and promote
Multi-MasterScales out writes within a region - It basically means that you can have multiple database instances to which you can actually write data and it’s all synchronized between all of the different masters
ServerlessOn-demand, autoscaling configuration for Amazon Aurora - does not support read replicas or public IPs (can only access through VPC or Direct Connect - not VPN)

Amazon DynamoDB

Now, with RDS, you can change the configuration by changing the instance type but it does mean downtime. It does mean the instance is going to have to restart and your application won’t be able to connect to the database for a period of time. So DynamoDB does not have that problem. So we have a table in DynamoDB. It’s not a database because... it’s a database service, but the database is DynamoDB. You create multiple tables within an existing database. A DynamoDB table scales horizontally. Your data is being distributed across different backend servers to make sure that you can scale and that you get great performance.

  • DynamoDB is made up of:
    • Tables
      • That’s where all of your data is stored
    • Items
      • These are like rows in a relational database
    • Attributes
      • Each item has a series of attributes associated with it. So that’s the data we see in these different fields, such as the order ID, the book, the price, and the date.

Amazon DynamoDB Key Features

DynamoDB FeatureBenefit
ServerlessFully managed, fault-tolerant, service - You don’t have any instances to manage or launch
Highly available99.99% availability SLA - 99.999% for Global Tables!
NoSQL type of database with Name / Value structureFlexible schema, good for when data is not well structured or unpredictable
Horizontal scalingSeamless scalability to any scale with push-button scaling or Auto Scaling
DynamoDB Accelerator (DAX)Fully managed in-memory cache for DynamoDB that increases performance (microsecond latency) - An in-memory cache for really great performance
BackupPoint-in-time recovery down to the second in last 35 days; On-demand backup and restore
Global TablesFully managed multi-region, multi-master solution - So you can synchronize a database table across multiple regions and have the data constantly synchronize between the two

Create Amazon DynamoDB Table

In the Console, go to Services > Database > DynamoDB. In DynamoDB we need to create a table. So let’s go to Tables and Create table.

We need to give it a name. I’ll just simply call it MyTable and we need to give a Partition key. So what are we storing in this table? Maybe we’re storing some orders and associated with those orders, we’re going to have various information. Or maybe this table is going to be a series of users from a leader board for a gaming application and we’re going to store some information about those people. Let’s go with that one and we’ll just use userid for the partition key. Now, this can be a String, which is what we have here. It could be Binary or it could be a Number. So we’ll leave it as a string. You can then have a Sort Key that allows you to be able to search for another key that is related to the partition key. We’ll leave that blank, that is optional.

Now we have some Settings options here for Default settings or Customize settings. I’m going to click on Customize so I can show you a bit more. Now, we have something called Capacity mode. So here, we have On-demand or Provisioned. With Provisioned, we specify these things called Read capacity units and Write capacity units. And what this is, is it’s a way of actually defining the performance of your database. So you pay for read capacity units and write capacity units. Now, alternatively, you could go to On-demand and then you pay for the actual reads and writes that your application performs instead of provisioning them and just paying for five read capacity units and five write capacity units, regardless of whether you use them. Now, I’ll just leave it on the default hereof Provisioned. This is very low and it will be within the free tier.

We can create something called a Secondary index beyond the scope of what we need to do here.

And so what I want to do now is just create the table. I’m not going to change the Encryption status. So let’s just click on Create table.

That’s it. That literally took a couple of seconds to create my table and we can see that it has a partition key of userid. So let’s just click on the MyTable.

So let’s just click on the table. Now, of course, there’s nothing on our table. We do not have any items on our table at this point. At the bottom here, we could actually Create item.

So let’s just go ahead and do that. So maybe this is going to be the first user and that’s going to be me. So the value for userid is neal001 and the type of that is going to be a string and we can’t change that. We could then add additional attributes. So maybe we need to add another string.

This one is going to be the firstname. So that would just be neal. We could have another one as a string and this might be lastname, davis, and then perhaps we have a number and that’s going to be bestscore and it’s 900. So let’s just create this item and see what happens.

So now we have an item in our table which has a few different attributes associated with it. Now if we wanted to, we could go and just create more items and each item can have the same or different attributes and that’s one of the benefits of a NoSQL database because it gives you that flexibility. With a structured database like SQL, a relational database, you do actually have to make sure that you have some value in every single column. You can’t have different columns for different rows or items in the database. So that’s one of the advantages.

Now, there are a few tabs up here as well. We have something called Global Tables, a really cool solution where you can create a replica of your table and you can create that in another region.

So for example, I could then choose one of these other regions and have a replica created there for me.

We can take backups, there’s an automated backup, and then you can also create on-demand backups as well.

You really just need to understand that it’s a NoSQL type of database and that you have push-button scaling, so can scale it without any interruptions of traffic, and of course, it’s a serverless service, so you don’t have to provision instances or manage them. So that’s it, let’s just go and delete our table.

Once I type to Delete and Delete table, that’s it.

Amazon RedShift

So let’s say we have our Amazon RDS databases and these are our production databases. Perhaps these ones are somewhere in different geographies around the world and this is where we’re taking orders from our customers. So these are our transactional or OLTP databases. What we might want to do is actually load the data, copy it from those transactional databases into RedShift, which will become our data warehouse. So the data gets loaded into RedShift and then we can do stuff like analyzing the data with Business Intelligence tools such as QuickSight using Structured Query Language (SQL). So that should give you an idea of how we use databases such as RedShift, which is a data warehouse.

Amazon Elastic Map Reduce (EMR)

You may have heard of Hadoop, which is a framework for big data. So big data, meaning you have large quantities of data and you need to perform some kind of analytics, some kind of processing on that data. Amazon Elastic Map Reduce (EMR), provides a managed implementation of Hadoop for you.

So we have an Availability Zone. An EMR always runs within an availability zone. It can then attach to various different data stores. So you can see we’ve got S3, we’ve got Glacier and RedShift and DynamoDB and RDS and Hadoop file system, HDFS as well. You can then also optionally have EBS volumes and then you have your cluster instances and you get root access to your cluster instances in EMR. EMR scales by adding cluster instances or deploying multiple clusters and you can do that then in different availability zones if you like. The key things to remember are that it’s a managed implementation of Hadoop and it’s used for big data analytics and business intelligence.

Amazon ElastiCache

Amazon ElastiCache is a database service that’s typically used for very high performance in-memory caching of data and often that data comes from another database such as RDS.

We have an RDS database and we have an application instance and what we do is we put an ElastiCache node in the middle. Now, the nodes actually run on EC2 instances. So again, you have to launch EC2 instances here and actually pay for those instances for the time you’re running them. Now, the application is going to write some data to RDS. That data then gets loaded into ElastiCache, which means that the next time the application needs to read the data from the database, it gets what’s called a Cache hit, which means it found it in the database and it’s going to be retrieved a lot faster than coming from RDS.

Use caseBenefit
Web session storeIn cases with load-balanced web servers, store web session information in Redis so if a server is lost, the session info is not lost, and another webserver can pick it up - So you’re storing state information in this particular database, which can be retrieved by instances. Perhaps you have some load-balanced Web servers and they’re storing some information about a customer session in ElastiCache and then if one of the Web servers fails and the connection gets routed over to another instance, that instance can find the information in ElastiCache. And obviously, it’s very high performance so it’s good for those types of use cases.
Database cachingUse Memcached in front of AWS RDS to cache popular queries to offload work from RDS and return results faster to users
LeaderboardsUse Redis to provide a live leaderboard for millions of users of your mobile app
Streaming data dashboardsProvide a landing spot for streaming sensor data on the factory floor, providing live real-time dashboard displays

The key thing to remember it is an in-memory database used for caching data.

Amazon Athena and AWS Glue

So Amazon Athena is a service that we can point at data on Amazon S3 and we’re able to run SQL queries against that data. So it gives us that ability to take a data lake, which means essentially you have a large quantity of data on S3, and you’re able to run queries to find out some information as if it was a database. Athena can query data that’s in various different formats. So it might be stored in a CSV file, a TSV file, in JSON format, in Parquet or ORC formats as well. We then have AWS Glue. Now, Glue is a metadata catalog that you can use with Amazon Athena. So in this case, you can create a metadata catalog that describes your data structure in S3 and then query using SQL to that data in S3.

Amazon Athena

AWS Glue

Amazon Kinesis

Amazon Kinesis is a service that’s made up of multiple different services but they’re all associated with streaming data. That’s the main thing I want you to remember is to think about Kinesis whenever you hear about streaming data. So we have our data coming in from some kind of source. These could be things like IoT sensors or some kind of device on the Internet or in a factory and they’re sending data in and that's hitting Kinesis data streams. So that’s one of the products in the portfolio. Kinesis data streams deal with streaming data. It ingests that data and then it can do something with it, so it can then process that data and forward it on. And in this case, maybe it’s forwarding on to another member of the Kinesis family, which is the Kinesis Data Analytics Service. Data Analytics will perform real-time SQL processing on data and then perhaps it’s going to send it through to Lambda and Lambda might store in a DynamoDB table and then we have Kinesis Data Firehose. Firehose is a service that loads data straight to a destination. So in other words, it might just put the data into S3 or it might put it into Amazon RedShift and then you’ll be able to do something with that data later on. So it’s just a way of taking that streaming data as it comes in and says, “Right, I’m just going to dump it all into Amazon S3 or load it into a RedShift database and then I can perform my queries on it later on.”

Examples of streaming data use cases include:

So we’re really talking about data that comes in often in fairly large and fairly consistent volumes. So it might be data that’s constantly being sent by a sensor recording information about some equipment or about the temperature or about the movement of a car, for example. So it’s giving back some coordinates. There are lots of different use cases here, and it’s really just about dealing with this constant stream of data that’s coming in.

Kinesis Data Streams

Amazon Kinesis Data Firehose

Amazon Kinesis Data Analytics

Other Databases and Analytics Services

AWS Data Pipeline

Amazon QuickSight

Amazon Neptune

Amazon DocumentDB

Amazon QLDB

Amazon Managed Blockchain

Exam-Cram


Amazon Relational Database Service (RD)

Amazon Aurora

Amazon DynamoDB

Amazon RedShift

Amazon EMR

Amazon ElastiCache

Amazon Athena

AWS Glue

Amazon Kinesis Data Streams

Amazon Kinesis Data Firehose

Amazon Kinesis Data Analytics

AWS Data Pipeline

Amazon QuickSight

Amazon Neptune

Amazon DocumentDB

Amazon QLDB

Amazon Managed Blockchain