How I passed Hashicorp Vault exam

Reading Time: 3 minutes

In this post, I will discuss how I prepared for and passed the Hashicorp vault exam.

Note: I only used vault for PoC with a limited scope before I start preparing for the exam and hence some concepts were really new to me.

This exam is very developer-focused and if you are not comfortable with development terms, you might have some difficulties but not impossible.

It took me a total of two weeks to prepare for the exam and during those two weeks, I worked on hands-on activities and read the documentation and some of which I will cover by making short videos in the coming days.

For preparing for this exam I used the official study guide published by Hashicorp and can be found here.

Vault Concepts

What is Vaulthttps://www.vaultproject.io/docs/what-is-vault

Watch video from Mike Møller Nielsen

Intro to Vault (Armon)

11 fundamentals conceptshttps://www.vaultproject.io/docs/concepts – Read them and re-read them as they are really important! Some core topics you should focus on

Vault Fundamentals

Describe authentication methods:

Authentication – https://www.vaultproject.io/docs/auth

Concepts https://www.vaultproject.io/docs/concepts/auth

AWS Auth methodhttps://www.vaultproject.io/docs/auth/aws

Also, I recommend that you complete all the labs for “Authentication” as this is a major topic for the exam.

https://learn.hashicorp.com/collections/vault/auth-methods

CLI and UI – Understand what all CLI commands do and also review all CLI options.

CLI access to Vault https://www.vaultproject.io/docs/commands/index.html

Vault UIhttps://www.vaultproject.io/docs/configuration/ui

Also, review the Vault CLI options within UI why it is different from binary cli.

Vault Policies

This is a very important topic and I recommend that you create a dev mode server, create some policies, create users with policies associated, and play around with concepts.

Knowing how policies work for the exam will help you answer questions quickly as there are many questions related to polices. Also, it is important to understand what “*” does and what “+” does. My recommendation is to go over the following labs:

https://learn.hashicorp.com/tutorials/vault/policies?in=vault/policies

https://learn.hashicorp.com/tutorials/vault/getting-started-policies?in=vault/getting-started

https://learn.hashicorp.com/tutorials/vault/policy-templating?in=vault/policies

Tokens

You have to know this very well! This is the heart and soul of the Vault engine so knowing this and going through hands-on labs will help you understand Vault really well.

Root Token – https://www.vaultproject.io/docs/concepts/tokens

Learn what is the difference between service and batch tokens. Token with parent and orphan tokens.

Also, understand how token lease works TTL on token – token Accessors

See the following video on auto-unseal and batch token:

Secrets Management

It is the core topic and you must know ins and outs of this – understand how each secret engine works and what is the use case for each secret engine. Understand

Review the following topics:

https://www.vaultproject.io/docs/secrets

https://www.vaultproject.io/docs/secrets/databases

https://www.vaultproject.io/docs/secrets/aws

Complete the following labs:

Vault API

You will see questions from Vault API, review the following topics, and understand how token header is sent via CURL:

https://www.vaultproject.io/docs/auth/approle.html

Understand when to use Approle vs other authentication methods.

https://learn.hashicorp.com/tutorials/vault/getting-started-apis

Watch this video from Mike Møller Nielsen – He explains how the API and curl works with response wrapping.

Vault Architecture

This is an important topic as well, you don’t necessarily have to create an HA vault cluster but it helps to understand how the deployment works. I used the following exercises to deploy Vault HA with AWS:

https://github.com/hashicorp/vault-guides/tree/master/operations/provision-vault/quick-start/terraform-aws (you must know how to use #Terrafrom)

Watch Bryan Krausen Vault HA video:

Overall, You must go through all the topics in the study guide here – As the questions are asked from a wide variety of topics but topics and items I have shared should get you comfortable with Vault.

Some tips for taking exams:

  • Arrive 15 mins before the exam starts
  • Read the questions and answers carefully
  • If you don’t know the answer, mark the question and move on
  • You have 60 minutes to answer all the questions
  • Take Ned Bellavance Pluralsight course if you can here

Terraform Cloud Series – Part 4 (remote state)

Reading Time: 2 minutes

Continuing from where left off, In this post, I will discuss how to tap into workspace state file.

In the previous post, we connected workspace dependency allowing execution of child workspace, however, in some cases stack requires fetching data sources in order to cross-reference the resource name, id, etc. allowing us to make terraform code more usable and flexible.

Let’s look at an example of how to pull data from a remote state file stored in the Terraform cloud.

If we look at the execution flow in the previous post, We executed 1-poc-network and stack trigger executed 2-poc-security-groups, but when we execute 2-poc-security-groups it requires vpc_id created in 1-poc-network. So, let’s look at the code and break it down a bit.

module "vote_service_sg" {
  source = "terraform-aws-modules/security-group/aws"
  name        = "access-security-group"
  description = "Security group for user-service with custom ports open 
  within VPC, and PostgreSQL publicly open"
  vpc_id      = "VPC_ID" # --> VPC ID associating Security group to VPC
  ingress_cidr_blocks      = ["10.10.0.0/16","10.10.105.0/24","78.1.10.100"]
  ingress_rules            = ["https-443-tcp"]
  ingress_with_cidr_blocks = [
    {
      from_port   = 8080
      to_port     = 8090
      protocol    = "tcp"
      description = "User-service ports"
      cidr_blocks = "10.10.0.0/16"
    },
    {
      rule        = "postgresql-tcp"
      cidr_blocks = "0.0.0.0/0"
    },
  ]
  tags = var.default_tags
}

Looking at line # 6, notice we have to provide VPC ID every time this code is to be executed.

vpc_id      = "VPC_ID" # --> VPC ID associating Security group to VPC

If we were to change or add this as variable, it will work, but requires someone to find the VPC ID and input the value; a lot of work!

What if we can fetch the data from the previous stack and let terraform figure this out. We need to add the following code block to our terraform stack:

data "terraform_remote_state" "vpc" {
  backend = "remote"
  config = {
    organization = "securectl-poc"
    workspaces = {
      name = "1-poc-network"
    }
  }
}

Let me explain how to interpret the remote state:

data "terraform_remote_state" "vpc" {
  backend = "remote"

The section above indicates that we are setting a remote state called “vpc” and with the backend type of remote.

  config = {
    organization = "securectl-poc"
    workspaces = {
      name = "1-poc-network"

And in the section above, we are setting up our config’s allowing us to fetch the needed data from a remote state file. Notice that there are two required inputs that are needed.

  • organization
  • workspace name

Now that we have our remote-state setup let’s change the code to fetch data from the remote state:

data "terraform_remote_state" "vpc" {
  backend = "remote"
  config = {
    organization = "securectl-poc"
    workspaces = {
      name = "1-poc-network"
    }
  }
}

module "vote_service_sg" {
  source = "terraform-aws-modules/security-group/aws"
  name        = "access-security-group"
  description = "Security group for user-service with custom ports open 
  within VPC, and PostgreSQL publicly open"
  vpc_id      = data.terraform_remote_state.vpc.outputs.vpc_id.vpc_id
  ingress_cidr_blocks      = ["10.10.0.0/16","10.10.105.0/24","78.1.10.100"]
  ingress_rules            = ["https-443-tcp"]
  ingress_with_cidr_blocks = [
    {
      from_port   = 8080
      to_port     = 8090
      protocol    = "tcp"
      description = "User-service ports"
      cidr_blocks = "10.10.0.0/16"
    },
    {
      rule        = "postgresql-tcp"
      cidr_blocks = "0.0.0.0/0"
    },
  ]
  tags = var.default_tags
}

Notice that vpc_id now points to a data value of remote-state file within workspace 1-poc-network.

data.terraform_remote_state.vpc.outputs.vpc_id.vpc_id

As you can see how our code is re-useable allowing us to extract output information from remote-state.

Using this method, we can create dependency within our terraform stack allowing us to use the remote state for extracting required attributes. I hope this helped you understand how the backend/remote state works, try it out yourself!

Terraform Cloud Series – Part 3 (Connect Workspace Trigger)

Reading Time: 4 minutes

In the previous series, we covered how to get started with Terraform Cloud and setup VCS with our source repository, In this post, we will look at how we can use “trigger” capability for dependent workspaces/stack.

For the purpose of the demo, I will create the following resources by using the trigger feature of the TF cloud and in the following order of stack:

  • 1-poc-network
  • 2-poc-security-groups
  • 3-poc-buckets

1-poc-network

This will create the required network i.e VPC, subnets, IGW, SG resource to create AWS EC2 instance and other resources that require a network.

2-poc-security-groups

This will create an application-specific security group for the purpose of the demo

3-poc-buckets

Additional resources needed to support the application i.e. S3 buckets, policies, etc.

Putting it all together, here how the visualization looks like below:

Essentially, we are sort of creating a job dependency, but my experience with the trigger has been mixed as there seem to be a lot of limitations with checks and balances. In my opinion, the workflow above is good for a repeatable process where you don’t care if the process is executed multiple times and you expect the same result every time regardless of number of executions.

What I experienced is that if you run into an error with a parent job during the apply phase, TF cloud will still trigger the jobs downstream, hence there seems to be duplication or no way to tell the downstream job if the parent fails. However, regardless of limitations, it is still good feature allowing you to setup simple chaining. If you need a more task-driven setup, in my opinion, Gitlab CI/CD is a better tool.

Now let’s look at the workspaces and how to setup the trigger for child jobs:

If we look at the 1-poc-network workspace, under the run trigger option we have the option to attach child workspace.

Note: Even if the run trigger is setup, a child job can be executed by itself or via VCS code commit.

Notice that I don’t have trigger setup on the parent job, that is because trigger needs to be executed from 2-poc-security-groups when 1-poc-network executed! and yes, I know it is confusing as it took me by surprise too!

So, let’s look at the trigger properties for 2-poc-security-groups:

So, basically we are saying when 1-poc-network job is executed, TF Cloud should also execute 2-poc-security-groups. Now, let’s also look at the 3-poc-buckets:

Now you get the idea of how the flow works! Also, if you are planning on taking the Hashicorp Terraform Associate exam, knowing TF cloud knowledge is a plus and will help pass the exam. I will do another post on TF Associate exam,

Trigger the parent job

Now – let me trigger the job (in this case from git repo commit) – as soon as I commit Terraform job is scheduled and executed

Noticed, it picked up the trigger and TF cloud will execute the dependent workspace after the apply is completed for the source job.

Similarly like before, 2-poc-security-group also detected the downstream trigger:

Now, noticed that there is nothing to do as my bucket was already created. However, I changed the name on the bucket in repo and the job still executed independently.

Conclusion

Terraform workspace trigger feature allows users to create stack dependency when working with a large stack. This good method when the user needs to create multiple workspaces connected and suppose you may be changing dependent resources that require complete teardown and re-create.

Terraform Cloud Series – Part 2

Reading Time: 5 minutes

So, Let’s continue from where we have left off at, In this blog, I will focus on the same build process of AWS VPC, but this time code will reside in a git repository (Gitlab for this demos). Assuming the audience is familiar with what is GitLab/GitHub – otherwise, I would recommend understanding the basics of Git before continuing with the rest of the demo.

For Part 2 of this series, I will be creating a new workspace for simplicity purposes.

And I will break the blog into the following areas:

  • Connect to a version control provider
    • Setup/Configure application
    • Generate Tokens
  • Integrate Terraform Cloud with Gitlab
  • Create AWS VPC network using git repo
  • Setup Cloud provider tokens/run time vars
  • Update the code base in git

Connect to a version control provider

Once signed into Terraform Cloud, click create “New workspace” you would be asked to set up the backend repository and cloud provider token:

For the purpose of this lab, I will be using Gitlab to setup my backend.

Note: If you planning on using GitHub or GitLab, one thing is to keep in mind is that each environment lifecycle should be a repo/project. If you combine your code with the root repo, it will be very difficult to manage the stack deployment and organization.

Setup/Configure application

Note: You will need to get properties from both GitLab and terraform cloud, hence I suggest that you open two windows/tabs to work in parallel.

Once you have signed into Gitlab, goto your account sections and applications:

Here we will create a new application that will integrate with Terraform Cloud, I am going to call my application “TerraformCloudisFun

Notice the Redirect URL is garbage value; that is on purpose. We will come back to this and fix it later. go ahead and save the application.

Now, let’s configure the Terraform Cloud section:

  • You should be already on “VCS Providers” section under your organization:
  • If not, you will need to get VCS Provider by clicking –> org –> New Workspace –> VCS provider
  • Again, I am calling my provider “TerraformCloudisFun” to keep the naming consistent.
  • We will need to provide application ID & Secret generated in step above
  • Add VCS provider and application is created.

Integrate Terraform Cloud with Gitlab

Locate the call-back URL and copy the URL – we need to modify with the Gitlab application we created in an earlier step.

  • If you are still on application created page, click the edit button and update callback URL with Terraform Cloud Callback URL:
  • Save & update the application.
  • Now, back to Terraform Cloud and click “connect organization
  • Terraform will try to access GitLab.com and authorize the application.

That’s it – Backend is configured and ready to be used.

Create AWS VPC network using git repo

Now that we have our backend ready, let us try to create the AWS VPC by pulling the code directly from version control.

Terraform application we created will fetch the repos/projects from Gitlab.com:

Select your repository or working project for provisioning and create workspace:

You might have to wait a bit before the workspace is ready to be configured.

Hit the configure button and provide the required properties for the cloud provider:

Setup Cloud provider tokens/run time vars

I will add my AWS IAM user Access Key and Secret which is needed to create the stack in AWS.

  1. AWS Access Key
  2. AWS Access Secret
  3. Additional tag values

Notice that TF Cloud allows you to encrypt the secret values, but this information may appear in TF outputs/debug logs.

  • Select the “Sensitive” checkbox & save the variables.

Now we are ready to create the stack using TF Cloud.

Hit the “Queue Plan” button and stack creation will generate plan and if there any errors, it will stop:

If all looks good, TF Cloud will ask the user to verify and apply the changes:

Apply the changes and provide comments.

While it is creating the stack, you can look at the raw logs:

If everything goes as planned, job will change the status to success:

Update the code base in git

For the final piece, I will update one of the subnet CIDR range in TF code block from 10.10.104.0/24 to 10.10.105.0/24 – Push the changes to Gitlab.

From:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "2.29.0"

  # insert the 12 required variables here
  name = "poc-vpc-${var.prefix}"
  cidr = "10.10.0.0/16"
   azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.10.1.0/24"]
  public_subnets  = ["10.10.104.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = true
  enable_s3_endpoint = true

  tags = var.default_tags
}

To:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "2.29.0"

  # insert the 12 required variables here
  name = "poc-vpc-${var.prefix}"
  cidr = "10.10.0.0/16"
   azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.10.1.0/24"]
  public_subnets  = ["10.10.105.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = true
  enable_s3_endpoint = true

  tags = var.default_tags
}


Terraform detected the changes from backend and generated the new infra plan:

Changes are detected as we can see the subnet will be re-created.

It is obvious that pushing this change impacts the network, I have the ability to discard the run with the provided comment:

In the next series, we will discuss organizations of the projects/workspace, state file, and advance features.

Hope this post helped to get you started with Terraform Cloud.

Terraform Cloud Series – Part 1

Reading Time: 4 minutes

What is a terraform cloud? Terraform cloud is a managed platform for teams/enterprises to create a TF stack via a managed platform. With Terraform cloud, tfstate file & stack state is stored in the Terrafrom Cloud platform. I won’t go into how the application works, but you can read up on it at the link below:

https://www.terraform.io/docs/cloud/index.html

TF Cloud is currently free for now, so go sign-up and start hands-on with it. Once you have created your account, you will see the following:

In order to create cloud resources, you need to create a workspace. Think a workspace as where you will add/modify/change cloud resources i.e. vpc, subnet, compute, etc, etc. For part 1, we will start small and work our way up to a bit more complex setup.

To create a working workspace, we will need to have the following tools:

  • TF Cloud workspace
  • TF with a remote backend
  • Backend repo control provider
  • Stack management & lifecycle management with Terraform Cloud
  • Terraform code for building stack

For this part 1 of the series, I will limit it to creating TF workspace and setup my terraform templates with remote backend. Also, If you need sample terraform templates you can get it from my git repo here.

Creating Workspace

Once signed into Terraform Cloud, click create “New workspace” you would be asked to set up the backend repository or use “no VCS connection”:

For the purpose of this lab, I will be using no VCS connection to setup my backend. Before you begin, stage local directory and download sample code from here.

TF Cloud backend

In order to use TF cloud, you need to create a remote backend. Let’s create a new file called backend.tf in a location where we have staged terraform code and copy-paste contents from below and update organization and workspace name:

terraform {
  backend "remote" {
    organization = "example-demo-org"

    workspaces {
      name = "example-demo-org-sandbox"
    }
  }
}

Note: You will need to update terraform to => .12.19 version to work with TF login.

Verify all information for backend.tf and cloud provider access key & token has been updated. After that execute the following command:

terraform login 

The above command should ask you to generate a token or if you already have token created, you can provide the token.

If everything goes as planned, you can execute the following commands next:

terrafrom init
terraform plan

If no errors are indicated, Terraform should spit out a plan with TF backend:

Wait! shouldn’t I see something in my workspace? no, not yet! with remote backend, only tfstate file & plan is stored on Terraform cloud. As soon as you apply the changes, you see the queued plan created for the stack asking for conformation.

Also, You can see my AWS account and my custom VPC is not created yet.

After confirming TF Plan – accept the changes and let’s see what happens.

Status changes from confirmation to applying.

Explore the TF Cloud stack, notice we can see the output as it is captured during the execution:

Congratulations! You have successfully created a stack using Terraform Cloud and stack statefile is managed by Terraform Cloud. In the next series, I will show you how to use Gitlab or Github with a remote repository.

If you have questions or stuck somewhere in this tutorial, please contact me or leave a comment.

AWS Pricing Calculator **NEW

Reading Time: 3 minutes

Recently I had a need to create a quote for AWS infrastructure and I noticed that AWS is switching from “AWS simple calculator” to “AWS pricing calculator” – So, let’s give it a try.

The process is pretty straight forward, you punch in some input and AWS generates TCO for AWS kit. It is a bit of a learning curve to get around, but not bad.

https://calculator.aws/#/addService


Once you click the URL, you will start with a blank pricing sheet which will allow the user to add by service and you simply input your requirements.

For instance, let’s say we need to provision 10 ec2 instances, simply click configure and add your inputs.

There are two methods:

  • Quick estimate
  • Advance estimate

For this demo, I am sticking with a quick estimate!

Check out the nice feature where I just plug in my numbers for “vCPUs” and “Memory” and AWS automatically suggested that I should use “r5a.8xlarge” – this is pretty nice since I don’t have to scramble with figuring out what shape I need to select for my use case.

Next, I need to define how many ec2 instances I need to add.


Great what about the pricing model, not to worry! The new pricing calculator allows us to select the model for pricing:

Another example with “Standard Reserved Instances”:

Next, we can add storage for EBS block volume:

Finally, we add the ec2 estimate to the overall pricing estimate and continue to work with additional resources.

Give it a try! it is free!

Attach is an example exported output from Pricing calculator:

Terraform registry AWS module

Reading Time: 2 minutes

I am starting to transition from TF .11 to TF .12, recently I started to work on AWS terraform registry module and start to run into the following issue.

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  version = "2.29.0"

  # insert the 12 required variables here
  name = "my-vpc"
  cidr = "10.0.0.0/16"
   azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway = true
  enable_vpn_gateway = true

  tags = {
    Terraform = "true"
    Environment = "dev"
  }
}

(dev-tools) ➜  sandbox-vpc  terraform plan       

Error: Call to unknown function

  on .terraform/modules/vpc/main.tf line 288, in resource "aws_subnet" "public":
 288:   availability_zone               = length(regexall("^[a-z]{2}-", element(var.azs, count.index))) > 0 ? element(var.azs, count.index) : null

There is no function named "regexall".


Error: Call to unknown function

  on .terraform/modules/vpc/main.tf line 289, in resource "aws_subnet" "public":
 289:   availability_zone_id            = length(regexall("^[a-z]{2}-", element(var.azs, count.index))) == 0 ? element(var.azs, count.index) : null

There is no function named "regexall".


Error: Call to unknown function

  on .terraform/modules/vpc/main.tf line 316, in resource "aws_subnet" "private":
 316:   availability_zone               = length(regexall("^[a-z]{2}-", element(var.azs, count.index))) > 0 ? element(var.azs, count.index) : null

There is no function named "regexall".


Error: Call to unknown function

  on .terraform/modules/vpc/main.tf line 317, in resource "aws_subnet" "private":
 317:   availability_zone_id            = length(regexall("^[a-z]{2}-", element(var.azs, count.index))) == 0 ? element(var.azs, count.index) : null

There is no function named "regexall".


Error: Call to unknown function

  on .terraform/modules/vpc/main.tf line 343, in resource "aws_subnet" "database":
 343:   availability_zone               = length(regexall("^[a-z]{2}-", element(var.azs, count.index))) > 0 ? element(var.azs, count.index) : null

There is no function named "regexall".

Turns out this is an issue with TF version .12.06

(dev-tools) ➜  sandbox-vpc  terraform --version
Terraform v0.12.6
+ provider.aws v2.55.0

Fix

Upgrade TF to .12.24

(dev-tools) ➜  sandbox-vpc  brew upgrade terraform
Updating Homebrew...
==> Auto-updated Homebrew!
Updated Homebrew from 8d3aa49ae to c1708ff6b.
Updated 2 taps (homebrew/core and homebrew/cask).
==> Updated Formulae
openssl@1.1 ✔
==> Updated Casks
loginputmac                                 openttd                                     wacom-inkspace

==> Upgrading 1 outdated package:
terraform 0.12.6 -> 0.12.24
==> Upgrading terraform 0.12.6 -> 0.12.24 
==> Downloading https://homebrew.bintray.com/bottles/terraform-0.12.24.mojave.bottle.tar.gz
==> Downloading from https://akamai.bintray.com/2a/2a21a77589673b2064c9fa7587a79a0375d69a8e02f824e5dc22dc960bf2d78b?__gda__=exp=1585
######################################################################## 100.0%
==> Pouring terraform-0.12.24.mojave.bottle.tar.gz
🍺  /usr/local/Cellar/terraform/0.12.24: 6 files, 51.2MB
==> `brew cleanup` has not been run in 30 days, running now...
/usr/local/share/ghostscript/9.19/Resource/CIDFSubst/ipaexg.ttf
(dev-tools) ➜  sandbox-vpc  terraform --version   
Terraform v0.12.24
+ provider.aws v2.55.0
(dev-tools) ➜  sandbox-vpc  terraform plan        
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.


------------------------------------------------------------------------

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # module.vpc.aws_eip.nat[0] will be created
  + resource "aws_eip" "nat" {
      + allocation_id     = (known after apply)
      + association_id    = (known after apply)
      + domain            = (known after apply)
      + id                = (known after apply)
      + instance          = (known after apply)
      + network_interface = (known after apply)
      + private_dns       = (known after apply)
      + private_ip        = (known after apply)
      + public_dns        = (known after apply)
      + public_ip         = (known after apply)
      + public_ipv4_pool  = (known after apply)
      + tags              = {
          + "Environment" = "dev"
          + "Name"        = "my-vpc-us-east-1a"
          + "Terraform"   = "true"
        }
      + vpc               = true
    }

Getting started with AWS Athena – Part 4

Reading Time: 2 minutesIn previous blog (Part-3),  I compared basic workload with Athena and other query engines, both on-prem and cloud based solution.  In this post, we will do bit deep dive, understand how the service works and how Amazon build Athena service.

First understand the service flow, figure below explains how flow works with AWS Athena service and how you can take the cold data and run analytics on data-set.

Athena flow

Let’s decouple the entire flow –

  • When you create table, the table metadata is stored in metadata indicated with red arrow.
  • The table definition has a reference of where data resides in S3 bucket indicated in blue pointers.
  • Also, Athena will also create S3 bucket to store service logs indicated in doted line
  • AWS Athena rely on Presto query in-memory engine for fast query analytics 
  • The results either can be displayed on the Athena console or can be pushed to AWS QuickSight for data slice and dice.
  • With AWS Quicksight, it is great way to understand, slice and dice data and publish dashboards.

There are some limitations with AWS Athena shown in table below:

Service limits

 

Athena Service limitations 
Action  Limit
Parallel submit 1
Parallel query executions 5
Number of databases 100
Tables per database 100
Partitions per table 20K
S3 buket – log log bucket for service outputs

Conclusion

Again, AWS Athena is good way to start learning about your data quality, data trend and converting raw data to dashboards with few clicks.

In Part-5 I will touch more on AWS Athena + QuickSight and how data can be quickly converted to dashboards.

Hope this post helps understand how AWS Athena workflow.  Comments and questions are welcomed!

Thanks!

Getting started with AWS Athena – Part 3

Reading Time: 3 minutesIn previous blog (Part-2),  I created two tables using JSON and CSV format. In this post (part 3) I will talk about how one can explore dataset,  query large data with predicate filtering and some basic inner joins using Athena. Also, I will compare the performance with Hadoop cluster and AWS EMR.

For this benchmark I am comparing between following platforms:

  • AWS EMR (1 master, 4 cores [m3.xlarge])
  • On-Prem Hadoop cluster (4 nodes)
    • Hive
    • Impala
    • Hive+Spark
  • AWS Athema

First I need to set up my tables, again using similar method from previous blog; I just simply generated my DDL and created “external” table on top of my S3 dataset.

Before we I can create tables, I should give readers context about the dataset.  I downloaded dataset from data.gov and I am using “Consumer Compliant” dataset. For accessibility reasons, I am providing the direct link to dataset:

https://catalog.data.gov/dataset/consumer-complaint-database

Data.gov provides quite extensive amount of open data which can be used for benchmarks and data discovery. I downloaded “csv” formatted file and converted to JSON file.  I did some testing with JSON file, but number don’t seems to be accurate; so I will not include it in this post for now.

DDL for Text Table


 CREATE EXTERNAL TABLE IF NOT EXISTS default.Consumer_Complaint_csv (
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 `complaint_id` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION 's3://athena1s3bucket/csv/';

Great, I have my table created, now let’s execute some basic queries  to make sure I can access the data from S3 bucket.

select count(*) from Consumer_Complaint_csv;

I’ve created similar tables on AWS EMR and On-Prem Hadoop cluster, I used DDL below to create tables respectively:

create external Table default.Consumer_Complaint_csv
(
 `date_received` string,
 `product` string,
 `sub-product` string,
 `issue` string,
 `sub-issue` string,
 `consumer-complaint-narrative` string,
 `company-public-response` string,
 `company` string,
 `state` string,
 `zip_code` int,
 `tags` string,
 `consumer-consent-provided` string,
 `submitted_via` string,
 `date_sent` string,
 `company-response` string,
 `timely_response` string,
 `consumer-disputed` string,
 complaint_id int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
 'serialization.format' = ',',
 'field.delim' = ','
) LOCATION '/tmp/Consumer/csv' 
;

HDFS Cluster config:

Now that I have my tables created on all environment, let’s start executing some queries for benchmark purpose. I used queries below to extract data and for consistency purpose, structure is same across the board.

Q1  – Simple count

select count(*) from consumer_complaint_csv;

Q2 – Simple count with predicate filter

select count(*), zip_code From consumer_complaint_csv where state = 'TX' 
group by zip_code having count(*) > 20 ;

Q3 – Self inner join with predicate filter

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
group by a.product, a.company , a.issue 
having count(*) > 50;

Q4 – Self inner join with predicate filter and in list

select a.product, a.company , a.issue , count(*) as count_ttl
from default.Consumer_Complaint_csv a 
join default.Consumer_Complaint_csv b 
on (a.company = b.company)
where a.state = 'TX'
and a.product in ('Debt collection','Credit card')
group by a.product, a.company , a.issue 
having count(*) > 50;

Looking at queries, nothing fancy, just simple sql queries. My goal here is to calculate performance and does AWS Athena holds up to it’s promise and performant enough.  I’m sure I can get better performance with parquet or orc file, but the goal here is to see if service work. I can say that I am impressed, not worrying about what is under the hood or infrastructure, it is a good tool.

Now let’s look at the numbers:

chart_benchamrk

Note:All the timings above are in “seconds.”

One thing to note is that dataset size are 500MB for Q3 & 4 due to self join. For Q1 & 2 dataset size is 263MB.

Conclusion

On aspect of performance, It is not bad but not great. keeping in mind that I don’t have to pay for underlying infrastructure, but only for my executes; that’s bang for the buck!

Overall I like the performance and I will certainly leverage Athena for my future designs.

I am not saying one should ditch the Hadoop cluster or EMR and start using Athena for on-going operations. I think Athena has it’s place in toolkit, can be good starting point to do data discovery and understanding data when one does not know quality of the data.

Hope this post helps understanding bit more about the AWS Athena as service. Do give it a try yourself and let me know your thoughts and comments.

Thanks!