Skip to content

Amazon S3 BigLake External Table


Cross-cloud Data Pipelines with Amazon S3 BigLake

Create Amazon S3 Bucket and Upload Data

Create an Amazon S3 bucket in the us-west-2 to store the data that you want to query with BigQuery.

aws-s3.tf
resource "random_string" "velano_collectives_suffix" {
  length  = 4
  special = false
  numeric = true
  lower   = true
  upper   = false
}

module "aws_s3_bucket_velano_collectives" {
  source  = "terraform-aws-modules/s3-bucket/aws"
  version = "~> 4.10.1"

  bucket = "velano-collectives-${random_string.velano_collectives_suffix.result}"
}

resource "aws_s3_object" "products" {
  bucket = module.aws_s3_bucket_velano_collectives.s3_bucket_id
  key    = "products.csv"
  source = "../data/products.csv"
}

products.csv is a sample data file that you can upload to the S3 bucket.

Connect to Amazon S3

In order to connect BigQuery to Amazon S3, you need to create a BigQuery connection that allows BigQuery to access data stored in S3. This involves creating an AWS IAM policy and role, configuring a trust relationship, and setting up the BigQuery connection.

Here are the steps to follow:

  1. Create an AWS IAM Policy and Role for BigQuery
  2. Create a BigQuery connection
  3. Add a trust relationship to the AWS role
    • Add a trust policy to the AWS role
    • Configure a custom AWS identity provider

See Connect to Amazon S3 for more details.

After completing these steps, the AWS IAM policy and role would look like this:

aws-iam.tf
resource "aws_iam_policy" "bigquery" {
  name = "bigquery-policy"

  policy = <<-EOF
            {
              "Version": "2012-10-17",
              "Statement": [
                  {
                      "Sid": "BucketLevelAccess",
                      "Effect": "Allow",
                      "Action": ["s3:ListBucket"],
                      "Resource": ["${module.aws_s3_bucket_velano_collectives.s3_bucket_arn}"]
                  },
                  {
                      "Sid": "ObjectLevelAccess",
                      "Effect": "Allow",
                      "Action": ["s3:GetObject","s3:PutObject"],
                      "Resource": [
                          "${module.aws_s3_bucket_velano_collectives.s3_bucket_arn}",
                          "${module.aws_s3_bucket_velano_collectives.s3_bucket_arn}/*"
                      ]
                  }
              ]
            }
            EOF
}

resource "aws_iam_role" "bigquery" {
  name                 = "bigquery-role"
  max_session_duration = 43200

  assume_role_policy = <<-EOF
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Federated": "accounts.google.com"
          },
          "Action": "sts:AssumeRoleWithWebIdentity",
          "Condition": {
            "StringEquals": {
              "accounts.google.com:sub": "${google_bigquery_connection.aws.aws[0].access_role[0].identity}"
            }
          }
        }
      ]
    }
    EOF
}

resource "aws_iam_role_policy_attachment" "main" {
  role       = aws_iam_role.bigquery.name
  policy_arn = aws_iam_policy.bigquery.arn
}

And the BigQuery connection would look like this:

gcp-bq-conns.tf
resource "google_bigquery_connection" "aws" {
  connection_id = "aws"
  friendly_name = "aws"
  description   = "Created by Terraform"

  location = "aws-us-west-2"
  aws {
    access_role {
      # This must be constructed as a string instead of referencing the
      # AWS resources directly to avoid a resource dependency cycle
      # in Terraform.
      iam_role_id = "arn:aws:iam::545757050262:role/bigquery-role"
    }
  }
}

Create Amazon S3 BigLake Tables

Create a Dataset

create schema if not exists products_aws
options (
  location = 'aws-us-west-2'
)

Create a Table

create external table products_aws.products
with connection `aws-us-west-2.aws`
options (
  format = "csv",
  uris = ["s3://velano-collectives-n1y3/products.csv"]
)


Query Amazon S3 BigLake Tables

In order to query the Amazon S3 BigLake table, you need to have the following roles assigned to your user or service account:

  • BigQuery Connection User (roles/bigquery.connectionUser)
  • BigQuery Data Viewer (roles/bigquery.dataViewer)
  • BigQuery User (roles/bigquery.user)

After you have the necessary permissions, you can query the table to see the data:

select
  *,
  _FILE_NAME AS file_name
from `velano-collective-ac8f.products_aws.products`

References