Geospatial data lakes with Amazon Redshift

Geospatial data lakes with Amazon Redshift

Data lake architectures help organizations offload data from premium storage systems without losing the ability to query and analyze the data. This architecture can be useful for geospatial data, where builders might have terabytes of infrequently accessed data in their databases that they want to cost-effectively maintain. However, this requires for their data lake query engine to support geographic information systems (GIS) data types and functions.

Amazon Redshift supports querying spatial data, including the GEOMETRY and GEOGRAPHY data types and functions that are used in querying GIS systems. Additionally, Amazon Redshift lets you query geospatial data both in your data lakes on Amazon S3 and your Redshift data warehouse, giving you the choice of how you can access your data. Additionally, AWS Lake Formation and support for AWS Identity and Access Management (IAM) in Esri’s ArcGIS Pro gives you a way to securely bridge data between your geospatial data lakes and map visualization tools. You can set up, manage, and secure geospatial data lakes in the cloud with a few clicks.

In this post, we walk through how to set up a geospatial data lake using Lake Formation and query the data with ArcGIS Pro using Amazon Redshift Serverless.

Solution overview

In our example, a county public health department has used Lake Formation to secure their data lake that contains public health information (PHI) data. Epidemiologists within the county want to create a map for the clinics providing vaccination for their communities. The county’s GIS analysts need access to the data lake to create the required maps without being able to access the PHI data.

This solution uses Lake Formation tags to allow column-level access in the database to the public information that includes the clinic names, addresses, zip codes, and longitude/latitude coordinates without allowing access to the PHI data within the same tables. We use Redshift Serverless and Amazon Redshift Spectrum to access this data from ArcGIS Pro, a GIS mapping software from Esri, an AWS Partner.

The following diagram shows the architecture for this solution.

End-to-end architecture showing ArcGIS Pro data integration with AWS analytics services through Redshift connector

The following is a sample schema for this post.

Description Column Name Geoproperty Tag
Patient ID patient_id No
Clinic ID clinic_id Yes
Address of Clinic clinic_address Yes
Clinic Zip Code clinic_zip Yes
Clinic City clinic_city Yes
First Name Patient first_name No
Last Name Patient last_name No
Patient Address patient_address No
Patient Zip Code patient_zip No
Vaccination Type vaccination_type No
Latitude of Clinic clinic_lat Yes
Longitude of Clinic clinic_long Yes

In the following sections, we walk through the steps to set up the solution:

  1. Deploy the solution infrastructure using AWS CloudFormation.
  2. Upload a CSV with sample data to an Amazon Simple Storage Service (Amazon S3) bucket and run an AWS Glue crawler to crawl the data.
  3. Set up Lake Formation permissions.
  4. Configure the Amazon Redshift Query Editor v2.
  5. Set up the schemas in Amazon Redshift.
  6. Create a view in Amazon Redshift.
  7. Create a local database user in ArcGIS Pro.
  8. Connect ArcGIS Pro to the Redshift database.

Prerequisites

You should have the following prerequisites:

Set up the infrastructure with AWS CloudFormation

To create the environment for the demo, complete the following steps:

  1. Log in to the AWS Management Console as an AWS account administrator and a Lake Formation data lake administrator—this account needs to be both an account admin and a data lake admin for the template to complete.
  2. Open the AWS CloudFormation console
  3. Choose Launch Stack.

The CloudFormation template creates the following components:

  • S3 bucketsamp-clinic-db-{ACCOUNT_ID}
  • AWS Glue databasesamp-clinical-glue-db
  • AWS Glue crawler samp-glue-crawler
  • Redshift Serverless workgroupsamp-clinical-rs-wg
  • Redshift Serverless namespacesamp-clinical-rs-ns
  • IAM role for Amazon Redshiftdemo-RedshiftIAMRole-{UNIQUE_ID}
  • IAM role for AWS Gluesamp-clinical-glue-role
  • Lake Formation tag geoproperty

Upload a CSV to the S3 bucket and run the AWS Glue crawler

The next step is to create a data lake in our demo environment and then use an AWS Glue crawler to populate the AWS Glue database and update the schema and metadata in the AWS Glue Data Catalog.

The CloudFormation stack created the S3 bucket we will use as well as the AWS Glue database and crawler. We have provided a fictious test dataset that will represent the patient and clinical information. Download the file and complete the following steps:

  1. On the AWS CloudFormation console, open the stack you just launched.
  2. On the Resources tab, choose the link to the S3 bucket.
  3. Choose Upload and add the CSV file (data-with-geocode.csv), then choose Upload.
  4. On the AWS Glue console, choose Crawlers in the navigation pane.
  5. Select the crawler you created with the CloudFormation stack and choose Run.

The crawler run should only take a minute to complete, and will populate a table named clinic-sample-s3_ACCOUNT_ID with a fictious dataset.

  1. Choose Tables in the navigation pane and open the table the crawler populated.

You will see that the dataset contains fields that contain PHI and personally identifiable information (PII).

AWS Glue table 'clinic-sample_s3' schema definition with patient and clinic fields, input/output formats, and database properties

We now have a database set up and the Data Catalog populated with the schema and metadata we will use for the rest of the demo.

Set up Lake Formation permissions

In this next set of steps, we demonstrate how to secure PHI data to maintain compliance and empower GIS analysts to work effectively. To secure the data lake, we use AWS Lake Formation. In order to properly set up Lake Formation permissions, we need to gather details on how access to the data lake is established.

The Data Catalog provides metadata and schema information that enables services to access data within the data lake. To access the data lake from ArcGIS Pro, we use the ArcGIS Pro Redshift connector, which allows a connection from ArcGIS Pro to Amazon Redshift. Amazon Redshift can access the Data Catalog and provide connectivity to the data lake. The CloudFormation template created a Redshift Serverless instance and namespace and an IAM role that we will use to configure this connection. We still need to set up Lake Formation permissions so that GIS analysts can only access publicly available fields and not those containing PHI or PII. We will assign a Lake Formation tag on the columns containing the publicly available information and assign permissions to the GIS analysts to allow access to columns with this tag.

By default, the Lake Formation configuration allows Super access to IAMAllowedPrinciples; this is to maintain backward compatibility as detailed in Changing the default settings for your data lake. To demonstrate a more secure configuration, we will remove this default configuration.

  1. On the Lake Formation console, choose Administration in the navigation pane.
  2. In the Data Catalog settings section, make sure Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are unchecked.

AWS Data Catalog settings interface showing unchecked IAM-only access control options for new databases and tables

  1. In the navigation pane, under Permissions, choose Data permissions.
  2. Select IAMAllowedPrincipals and choose Revoke.
  3. Choose Tables in the navigation pane.
  4. Open the table clinic-sample-s3_ACCOUNT_ID and choose Edit schema.
  5. Select the fields beginning with clinic_ and choose Edit LF-Tags.
  6. The CloudFormation stack created a Lake Formation tag named geoproperty. Assign geoproperty as the key and true for the value on all the clinic_ fields, then choose Save.

Next, we need to grant the Amazon Redshift IAM role permission to access fields tagged with geoproperty = true.

  1. Choose Data lake permissions, then choose Grant.
  2. For the IAM role, choose demo-RedshiftIAMRole-UNIQUE_ID.
  3. Select geoproperty for the key and true for the value.
  4. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

Configure the Amazon Redshift Query Editor v2

Next, we need to perform the initial configuration of Amazon Redshift required for database operations. We use an AWS Secrets Manager secret created by the template to make sure password access is managed securely in accordance with AWS best practices.

  1. On the Amazon Redshift console, choose Query editor v2.
  2. When you first start Amazon Redshift, a one-time configuration for the account appears. For this post, leave the options default and choose Configure account.

For more information about these options, refer to Configuring your AWS account.

Redshift query editor configuration interface with AWS KMS encryption settings and optional S3 bucket path input

The query editor will require credentials to connect to the serverless instance; these have been created by the template and stored in Secrets Manager.

  1. Select Other ways to connect, then select AWS Secrets Manager.
  2. For Secret, select (Redshift-admin-credentials).
  3. Choose Save.

Redshift connection interface displaying IAM Identity Center and AWS Secrets Manager authentication methods with credential selector

Set up schemas in Amazon Redshift

An external schema in Amazon Redshift is a feature used to reference schemas that exist in external data sources. For information on creating external schemas, see External schemas in Amazon Redshift Spectrum. We use an external schema to provide access to the data lake in Amazon Redshift. From ArcGIS Pro, we will connect to Amazon Redshift to access the geospatial data.

The IAM role used in the creation of the external schema needs to be associated with the Redshift namespace. This has already been set up by the CloudFormation template, but it’s a good practice to verify that the role is set up correctly before proceeding.

  1. On the Redshift Serverless console, choose Namespace configuration in the navigation pane.
  2. Choose the namespace (sample-rs-namespace).

Amazon Redshift Serverless console displaying namespace configuration with status, workgroup and creation details

On the Security and encryption tab, you should see the IAM role created by CloudFormation. If this role or the namespace isn’t present, verify the stack in AWS CloudFormation before proceeding.

  1. Copy the ARN of the role for use in a later step.

Redshift security configuration panel showing single synchronized IAM role with complete ARN and management options

  1. Choose Query data to return to the query editor.

Amazon Redshift Serverless interface displaying sample-rs-namespace configuration with management and query data controls

  1. In the query editor, enter the following SQL command; be sure to replace the example role ARN with your own. This SQL command will create an external schema that uses the same Redshift role associated with our namespace to attach to the AWS Glue database.
CREATE EXTERNAL SCHEMA samp_clinic_sch_ext FROM DATA CATALOG
database 'sample-glue-database'
IAM_ROLE 'arn:aws:iam::{ACCOUNT_ID}:role/demo-RedshiftIAMRole-{UNIQUE_ID}';
  1. In the query editor, perform a select query on sample-glue-database:

SELECT * FROM "dev"."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}";

Because the associated role has been granted access to columns tagged with geoproperty = true, only those fields will be returned, as shown in the following screenshot (the data in this example is fictionalized).

Query result displaying 20 medical clinics with details like name, address, and coordinates

  1. Use the following command to create a local schema in Amazon Redshift. The external schema can’t be updated; we will use this local schema to add a geometry field with a Redshift function.

CREATE SCHEMA samp_clinic_sch_local

Create a view in Amazon Redshift

For the data to be viewable from ArcGIS Pro, we will need to create a view. Now that the schemas have been established, we can create the view that can be accessed from ArcGIS Pro.

Amazon Redshift provides many geospatial functions that can be used to create views with fields used by ArcGIS Pro to add points onto a map. We will use one of these functions because the dataset contains latitude and longitude.

Use the following SQL code in the Amazon Redshift Query Editor to create a new view named clinic_location_view. Replace {ACCOUNT_ID} with your own account ID.

CREATE
OR REPLACE VIEW "samp_clinic_sch_local"."clinic_location_view" AS
SELECT
    clinic_id as id,
    clinic_lat as lat,
    clinic_long as long,
    ST_MAKEPOINT(long, lat) as geom
FROM
    “dev”."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}"
WITH NO SCHEMA BINDING;

The new view that is created under your local schema will have a column named geom containing map-based points that can be used by ArcGIS Pro to add points during map creation. The points in this example are for the clinics providing vaccines. In a real-world scenario, as new clinics are built and their data is added to the data lake, their locations would be added to the map created using this data.

Create a local database user for ArcGIS Pro

For this demo, we use a database user and group to provide access for ArcGIS Pro clients. Enter the following SQL code into the Amazon Redshift Query Editor to create a database user and group:

CREATE USER dbuser with PASSWORD ‘SET_PASSWORD_HERE’;
CREATE GROUP esri_developer_group;
ALTER GROUP esri_developer_group ADD USER dbuser;

After the commands are complete, use the following code to grant permissions to the group:

GRANT USAGE ON SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA samp_clinic_sch_local GRANT SELECT ON TABLES TO GROUP esri_developer_group;
GRANT SELECT ON ALL TABLES IN SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;

Connect ArcGIS Pro to the Redshift database

In order to add the database connection to ArcGIS Pro, you need the endpoint for the Redshift Serverless workgroup. You can access the endpoint information on the sample-rs-wg workgroup details page on the Redshift Serverless console. The Redshift namespaces and workgroups are listed by default, as shown in the following screenshot.

Amazon Redshift Serverless namespace and workgroup status dashboard with performance metrics

You can copy the endpoint in the General information section. This endpoint will need to modified; the :5439/dev will need to be removed when configuring the connector in ArcGIS Pro.

Amazon Redshift Serverless workgroup details showing configuration and connection information

  1. Open ArcGIS Pro with the project file you want to add the Redshift connection to.
  1. On the menu, choose Insert and then Connections, Database, and New Database Connection.
  2. For Database Platform, choose Amazon Redshift.
  3. For Server, insert the endpoint you copied (remove everything following .com from the endpoint).
  4. For Database, choose your database.

Amazon Redshift Serverless connection settings with server, authentication, and database fields

If your ArcGIS Pro client doesn’t have access to the endpoint, you will receive an error during this step. A network path must exist between the ArcGIS Pro client and the Redshift Serverless endpoint. You can set up the network path with Direct Connect, AWS Site-to-Site VPN, or AWS Client VPN. Although it’s not recommended for security reasons, you can also configure Amazon Redshift with a publicly available endpoint. Be sure you consult your security and network teams for best practices and policy guidance before allowing public access to your Redshift Serverless instance.

If a network path exists and you’re having issues connecting, verify the security group rules allow communication inbound from your ArcGIS Pro subnet over the port your Redshift Serverless instance is running on. The default port is 5439, but you can configure a range of ports depending on your environment; see Connecting to Amazon Redshift Serverless for more information.

If connectivity is successful, ArcGIS Pro will add the Amazon Redshift connection under Connection File Name.

  1. Choose OK.
  2. Choose the connection to display the view that was created to include geometry (clinic_location_view).
  3. Choose (right-click) the view and choose Add To Current Map.

ArcGIS Pro will add the points from the view onto the map. The final map displayed has the symbology edited to use red crosses to represent the clinics instead of dots.

Professional GIS interface showing Houston metropolitan vaccination clinics with topographic base map, toolbars, and database connectivity

Clean up

After you have finished the demo, complete the following steps to clean up your resources:

  1. On the Amazon S3 console, open the bucket created by the CloudFormation stack and delete the data-with-geocode.csv file.
  2. On the AWS CloudFormation console, delete the demo stack to remove the resources it created.

Conclusion

In this post, we reviewed how to set up Redshift Serverless to use geospatial data contained within a data lake to enhance maps in ArcGIS Pro. This technique helps builders and GIS analysts use available datasets in data lakes and transform it in Amazon Redshift to further enrich the data before presenting it on a map. We also showed how to secure a data lake using Lake Formation, crawl a geospatial dataset with AWS Glue, and visualize the data in ArcGIS Pro.

For additional best practices for storing geospatial data in Amazon S3 and querying it with Amazon Redshift, see How to partition your geospatial data lake for analysis with Amazon Redshift. We invite you to leave feedback in the comments section.


About the authors

Jeremy Spell is a Cloud Infrastructure Architect working with Amazon Web Services (AWS) Professional Services. He enjoys architecting and building solutions for customers. In his free time Jeremy makes Texas style BBQ, and spends time with his family and church community.

Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.

Stay Informed

Get the best articles every day for FREE. Cancel anytime.