
Data lake architectures help organizations offload data from premium storage systems without losing the ability to query and analyze the data. This architecture can be useful for geospatial data, where builders might have terabytes of infrequently accessed data in their databases that they want to cost-effectively maintain. However, this requires for their data lake query engine to support geographic information systems (GIS) data types and functions.
Amazon Redshift supports querying spatial data, including the GEOMETRY and GEOGRAPHY data types and functions that are used in querying GIS systems. Additionally, Amazon Redshift lets you query geospatial data both in your data lakes on Amazon S3 and your Redshift data warehouse, giving you the choice of how you can access your data. Additionally, AWS Lake Formation and support for AWS Identity and Access Management (IAM) in Esri’s ArcGIS Pro gives you a way to securely bridge data between your geospatial data lakes and map visualization tools. You can set up, manage, and secure geospatial data lakes in the cloud with a few clicks.
In this post, we walk through how to set up a geospatial data lake using Lake Formation and query the data with ArcGIS Pro using Amazon Redshift Serverless.
Solution overview
In our example, a county public health department has used Lake Formation to secure their data lake that contains public health information (PHI) data. Epidemiologists within the county want to create a map for the clinics providing vaccination for their communities. The county’s GIS analysts need access to the data lake to create the required maps without being able to access the PHI data.
This solution uses Lake Formation tags to allow column-level access in the database to the public information that includes the clinic names, addresses, zip codes, and longitude/latitude coordinates without allowing access to the PHI data within the same tables. We use Redshift Serverless and Amazon Redshift Spectrum to access this data from ArcGIS Pro, a GIS mapping software from Esri, an AWS Partner.
The following diagram shows the architecture for this solution.
The following is a sample schema for this post.
Description |
Column Name |
Geoproperty Tag |
Patient ID |
patient_id |
No |
Clinic ID |
clinic_id |
Yes |
Address of Clinic |
clinic_address |
Yes |
Clinic Zip Code |
clinic_zip |
Yes |
Clinic City |
clinic_city |
Yes |
First Name Patient |
first_name |
No |
Last Name Patient |
last_name |
No |
Patient Address |
patient_address |
No |
Patient Zip Code |
patient_zip |
No |
Vaccination Type |
vaccination_type |
No |
Latitude of Clinic |
clinic_lat |
Yes |
Longitude of Clinic |
clinic_long |
Yes |
In the following sections, we walk through the steps to set up the solution:
- Deploy the solution infrastructure using AWS CloudFormation.
- Upload a CSV with sample data to an Amazon Simple Storage Service (Amazon S3) bucket and run an AWS Glue crawler to crawl the data.
- Set up Lake Formation permissions.
- Configure the Amazon Redshift Query Editor v2.
- Set up the schemas in Amazon Redshift.
- Create a view in Amazon Redshift.
- Create a local database user in ArcGIS Pro.
- Connect ArcGIS Pro to the Redshift database.
Prerequisites
You should have the following prerequisites:
- An AWS account
- Lake Formation enabled in your target AWS Region
- Familiarity with Lake Formation and setting permissions on tables
- ArcGIS Pro
- Network connectivity from the ArcGIS Pro client to the virtual private cloud (VPC) where Amazon Redshift resources will be deployed using either VPN or AWS Direct Connect
Set up the infrastructure with AWS CloudFormation
To create the environment for the demo, complete the following steps:
- Log in to the AWS Management Console as an AWS account administrator and a Lake Formation data lake administrator—this account needs to be both an account admin and a data lake admin for the template to complete.
- Open the AWS CloudFormation console
- Choose Launch Stack.
The CloudFormation template creates the following components:
- S3 bucket –
samp-clinic-db-{ACCOUNT_ID}
- AWS Glue database –
samp-clinical-glue-db
- AWS Glue crawler –
samp-glue-crawler
- Redshift Serverless workgroup –
samp-clinical-rs-wg
- Redshift Serverless namespace –
samp-clinical-rs-ns
- IAM role for Amazon Redshift –
demo-RedshiftIAMRole-{UNIQUE_ID}
- IAM role for AWS Glue –
samp-clinical-glue-role
- Lake Formation tag –
geoproperty
Upload a CSV to the S3 bucket and run the AWS Glue crawler
The next step is to create a data lake in our demo environment and then use an AWS Glue crawler to populate the AWS Glue database and update the schema and metadata in the AWS Glue Data Catalog.
The CloudFormation stack created the S3 bucket we will use as well as the AWS Glue database and crawler. We have provided a fictious test dataset that will represent the patient and clinical information. Download the file and complete the following steps:
- On the AWS CloudFormation console, open the stack you just launched.
- On the Resources tab, choose the link to the S3 bucket.
- Choose Upload and add the CSV file (data-with-geocode.csv), then choose Upload.
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Select the crawler you created with the CloudFormation stack and choose Run.
The crawler run should only take a minute to complete, and will populate a table named clinic-sample-s3_ACCOUNT_ID
with a fictious dataset.
- Choose Tables in the navigation pane and open the table the crawler populated.
You will see that the dataset contains fields that contain PHI and personally identifiable information (PII).
We now have a database set up and the Data Catalog populated with the schema and metadata we will use for the rest of the demo.
Set up Lake Formation permissions
In this next set of steps, we demonstrate how to secure PHI data to maintain compliance and empower GIS analysts to work effectively. To secure the data lake, we use AWS Lake Formation. In order to properly set up Lake Formation permissions, we need to gather details on how access to the data lake is established.
The Data Catalog provides metadata and schema information that enables services to access data within the data lake. To access the data lake from ArcGIS Pro, we use the ArcGIS Pro Redshift connector, which allows a connection from ArcGIS Pro to Amazon Redshift. Amazon Redshift can access the Data Catalog and provide connectivity to the data lake. The CloudFormation template created a Redshift Serverless instance and namespace and an IAM role that we will use to configure this connection. We still need to set up Lake Formation permissions so that GIS analysts can only access publicly available fields and not those containing PHI or PII. We will assign a Lake Formation tag on the columns containing the publicly available information and assign permissions to the GIS analysts to allow access to columns with this tag.
By default, the Lake Formation configuration allows Super access to IAMAllowedPrinciples
; this is to maintain backward compatibility as detailed in Changing the default settings for your data lake. To demonstrate a more secure configuration, we will remove this default configuration.
- On the Lake Formation console, choose Administration in the navigation pane.
- In the Data Catalog settings section, make sure Use only IAM access control for new databases and Use only IAM access control for new tables in new databases are unchecked.
- In the navigation pane, under Permissions, choose Data permissions.
- Select
IAMAllowedPrincipals
and choose Revoke. - Choose Tables in the navigation pane.
- Open the table
clinic-sample-s3_ACCOUNT_ID
and choose Edit schema. - Select the fields beginning with clinic_ and choose Edit LF-Tags.
- The CloudFormation stack created a Lake Formation tag named
geoproperty
. Assigngeoproperty
as the key and true for the value on all theclinic_
fields, then choose Save.
Next, we need to grant the Amazon Redshift IAM role permission to access fields tagged with geoproperty = true
.
- Choose Data lake permissions, then choose Grant.
- For the IAM role, choose
demo-RedshiftIAMRole-UNIQUE_ID
. - Select
geoproperty
for the key and true for the value. - Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
Configure the Amazon Redshift Query Editor v2
Next, we need to perform the initial configuration of Amazon Redshift required for database operations. We use an AWS Secrets Manager secret created by the template to make sure password access is managed securely in accordance with AWS best practices.
- On the Amazon Redshift console, choose Query editor v2.
- When you first start Amazon Redshift, a one-time configuration for the account appears. For this post, leave the options default and choose Configure account.
For more information about these options, refer to Configuring your AWS account.
The query editor will require credentials to connect to the serverless instance; these have been created by the template and stored in Secrets Manager.
- Select Other ways to connect, then select AWS Secrets Manager.
- For Secret, select (
Redshift-admin-credentials
). - Choose Save.
Set up schemas in Amazon Redshift
An external schema in Amazon Redshift is a feature used to reference schemas that exist in external data sources. For information on creating external schemas, see External schemas in Amazon Redshift Spectrum. We use an external schema to provide access to the data lake in Amazon Redshift. From ArcGIS Pro, we will connect to Amazon Redshift to access the geospatial data.
The IAM role used in the creation of the external schema needs to be associated with the Redshift namespace. This has already been set up by the CloudFormation template, but it’s a good practice to verify that the role is set up correctly before proceeding.
- On the Redshift Serverless console, choose Namespace configuration in the navigation pane.
- Choose the namespace (
sample-rs-namespace
).
On the Security and encryption tab, you should see the IAM role created by CloudFormation. If this role or the namespace isn’t present, verify the stack in AWS CloudFormation before proceeding.
- Copy the ARN of the role for use in a later step.
- Choose Query data to return to the query editor.
- In the query editor, enter the following SQL command; be sure to replace the example role ARN with your own. This SQL command will create an external schema that uses the same Redshift role associated with our namespace to attach to the AWS Glue database.
CREATE EXTERNAL SCHEMA samp_clinic_sch_ext FROM DATA CATALOG
database 'sample-glue-database'
IAM_ROLE 'arn:aws:iam::{ACCOUNT_ID}:role/demo-RedshiftIAMRole-{UNIQUE_ID}';
- In the query editor, perform a select query on
sample-glue-database
:
SELECT * FROM "dev"."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}";
Because the associated role has been granted access to columns tagged with geoproperty = true
, only those fields will be returned, as shown in the following screenshot (the data in this example is fictionalized).
- Use the following command to create a local schema in Amazon Redshift. The external schema can’t be updated; we will use this local schema to add a geometry field with a Redshift function.
CREATE SCHEMA samp_clinic_sch_local
Create a view in Amazon Redshift
For the data to be viewable from ArcGIS Pro, we will need to create a view. Now that the schemas have been established, we can create the view that can be accessed from ArcGIS Pro.
Amazon Redshift provides many geospatial functions that can be used to create views with fields used by ArcGIS Pro to add points onto a map. We will use one of these functions because the dataset contains latitude and longitude.
Use the following SQL code in the Amazon Redshift Query Editor to create a new view named clinic_location_view
. Replace {ACCOUNT_ID} with your own account ID.
CREATE
OR REPLACE VIEW "samp_clinic_sch_local"."clinic_location_view" AS
SELECT
clinic_id as id,
clinic_lat as lat,
clinic_long as long,
ST_MAKEPOINT(long, lat) as geom
FROM
“dev”."samp_clinic_sch_ext"."clinic-sample_s3_{ACCOUNT_ID}"
WITH NO SCHEMA BINDING;
The new view that is created under your local schema will have a column named geom containing map-based points that can be used by ArcGIS Pro to add points during map creation. The points in this example are for the clinics providing vaccines. In a real-world scenario, as new clinics are built and their data is added to the data lake, their locations would be added to the map created using this data.
Create a local database user for ArcGIS Pro
For this demo, we use a database user and group to provide access for ArcGIS Pro clients. Enter the following SQL code into the Amazon Redshift Query Editor to create a database user and group:
CREATE USER dbuser with PASSWORD ‘SET_PASSWORD_HERE’;
CREATE GROUP esri_developer_group;
ALTER GROUP esri_developer_group ADD USER dbuser;
After the commands are complete, use the following code to grant permissions to the group:
GRANT USAGE ON SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;
ALTER DEFAULT PRIVILEGES IN SCHEMA samp_clinic_sch_local GRANT SELECT ON TABLES TO GROUP esri_developer_group;
GRANT SELECT ON ALL TABLES IN SCHEMA samp_clinic_sch_local TO GROUP esri_developer_group;
Connect ArcGIS Pro to the Redshift database
In order to add the database connection to ArcGIS Pro, you need the endpoint for the Redshift Serverless workgroup. You can access the endpoint information on the sample-rs-wg
workgroup details page on the Redshift Serverless console. The Redshift namespaces and workgroups are listed by default, as shown in the following screenshot.
You can copy the endpoint in the General information section. This endpoint will need to modified; the :5439/dev will need to be removed when configuring the connector in ArcGIS Pro.
- Open ArcGIS Pro with the project file you want to add the Redshift connection to.
- On the menu, choose Insert and then Connections, Database, and New Database Connection.
- For Database Platform, choose Amazon Redshift.
- For Server, insert the endpoint you copied (remove everything following
.com
from the endpoint). - For Database, choose your database.
If your ArcGIS Pro client doesn’t have access to the endpoint, you will receive an error during this step. A network path must exist between the ArcGIS Pro client and the Redshift Serverless endpoint. You can set up the network path with Direct Connect, AWS Site-to-Site VPN, or AWS Client VPN. Although it’s not recommended for security reasons, you can also configure Amazon Redshift with a publicly available endpoint. Be sure you consult your security and network teams for best practices and policy guidance before allowing public access to your Redshift Serverless instance.
If a network path exists and you’re having issues connecting, verify the security group rules allow communication inbound from your ArcGIS Pro subnet over the port your Redshift Serverless instance is running on. The default port is 5439, but you can configure a range of ports depending on your environment; see Connecting to Amazon Redshift Serverless for more information.
If connectivity is successful, ArcGIS Pro will add the Amazon Redshift connection under Connection File Name.
- Choose OK.
- Choose the connection to display the view that was created to include geometry (
clinic_location_view
). - Choose (right-click) the view and choose Add To Current Map.
ArcGIS Pro will add the points from the view onto the map. The final map displayed has the symbology edited to use red crosses to represent the clinics instead of dots.
Clean up
After you have finished the demo, complete the following steps to clean up your resources:
- On the Amazon S3 console, open the bucket created by the CloudFormation stack and delete the
data-with-geocode.csv
file. - On the AWS CloudFormation console, delete the demo stack to remove the resources it created.
Conclusion
In this post, we reviewed how to set up Redshift Serverless to use geospatial data contained within a data lake to enhance maps in ArcGIS Pro. This technique helps builders and GIS analysts use available datasets in data lakes and transform it in Amazon Redshift to further enrich the data before presenting it on a map. We also showed how to secure a data lake using Lake Formation, crawl a geospatial dataset with AWS Glue, and visualize the data in ArcGIS Pro.
For additional best practices for storing geospatial data in Amazon S3 and querying it with Amazon Redshift, see How to partition your geospatial data lake for analysis with Amazon Redshift. We invite you to leave feedback in the comments section.
About the authors
Jeremy Spell is a Cloud Infrastructure Architect working with Amazon Web Services (AWS) Professional Services. He enjoys architecting and building solutions for customers. In his free time Jeremy makes Texas style BBQ, and spends time with his family and church community.
Jeff Demuth is a solutions architect who joined Amazon Web Services (AWS) in 2016. He focuses on the geospatial community and is passionate about geographic information systems (GIS) and technology. Outside of work, Jeff enjoys traveling, building Internet of Things (IoT) applications, and tinkering with the latest gadgets.