Extracting data from PostgreSQL to Amazon Redshift
Extracting data from PostgreSQL is a foundational step in many modern data integration projects, particularly when centralizing analytics in data warehouses like Amazon Redshift. Establishing an efficient and reliable data pipeline ensures that your organization can benefit from timely, accurate insights while minimizing operational complexity. In this article, we will guide you through the essential steps for moving your PostgreSQL data into Redshift. We begin with the creation of an identity in PostgreSQL, ensuring secure and controlled access for your extraction process. For users of Matillion, a widely adopted ETL platform, we discuss how to check for (or acquire) the necessary JDBC driver that enables connectivity between services. Next, we examine network connectivity requirements to allow seamless data transfer from your PostgreSQL source to the Redshift target. Finally, we cover techniques for querying data, handling both the initial full load and subsequent incremental updates. Whether you are setting up your pipeline for the first time or refining an existing process, this article provides practical guidance to streamline your PostgreSQL to Redshift data integration.
What is PostgreSQL?
PostgreSQL is a powerful, open-source relational database known for advanced SQL features like transactional integrity, multi-version concurrency control (MVCC), and diverse indexing, including full-text search. Highly extensible, it supports custom data types and procedural languages, making it adaptable for varied applications. Adhering to ACID principles and SQL standards, PostgreSQL ensures reliable, consistent, and interoperable data management. Its active community drives frequent updates and a broad ecosystem of tools, solidifying its reputation as a top choice for scalable, reliable, and feature-rich database solutions in both enterprise and open-source environments.
What is Amazon Redshift?
Amazon Redshift is a fully managed, petabyte-scale cloud data warehouse optimized for fast querying and analysis of large datasets. Built on PostgreSQL, it uses columnar storage, parallel processing, and compression for high-performance analytic workloads. Redshift ingests data from sources like Amazon S3, DynamoDB, and on-premises databases, and supports complex SQL queries via BI tools within AWS. Key features include automatic scaling, workload management, and materialized views, supporting enterprise BI and real-time analytics. Its architecture enables seamless connectivity and integration with AWS services, allowing users to build robust end-to-end analytics pipelines.
Why Move Data from PostgreSQL into Amazon Redshift
The Advantages of Transferring Data from PostgreSQL to Redshift for Analytics
A data engineer or architect may choose to copy data from PostgreSQL into Amazon Redshift for several compelling reasons. Firstly, the data housed within PostgreSQL often contains valuable information that can inform business decisions and drive analytical insights. By transferring this data to Redshift, it becomes possible to integrate PostgreSQL data with other enterprise data sources in a centralized data warehouse, thereby amplifying its value through comprehensive analysis and cross-source correlations. Importantly, performing complex queries, reporting, and data integration operations directly on the transactional PostgreSQL database can negatively impact its performance and affect critical business applications. By leveraging Redshift’s high-performance, scalable architecture for analytical workloads, organizations can ensure that operational databases remain unburdened while still gaining timely and actionable intelligence from their data.
Similar connectors
Creating a User in PostgreSQL
This guide provides step-by-step instructions for creating a user (role) in a PostgreSQL database using the PostgreSQL command-line interface (
psql). These steps assume you have administrative (superuser or sufficient privilege) access to the database server.
1. Connect to PostgreSQL
First, access your PostgreSQL environment via the terminal:
bash psql -U postgres- Replace
postgreswith your superuser or admin username if different.
2. Check Existing Roles
To list existing roles, run:
\du
3. Create a New User
Use the
CREATE ROLEstatement or its shortcut
CREATE USERto create a new database user. Here is an example:
CREATE USER username WITH PASSWORD 'user_password';
- Replace
username
with the desired PostgreSQL username. - Replace
user_password
with a strong, secure password.
Example:
CREATE USER reporting_user WITH PASSWORD 'Str0ngP@ssw0rd!';
4. Grant Privileges (Optional)
By default, the new user cannot create databases or modify schemas. To grant additional privileges, use:
Grant Permission to Create Databases
ALTER USER username CREATEDB;
Grant Permission to Login (Almost always required for users):
The
LOGINprivilege is included by default when using
CREATE USER. To be explicit, you can write:
CREATE ROLE username WITH LOGIN PASSWORD 'user_password';
Grant User All Privileges on a Specific Database
GRANT ALL PRIVILEGES ON DATABASE database_name TO username;
Example:
GRANT ALL PRIVILEGES ON DATABASE mydb TO reporting_user;
5. Verify the New User
To confirm the new user was created successfully, run:
\du
The new username should appear in the list of roles.
Note:
Managing users in PostgreSQL is a privileged operation. Only superusers or users with adequate rights can create and manage roles and permissions. For more advanced configurations—such as role inheritance or restricting user access to specific tables or schemas—refer to the PostgreSQL documentation on roles and privileges.
Checking the JDBC Driver
The PostgreSQL JDBC driver is already installed in Matillion Data Productivity Cloud, so there is no need to download or install it separately. To confirm the presence of the JDBC driver, follow these steps:
- In Matillion Data Productivity Cloud, create a new Orchestration pipeline.
- Drag and drop a "Database Query" component into the workspace.
- Open the configuration for the "Database Query" component and locate the "Database Type" dropdown list.
- Check that "PostgreSQL" appears as an option in the dropdown list. This confirms that the corresponding JDBC driver is available.
If you would like to check the version of the JDBC driver in your environment, follow the instructions provided in the Matillion documentation: https://docs.matillion.com/data-productivity-cloud/designer/docs/database-query/#database-driver-versions
Checking network connectivity
To enable Matillion Data Productivity Cloud to connect to your PostgreSQL database, you must ensure the database accepts incoming connections according to your deployment configuration:
-
Full SaaS Agent Configuration:
Configure your PostgreSQL database (and any relevant firewalls or network controls) to allow incoming connections from the IP addresses published by Matillion for Full SaaS environments. You can find the current list of required IP addresses here: https://docs.matillion.com/data-productivity-cloud/security/docs/allowing-ip-addresses/ -
Hybrid SaaS Deployment:
In this deployment model, the connections to your PostgreSQL database originate from within your own virtual private cloud (VPC) where the Hybrid SaaS agent is running. You must allow incoming connections from the appropriate VPC resources. For assistance with network diagnostics, you can use the utilities available at: https://exchange.matillion.com/data-productivity-cloud/pipeline/check-network-access/
Note: If your PostgreSQL database is referenced using a DNS address (rather than a direct IP), ensure that the Full SaaS or Hybrid SaaS agent can resolve the hostname to the correct address. Network DNS settings must be configured to allow proper name resolution within your chosen deployment.
Querying Data from a PostgreSQL Database
To efficiently retrieve and replicate data from a PostgreSQL database—especially when integrating with platforms like Amazon Redshift—follow the best practices outlined below.
Example SELECT Queries (PostgreSQL)
Here are common SQL
SELECTstatements used to query data from PostgreSQL:
``` -- Select all columns from a table SELECT * FROM public.customers;
-- Select specific columns from a table SELECT id, first_name, last_name FROM public.customers;
-- Filter data using WHERE SELECT * FROM public.orders WHERE order_date >= '2024-01-01';
-- Join two tables SELECT o.id, o.order_date, c.first_name, c.last_name FROM public.orders o JOIN public.customers c ON o.customer_id = c.id WHERE o.order_status = 'SHIPPED'; ```
Datatype Conversion
When replicating or synchronizing data from PostgreSQL to Amazon Redshift, pay attention to datatype conversions. Some PostgreSQL datatypes do not have a direct equivalent in Redshift and may require explicit casting or adjustment. For example:
serial
orbigserial
(PostgreSQL) ⇨INTEGER
orBIGINT
(Redshift)text
⇨VARCHAR
timestamp with time zone
⇨TIMESTAMPTZ
Consult the official documentation for detailed mapping between PostgreSQL and Redshift datatypes.
Loading Patterns: Initial and Incremental Loads
The optimal loading strategy from PostgreSQL to Redshift (or similar targets) is a two-stage approach:
1. Initial (Once-Off) Load
- Extract the entire dataset from the source table.
- Use the Database Query component without a filter clause:
SELECT * FROM public.orders;
- All available rows are retrieved and replicated.
2. Incremental Loads
- Only extract and replicate data that has changed (new or updated) since the last load.
- Use the Database Query component with a filter clause referencing a suitable column (typically a timestamp or incrementing ID):
SELECT * FROM public.orders
WHERE order_modified_at > '2024-06-01 00:00:00';
Adjust the value in the filter dynamically with each load, so only the newly changed rows are processed.
Further Reading: Detailed explanation and patterns can be found at Matillion Exchange: Incremental Load Data Replication Strategy
By following the above query patterns and load strategies, and understanding datatype implications, your data integrations with PostgreSQL will be robust and performant.
Data Integration Architecture
Loading data in advance of integration represents a "divide and conquer" strategy, where the challenge of integrating heterogeneous data sources is separated into two distinct steps. This approach underpins the Extract, Load, and Transform (ELT) architecture, delivering clear advantages over traditional models. Data is first extracted and loaded efficiently into Amazon Redshift, and only then subjected to necessary integration and transformation. Data integration inherently requires transformation—for example, harmonizing different formats or applying business rules—and this is best achieved through robust data transformation pipelines. In the ELT paradigm, all data transformation and integration processes are executed within the target Redshift database itself. This confers further benefits: transformations are fast, on-demand, and naturally scalable with the Redshift cluster, while organizations avoid the cost and operational complexity of maintaining separate data processing infrastructure externally.