Extracting data from PostgreSQL to Snowflake
Extracting data from PostgreSQL is a fundamental step for organizations seeking to leverage modern cloud data platforms such as Snowflake. Whether you are consolidating analytics workloads, migrating legacy systems, or building robust data pipelines, ensuring a smooth and reliable data transfer process is essential. In this article, we will guide you through the key stages involved in extracting data from a PostgreSQL source and loading it into Snowflake, with practical considerations for users working with Matillion. Specifically, we will cover the process of creating an appropriate identity in PostgreSQL to facilitate secure data access. For Matillion users, we will explain how to check for, or acquire, the necessary JDBC driver that enables connectivity between PostgreSQL and Matillion. We will also address network connectivity requirements to ensure successful data transfer between your source and target environments. Finally, we will discuss how to query data from PostgreSQL both for the initial full extract and for ongoing incremental updates. By the end of this article, you will have a clear understanding of each required step, along with actionable guidance to help you establish a reliable extraction and loading workflow between PostgreSQL and Snowflake.
What is PostgreSQL?
PostgreSQL is a powerful, open-source object-relational database management system renowned for its robustness, extensibility, and standards compliance. It supports advanced data types, full ACID compliance, complex queries, and comprehensive indexing methods, making it suitable for a variety of transactional and analytical workloads. With features such as multi-version concurrency control (MVCC), table partitioning, and support for procedural languages (e.g., PL/pgSQL, PL/Python), PostgreSQL can handle demanding enterprise applications. Its extensibility enables users to define custom data types, operators, and functions, while its active community consistently contributes improvements and security enhancements, reinforcing PostgreSQL’s position as a reliable choice for mission-critical data management.
What is Snowflake?
Snowflake is a cloud-based data platform designed to handle large-scale data warehousing, analytics, and data sharing. Architected from the ground up for the cloud, it decouples compute from storage, enabling independent scalability of both resources. Its multi-cluster architecture efficiently manages concurrency, making it well-suited for environments with varying workload patterns. Snowflake supports structured and semi-structured data, such as JSON, Avro, and Parquet, using its robust SQL engine for high-performance querying. With native support for time travel, zero-copy cloning, and secure data sharing between accounts, Snowflake facilitates agile data operations while maintaining strong security and governance controls.
Why Move Data from PostgreSQL into Snowflake
The Benefits of Copying Data from PostgreSQL to Snowflake for Advanced Analytics
A data engineer or architect may wish to copy data from PostgreSQL into Snowflake for several compelling reasons. Firstly, PostgreSQL databases often contain valuable operational data that can provide important business insights when properly analyzed. However, the true benefit of this data is realized when it is integrated with information from other sources, allowing for comprehensive analytics and more informed decision-making. Performing these integration and analytical tasks within Snowflake, rather than directly on PostgreSQL, offers a significant operational advantage: it alleviates the risk of adding substantial workload to the production PostgreSQL system. Snowflake’s scalable architecture is designed for high-performance analytics and data integration, ensuring that analytical processing does not impact the performance of transactional systems.
Similar connectors
Creating an Identity in PostgreSQL
The following instructions guide you through creating a new user (also referred to as a "role") in a PostgreSQL database. These steps assume you have sufficient privileges (a superuser or a user with CREATEROLE permission).
1. Connect to PostgreSQL
Depending on your setup, use the
psqlcommand-line interface or a graphical client. To start
psqlas the default PostgreSQL superuser (often
postgres):
psql -U postgres
If you need to specify a database:
psql -U postgres -d your_database_name
2. Create a User with a Password
Execute the following SQL command to create a new user with login privileges.
CREATE USER username WITH PASSWORD 'your_strong_password';
Replace
usernameand
your_strong_passwordwith the desired username and a secure password.
3. (Optional) Grant Additional Permissions
- Grant superuser privileges:
ALTER USER username WITH SUPERUSER;
- Allow user to create databases:
ALTER USER username WITH CREATEDB;
- Allow user to create roles:
ALTER USER username WITH CREATEROLE;
4. (Optional) Assign Database Privileges
To allow the user to access and manipulate a specific database, use the following command after connecting to that database:
GRANT ALL PRIVILEGES ON DATABASE database_name TO username;
Or, to grant more restrictive privileges, you could use specific statements such as:
GRANT CONNECT ON DATABASE database_name TO username;
Example: Creating a User and Granting Access
``` -- Create user 'appuser' with password 'securePassword123' CREATE USER appuser WITH PASSWORD 'securePassword123';
-- Allow 'appuser' to create databases ALTER USER appuser WITH CREATEDB;
-- Grant connect privilege to an existing database 'appdb' GRANT CONNECT ON DATABASE appdb TO appuser; ```
5. Verify the New User
List all users to confirm creation:
\du
This command can be entered in the
psqlinterface. It displays existing roles and their privileges.
Note:
For production systems, always use strong and unique passwords. Adjust privileges according to the principle of least privilege to enhance security.
Checking the JDBC Driver
The PostgreSQL JDBC driver is already installed in Matillion Data Productivity Cloud—there’s no need to download or add the driver manually. To confirm that the JDBC driver is present and available for use:
-
Create a Matillion Orchestration Pipeline:
Start a new orchestration pipeline within Matillion Data Productivity Cloud. -
Add a “Database Query” Component:
In your pipeline, add the Database Query component from the components panel. -
Check the Dropdown for Database Types:
With the Database Query component selected, locate the Database Type dropdown. If the PostgreSQL JDBC driver is present, “PostgreSQL” will be listed as a selectable database type.
If you would also like to check which version of the driver is installed, please refer to the official documentation: Check JDBC Driver Versions.
Checking network connectivity
To enable successful communication between Matillion Data Productivity Cloud and your PostgreSQL database, you must ensure that the database is configured to allow incoming connections from the appropriate sources, depending on your deployment architecture:
-
Full SaaS agent configuration:
You need to allow incoming connections from the set of Matillion-managed IP addresses. The list of required IP addresses can be found at Matillion Data Productivity Cloud Allowed IP Addresses. Configure your PostgreSQL instance's firewall or network access rules to permit connections from these addresses. -
Hybrid SaaS deployment:
In this scenario, your agent runs within your own virtual private cloud (VPC). Therefore, you must allow incoming connections from your VPC's IP address range. You can use the network checking utilities available at the Matillion Exchange: Check Network Access page to verify network accessibility from your VPC to the PostgreSQL instance.
Additionally, if your PostgreSQL database is being referenced using a DNS name rather than a static IP address, you must ensure that the Full SaaS or Hybrid SaaS agent is able to resolve the DNS name correctly. This means verifying that the agent has access to appropriate DNS resolution services and that any custom DNS entries are reachable from the environment where the agent is running.
Querying Data from a PostgreSQL Database
This guide describes best practices for querying data from PostgreSQL, especially when integrating with platforms such as Snowflake. We'll cover SQL SELECT query examples, discuss potential datatype conversion, and explain the pattern for data extraction using initial and incremental loads.
Example PostgreSQL SELECT Queries
Here are several common SELECT statement examples for PostgreSQL:
``` -- Retrieve all columns and rows from a table SELECT * FROM public.orders;
-- Retrieve specific columns SELECT order_id, customer_id, order_date FROM public.orders;
-- Apply filtering SELECT * FROM public.orders WHERE order_date >= '2024-01-01';
-- Aggregation example SELECT customer_id, COUNT(*) AS order_count FROM public.orders GROUP BY customer_id; ```
Datatype Conversion Between PostgreSQL and Snowflake
When loading data from PostgreSQL into Snowflake, be aware that some datatypes may not have exact matches and may require conversion. Example considerations:
| PostgreSQL Datatype | Target Snowflake Datatype |
|---|---|
SERIAL |
NUMBER, IDENTITY |
TEXT, VARCHAR |
STRING |
BYTEA |
BINARY |
TIMESTAMP WITH TZ |
TIMESTAMP_TZ |
BOOLEAN |
BOOLEAN |
Always verify that your destination schema in Snowflake accommodates any datatype differences that could affect data fidelity or SQL query behavior.
Pattern: Initial and Incremental Data Loads
For efficient data pipeline management, it's recommended to use a two-phase extraction strategy:
- Initial Load:
- Extracts the full dataset from PostgreSQL.
- No filter clause is used.
-
Example:
SELECT * FROM public.orders;
-
Incremental Load:
- Extracts only new or changed records (since the last extraction).
- Use a filter clause (WHERE condition) based on a suitable column, e.g.,
last_updated
. - Example:
SELECT * FROM public.orders WHERE last_updated > '2024-06-01 00:00:00';
Both phases should use the same Database Query component to ensure consistency.
See more: Incremental Load Data Replication Strategy — Matillion Exchange
Note: Make sure to select appropriate filter columns such as monotonically increasing IDs or reliable timestamp columns to track and manage incremental changes properly.
Data Integration Architecture
One of the key advantages of the ELT (Extract, Load, Transform) architecture is the ability to "divide and conquer" the data integration process by separating it into two distinct steps: first loading raw data into the Snowflake database, and then performing integration and transformation within Snowflake itself. This approach allows organizations to efficiently manage large datasets and streamline their workflows. Effective data integration inherently requires robust data transformation, and the most efficient way to achieve this is through data transformation pipelines, which automate and manage the various transformation tasks needed to prepare data for analysis. By conducting these transformation and integration steps directly within the target Snowflake environment, ELT offers several additional benefits: the process is fast, scalable, and executed on-demand, leveraging Snowflake's powerful compute resources. This also eliminates the need for maintaining separate, dedicated data processing infrastructure, resulting in streamlined operations and cost savings.