The Agentic Advantage series: Real-world AI success stories from leading CDAOs

Starting Feb 4

Integrate data from PostgreSQL to Databricks using Matillion

Our PostgreSQL to Databricks connector enables seamless and timely data transfers to Databricks, automatically keeping your information current without the need for manual coding or managing complex ETL scripts.

PostgreSQL
Databricks
PostgreSQL to Databricks banner

Extracting data from PostgreSQL to Databricks

Extracting data from PostgreSQL is a foundational task for organisations looking to leverage robust analytics and data processing platforms such as Databricks. Moving data seamlessly from a transactional database into an analytics environment can unlock new insights, but the process involves several key considerations to ensure reliability, security, and efficiency. In this article, we will guide you through each critical step of extracting data from PostgreSQL and loading it into Databricks. We begin by outlining how to create an identity in PostgreSQL for secure and controlled access. For users leveraging Matillion, we will review how to check for, or acquire, the appropriate JDBC driver to facilitate the connection between PostgreSQL and Databricks. Next, we address the often-overlooked aspect of establishing the necessary network connectivity to ensure data can flow from your source system to your target analytics platform. Finally, we will discuss strategies for querying data, covering both initial data loads and ongoing incremental updates to efficiently maintain your Databricks environment. By the end of this article, you'll have a clear understanding of the end-to-end process, allowing you to manage your data extraction pipeline with confidence and best practices.


What is PostgreSQL?

PostgreSQL is a powerful, open-source object-relational database management system renowned for its reliability, robust feature set, and standards compliance. Supporting advanced data types such as JSON, XML, and arrays, PostgreSQL provides extensibility through custom functions, procedural languages, and user-defined data types. Its strong support for ACID transactions ensures data integrity, while concurrency is efficiently managed via Multi-Version Concurrency Control (MVCC). Additionally, PostgreSQL offers rich indexing options, full-text search capabilities, and extensive support for advanced queries, including common table expressions and window functions. Its active development community constantly contributes enhancements, solidifying PostgreSQL’s position as a preferred choice for enterprises and developers seeking scalability, security, and performance in critical applications.

matillion logo x PostgreSQL

What is Databricks?

Databricks is a unified analytics platform built on Apache Spark, designed as a managed data lakehouse rather than a traditional database. Leveraging Delta Lake, it combines scalable, low-cost data lake storage with warehouse reliability, supporting ACID transactions and schema enforcement. Databricks efficiently handles both structured and unstructured data, integrates with formats like Parquet, CSV, and JSON, and works seamlessly with major cloud storage providers such as AWS S3, Azure Data Lake, and Google Cloud Storage. Collaborative notebooks, robust APIs, and advanced machine learning tools enable teams to operate large-scale data pipelines with strong data governance, lineage, and security.

Why Move Data from PostgreSQL into Databricks

The Advantages of Copying Data from PostgreSQL to Databricks for Advanced Analytics

A data engineer or architect may wish to copy data from PostgreSQL into Databricks for several strategic reasons. First, the data stored in PostgreSQL often holds significant value, containing critical operational information or transactional records that can yield actionable insights. By integrating this data with other sources within the Databricks platform, organizations can realize its full potential—enabling advanced analytics, reporting, and data science workflows that span multiple data domains. Importantly, running these complex integration and analysis tasks within Databricks, rather than querying PostgreSQL directly, helps to protect the operational database from excessive workloads, thus maintaining its performance and reliability for day-to-day business operations. This approach allows teams to maximize the utility of their PostgreSQL data while ensuring robust, scalable analytics in an environment suited for high-performance processing.

Creating a User in PostgreSQL

To create a user (also referred to as a "role" with login privileges) in PostgreSQL, you can use either SQL commands within the

psql
client or utilities like
createuser
. Below you'll find step-by-step instructions using both methods.


1. Using SQL in the
psql
Client

  1. Connect to your database cluster as a superuser or a user with
    CREATEROLE
    privileges:

bash
   psql -U postgres

  1. Create the user using the
    CREATE ROLE
    command:

CREATE ROLE example_user WITH LOGIN PASSWORD 'your_secure_password';

  • Replace
    example_user
    with your desired username.
  • Replace

    'your_secure_password'
    with a strong password.

  • (Optional) Grant additional privileges:

  • To allow user creation of databases:

ALTER ROLE example_user CREATEDB;

  • To grant superuser privileges (use with caution):

ALTER ROLE example_user SUPERUSER;


2. Using the
createuser
Command-Line Utility

  1. Run the following command as the OS user that owns the PostgreSQL cluster (usually
    postgres
    ):

bash
   createuser --interactive --pwprompt

  1. Follow the prompts to specify:
  2. The name of the new role.
  3. Whether the user should be a superuser, allowed to create databases, etc.
  4. A password for the new role.

3. Example: Creating a User With Common Privileges

This SQL script creates a user, allows database creation, and login:

CREATE ROLE app_user WITH
  LOGIN
  PASSWORD 'StrongPassw0rd!'
  CREATEDB;


Note:
Always use strong, unique passwords and assign the minimum required privileges for security.

Checking the JDBC driver

The PostgreSQL JDBC driver is already installed in Matillion Data Productivity Cloud—there is no need to download or install it yourself. To confirm the presence of the JDBC driver, you can follow these steps:

  1. Create a Matillion Orchestration pipeline. Start by creating a new orchestration pipeline in your Matillion environment.
  2. Add a "Database Query" component. In the pipeline canvas, drag and drop a "Database Query" component onto your workflow.
  3. Check Database Types. Open the component settings and look for the "Database Type" dropdown menu. If the PostgreSQL database appears in the list of available database types, this confirms that the JDBC driver is present and ready to use.

If you need to check the exact version of the JDBC driver installed, refer to the official documentation here: Database Query component – Database Driver Versions.

Checking network connectivity

To successfully connect Matillion Data Productivity Cloud to your PostgreSQL database, you must ensure that the database accepts incoming connections from the appropriate sources based on your deployment configuration.

Full SaaS Agent Configuration:
If you are using the Full SaaS agent configuration, you need to update your network and PostgreSQL database firewall settings to allow incoming connections from the IP addresses used by Matillion Data Productivity Cloud. The full list of required IP addresses can be found here: Matillion Allowed IP Addresses.

Hybrid SaaS Deployment:
For a Hybrid SaaS deployment, incoming connections should be allowed from your own Virtual Private Cloud (VPC) where the Matillion Hybrid SaaS agent is running. This typically involves configuring your firewall or security groups to permit traffic from your VPC’s IP range. To verify connectivity from your VPC, you can use the tools and utilities available here: Matillion Network Access Checker.

DNS Resolution:
If your PostgreSQL database is being referenced using a DNS hostname (rather than an IP address), ensure that the appropriate agent (Full SaaS or Hybrid SaaS) can resolve the DNS address to the correct IP. This may require appropriate DNS setup and access permissions for the agent to perform name resolution successfully.

Querying Data from a PostgreSQL Database

This guide explains how to query data from a PostgreSQL database, provide examples of SQL

SELECT
statements, highlight potential datatype conversion issues (such as with Databricks), and outline recommended practices for initial and incremental data loads.


Basic SQL SELECT Statements

A basic query to retrieve all columns from a table:

SELECT * FROM customers;

To select specific columns:

SELECT customer_id, first_name, last_name FROM customers;

Filtering rows using a

WHERE
clause:

SELECT * FROM orders
WHERE order_date >= '2024-01-01';

Sorting the results:

SELECT * FROM products
ORDER BY price DESC;

Aggregating data:

SELECT status, COUNT(*) AS order_count
FROM orders
GROUP BY status;

Joining multiple tables:

SELECT o.order_id, c.first_name, c.last_name, o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id;


Datatype Conversion (PostgreSQL ↔ Databricks)

When moving data between PostgreSQL and Databricks (or other technologies), datatype conversions may occur. Examples:

  • PostgreSQL
    TIMESTAMP
    may convert to Databricks
    TIMESTAMP
    but with potential timezone differences.
  • PostgreSQL
    NUMERIC
    may become Databricks
    DECIMAL
    .
  • PostgreSQL
    BYTEA
    may not have a direct equivalent in Databricks.

Best Practice: Always review and map PostgreSQL datatypes to their Databricks equivalents before data transfer to prevent data integrity or precision loss.


Initial Load vs Incremental Load Patterns

The optimal pattern is to perform a one-time initial load followed by incremental loads. Both loads use the same Database Query component, but differ in their filtering logic.

Initial Load

  • Purpose: Capture all existing historical data.
  • Implementation: Execute a query with no filtering condition.

Example:

SELECT * FROM orders;

Incremental Load

  • Purpose: Retrieve only new or changed records since the last load, reducing processing time and cost.
  • Implementation: Use a filter (usually on a monotonically increasing field such as
    last_updated
    or an ID).

Example: (Fetching rows updated since the last load)

SELECT * FROM orders
WHERE last_updated > '2024-06-26 00:00:00';
Update the filter value (
'2024-06-26 00:00:00'
) dynamically based on the last successful load.

Further Reading:
Refer to the Matillion incremental-load replication strategy article for more details on designing robust incremental and CDC (change data capture) pipelines.


Data Integration Architecture

Loading data in advance of integration exemplifies the "divide and conquer" principle by splitting the data pipeline into two distinct steps: first, extracting and loading raw data into the Databricks database, and then subsequently integrating and transforming that data. This approach is a core advantage of the ELT (Extract, Load, Transform) architecture, as it allows data teams to address each stage independently, improving both flexibility and manageability. Effective data integration relies on robust data transformation, and this is best achieved by employing data transformation pipelines—structured workflows that automate the necessary cleaning, shaping, and joining of data. Moreover, another important advantage of the ELT approach is that all transformation and integration processes occur directly within the target Databricks database. This enables fast, on-demand, and highly scalable operations using Databricks' compute and storage resources. As a result, there is no need to invest in or maintain separate data processing infrastructure, which streamlines operations and reduces costs.

Get started today

Matillion's comprehensive data pipeline platform offers more than point solutions.