Extracting data from PostgreSQL to Databricks
Extracting data from PostgreSQL is a foundational task for organisations looking to leverage robust analytics and data processing platforms such as Databricks. Moving data seamlessly from a transactional database into an analytics environment can unlock new insights, but the process involves several key considerations to ensure reliability, security, and efficiency. In this article, we will guide you through each critical step of extracting data from PostgreSQL and loading it into Databricks. We begin by outlining how to create an identity in PostgreSQL for secure and controlled access. For users leveraging Matillion, we will review how to check for, or acquire, the appropriate JDBC driver to facilitate the connection between PostgreSQL and Databricks. Next, we address the often-overlooked aspect of establishing the necessary network connectivity to ensure data can flow from your source system to your target analytics platform. Finally, we will discuss strategies for querying data, covering both initial data loads and ongoing incremental updates to efficiently maintain your Databricks environment. By the end of this article, you'll have a clear understanding of the end-to-end process, allowing you to manage your data extraction pipeline with confidence and best practices.
What is PostgreSQL?
PostgreSQL is a powerful, open-source object-relational database management system renowned for its reliability, robust feature set, and standards compliance. Supporting advanced data types such as JSON, XML, and arrays, PostgreSQL provides extensibility through custom functions, procedural languages, and user-defined data types. Its strong support for ACID transactions ensures data integrity, while concurrency is efficiently managed via Multi-Version Concurrency Control (MVCC). Additionally, PostgreSQL offers rich indexing options, full-text search capabilities, and extensive support for advanced queries, including common table expressions and window functions. Its active development community constantly contributes enhancements, solidifying PostgreSQL’s position as a preferred choice for enterprises and developers seeking scalability, security, and performance in critical applications.
What is Databricks?
Databricks is a unified analytics platform built on Apache Spark, designed as a managed data lakehouse rather than a traditional database. Leveraging Delta Lake, it combines scalable, low-cost data lake storage with warehouse reliability, supporting ACID transactions and schema enforcement. Databricks efficiently handles both structured and unstructured data, integrates with formats like Parquet, CSV, and JSON, and works seamlessly with major cloud storage providers such as AWS S3, Azure Data Lake, and Google Cloud Storage. Collaborative notebooks, robust APIs, and advanced machine learning tools enable teams to operate large-scale data pipelines with strong data governance, lineage, and security.
Why Move Data from PostgreSQL into Databricks
The Advantages of Copying Data from PostgreSQL to Databricks for Advanced Analytics
A data engineer or architect may wish to copy data from PostgreSQL into Databricks for several strategic reasons. First, the data stored in PostgreSQL often holds significant value, containing critical operational information or transactional records that can yield actionable insights. By integrating this data with other sources within the Databricks platform, organizations can realize its full potential—enabling advanced analytics, reporting, and data science workflows that span multiple data domains. Importantly, running these complex integration and analysis tasks within Databricks, rather than querying PostgreSQL directly, helps to protect the operational database from excessive workloads, thus maintaining its performance and reliability for day-to-day business operations. This approach allows teams to maximize the utility of their PostgreSQL data while ensuring robust, scalable analytics in an environment suited for high-performance processing.
Similar connectors
Creating a User in PostgreSQL
To create a user (also referred to as a "role" with login privileges) in PostgreSQL, you can use either SQL commands within the
psqlclient or utilities like
createuser. Below you'll find step-by-step instructions using both methods.
1. Using SQL in the psql
Client
- Connect to your database cluster as a superuser or a user with
CREATEROLE
privileges:
bash psql -U postgres
- Create the user using the
CREATE ROLE
command:
CREATE ROLE example_user WITH LOGIN PASSWORD 'your_secure_password';
- Replace
example_user
with your desired username. -
Replace
'your_secure_password'
with a strong password. -
(Optional) Grant additional privileges:
-
To allow user creation of databases:
ALTER ROLE example_user CREATEDB;
- To grant superuser privileges (use with caution):
ALTER ROLE example_user SUPERUSER;
2. Using the createuser
Command-Line Utility
- Run the following command as the OS user that owns the PostgreSQL cluster (usually
postgres
):
bash createuser --interactive --pwprompt
- Follow the prompts to specify:
- The name of the new role.
- Whether the user should be a superuser, allowed to create databases, etc.
- A password for the new role.
3. Example: Creating a User With Common Privileges
This SQL script creates a user, allows database creation, and login:
CREATE ROLE app_user WITH LOGIN PASSWORD 'StrongPassw0rd!' CREATEDB;
Note:
Always use strong, unique passwords and assign the minimum required privileges for security.
Checking the JDBC driver
The PostgreSQL JDBC driver is already installed in Matillion Data Productivity Cloud—there is no need to download or install it yourself. To confirm the presence of the JDBC driver, you can follow these steps:
- Create a Matillion Orchestration pipeline. Start by creating a new orchestration pipeline in your Matillion environment.
- Add a "Database Query" component. In the pipeline canvas, drag and drop a "Database Query" component onto your workflow.
- Check Database Types. Open the component settings and look for the "Database Type" dropdown menu. If the PostgreSQL database appears in the list of available database types, this confirms that the JDBC driver is present and ready to use.
If you need to check the exact version of the JDBC driver installed, refer to the official documentation here: Database Query component – Database Driver Versions.
Checking network connectivity
To successfully connect Matillion Data Productivity Cloud to your PostgreSQL database, you must ensure that the database accepts incoming connections from the appropriate sources based on your deployment configuration.
Full SaaS Agent Configuration:
If you are using the Full SaaS agent configuration, you need to update your network and PostgreSQL database firewall settings to allow incoming connections from the IP addresses used by Matillion Data Productivity Cloud. The full list of required IP addresses can be found here: Matillion Allowed IP Addresses.
Hybrid SaaS Deployment:
For a Hybrid SaaS deployment, incoming connections should be allowed from your own Virtual Private Cloud (VPC) where the Matillion Hybrid SaaS agent is running. This typically involves configuring your firewall or security groups to permit traffic from your VPC’s IP range. To verify connectivity from your VPC, you can use the tools and utilities available here: Matillion Network Access Checker.
DNS Resolution:
If your PostgreSQL database is being referenced using a DNS hostname (rather than an IP address), ensure that the appropriate agent (Full SaaS or Hybrid SaaS) can resolve the DNS address to the correct IP. This may require appropriate DNS setup and access permissions for the agent to perform name resolution successfully.
Querying Data from a PostgreSQL Database
This guide explains how to query data from a PostgreSQL database, provide examples of SQL
SELECTstatements, highlight potential datatype conversion issues (such as with Databricks), and outline recommended practices for initial and incremental data loads.
Basic SQL SELECT Statements
A basic query to retrieve all columns from a table:
SELECT * FROM customers;
To select specific columns:
SELECT customer_id, first_name, last_name FROM customers;
Filtering rows using a
WHEREclause:
SELECT * FROM orders WHERE order_date >= '2024-01-01';
Sorting the results:
SELECT * FROM products ORDER BY price DESC;
Aggregating data:
SELECT status, COUNT(*) AS order_count FROM orders GROUP BY status;
Joining multiple tables:
SELECT o.order_id, c.first_name, c.last_name, o.total_amount FROM orders o JOIN customers c ON o.customer_id = c.customer_id;
Datatype Conversion (PostgreSQL ↔ Databricks)
When moving data between PostgreSQL and Databricks (or other technologies), datatype conversions may occur. Examples:
- PostgreSQL
TIMESTAMP
may convert to DatabricksTIMESTAMP
but with potential timezone differences. - PostgreSQL
NUMERIC
may become DatabricksDECIMAL
. - PostgreSQL
BYTEA
may not have a direct equivalent in Databricks.
Best Practice: Always review and map PostgreSQL datatypes to their Databricks equivalents before data transfer to prevent data integrity or precision loss.
Initial Load vs Incremental Load Patterns
The optimal pattern is to perform a one-time initial load followed by incremental loads. Both loads use the same Database Query component, but differ in their filtering logic.
Initial Load
- Purpose: Capture all existing historical data.
- Implementation: Execute a query with no filtering condition.
Example:
SELECT * FROM orders;
Incremental Load
- Purpose: Retrieve only new or changed records since the last load, reducing processing time and cost.
- Implementation: Use a filter (usually on a monotonically increasing field such as
last_updated
or an ID).
Example: (Fetching rows updated since the last load)
SELECT * FROM orders WHERE last_updated > '2024-06-26 00:00:00';Update the filter value (
'2024-06-26 00:00:00') dynamically based on the last successful load.
Further Reading:
Refer to the Matillion incremental-load replication strategy article for more details on designing robust incremental and CDC (change data capture) pipelines.
Data Integration Architecture
Loading data in advance of integration exemplifies the "divide and conquer" principle by splitting the data pipeline into two distinct steps: first, extracting and loading raw data into the Databricks database, and then subsequently integrating and transforming that data. This approach is a core advantage of the ELT (Extract, Load, Transform) architecture, as it allows data teams to address each stage independently, improving both flexibility and manageability. Effective data integration relies on robust data transformation, and this is best achieved by employing data transformation pipelines—structured workflows that automate the necessary cleaning, shaping, and joining of data. Moreover, another important advantage of the ELT approach is that all transformation and integration processes occur directly within the target Databricks database. This enables fast, on-demand, and highly scalable operations using Databricks' compute and storage resources. As a result, there is no need to invest in or maintain separate data processing infrastructure, which streamlines operations and reduces costs.