Python for Data Engineering: When SQL Isn't Enough

7 minutes to read
Get free consultation

 

We have all been there. You are staring at a 500-line SQL Common Table Expression (CTE) that just failed. Effective debugging relies on clear feedback, yet we often face cryptic error messages and brittle logic that requires commenting out lines one by one to fix.

For years, SQL has been the lingua franca of data. Set-based operations and aggregations perform best in SQL. However, the data landscape has shifted. We are managing systems rather than just rows and columns. Data streams in from APIs, unstructured JSON blobs, and third-party SaaS tools, often requiring logic that standard SQL stored procedures cannot handle gracefully.

At Stellans, we believe that modern data infrastructure requires a shift in perspective. Integrating Python strengthens your stack without abandoning SQL. While SQL handles the heavy lifting of data transformation, Python data engineering handles the complexity, connectivity, and system control.

Here is how we determine when to stop querying and start engineering.

The Tipping Point: Recognizing the Limits of SQL

SQL is powerful, but it functions best within its specific design scope rather than as a general-purpose programming language. Forcing SQL to perform tasks outside its wheelhouse often results in “Frankenstein” queries that are impossible to maintain.

Non-Tabular Data Struggles

Modern data ecosystems involve complex formats, whereas clean tabular data is rare right out of the box. Handling nested JSON structures, XML, or scraping web data directly within a data warehouse often leads to performance bottlenecks. While modern warehouses like Snowflake or BigQuery have JSON parsing capabilities, writing complex extraction logic inside a query string is error-prone. Identifying a tipping point becomes easier when you notice regex appearing in SQL WHERE clauses to parse log files.

The “Spaghetti Code” Trap

Modern software development thrives on modularity, which SQL scripts often lack. Reusing code should be simple, yet SQL often forces copy-pasting segments into new views because it lacks classes or easy unit tests. This leads to the “spaghetti code” trap: massive, monolithic scripts where a single change in line 40 breaks the logic in line 400.

Integration Bottlenecks

Integrated systems require data to flow freely, rather than living in a silo. You need to pull data from Salesforce, push insights to HubSpot, or interact with a custom internal API. Standard database links often fail to handle authentication protocols (like OAuth2) or complex rate-limiting logic. Robust integrations require more than SQL, as relying solely on it often results in fragile connections that break silently when an API token expires or an endpoint changes.

Core Python for Engineering: The Essential Toolkit

To build robust, scalable systems, we treat data pipelines as software products. This means utilizing a Python ecosystem designed for reliability and structure.

Connecting to APIs with Python

Ingesting data from APIs is one of the primary drivers for adopting Python ETL scripts. Active management is essential for APIs, unlike static database connections.

You need to handle pagination (looping through pages of results), respect rate limits to avoid getting banned, and manage backoff strategies when the server is busy. Python allows us to write modular functions to handle these edge cases elegantly. For example, a simple Python script can detect a 429 Too Many Requests status, wait for the specified time, and retry automatically. Trying to replicate this logic in a stored procedure is a recipe for frustration.

Advanced Data Validation with Pydantic

One of the biggest risks in data engineering is “garbage in, garbage out.” Ensuring upstream data consistency safeguards your warehouse from corruption if schema changes occur without warning.

This is where data validation with Pydantic becomes essential. Pydantic allows us to define strict data models using Python type hints. Before data ever hits your warehouse, it is passed through a Pydantic model. If the data is usually a string but suddenly arrives as an integer, or if a required field is missing, Pydantic catches it immediately.

By enforcing schema at the ingestion layer, we stop bad data at the door. This brings a level of type safety and reliability that SQL constraints alone cannot match.

Complex Transformations

Unique business requirements sometimes demand logic that exceeds SQL’s capabilities. Maybe you need to apply a recursive calculation, perform complex string manipulation that exceeds standard regex, or use a specific mathematical library.

Python shines here because it allows for imperative programming. You can write loops, use if/else logical branches that are readable, and import libraries designed for specific tasks. Complex data transformations that would require a customized User Defined Function (UDF) in SQL can often be written as a clean, testable Python function.

Scaling Logic: Pandas vs Spark

Once you decide to process data outside the warehouse, you face a new choice: pandas vs spark. We see many teams struggle to pick the right tool, often defaulting to Spark when it is overkill, or sticking with Pandas until the server crashes.

Pandas for Single-Node Agility

Pandas is the gold standard for data manipulation in memory. It is incredibly fast for datasets that fit within the RAM of a single machine. Best for:

PySpark for Distributed Scale

Processing massive datasets requires distributed power, as single-node processing fails when data grows into the terabytes. PySpark allows you to execute logic across a cluster of machines. Best for:

Making the Decision

Our rule of thumb at Stellans: Start with Pandas, scale with Spark. Keeping overhead low is better if your daily delta is only 500MB, so do not incur the cost of managing a Spark cluster prematurely. However, if you are hitting “Out of Memory” errors frequently, it is time to distribute the load.

Orchestration with Airflow

Writing the scripts is only half the battle. Running them reliably is the other half. Reliability in enterprise data requires more sophisticated scheduling than simple Cron jobs can provide.

Why Cron Jobs Are Not Enough

Cron fires scripts at specific times, ensuring execution but ignoring whether the previous step succeeded. If your API ingestion script fails at 2:00 AM, your Transformation script will still fire at 2:30 AM, processing empty or incomplete data.

Best Practices for Airflow DAG Writing

Apache Airflow solves this by managing dependencies. In airflow dag writing, you define a Directed Acyclic Graph (DAG) where task execution depends on the success of upstream tasks.

We structure DAGs to maintain a clear “Extract -> Load -> Transform” pattern.

  1. Task A (Python): Fetch data from API.
    • If success…
  2. Task B (Python): Validate data with Pydantic.
    • If success…
  3. Task C (SQL): Trigger the dbt model to transform data in the warehouse.

This ensures that if the API is down, the pipeline pauses and alerts the engineering team, rather than polluting the warehouse with partial data.

Building a Hybrid Architecture

Adopting Python enhances your existing SQL capabilities rather than replacing them. In fact, our most successful deployments utilize a hybrid architecture.

We use Python for the “outer loop”:

We use SQL for the “inner loop”:

This methodology leverages the strengths of both tools. It allows your Data Engineering Services to run like a well-oiled machine, capable of scaling as your organization grows. We have applied this exact hybrid approach in numerous engagements, helping clients move from fragile scripts to resilient platforms. You can see examples of this maturity in our projects portfolio.

Conclusion

The transition from “SQL-only” to “Software Engineering” is a hallmark of a mature data organization. Python offers the control, connectivity, and rigor required to build systems that scale. It allows us to validate data before it causes damage, integrate with any tool in the modern stack, and orchestrate complex workflows clearly.

Is your data infrastructure struggling to scale beyond simple SQL scripts? We build robust, scalable data systems. Reach out to Stellans today to modernize your architecture.

Frequently Asked Questions

Q: When should I use Python over SQL for data engineering? A: You should use Python when you need to connect to external APIs, parse unstructured data (like complex JSON or logs), apply complex procedural logic (recursion/loops), or require strict data validation (via Pydantic) before the data enters your warehouse.

Q: What is Pydantic used for in Data Engineering? A: Pydantic is used for data validation and settings management using Python usage type annotations. In data engineering, it allows you to define strict schemas for incoming data, ensuring structural integrity and data quality before loading it into your database.

Q: Pandas vs Spark: Which should I choose? A: Choose Pandas for datasets that fit within a single machine’s memory, as it offers fast manipulation and agility. Choose Spark (PySpark) for the distributed processing of massive datasets (terabytes) across clusters when single-node memory is insufficient.

References:

Article By:

https://stellans.io/wp-content/uploads/2026/01/leadership-2.jpg
Anton Malyshev

Co-founder

Related Posts

    Get a Free Data Audit

    * You can attach up to 3 files, each up to 3MB, in doc, docx, pdf, ppt, or pptx format.