ETL Optimization in Microsoft Fabric: Simplifying Bloated Dataflows
Every enterprise Fabric deployment we audit has the same problem: dataflows that started clean have grown into bloated, redundant transformation chains consuming 3-5x more capacity than necessary. Here is how EPC Group identifies and fixes the most common ETL anti-patterns.
The bloated dataflow problem
When organizations first adopt Microsoft Fabric, dataflows are typically designed with care — clean transformations, logical naming, efficient query folding. Then reality sets in. New source columns get added. Business rules change. Different team members modify dataflows without understanding the original design. Three months later, a dataflow that refreshed in 8 minutes now takes 45 minutes and consumes 4x the capacity units.
This is not a theoretical problem. In the last 12 months, EPC Group has optimized Fabric environments for enterprises where bloated ETL was the single largest driver of capacity cost overruns. One client was running an F128 capacity ($24,000/month) when an F32 ($6,000/month) would have sufficed after optimization. That is $216,000 per year in avoidable spend.
The root causes are consistent across every engagement: broken query folding, redundant transformations, missing incremental refresh, unnecessary data type conversions, and dataflows that load entire tables when only a filtered subset is needed.
Anti-pattern 1: Broken query folding
Query folding is the most impactful optimization lever in Dataflow Gen2. When it works, your M query steps translate to SQL that executes at the source — filtering, aggregating, and joining before data ever enters Fabric compute. When it breaks, every row from the source table streams into memory and Fabric processes it locally.
Common query folding breakers we find in enterprise dataflows:
- Table.Buffer() calls: Forces full materialization in memory. Often added by developers trying to "speed up" downstream steps, but it actually prevents folding for everything after it.
- Custom M functions in filter steps: If you write a custom function for a filter condition, the engine cannot translate it to SQL. Use native M filter syntax instead.
- Adding index columns mid-query: Table.AddIndexColumn() breaks folding because SQL sources cannot natively generate arbitrary row indices in the way M expects.
- Merging queries from different sources: A join between a SQL Server table and an Excel file cannot fold — the SQL side materializes fully so the M engine can perform the join in memory.
- Type changes after non-foldable steps: Even a simple Table.TransformColumnTypes() breaks folding if any prior step already broke it.
The fix: EPC Group audits every step in every query using the "View Native Query" indicator in Power Query. Steps where the indicator disappears have broken folding. We restructure queries to push foldable operations (filters, column selection, type changes, basic calculations) before any non-foldable step, and we replace non-foldable patterns with foldable alternatives wherever possible.
Anti-pattern 2: Redundant transformations across dataflows
In large Fabric environments, different teams often build separate dataflows that pull the same source data and apply similar (but not identical) transformations. We routinely find:
- Three dataflows pulling the same SQL Server table, each filtering for different departments but duplicating 15 identical cleansing steps.
- Multiple dataflows that each compute the same calculated column (e.g., fiscal quarter from a date) independently.
- Staging dataflows that load data to lakehouse tables, then downstream dataflows that re-read and re-transform the same data because the staging layer was not designed as a shared resource.
The fix: Implement a layered dataflow architecture. Bronze-layer dataflows handle ingestion and basic cleansing, outputting to shared lakehouse tables. Silver-layer dataflows or notebooks apply business transformations. Gold-layer semantic models consume the silver output. This eliminates duplication and establishes clear ownership boundaries.
Anti-pattern 3: Full refresh where incremental refresh should be used
The default behavior for a Dataflow Gen2 is full refresh — it re-reads and re-processes the entire source table on every run. For a 100-million-row fact table, this means processing 100 million rows every day even when only 50,000 rows changed. The capacity consumption is proportional to the full table size, not the delta.
The fix: Configure incremental refresh on every fact table dataflow. Requirements: a reliable datetime column (created_date, modified_date, or a CDC timestamp), a defined refresh window (typically 7-30 days), and a historical window that matches your reporting needs. For sources without a reliable datetime column, implement watermark patterns using SQL Server Change Tracking, CDC, or application-level audit columns.
The impact is dramatic. A client's 120-million-row transaction table went from 38-minute full refresh to 4-minute incremental refresh — a 90% reduction in both time and capacity consumption.
Anti-pattern 4: Unnecessary data type conversions and column bloat
Power Query's automatic type detection is helpful during development but creates problems at scale. We frequently find:
- Columns cast to text that should remain as integers (e.g., zip codes, ID fields), consuming 5-10x more memory.
- Decimal precision set to Currency or Decimal Number when Fixed Decimal or Whole Number would suffice.
- Dataflows that load 80 columns from a source table when only 15 are used downstream. Every unused column consumes capacity during refresh and storage in the lakehouse.
- Multiple type conversion steps scattered throughout the query instead of a single consolidated step at the end.
The fix: Select only needed columns immediately after the source step (this folds to SQL SELECT). Consolidate all type conversions into a single step. Use the most memory-efficient type for each column. Remove columns that no downstream consumer uses — if in doubt, check the semantic model and report references.
Anti-pattern 5: M query spaghetti
M queries in enterprise dataflows often grow organically into unreadable chains of 30-50 steps with auto-generated names like "Changed Type2", "Merged Queries1", "Expanded Table3". When the original developer leaves, no one understands what the query does, so the next developer adds workaround steps instead of refactoring.
The fix: EPC Group refactors M queries following these principles:
- Descriptive step names: Every step gets a name that explains what it does (e.g., "FilterActiveLeases" not "Filtered Rows3").
- Foldable steps first: Column selection, filtering, type changes, and basic calculations come before any non-foldable operation.
- Shared functions in a library query: Repeated transformation logic (date parsing, address standardization, status mapping) is extracted into reusable M functions referenced across dataflows.
- Documentation comments: M supports // comments — every non-obvious step gets a comment explaining the business rule.
- Step count target: If a query exceeds 15 steps, it is a candidate for splitting into a staging query and a transformation query.
Dataflow Gen2 best practices checklist
After optimizing dozens of enterprise Fabric environments, EPC Group applies this checklist to every dataflow:
| Practice | Impact | Effort |
|---|---|---|
| Verify query folding on every step | 60-80% faster refresh | Low |
| Enable incremental refresh on fact tables | 80-95% less data processed | Medium |
| Select only needed columns at source | 30-50% memory reduction | Low |
| Consolidate type conversions | 10-20% faster refresh | Low |
| Eliminate cross-source merges where possible | 50-70% faster on affected queries | Medium |
| Implement layered architecture (bronze/silver/gold) | Eliminates redundant processing | High |
| Schedule refreshes during off-peak hours | Reduces throttling risk | Low |
| Monitor with Fabric Capacity Metrics app | Identifies regressions early | Low |
When to use notebooks instead of Dataflow Gen2
Dataflow Gen2 is not always the right tool. EPC Group recommends switching to Fabric notebooks (PySpark) when:
- Transformation logic requires complex joins across 5+ tables from different sources.
- Data volumes exceed 1 GB in-memory after filtering — PySpark distributes processing across the Fabric compute cluster.
- Business rules require Python libraries (e.g., fuzzy matching, geospatial calculations, ML feature engineering) that M cannot express.
- You need unit-testable, version-controlled transformation code that can go through a CI/CD pipeline.
- Real-time or near-real-time ingestion is required — Fabric notebooks support streaming via Spark Structured Streaming.
The hybrid approach works best: Dataflow Gen2 for the majority of straightforward ingestion pipelines, notebooks for complex transformations, and Power BI semantic models consuming the lakehouse gold layer via Direct Lake.
EPC Group's ETL optimization engagement
EPC Group delivers a structured 4-week Fabric ETL Optimization engagement:
- Week 1: Dataflow inventory and profiling. We catalog every dataflow, measure refresh times and CU consumption, identify query folding breaks, and map data lineage from source to report.
- Week 2: Optimization plan. Prioritized list of changes ranked by capacity savings impact. Includes architecture recommendations (layered design, notebook migration candidates, incremental refresh candidates).
- Weeks 3-4: Implementation. We refactor the top-priority dataflows, implement incremental refresh, restructure queries for folding, and validate capacity savings. Knowledge transfer to your team on maintaining the optimized state.
Typical outcomes: 40-70% reduction in refresh duration, 30-60% reduction in capacity utilization, and a right-sizing recommendation that often drops the Fabric SKU by one or two tiers.
Frequently Asked Questions
How do I know if my Fabric dataflows are bloated?
The clearest indicators are: refresh times that have grown 3-5x since initial deployment, capacity utilization spikes that force throttling, duplicate data copies across multiple dataflows, and M queries with 20+ transformation steps where most could be pushed to the source. EPC Group runs a Dataflow Health Assessment that profiles every dataflow in your Fabric workspace, identifies redundant transformations, measures query folding coverage, and produces a prioritized optimization roadmap with estimated capacity savings.
What is query folding and why does it matter for Fabric performance?
Query folding is when the Power Query (M) engine translates your transformation steps into native source queries — typically SQL — so the work happens at the source database instead of consuming Fabric capacity. When query folding breaks, every row comes into memory and Fabric compute processes it locally, which is 10-100x slower and more expensive. In our optimization engagements, restoring query folding on broken dataflows typically cuts refresh time by 60-80% and reduces capacity consumption proportionally.
Should we use Dataflow Gen2 or Fabric notebooks for ETL?
Use Dataflow Gen2 for low-to-medium complexity ingestion and transformation tasks, especially when the source supports query folding (SQL databases, OData, many REST APIs). Use Fabric notebooks (PySpark) for complex transformations, multi-source joins that exceed M query capabilities, ML feature engineering, and workloads exceeding 1 GB of in-memory processing. Most enterprises use both: Dataflow Gen2 for the 70% of pipelines that are straightforward, notebooks for the 30% that require code.
How does incremental refresh work in Fabric Dataflow Gen2?
Dataflow Gen2 supports incremental refresh by partitioning data based on a datetime column. You define a refresh window (e.g., last 7 days) and a historical window (e.g., last 3 years). On each refresh, only the refresh window is re-queried from the source; historical partitions are untouched. This reduces refresh time, source system load, and Fabric capacity consumption. The key requirement is a reliable datetime column in the source — if your source lacks one, EPC Group implements watermark patterns using change tracking or CDC.
What capacity savings can we expect from ETL optimization?
In our enterprise Fabric optimization engagements, we typically deliver 40-70% reduction in refresh duration, 30-60% reduction in CU consumption, and elimination of throttling-induced refresh failures. For a client running 200+ dataflows on an F64 capacity, optimization reduced monthly Fabric cost from $12,000 to $7,200 by right-sizing the capacity from F64 to F32 after eliminating redundant compute. The ROI on a 4-week optimization engagement is usually realized within 2-3 months of capacity savings.
Stop overpaying for Fabric capacity
EPC Group's Dataflow Health Assessment identifies exactly where your Fabric ETL is wasting capacity and how much you can save. Fixed fee, 2-week turnaround. Call (888) 381-9725 or request an assessment below.
Request a Dataflow Health Assessment