EPC Group - Enterprise Microsoft AI, SharePoint, Power BI, and Azure Consulting
G2 High Performer Summer 2025, Momentum Leader Spring 2025, Leader Winter 2025, Leader Spring 2026
BlogContact
Ready to transform your Microsoft environment?Get started today
(888) 381-9725Get Free Consultation
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌

EPC Group

Enterprise Microsoft consulting with 29 years serving Fortune 500 companies.

(888) 381-9725
contact@epcgroup.net
4900 Woodway Drive - Suite 830
Houston, TX 77056

Follow Us

Solutions

  • All Services
  • Microsoft 365 Consulting
  • AI Governance
  • Azure AI Consulting
  • Cloud Migration
  • Microsoft Copilot
  • Data Governance
  • Microsoft Fabric
  • vCIO / vCAIO Services
  • Large-Scale Migrations
  • SharePoint Development

Industries

  • All Industries
  • Healthcare IT
  • Financial Services
  • Government
  • Education
  • Teams vs Slack

Power BI

  • Case Studies
  • 24/7 Emergency Support
  • Dashboard Guide
  • Gateway Setup
  • Premium Features
  • Lookup Functions
  • Power Pivot vs BI
  • Treemaps Guide
  • Dataverse
  • Power BI Consulting

Company

  • About Us
  • Our History
  • Microsoft Gold Partner
  • Case Studies
  • Testimonials
  • Blog
  • Resources
  • All Guides & Articles
  • Video Library
  • Client Reviews
  • Contact
  • Schedule a consultation

Microsoft Teams

  • Teams Questions
  • Teams Healthcare
  • Task Management
  • PSTN Calling
  • Enable Dial Pad

Azure & SharePoint

  • Azure Databricks
  • Azure DevOps
  • Azure Synapse
  • SharePoint MySites
  • SharePoint ECM
  • SharePoint vs M-Files

Comparisons

  • M365 vs Google
  • Databricks vs Dataproc
  • Dynamics vs SAP
  • Intune vs SCCM
  • Power BI vs MicroStrategy

Legal

  • Sitemap
  • Privacy Policy
  • Terms
  • Cookies

About EPC Group

EPC Group is a Microsoft consulting firm founded in 1997 (originally Enterprise Project Consulting, renamed EPC Group in 2005). 29 years of enterprise Microsoft consulting experience. Microsoft Gold Partner from 2003–2022 — the oldest Microsoft Gold Partner in North America — and currently a Microsoft Solutions Partner with six designations: Data & AI, Modern Work, Infrastructure, Security, Digital & App Innovation, and Business Applications.

Headquartered at 4900 Woodway Drive, Suite 830, Houston, TX 77056. Public clients include NASA, FBI, Federal Reserve, Pentagon, United Airlines, PepsiCo, Nike, and Northrop Grumman. 6,500+ SharePoint implementations, 1,500+ Power BI deployments, 500+ Microsoft Fabric implementations, 70+ Fortune 500 organizations served, 11,000+ enterprise engagements, 200+ Microsoft Power BI and Microsoft 365 consultants on staff.

About Errin O'Connor

Errin O'Connor is the Founder, CEO, and Chief AI Architect of EPC Group. Microsoft MVP for multiple years starting 2002–2003. 4× Microsoft Press bestselling author of Windows SharePoint Services 3.0 Inside Out (MS Press 2007), Microsoft SharePoint Foundation 2010 Inside Out (MS Press 2011), SharePoint 2013 Field Guide (Sams/Pearson 2014), and Microsoft Power BI Dashboards Step by Step (MS Press 2018).

Original SharePoint Beta Team member (Project Tahoe). Original Power BI Beta Team member (Project Crescent). FedRAMP framework contributor. Worked with U.S. CIO Vivek Kundra on the Obama administration's 25-Point Plan to reform federal IT, and with NASA CIO Chris Kemp as Lead Architect on the NASA Nebula Cloud project. Speaker at Microsoft Ignite, SharePoint Conference, KMWorld, and DATAVERSITY.

© 2026 EPC Group. All rights reserved. Microsoft, SharePoint, Power BI, Azure, Microsoft 365, Microsoft Copilot, Microsoft Fabric, and Microsoft Dynamics 365 are trademarks of the Microsoft group of companies.

‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
Home / Blog / ETL Optimization in Microsoft Fabric

ETL Optimization in Microsoft Fabric: Simplifying Bloated Dataflows

By Errin O'Connor|April 15, 2026|14 min read

Every enterprise Fabric deployment we audit has the same problem: dataflows that started clean have grown into bloated, redundant transformation chains consuming 3-5x more capacity than necessary. Here is how EPC Group identifies and fixes the most common ETL anti-patterns.

The bloated dataflow problem

When organizations first adopt Microsoft Fabric, dataflows are typically designed with care — clean transformations, logical naming, efficient query folding. Then reality sets in. New source columns get added. Business rules change. Different team members modify dataflows without understanding the original design. Three months later, a dataflow that refreshed in 8 minutes now takes 45 minutes and consumes 4x the capacity units.

This is not a theoretical problem. In the last 12 months, EPC Group has optimized Fabric environments for enterprises where bloated ETL was the single largest driver of capacity cost overruns. One client was running an F128 capacity ($24,000/month) when an F32 ($6,000/month) would have sufficed after optimization. That is $216,000 per year in avoidable spend.

The root causes are consistent across every engagement: broken query folding, redundant transformations, missing incremental refresh, unnecessary data type conversions, and dataflows that load entire tables when only a filtered subset is needed.

Anti-pattern 1: Broken query folding

Query folding is the most impactful optimization lever in Dataflow Gen2. When it works, your M query steps translate to SQL that executes at the source — filtering, aggregating, and joining before data ever enters Fabric compute. When it breaks, every row from the source table streams into memory and Fabric processes it locally.

Common query folding breakers we find in enterprise dataflows:

  • Table.Buffer() calls: Forces full materialization in memory. Often added by developers trying to "speed up" downstream steps, but it actually prevents folding for everything after it.
  • Custom M functions in filter steps: If you write a custom function for a filter condition, the engine cannot translate it to SQL. Use native M filter syntax instead.
  • Adding index columns mid-query: Table.AddIndexColumn() breaks folding because SQL sources cannot natively generate arbitrary row indices in the way M expects.
  • Merging queries from different sources: A join between a SQL Server table and an Excel file cannot fold — the SQL side materializes fully so the M engine can perform the join in memory.
  • Type changes after non-foldable steps: Even a simple Table.TransformColumnTypes() breaks folding if any prior step already broke it.

The fix: EPC Group audits every step in every query using the "View Native Query" indicator in Power Query. Steps where the indicator disappears have broken folding. We restructure queries to push foldable operations (filters, column selection, type changes, basic calculations) before any non-foldable step, and we replace non-foldable patterns with foldable alternatives wherever possible.

Anti-pattern 2: Redundant transformations across dataflows

In large Fabric environments, different teams often build separate dataflows that pull the same source data and apply similar (but not identical) transformations. We routinely find:

  • Three dataflows pulling the same SQL Server table, each filtering for different departments but duplicating 15 identical cleansing steps.
  • Multiple dataflows that each compute the same calculated column (e.g., fiscal quarter from a date) independently.
  • Staging dataflows that load data to lakehouse tables, then downstream dataflows that re-read and re-transform the same data because the staging layer was not designed as a shared resource.

The fix: Implement a layered dataflow architecture. Bronze-layer dataflows handle ingestion and basic cleansing, outputting to shared lakehouse tables. Silver-layer dataflows or notebooks apply business transformations. Gold-layer semantic models consume the silver output. This eliminates duplication and establishes clear ownership boundaries.

Anti-pattern 3: Full refresh where incremental refresh should be used

The default behavior for a Dataflow Gen2 is full refresh — it re-reads and re-processes the entire source table on every run. For a 100-million-row fact table, this means processing 100 million rows every day even when only 50,000 rows changed. The capacity consumption is proportional to the full table size, not the delta.

The fix: Configure incremental refresh on every fact table dataflow. Requirements: a reliable datetime column (created_date, modified_date, or a CDC timestamp), a defined refresh window (typically 7-30 days), and a historical window that matches your reporting needs. For sources without a reliable datetime column, implement watermark patterns using SQL Server Change Tracking, CDC, or application-level audit columns.

The impact is dramatic. A client's 120-million-row transaction table went from 38-minute full refresh to 4-minute incremental refresh — a 90% reduction in both time and capacity consumption.

Anti-pattern 4: Unnecessary data type conversions and column bloat

Power Query's automatic type detection is helpful during development but creates problems at scale. We frequently find:

  • Columns cast to text that should remain as integers (e.g., zip codes, ID fields), consuming 5-10x more memory.
  • Decimal precision set to Currency or Decimal Number when Fixed Decimal or Whole Number would suffice.
  • Dataflows that load 80 columns from a source table when only 15 are used downstream. Every unused column consumes capacity during refresh and storage in the lakehouse.
  • Multiple type conversion steps scattered throughout the query instead of a single consolidated step at the end.

The fix: Select only needed columns immediately after the source step (this folds to SQL SELECT). Consolidate all type conversions into a single step. Use the most memory-efficient type for each column. Remove columns that no downstream consumer uses — if in doubt, check the semantic model and report references.

Anti-pattern 5: M query spaghetti

M queries in enterprise dataflows often grow organically into unreadable chains of 30-50 steps with auto-generated names like "Changed Type2", "Merged Queries1", "Expanded Table3". When the original developer leaves, no one understands what the query does, so the next developer adds workaround steps instead of refactoring.

The fix: EPC Group refactors M queries following these principles:

  • Descriptive step names: Every step gets a name that explains what it does (e.g., "FilterActiveLeases" not "Filtered Rows3").
  • Foldable steps first: Column selection, filtering, type changes, and basic calculations come before any non-foldable operation.
  • Shared functions in a library query: Repeated transformation logic (date parsing, address standardization, status mapping) is extracted into reusable M functions referenced across dataflows.
  • Documentation comments: M supports // comments — every non-obvious step gets a comment explaining the business rule.
  • Step count target: If a query exceeds 15 steps, it is a candidate for splitting into a staging query and a transformation query.

Dataflow Gen2 best practices checklist

After optimizing dozens of enterprise Fabric environments, EPC Group applies this checklist to every dataflow:

PracticeImpactEffort
Verify query folding on every step60-80% faster refreshLow
Enable incremental refresh on fact tables80-95% less data processedMedium
Select only needed columns at source30-50% memory reductionLow
Consolidate type conversions10-20% faster refreshLow
Eliminate cross-source merges where possible50-70% faster on affected queriesMedium
Implement layered architecture (bronze/silver/gold)Eliminates redundant processingHigh
Schedule refreshes during off-peak hoursReduces throttling riskLow
Monitor with Fabric Capacity Metrics appIdentifies regressions earlyLow

When to use notebooks instead of Dataflow Gen2

Dataflow Gen2 is not always the right tool. EPC Group recommends switching to Fabric notebooks (PySpark) when:

  • Transformation logic requires complex joins across 5+ tables from different sources.
  • Data volumes exceed 1 GB in-memory after filtering — PySpark distributes processing across the Fabric compute cluster.
  • Business rules require Python libraries (e.g., fuzzy matching, geospatial calculations, ML feature engineering) that M cannot express.
  • You need unit-testable, version-controlled transformation code that can go through a CI/CD pipeline.
  • Real-time or near-real-time ingestion is required — Fabric notebooks support streaming via Spark Structured Streaming.

The hybrid approach works best: Dataflow Gen2 for the majority of straightforward ingestion pipelines, notebooks for complex transformations, and Power BI semantic models consuming the lakehouse gold layer via Direct Lake.

EPC Group's ETL optimization engagement

EPC Group delivers a structured 4-week Fabric ETL Optimization engagement:

  • Week 1: Dataflow inventory and profiling. We catalog every dataflow, measure refresh times and CU consumption, identify query folding breaks, and map data lineage from source to report.
  • Week 2: Optimization plan. Prioritized list of changes ranked by capacity savings impact. Includes architecture recommendations (layered design, notebook migration candidates, incremental refresh candidates).
  • Weeks 3-4: Implementation. We refactor the top-priority dataflows, implement incremental refresh, restructure queries for folding, and validate capacity savings. Knowledge transfer to your team on maintaining the optimized state.

Typical outcomes: 40-70% reduction in refresh duration, 30-60% reduction in capacity utilization, and a right-sizing recommendation that often drops the Fabric SKU by one or two tiers.

Frequently Asked Questions

How do I know if my Fabric dataflows are bloated?

The clearest indicators are: refresh times that have grown 3-5x since initial deployment, capacity utilization spikes that force throttling, duplicate data copies across multiple dataflows, and M queries with 20+ transformation steps where most could be pushed to the source. EPC Group runs a Dataflow Health Assessment that profiles every dataflow in your Fabric workspace, identifies redundant transformations, measures query folding coverage, and produces a prioritized optimization roadmap with estimated capacity savings.

What is query folding and why does it matter for Fabric performance?

Query folding is when the Power Query (M) engine translates your transformation steps into native source queries — typically SQL — so the work happens at the source database instead of consuming Fabric capacity. When query folding breaks, every row comes into memory and Fabric compute processes it locally, which is 10-100x slower and more expensive. In our optimization engagements, restoring query folding on broken dataflows typically cuts refresh time by 60-80% and reduces capacity consumption proportionally.

Should we use Dataflow Gen2 or Fabric notebooks for ETL?

Use Dataflow Gen2 for low-to-medium complexity ingestion and transformation tasks, especially when the source supports query folding (SQL databases, OData, many REST APIs). Use Fabric notebooks (PySpark) for complex transformations, multi-source joins that exceed M query capabilities, ML feature engineering, and workloads exceeding 1 GB of in-memory processing. Most enterprises use both: Dataflow Gen2 for the 70% of pipelines that are straightforward, notebooks for the 30% that require code.

How does incremental refresh work in Fabric Dataflow Gen2?

Dataflow Gen2 supports incremental refresh by partitioning data based on a datetime column. You define a refresh window (e.g., last 7 days) and a historical window (e.g., last 3 years). On each refresh, only the refresh window is re-queried from the source; historical partitions are untouched. This reduces refresh time, source system load, and Fabric capacity consumption. The key requirement is a reliable datetime column in the source — if your source lacks one, EPC Group implements watermark patterns using change tracking or CDC.

What capacity savings can we expect from ETL optimization?

In our enterprise Fabric optimization engagements, we typically deliver 40-70% reduction in refresh duration, 30-60% reduction in CU consumption, and elimination of throttling-induced refresh failures. For a client running 200+ dataflows on an F64 capacity, optimization reduced monthly Fabric cost from $12,000 to $7,200 by right-sizing the capacity from F64 to F32 after eliminating redundant compute. The ROI on a 4-week optimization engagement is usually realized within 2-3 months of capacity savings.

Stop overpaying for Fabric capacity

EPC Group's Dataflow Health Assessment identifies exactly where your Fabric ETL is wasting capacity and how much you can save. Fixed fee, 2-week turnaround. Call (888) 381-9725 or request an assessment below.

Request a Dataflow Health Assessment

Microsoft Fabric Architecture: 2026 Considerations for Blog Etl Optimization Microsoft Fabric Simplifying Dataflows

Microsoft Fabric F-SKU pricing in 2026 starts at F2 ($263/mo) and scales to F2048 ($269,000/mo). F64 ($5,257/mo) is the inflection point; it includes Power BI Premium capacity-equivalent features and unlocks Direct Lake mode across the full Fabric workload set (Data Engineering, Data Warehouse, Real-Time Intelligence, Data Science, Data Activator). For a typical Fortune 500 analytics workload, F64-F128 is the most common starting point.

OneLake (Microsoft Fabric unified data lake) uses a shortcut model that lets a single physical Parquet dataset serve both Fabric Lakehouse queries (Spark) and Fabric Warehouse queries (T-SQL) without copy. This eliminates the historical lakehouse vs warehouse pick-one decision and reduces typical enterprise data-platform footprint by 30-50% versus comparable Snowflake plus Databricks dual-vendor deployments.

Decision factors EPC Group evaluates

  • Microsoft Purview lineage tracking across Fabric workloads
  • OneLake shortcut strategy for cross-workload data sharing
  • Real-Time Intelligence vs Power BI streaming deployment patterns
  • Fabric vs Snowflake/Databricks consolidation TCO analysis
  • F-SKU capacity sizing (F2 to F2048) with Direct Lake compatibility

For a tailored read on this topic in your specific tenant, contact EPC Group at contact@epcgroup.net or +1 (888) 381-9725. Engagement options at /pricing.