Azure Kubernetes Service (AKS) Enterprise Guide 2026: Architecture, Security, and Operations
Azure Kubernetes Service is the dominant managed Kubernetes platform for enterprises running on Azure. This guide covers enterprise AKS cluster architecture, networking (Azure CNI, Kubenet, CNI Overlay), security hardening with Entra ID and pod identity, autoscaling strategies, monitoring with Prometheus and Grafana, and GitOps with Flux — based on EPC Group's 150+ enterprise AKS deployments.
Azure Kubernetes Service (AKS) Enterprise Guide 2026
Azure Kubernetes Service (AKS) is Microsoft's managed Kubernetes platform. According to the 2025 CNCF Survey, 96% of organizations are either using or evaluating Kubernetes. AKS handles control plane management, automatic upgrades, patching, and scaling. EPC Group has deployed AKS for 150+ enterprise organizations — from startups running a single microservice to Fortune 500 companies operating 100+ clusters across multiple regions.
Key facts
- EPC Group: 150+ enterprise AKS deployments across healthcare, financial services, and government.
- AKS control plane: free for the Standard tier. You pay only for worker node VMs, managed disks, load balancers, container registry, and egress bandwidth.
- Typical 3-node production cluster (Standard_D4s_v5 VMs): $400–$600/month for compute; $250–$400/month with 1-year Azure Reserved Instances.
- Enterprise clusters with autoscaling, monitoring, and multiple node pools: $2,000–$10,000/month depending on workload scale.
- Spot node pools: 60–90% compute cost savings for fault-tolerant batch workloads.
- EPC Group deploys all production clusters with CIS Kubernetes Benchmark compliance validation.
Why AKS for enterprise container workloads
AKS simplifies Kubernetes operations by managing the control plane and providing built-in integration with Azure services. It offers a 99.95% uptime SLA for the API server. Microsoft manages the Kubernetes API server, etcd, controller manager, and scheduler at no cost — you pay only for worker nodes and associated resources.
The organizations that succeed with AKS invest in three areas: networking architecture, security hardening, and operational maturity (monitoring, GitOps, and incident response).
Enterprise cluster architecture
Node pool design
- System node pool — dedicated to system pods (CoreDNS, metrics-server, konnectivity-agent). Minimum 3 nodes across availability zones. Taint with CriticalAddonsOnly=true:NoSchedule. Recommended size: Standard_D4s_v5 (4 vCPU, 16 GB RAM).
- Application node pool(s) — run user workloads. Create separate node pools for different workload profiles: general compute (D-series), memory-optimized (E-series), and GPU (N-series). Enable Cluster Autoscaler with min/max node counts.
- Spot node pool — for fault-tolerant workloads (batch processing, CI/CD build agents). Spot pricing saves 60–90% on compute. Spot nodes can be evicted with 30-second notice — only schedule workloads that handle interruptions gracefully.
- Availability zones — spread node pools across 3 availability zones for high availability. Combined with pod topology spread constraints, workloads survive a full zone failure.
Networking: CNI, Kubenet, and CNI Overlay
AKS networking determines how pods communicate with each other, with Azure services, and with external networks. The networking plugin choice impacts IP address planning, security policy enforcement, and cluster scalability.
- Azure CNI — assigns real VNet IP addresses to each pod. Enables direct pod-to-VNet communication without NAT. Use for production enterprise clusters needing Windows node pools, direct pod connectivity to Azure services via private endpoints, or advanced networking features.
- Kubenet — uses a NAT overlay network where pods get IPs from a secondary CIDR range. Use only for simple dev/test clusters or when VNet IP address space is extremely limited.
- Azure CNI Overlay (recommended) — provides CNI benefits with efficient IP address management using an overlay network. EPC Group recommends Azure CNI Overlay for all new enterprise clusters.
Note: With traditional Azure CNI, a 100-node cluster running 30 pods per node requires 3,000 VNet IP addresses for pods alone. CNI Overlay solves this by assigning pods IPs from a separate overlay CIDR that doesn't consume VNet address space.
Security hardening
Production AKS security requires defense-in-depth across the cluster, node, pod, and container layers.
Cluster security
- Private cluster — API server accessible only via private endpoint (no public IP).
- Authorized IP ranges — if public API server is required, restrict to known corporate IP ranges.
- AKS Automatic — enable for auto-patching of node OS images within 24 hours of CVE release.
- Kubernetes version — use the latest stable N-1 version (not bleeding edge, not outdated).
- Azure Policy — apply "Kubernetes cluster pod security restricted standards" built-in initiative.
Node security
- Node image — use AKS Ubuntu or Azure Linux (Mariner) with FIPS-enabled images for compliance.
- SSH access — disable SSH to nodes in production. Use Azure Bastion + kubectl for troubleshooting.
- Node OS disk encryption — enable host-based encryption for OS and temp disks (EncryptionAtHost).
- Confidential VMs — for sensitive workloads, use DCasv5/ECasv5 confidential VMs with SEV-SNP.
Pod security
- Pod Security Standards — enforce "restricted" profile (non-root, read-only root filesystem, no privilege escalation).
- Network policies — default deny all ingress/egress; explicitly allow required communication paths.
- Resource limits — set CPU and memory requests/limits on all pods to prevent noisy neighbors.
- Service mesh — use Istio or Linkerd for mTLS between services, traffic management, and observability.
Container security
- Image scanning — Microsoft Defender for Containers scans ACR images for CVEs on push and continuously.
- Image provenance — enable Notation (Notary v2) for container image signing and verification.
- Base images — use Microsoft-maintained base images (mcr.microsoft.com) with regular updates.
- ACR tasks — automate base image rebuild when upstream images are updated.
Identity: Entra ID and Workload Identity
Cluster operator authentication
- Enable AKS-managed Entra ID integration. Operators authenticate with their Entra ID identity via az aks get-credentials. MFA and Conditional Access policies apply to cluster access.
- Map Entra ID groups to Kubernetes ClusterRoles and Roles. Example: "AKS-Cluster-Admins" bound to cluster-admin, "AKS-App-Developers" bound to a namespace-scoped Role.
- Use Entra ID PIM for the AKS admin group. Operators activate admin membership on demand with time-limited access, approval workflow, and justification.
Workload Identity (pod identity)
AKS Workload Identity is the recommended method for pods to authenticate to Azure services (SQL, Cosmos DB, Key Vault, Storage) without storing credentials.
- A Kubernetes service account is federated with an Entra ID managed identity.
- AKS exchanges a Kubernetes token for an Entra ID token via OIDC federation.
- The pod receives an Azure access token without any stored secrets.
- Use Azure Key Vault with the Secrets Store CSI Driver — Workload Identity authenticates to Key Vault; the CSI driver mounts secrets as files in the pod. Never store secrets in Kubernetes Secrets objects (they are base64 encoded, not encrypted).
Autoscaling strategies
AKS provides three autoscaling levels. EPC Group recommends using all three together.
- Horizontal Pod Autoscaler (HPA) — scales pod replicas based on CPU, memory, or custom metrics.
- Vertical Pod Autoscaler (VPA) — adjusts pod resource requests and limits based on historical usage.
- Cluster Autoscaler — adjusts the number of worker nodes based on pending pod scheduling requests. When pods can't be scheduled because nodes are full, Cluster Autoscaler adds nodes.
- KEDA (Kubernetes Event-Driven Autoscaling) — extends HPA with event-driven triggers (Azure Service Bus queue depth, HTTP request rate) for more responsive scaling.
GitOps with Flux
GitOps uses Git as the single source of truth for Kubernetes cluster configuration. AKS natively supports Flux v2 through the AKS GitOps extension. Flux continuously reconciles cluster state with the desired state defined in Git.
- No more kubectl apply from CI/CD pipelines.
- Complete audit trail via Git history.
- Easy rollback — git revert restores a previous state.
- No manual changes to the cluster (enforced by policy).
Never store secrets in Git. Use Mozilla SOPS with Azure Key Vault for encrypting secrets in Git, or use the External Secrets Operator to sync secrets from Azure Key Vault directly into Kubernetes.
Cost optimization
- Right-size nodes — monitor actual CPU and memory utilization. If nodes average 30% utilization, switch to smaller VM SKUs.
- Spot node pools — add Spot node pools for fault-tolerant workloads. 60–90% savings on compute.
- Cluster Autoscaler tuning — set scale-down utilization threshold to 65% (default is 50%) for more aggressive node removal. Reduce scale-down delay after add to 5 minutes (default 10) for faster cost recovery.
- Azure Reservations — purchase 1-year or 3-year reservations for baseline node VMs (your minimum node count). Savings: 30–60% vs. pay-as-you-go.
- Dev/test cluster scheduling — use start/stop cluster feature to shut down non-production AKS clusters outside business hours. Saves 60%+ of total dev/test cost.
Frequently asked questions
What is Azure Kubernetes Service (AKS)?
AKS is a managed Kubernetes platform on Azure. Microsoft manages the control plane (API server, etcd, controller manager, scheduler) at no cost. You pay only for worker node VMs and associated resources (storage, networking, load balancers).
AKS supports Kubernetes versions within the N-2 support window, provides Entra ID authentication, Azure Monitor Container Insights, and native integration with Azure Container Registry, Azure Key Vault, and Azure Policy.
How much does AKS cost?
The AKS control plane is free for the Standard tier. You pay for worker node VMs. A typical 3-node production cluster using Standard_D4s_v5 VMs (4 vCPU, 16 GB RAM) costs approximately $400–$600/month.
With 1-year Azure Reserved Instances, this drops to $250–$400/month. Add $100–$200/month for networking and storage. Enterprise clusters with autoscaling and multiple node pools typically cost $2,000–$10,000/month.
Should I use Azure CNI or Kubenet networking?
EPC Group recommends Azure CNI Overlay for all new enterprise clusters. It provides Azure CNI benefits (direct pod-to-VNet communication, network policies, private endpoints) with efficient IP address management.
Use Kubenet only for simple dev/test clusters or when VNet IP address space is extremely limited. Avoid traditional Azure CNI for large clusters — a 100-node cluster with 30 pods per node requires 3,000 VNet IP addresses.
How do I secure AKS for production workloads?
Production AKS security requires defense-in-depth across four layers: cluster (private API server, Azure Policy), node (FIPS-enabled images, no SSH, disk encryption), pod (restricted Pod Security Standards, network policies, resource limits), and container (ACR image scanning, signed images, Microsoft-maintained base images). EPC Group deploys all production clusters with CIS Kubernetes Benchmark compliance validation.
What is GitOps for AKS and how does it work?
GitOps declares cluster configuration in Git, and an in-cluster agent (Flux v2) continuously reconciles the actual cluster state to match. When a developer merges a pull request updating a Kubernetes manifest or Helm chart, Flux automatically detects the change and applies it.
This eliminates kubectl apply from CI/CD pipelines, provides a Git audit trail, and makes rollback a git revert operation. AKS natively supports Flux v2 through the Microsoft.KubernetesConfiguration extension.
Schedule a consultation
EPC Group has completed 10,000+ implementations across Azure, Power BI, Microsoft Fabric, SharePoint, and Copilot. Talk to an Azure architect about your AKS deployment. Call (888) 381-9725 or request a discovery call.
Frequently Asked Questions
What is Azure Kubernetes Service (AKS)?
Azure Kubernetes Service (AKS) is a managed Kubernetes platform on Azure that handles control plane management, automatic upgrades, patching, and scaling. Microsoft manages the Kubernetes API server, etcd, controller manager, and scheduler at no cost — you only pay for the worker node VMs and associated resources (storage, networking, load balancers). AKS supports Kubernetes versions within the N-2 support window, provides integrated Azure AD (Entra ID) authentication, Azure Monitor Container Insights, and native integration with Azure Container Registry (ACR), Azure Key Vault, and Azure Policy.
How much does AKS cost?
The AKS control plane is free for the Standard tier (includes uptime SLA) and the managed Kubernetes API server. You pay for worker node VMs (compute), managed disks (storage), Azure Load Balancer or Application Gateway (networking), container registry (ACR), and egress bandwidth. A typical 3-node production cluster using Standard_D4s_v5 VMs (4 vCPU, 16 GB RAM) costs approximately $400-$600/month for compute. With Azure Reserved Instances (1-year), this drops to $250-$400/month. Add $100-$200/month for networking and storage. Enterprise clusters with autoscaling, monitoring, and multiple node pools typically cost $2,000-$10,000/month depending on workload scale.
Should I use Azure CNI or Kubenet networking for AKS?
Azure CNI assigns real VNet IP addresses to each pod, enabling direct pod-to-VNet communication without NAT. Use Azure CNI for production enterprise clusters that need Windows node pools, direct pod connectivity to Azure services via private endpoints, or advanced networking features (Network Policy, Azure Network Policy Manager). Kubenet uses a NAT overlay network where pods get IPs from a secondary CIDR range. Use Kubenet only for simple dev/test clusters or when VNet IP address space is extremely limited. For most enterprise deployments, EPC Group recommends Azure CNI Overlay — it provides CNI benefits with efficient IP address management using an overlay network.
How do I secure AKS for production workloads?
Production AKS security requires multiple layers: Entra ID integration with Kubernetes RBAC for authentication/authorization, Azure Policy with built-in AKS security initiatives, private cluster (API server not exposed to public internet), Azure Defender for Containers for runtime threat detection, network policies to restrict pod-to-pod traffic, pod security standards (restricted mode), Azure Key Vault provider for secrets management (never store secrets in Kubernetes Secrets directly), container image scanning in ACR with Microsoft Defender, and managed identities for pod-to-Azure-service authentication (workload identity). EPC Group deploys all production clusters with CIS Kubernetes Benchmark compliance validation.
What is GitOps for AKS and how does it work?
GitOps uses Git repositories as the single source of truth for Kubernetes cluster configuration and application deployments. With AKS, Microsoft supports Flux v2 as the built-in GitOps engine through the AKS GitOps extension. Flux continuously reconciles the cluster state with the desired state defined in Git. When a developer merges a pull request that updates a Kubernetes manifest or Helm chart, Flux automatically detects the change and applies it to the cluster. This eliminates kubectl apply from CI/CD pipelines, provides a complete audit trail via Git history, enables easy rollback (git revert), and enforces the principle that no manual changes are made to the cluster.
How does AKS autoscaling work?
AKS provides three levels of autoscaling: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on CPU, memory, or custom metrics. Vertical Pod Autoscaler (VPA) adjusts pod resource requests and limits based on historical usage. Cluster Autoscaler adjusts the number of worker nodes based on pending pod scheduling requests. For production, EPC Group recommends HPA combined with Cluster Autoscaler — HPA scales pods to handle increased traffic, and when pods cannot be scheduled due to insufficient node resources, Cluster Autoscaler adds nodes to the node pool. KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with event-driven triggers (Azure Service Bus queue depth, HTTP request rate) for more responsive scaling.
