<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://anuragkhuntia.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://anuragkhuntia.github.io/" rel="alternate" type="text/html" /><updated>2025-10-15T07:02:12+00:00</updated><id>https://anuragkhuntia.github.io/feed.xml</id><title type="html">Anurag Khuntia</title><subtitle>Personal blog on networking, Kubernetes, and systems.</subtitle><author><name>Anurag Khuntia</name></author><entry><title type="html">Cosmolet: Technical Deep Dive</title><link href="https://anuragkhuntia.github.io/Cosmolet-deep-dive/" rel="alternate" type="text/html" title="Cosmolet: Technical Deep Dive" /><published>2025-10-15T00:00:00+00:00</published><updated>2025-10-15T00:00:00+00:00</updated><id>https://anuragkhuntia.github.io/Cosmolet-deep-dive</id><content type="html" xml:base="https://anuragkhuntia.github.io/Cosmolet-deep-dive/"><![CDATA[<h1 id="cosmolet-deep-dive-into-kubernetes-bgp-service-advertisement">Cosmolet: Deep Dive into Kubernetes BGP Service Advertisement</h1>

<p>When running Kubernetes on bare-metal, one of the biggest challenges is integrating Kubernetes networking with the existing data center fabric.<br />
While cloud environments provide managed LoadBalancers, on-prem clusters typically depend on external devices, static routes, or CNI-specific BGP integrations that often break the abstraction between Kubernetes and the underlay.</p>

<p><strong>Cosmolet</strong> aims to bridge this gap.</p>

<p>It’s a lightweight, CNI-agnostic Kubernetes controller that automates the advertisement of Kubernetes Service IPs directly to your network using <strong>FRR (Free Range Routing)</strong>.<br />
Cosmolet runs as a <strong>DaemonSet</strong> across all cluster nodes, continuously monitoring the cluster state and syncing service reachability with the routing fabric — without the need for external proxies or vendor-specific solutions.</p>

<hr />

<h2 id="design-philosophy">Design Philosophy</h2>

<p>Cosmolet’s core principle is <strong>simplicity without compromise</strong>:</p>
<ul>
  <li>It should work in any Kubernetes environment — regardless of the CNI.</li>
  <li>It should integrate naturally with standard BGP deployments.</li>
  <li>It should remain transparent, observable, and secure.</li>
</ul>

<p>This philosophy results in an agent that directly interacts with the <strong>Linux networking stack</strong> and <strong>FRR</strong>, making BGP-based service advertisement both predictable and portable.</p>

<hr />

<h2 id="operating-modes">Operating Modes</h2>

<p>Cosmolet operates in <strong>two distinct modes</strong> — <code class="language-plaintext highlighter-rouge">Connected</code> and <code class="language-plaintext highlighter-rouge">Dynamic</code> — to accommodate different FRR configurations and deployment styles.<br />
It automatically determines the mode at runtime based on FRR’s configuration parameters.</p>

<h3 id="connected-mode">Connected Mode</h3>
<p>If FRR is configured with: bgpd distributed connected
then Cosmolet runs in <strong>Connected Mode</strong>.<br />
In this mode:</p>
<ul>
  <li>FRR automatically advertises all locally connected routes.</li>
  <li>Cosmolet’s responsibility is to <strong>synchronize service IPs with the loopback interface</strong>.</li>
  <li>It adds or removes <code class="language-plaintext highlighter-rouge">/32</code> (or <code class="language-plaintext highlighter-rouge">/128</code> for IPv6) routes corresponding to Kubernetes service IPs.</li>
  <li>No direct BGP configuration commands are issued — FRR handles advertisement natively.</li>
</ul>

<p>Connected mode is ideal for modern, distributed FRR setups commonly found in large-scale fabrics (e.g., Cumulus, SONiC, or containerized FRR instances).</p>

<h3 id="dynamic-mode">Dynamic Mode</h3>
<p>If FRR does <strong>not</strong> have the <code class="language-plaintext highlighter-rouge">distributed connected</code> parameter, Cosmolet switches to <strong>Dynamic Mode</strong>.<br />
In this mode:</p>
<ul>
  <li>Cosmolet takes direct control of BGP advertisement.</li>
  <li>It uses the FRR vtysh CLI or API to programmatically add or remove network statements under the BGP configuration context.</li>
  <li>Each node independently advertises only those service IPs it’s responsible for.</li>
</ul>

<p>This mode ensures compatibility with simpler or traditional FRR setups where automatic route distribution isn’t configured.</p>

<hr />

<h2 id="service-discovery-and-health-validation">Service Discovery and Health Validation</h2>

<p>Cosmolet continuously watches Kubernetes <code class="language-plaintext highlighter-rouge">Service</code> and <code class="language-plaintext highlighter-rouge">Endpoint</code> objects using native client-go informers.<br />
For every eligible service (<code class="language-plaintext highlighter-rouge">LoadBalancer</code> or <code class="language-plaintext highlighter-rouge">ClusterIP</code>), it checks whether:</p>
<ul>
  <li>The service has healthy endpoints.</li>
  <li>The service IP is assigned or reachable on the node.</li>
</ul>

<p>If both conditions are met, the service IP is added to the loopback and advertised to the fabric.</p>

<p>If all endpoints of a service become unhealthy, Cosmolet <strong>withdraws the advertisement</strong> — preventing blackhole routes and ensuring that BGP reflects real application availability.</p>

<p>This <strong>health-aware advertisement</strong> model forms the foundation of reliable BGP-based service publishing in Kubernetes.</p>

<hr />

<h2 id="loopback-ip-synchronization">Loopback IP Synchronization</h2>

<p>Loopback management is central to Cosmolet’s operation.</p>

<p>For each service IP that should be advertised, the controller runs:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ip addr add 10.30.21.232/32 dev lo 
</code></pre></div></div>
<p>and when no longer valid:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ip addr del 10.30.21.232/32 dev lo 
</code></pre></div></div>
<p>Cosmolet automatically reconciles the loopback state on every iteration:</p>

<ul>
  <li>Active IPs → retained</li>
  <li>Stale IPs → removed</li>
  <li>Excluded IPs → preserved (based on config.yaml)</li>
</ul>

<p>This ensures that the node’s loopback interface always reflects the real set of advertised service IPs.</p>

<hr />
<h2 id="frr-integration-and-control-flow">FRR Integration and Control Flow</h2>

<p>FRR (Free Range Routing) acts as the BGP stack beneath Cosmolet.
Cosmolet communicates with FRR in one of two ways:</p>

<p>Dynamic Mode:
Direct vtysh command execution:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vtysh <span class="nt">-c</span> <span class="s2">"configure terminal"</span> <span class="se">\</span>
      <span class="nt">-c</span> <span class="s2">"router bgp 65001"</span> <span class="se">\</span>
      <span class="nt">-c</span> <span class="s2">"network 10.30.21.232/32"</span>
</code></pre></div></div>
<p>or its removal:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vtysh <span class="nt">-c</span> <span class="s2">"configure terminal"</span> <span class="se">\</span>
      <span class="nt">-c</span> <span class="s2">"router bgp 65001"</span> <span class="se">\</span>
      <span class="nt">-c</span> <span class="s2">"no network 10.30.21.232/32"</span>
</code></pre></div></div>
<p>Connected Mode:
FRR automatically detects connected loopback IPs, so no direct CLI execution is needed. Cosmolet’s role ends after adding the IP to lo.</p>

<p>This modular integration ensures that Cosmolet can adapt to both fully distributed and standalone FRR deployments, without code or configuration changes.</p>

<hr />
<h2 id="high-availability-and-scalability">High Availability and Scalability</h2>

<p>Cosmolet runs as a DaemonSet — ensuring that every node independently manages its local advertisements.
This design offers:</p>

<ul>
  <li>Horizontal scalability: Each node runs autonomously.</li>
  <li>No single point of failure: BGP advertisements are localized.</li>
  <li>Leader election support (optional): Used for centralized coordination or cluster-wide statistics.</li>
</ul>

<p>This decentralized architecture matches Kubernetes’ fault-tolerance model — if a node or pod fails, its routes are automatically withdrawn by FRR’s BGP session teardown.</p>

<hr />
<h2 id="security-and-privileges">Security and Privileges</h2>

<p>Cosmolet follows a minimal-privilege principle:</p>

<ul>
  <li>Runs as a non-root container.</li>
  <li>Uses Linux capabilities (NET_ADMIN) only where required.</li>
  <li>No dependency on sudo or privileged mode.</li>
  <li>RBAC limited to read-only access on Service and Endpoints resources.</li>
</ul>

<p>This ensures compatibility with hardened cluster policies and secure multi-tenant setups.</p>

<hr />

<h2 id="deployment-approaches">Deployment Approaches</h2>

<p>Cosmolet supports multiple deployment methods:</p>

<ul>
  <li>Helm chart for production use with configurable BGP, FRR, and namespace settings.</li>
  <li>YAML manifests for quick testing or custom integration.</li>
  <li>Custom DaemonSet templates for embedding within existing network operators.</li>
</ul>

<p>Since it always runs as a DaemonSet, deployment is per-node and aligns with FRR’s operational model.</p>

<hr />

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>Cosmolet simplifies one of the most complex aspects of bare-metal Kubernetes — integrating service IP routing with real BGP networks.
Its two-mode architecture ensures that it can adapt to both modern and legacy FRR topologies, making it an effective choice for clusters that span racks, fabrics, or hybrid environments.</p>

<p>By bringing BGP awareness natively into the Kubernetes control plane, Cosmolet enables operators to run cloud-like networking on bare metal — predictably, efficiently, and transparently.</p>]]></content><author><name>Anurag Khuntia</name></author><category term="Kubernetes" /><category term="Networking" /><category term="Kubernetes" /><category term="BGP" /><category term="FRR" /><category term="Bare-Metal" /><category term="LoadBalancer" /><category term="On-Prem" /><summary type="html"><![CDATA[Exploring automated BGP service IP advertisement for on-prem Kubernetes clusters using FRR.]]></summary></entry><entry><title type="html">Cosmolet: Dynamic BGP Service Advertisement for Bare-Metal Kubernetes</title><link href="https://anuragkhuntia.github.io/Cosmolet/" rel="alternate" type="text/html" title="Cosmolet: Dynamic BGP Service Advertisement for Bare-Metal Kubernetes" /><published>2025-10-14T00:00:00+00:00</published><updated>2025-10-14T00:00:00+00:00</updated><id>https://anuragkhuntia.github.io/Cosmolet</id><content type="html" xml:base="https://anuragkhuntia.github.io/Cosmolet/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>Kubernetes has transformed how we orchestrate containerized workloads. However, running Kubernetes on bare-metal clusters exposes networking challenges absent in cloud environments. Unlike managed cloud LoadBalancers, bare-metal setups often require manual route configuration, external proxies, or appliances to expose services externally.</p>

<p><strong>Cosmolet</strong> is an open-source Kubernetes controller designed to solve this problem. By integrating with <strong>FRR (Free Range Routing)</strong>, Cosmolet dynamically advertises service IPs over BGP, enabling bare-metal clusters to expose services to the network fabric automatically. It automates service discovery, loopback IP management, and BGP advertisement, ensuring healthy and available services are reachable without manual intervention.</p>

<hr />

<h2 id="challenges-in-bare-metal-kubernetes-networking">Challenges in Bare-Metal Kubernetes Networking</h2>

<p>Running Kubernetes on bare-metal introduces several unique hurdles:</p>

<ul>
  <li><strong>Manual Route Management:</strong> Administrators often configure static routes or modify external routers to make services reachable.</li>
  <li><strong>Service Reliability:</strong> Without health-aware routing, traffic may be sent to pods that are down, causing black-hole routes.</li>
  <li><strong>Overlay Dependency:</strong> Many solutions rely on overlays, adding latency and operational overhead.</li>
  <li><strong>Scalability:</strong> As nodes and services grow, managing route advertisements manually becomes error-prone and unsustainable.</li>
</ul>

<p>Cosmolet addresses these by running a lightweight <strong>daemonset</strong> on each node to monitor services and pods, manage loopback IPs, and advertise IPs dynamically to the network via BGP.</p>

<hr />

<h2 id="core-features">Core Features</h2>

<p><strong>Automatic Service Discovery</strong><br />
Cosmolet continuously monitors Kubernetes services across namespaces. It detects new, updated, and deleted services automatically, ensuring the network reflects the cluster’s current state.</p>

<p><strong>BGP Advertisement</strong><br />
In <strong>dynamic mode</strong>, Cosmolet advertises service IPs via BGP using FRR, and withdraws them when services are unhealthy or inactive, preventing black-hole traffic.</p>

<p><strong>Health-Aware Routing</strong><br />
Cosmolet evaluates pod liveness probes before advertising service IPs. Only healthy services are announced to BGP peers, improving reliability.</p>

<p><strong>Node-Local Loopback Management</strong><br />
Service IPs are added to each node’s loopback interface. Stale or inactive IPs are removed automatically, keeping routing tables accurate.</p>

<p><strong>Observability</strong><br />
Cosmolet exposes a <code class="language-plaintext highlighter-rouge">/metrics</code> endpoint compatible with Prometheus, offering metrics such as loopback IP states, BGP advertisement status, and control loop timing. Logs provide insight into loopback management, pod health checks, and BGP operations.</p>

<p><strong>Security and RBAC</strong><br />
Cosmolet operates with a minimal-privilege Kubernetes service account. It requires only permissions to list pods, services, and nodes, adhering to the principle of least privilege.</p>

<hr />

<h2 id="how-cosmolet-works">How Cosmolet Works</h2>

<p>Cosmolet operates in a recurring loop on each node:</p>

<ol>
  <li><strong>Service Discovery:</strong> Queries Kubernetes API for services in configured namespaces.</li>
  <li><strong>Pod Health Check:</strong> Filters for pods scheduled on the local node and verifies their liveness probes.</li>
  <li><strong>Loopback Management:</strong> Adds active service IPs to the node’s loopback interface and removes stale ones.</li>
  <li><strong>BGP Advertisement:</strong> Uses <code class="language-plaintext highlighter-rouge">vtysh</code> and FRR to advertise or withdraw service IPs to BGP peers.</li>
  <li><strong>Metrics Exposure:</strong> Updates Prometheus metrics for observability.</li>
  <li><strong>Logging:</strong> Provides detailed debug information, including active and removed IPs, health checks, and BGP operations.</li>
</ol>

<hr />

<h2 id="deployment-approaches">Deployment Approaches</h2>

<p>Cosmolet can be deployed in multiple ways depending on operational preference and cluster management style:</p>

<ul>
  <li><strong>Daemonset on Kubernetes:</strong> Run Cosmolet as a daemonset so every node participates in service IP advertisement and loopback management.</li>
  <li><strong>Helm Charts:</strong> Package Cosmolet for reproducible, configurable deployment across clusters.</li>
  <li><strong>GitOps / Operator Models:</strong> Integrate Cosmolet with GitOps pipelines or Kubernetes operators for automated configuration and lifecycle management.</li>
</ul>

<p>This flexibility allows clusters of any size or topology to integrate seamlessly with existing FRR-based BGP fabrics.</p>

<hr />

<h2 id="operational-workflow">Operational Workflow</h2>

<ul>
  <li>Cosmolet identifies active service IPs.</li>
  <li>Updates the node’s loopback interface.</li>
  <li>Advertises IPs via BGP in dynamic mode.</li>
  <li>Removes stale IPs from loopback.</li>
  <li>Exposes metrics and logs for observability.</li>
</ul>

<hr />

<h2 id="benefits">Benefits</h2>

<ul>
  <li><strong>Automation:</strong> Reduces manual intervention in BGP advertisement.</li>
  <li><strong>Health-Aware Routing:</strong> Ensures only healthy services are advertised.</li>
  <li><strong>High Availability:</strong> Node-local loopbacks with dynamic advertisement maintain service reachability even during failures.</li>
  <li><strong>Observability:</strong> Prometheus metrics and detailed logs provide real-time insights.</li>
  <li><strong>Security:</strong> Operates with minimal privileges and RBAC.</li>
</ul>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Cosmolet bridges Kubernetes and bare-metal network fabrics by providing automated service discovery, loopback management, health-aware BGP advertisement, and observability. It simplifies networking, reduces operational complexity, and allows Kubernetes services to be exposed reliably and efficiently across on-premises networks.</p>

<p><strong>Explore Cosmolet:</strong> <a href="https://github.com/platformbuilds/cosmolet">GitHub / Documentation</a></p>]]></content><author><name>Anurag Khuntia</name></author><category term="Kubernetes" /><category term="Networking" /><category term="Kubernetes" /><category term="BGP" /><category term="FRR" /><category term="Bare-Metal" /><category term="LoadBalancer" /><category term="On-Prem" /><summary type="html"><![CDATA[Exploring automated BGP service IP advertisement for on-prem Kubernetes clusters using FRR.]]></summary></entry><entry><title type="html">Kubernetes Networking Across On-Prem Datacenters with BGP, ECMP, and BFD</title><link href="https://anuragkhuntia.github.io/k8s-bgp-ecmp-bfd/" rel="alternate" type="text/html" title="Kubernetes Networking Across On-Prem Datacenters with BGP, ECMP, and BFD" /><published>2025-09-21T00:00:00+00:00</published><updated>2025-09-21T00:00:00+00:00</updated><id>https://anuragkhuntia.github.io/k8s-bgp-ecmp-bfd</id><content type="html" xml:base="https://anuragkhuntia.github.io/k8s-bgp-ecmp-bfd/"><![CDATA[<h2 id="introduction">Introduction</h2>

<p>Kubernetes powers a vast array of mission-critical applications, many of which operate in on-premises datacenters spanning multiple sites. Unlike public cloud environments, which abstract networking complexity behind managed services, on-premises deployments expose the underlying challenges of routing Pods and Services efficiently across a distributed infrastructure. Designing a resilient, scalable, and observable network fabric is crucial to ensure predictable connectivity, high availability, and seamless failover behavior for workloads.</p>

<p>In this blog, we dive deep into how modern datacenter networking concepts—such as <strong>Clos fabrics</strong>, <strong>dynamic routing with BGP</strong>, <strong>ECMP load balancing</strong>, and <strong>BFD rapid failover</strong>—can be leveraged to integrate Kubernetes clusters directly into enterprise network fabrics.</p>

<hr />

<h2 id="core-networking-challenges-in-on-prem-kubernetes">Core Networking Challenges in On-Prem Kubernetes</h2>

<p>On-premises Kubernetes networking presents unique challenges that require careful architectural consideration:</p>

<ul>
  <li>
    <p><strong>Scalable Routing</strong>: As the number of nodes, Pods, and Services grows, traditional Layer 2 overlays and static routing become bottlenecks. Without native Layer 3 routing visibility, scaling beyond a few hundred nodes is difficult.</p>
  </li>
  <li>
    <p><strong>Rapid Failover</strong>: Links or nodes can fail at any time. Slow detection and rerouting can render workloads unreachable, causing downtime for critical applications.</p>
  </li>
  <li>
    <p><strong>Multi-Datacenter Connectivity</strong>: Applications often span multiple datacenters. Maintaining consistent routing policies and ensuring Service reachability across sites are essential for high availability and global access.</p>
  </li>
  <li>
    <p><strong>Operational Simplicity</strong>: Managing IPs, routing tables, and point-to-point links in a large fabric can quickly become overwhelming. Automation and dynamic routing protocols are required to reduce human error and maintain consistency.</p>
  </li>
</ul>

<hr />

<h2 id="clos-fabrics-and-dynamic-routing">Clos Fabrics and Dynamic Routing</h2>

<p>A <strong>Clos network</strong>, commonly referred to as a <strong>spine-leaf architecture</strong>, forms the backbone of modern datacenter fabrics. It provides high bandwidth, predictable latency, and multiple equal-cost paths to enable resilient connectivity.</p>

<ul>
  <li><strong>Leaf Switches</strong> connect directly to servers, including Kubernetes nodes, handling BGP peering with upstream layers or external networks. They forward traffic to spine switches for inter-leaf communication.</li>
  <li><strong>Spine Switches</strong> interconnect leaf switches, creating a high-speed backbone that ensures low latency, redundancy, and ECMP-enabled multiple-path traffic forwarding.</li>
  <li><strong>Border Leaf Switches</strong> act as gateways to external networks, connecting the datacenter to WANs, other datacenters, or the Internet.</li>
  <li>In very large-scale deployments, a <strong>Super Spine</strong> layer can be introduced above the spine layer to interconnect multiple spine blocks, further reducing oversubscription and improving multi-datacenter scalability.</li>
</ul>

<p>Dynamic routing protocols, particularly <strong>BGP</strong>, allow nodes and services to advertise their presence directly into the fabric, providing a scalable, loop-free routing topology. <strong>ECMP (Equal-Cost Multi-Path)</strong> spreads traffic across multiple spine-leaf paths, optimizing bandwidth utilization and providing redundancy.</p>

<hr />

<h2 id="scaling-control-planes-with-ipv6-bgp-unnumbered">Scaling Control Planes with IPv6 BGP Unnumbered</h2>

<p>Traditional IPv4 BGP deployments require dedicated subnets for each point-to-point link, which quickly consumes IP space in large fabrics.</p>

<p><strong>IPv6 BGP Unnumbered</strong> solves this problem:</p>

<ul>
  <li>Uses link-local IPv6 addresses for BGP session establishment.</li>
  <li>No need to allocate /31 or /30 subnets per link.</li>
  <li>IPv4 routes for Pods and Services are still exchanged.</li>
</ul>

<p>This approach simplifies automation, reduces operational overhead, and ensures consistent routing without wasting precious IPv4 addresses.</p>

<p><strong>Key takeaway</strong>: IPv6 handles the control plane; IPv4 remains for workloads.</p>

<hr />

<h2 id="running-frr-on-kubernetes-nodes">Running FRR on Kubernetes Nodes</h2>

<p><strong>FRR (Free Range Routing)</strong> is an open-source routing software suite that provides implementations of standard routing protocols such as BGP, OSPF, RIP, and IS-IS. It allows network devices—including Linux servers, virtual machines, and Kubernetes nodes—to participate in IP routing just like traditional routers.</p>

<p>In simpler terms, FRR turns a regular machine into a fully capable router, enabling it to advertise, receive, and manage network routes dynamically. It’s widely used in data centers, cloud networking, and Kubernetes environments to integrate workloads directly into the network fabric, support <strong>ECMP load balancing</strong>, and enable fast failover with <strong>BFD</strong>.</p>

<p>Running FRR on nodes converts each Kubernetes worker into a <strong>mini-router</strong>:</p>

<ul>
  <li>Advertises <strong>Pod CIDRs</strong> and <strong>Service VIPs</strong> directly to the fabric.</li>
  <li>Supports <strong>BFD</strong> for rapid failure detection (&lt;1 second).</li>
  <li>Uses <strong>loopbacks</strong> as stable next-hops for ECMP.</li>
  <li>Avoids overlay encapsulation overhead, enabling native L3 routing visibility.</li>
</ul>

<p>This architecture ensures that all workloads are <strong>first-class citizens</strong> on the network, visible to switches and routers for direct routing, allowing the datacenter fabric to see Pods and Services as native IP prefixes, not just encapsulated traffic.</p>

<hr />

<h2 id="addressing-model">Addressing Model</h2>

<p>A clean addressing scheme simplifies routing and policy management:</p>

<ul>
  <li>
    <p><strong>Node Segment (10.10.19.x)</strong>: Each Kubernetes node is assigned a unique loopback IP that serves as its BGP router ID. For example, Node1 could have <code class="language-plaintext highlighter-rouge">10.10.19.10</code>. This address remains stable even if the physical interface changes.</p>
  </li>
  <li>
    <p><strong>Pod Segment (10.10.20.x)</strong>: Pods receive dynamic IPs from per-node CIDRs. For instance, a Pod on Node1 might get <code class="language-plaintext highlighter-rouge">10.10.20.5</code>. These IPs are advertised automatically by FRR, reducing manual configuration.</p>
  </li>
  <li>
    <p><strong>Service Segment (10.10.21.x)</strong>: ClusterIP or LoadBalancer VIPs are assigned <code class="language-plaintext highlighter-rouge">/32</code> routes and advertised to the fabric via node loopbacks. This ensures services are reachable across the network, enabling high availability and ECMP forwarding.</p>
  </li>
</ul>

<hr />

<h2 id="sample-frr-configuration">Sample FRR Configuration</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>frr version 8.1
frr defaults traditional
<span class="nb">hostname </span>cp1
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config

interface eno5
 ipv6 nd ra-interval 6
 no ipv6 nd suppress-ra
<span class="nb">exit

</span>interface eno7
 ipv6 nd ra-interval 6
 no ipv6 nd suppress-ra
<span class="nb">exit

</span>interface lo
 ip address 10.10.19.10/32     <span class="c"># Node loopback</span>
 ip address 10.10.21.66/32     <span class="c"># Service VIP</span>
 ip address 10.10.21.227/32    <span class="c"># Additional VIP</span>
<span class="nb">exit

</span>router bgp 65496
 bgp router-id 10.10.19.10
 bgp bestpath as-path multipath-relax

 neighbor TOR peer-group
 neighbor TOR remote-as internal
 neighbor TOR bfd
 neighbor TOR timers 1 3

 neighbor eno5 interface peer-group TOR
 neighbor eno7 interface peer-group TOR

 bgp fast-convergence

 address-family ipv4 unicast
  network 10.10.19.10/32
  network 10.10.21.66/32
  network 10.10.21.227/32
  redistribute kernel
  redistribute connected
 exit-address-family
<span class="nb">exit</span>

</code></pre></div></div>

<h2 id="fabrics-overview">Fabrics Overview</h2>

<p>Modern Kubernetes deployments often involve multiple interconnected fabrics:</p>

<ul>
  <li>
    <p><strong>App Fabrics</strong>: Individual datacenter clusters advertise node, Pod, and Service routes into the Clos fabric. ECMP load-balances traffic, and BFD enables sub-second failover.</p>
  </li>
  <li>
    <p><strong>DCI Fabric</strong>: Interconnects App Fabrics across geographies, sharing routing information and enabling cross-datacenter service reachability.</p>
  </li>
  <li>
    <p><strong>Edge Fabric</strong>: Handles north-south traffic, enforces security policies, and routes external traffic to the correct internal services.</p>
  </li>
</ul>

<p><img src="/assets/images/kubernets_onprem_fabric.png" alt="Clos fabric with Leaf, Spine, Border Leaf, and Super Spine" /></p>

<h2 id="bgp-ecmp-and-bfd-in-action">BGP, ECMP, and BFD in Action</h2>

<h3 id="multi-bgp-neighbor-setup">Multi-BGP Neighbor Setup</h3>

<p>Each Kubernetes node typically peers with two upstream leaf switches. <strong>ECMP</strong> allows traffic to traverse both uplinks efficiently, while <strong>BFD</strong> ensures rapid failure detection.</p>

<ul>
  <li><strong>Service VIPs</strong> are advertised from node loopbacks and propagated across the spine-leaf fabric, DCI, and edge networks, providing seamless connectivity even during failures.</li>
  <li><strong>Pod IPs</strong> are redistributed automatically from kernel routes, allowing real-time updates without manual intervention.</li>
</ul>

<h3 id="service-advertisement-flow">Service Advertisement Flow</h3>

<ul>
  <li>Node advertises Service VIP to fabric via BGP.</li>
  <li>Leaf switches propagate the route across spines (ECMP paths).</li>
  <li>DCI fabric shares the route across datacenters.</li>
  <li>Edge fabric routes external traffic to the correct node.</li>
  <li>Failures are handled seamlessly, with remaining nodes continuing to advertise VIPs.</li>
</ul>

<h3 id="why-service-ips-are-added-to-loopback">Why Service IPs Are Added to Loopback</h3>

<ul>
  <li>Service IPs are virtual and unbound from physical interfaces.</li>
  <li>Assigning them to the node’s loopback makes them stable BGP hosts.</li>
  <li>Advertising these as /32 routes ensures accurate reachability.</li>
  <li>Supports HA and ECMP forwarding of service traffic across the fabric.</li>
</ul>

<h3 id="why-pod-ips-are-advertised-automatically">Why Pod IPs Are Advertised Automatically</h3>

<ul>
  <li>Pods obtain dynamic IPs within node-specific CIDRs.</li>
  <li>These IPs map to kernel-managed routes automatically.</li>
  <li>FRR redistributes kernel routes, advertising pod presence dynamically.</li>
  <li>Automation reduces manual config and promotes real-time fabric updates.</li>
</ul>

<h2 id="extended-theory">Extended Theory</h2>

<h3 id="bgp-benefits-in-k8s">BGP Benefits in K8s</h3>

<ul>
  <li><strong>Loop-free routing:</strong> BGP prevents routing loops even with multiple paths.</li>
  <li><strong>Scalability:</strong> Thousands of nodes and services can be advertised without massive L2 overlays.</li>
  <li><strong>Policy Control:</strong> Route-maps and filters can enforce service placement policies.</li>
</ul>

<h3 id="ecmp-considerations">ECMP Considerations</h3>

<ul>
  <li>Provides <strong>load distribution</strong> across multiple paths.</li>
  <li>Works best with <strong>stable loopback next-hops</strong>.</li>
  <li>May require tuning for <strong>hash algorithms</strong> to avoid uneven traffic flows.</li>
</ul>

<h3 id="bfd-insights">BFD Insights</h3>

<ul>
  <li>Detects link or node failure in <strong>milliseconds</strong>.</li>
  <li>Reduces downtime by quickly withdrawing unreachable routes.</li>
  <li>Works alongside BGP to trigger failover without waiting for BGP timers.</li>
</ul>

<h2 id="best-practices">Best Practices</h2>

<ul>
  <li>Automate FRR deployment using tools like <strong>Ansible</strong>, <strong>Helm charts</strong>, or <strong>Kubernetes operators</strong>.</li>
  <li>Maintain a consistent <strong>IP addressing scheme</strong> across datacenters to avoid conflicts.</li>
  <li>Actively monitor <strong>BGP</strong> and <strong>BFD</strong> sessions for anomalies.</li>
  <li>Automate adding <strong>Service IPs into FRR</strong> using <strong><a href="https://github.com/platformbuilds/cosmolet">Cosmolet</a></strong> (an open-source project by PlatformCosmo).</li>
  <li>Secure routing with <strong>prefix filters</strong> and <strong>network policies</strong>.</li>
  <li>Regularly test <strong>failover scenarios</strong> to ensure resilience.</li>
</ul>

<h2 id="conclusion">Conclusion</h2>

<p>Integrating Kubernetes clusters into <strong>Clos fabrics</strong> using <strong>FRR</strong>, <strong>BGP</strong>, <strong>ECMP</strong>, and <strong>BFD</strong> transforms on-premises networking. Pods and Services become natively routable, highly available, and globally reachable across datacenters.</p>

<p>Leveraging <strong>IPv6 BGP unnumbered</strong> reduces operational complexity, while <strong>BFD</strong> ensures sub-second failover. This architecture delivers cloud-like networking behavior for enterprises that demand predictable, scalable, and resilient on-premises deployments.</p>

<p>This design delivers <strong>cloud-like networking behavior</strong> for on-prem deployments — allowing Kubernetes services to remain globally reachable, highly available, and scalable across fabrics.</p>]]></content><author><name>Anurag Khuntia</name></author><category term="Kubernetes" /><category term="Networking" /><category term="Kubernetes" /><category term="BGP" /><category term="ECMP" /><category term="BFD" /><category term="Datacenters" /><category term="On-Prem" /><summary type="html"><![CDATA[Exploring scalable Kubernetes networking for multi-site on-prem clusters.]]></summary></entry></feed>