Anurag Khuntia

Cosmolet: Technical Deep Dive

2025-10-15T00:00:00+00:00

Cosmolet: Deep Dive into Kubernetes BGP Service Advertisement

When running Kubernetes on bare-metal, one of the biggest challenges is integrating Kubernetes networking with the existing data center fabric.
While cloud environments provide managed LoadBalancers, on-prem clusters typically depend on external devices, static routes, or CNI-specific BGP integrations that often break the abstraction between Kubernetes and the underlay.

Cosmolet aims to bridge this gap.

It’s a lightweight, CNI-agnostic Kubernetes controller that automates the advertisement of Kubernetes Service IPs directly to your network using FRR (Free Range Routing).
Cosmolet runs as a DaemonSet across all cluster nodes, continuously monitoring the cluster state and syncing service reachability with the routing fabric — without the need for external proxies or vendor-specific solutions.

Design Philosophy

Cosmolet’s core principle is simplicity without compromise:

It should work in any Kubernetes environment — regardless of the CNI.
It should integrate naturally with standard BGP deployments.
It should remain transparent, observable, and secure.

This philosophy results in an agent that directly interacts with the Linux networking stack and FRR, making BGP-based service advertisement both predictable and portable.

Operating Modes

Cosmolet operates in two distinct modes — Connected and Dynamic — to accommodate different FRR configurations and deployment styles.
It automatically determines the mode at runtime based on FRR’s configuration parameters.

Connected Mode

If FRR is configured with: bgpd distributed connected then Cosmolet runs in Connected Mode.
In this mode:

FRR automatically advertises all locally connected routes.
Cosmolet’s responsibility is to synchronize service IPs with the loopback interface.
It adds or removes /32 (or /128 for IPv6) routes corresponding to Kubernetes service IPs.
No direct BGP configuration commands are issued — FRR handles advertisement natively.

Connected mode is ideal for modern, distributed FRR setups commonly found in large-scale fabrics (e.g., Cumulus, SONiC, or containerized FRR instances).

Dynamic Mode

If FRR does not have the distributed connected parameter, Cosmolet switches to Dynamic Mode.
In this mode:

Cosmolet takes direct control of BGP advertisement.
It uses the FRR vtysh CLI or API to programmatically add or remove network statements under the BGP configuration context.
Each node independently advertises only those service IPs it’s responsible for.

This mode ensures compatibility with simpler or traditional FRR setups where automatic route distribution isn’t configured.

Service Discovery and Health Validation

Cosmolet continuously watches Kubernetes Service and Endpoint objects using native client-go informers.
For every eligible service (LoadBalancer or ClusterIP), it checks whether:

The service has healthy endpoints.
The service IP is assigned or reachable on the node.

If both conditions are met, the service IP is added to the loopback and advertised to the fabric.

If all endpoints of a service become unhealthy, Cosmolet withdraws the advertisement — preventing blackhole routes and ensuring that BGP reflects real application availability.

This health-aware advertisement model forms the foundation of reliable BGP-based service publishing in Kubernetes.

Loopback IP Synchronization

Loopback management is central to Cosmolet’s operation.

For each service IP that should be advertised, the controller runs:

ip addr add 10.30.21.232/32 dev lo

and when no longer valid:

ip addr del 10.30.21.232/32 dev lo

Cosmolet automatically reconciles the loopback state on every iteration:

Active IPs → retained
Stale IPs → removed
Excluded IPs → preserved (based on config.yaml)

This ensures that the node’s loopback interface always reflects the real set of advertised service IPs.

FRR Integration and Control Flow

FRR (Free Range Routing) acts as the BGP stack beneath Cosmolet. Cosmolet communicates with FRR in one of two ways:

Dynamic Mode: Direct vtysh command execution:

vtysh -c "configure terminal" \
      -c "router bgp 65001" \
      -c "network 10.30.21.232/32"

or its removal:

vtysh -c "configure terminal" \
      -c "router bgp 65001" \
      -c "no network 10.30.21.232/32"

Connected Mode: FRR automatically detects connected loopback IPs, so no direct CLI execution is needed. Cosmolet’s role ends after adding the IP to lo.

This modular integration ensures that Cosmolet can adapt to both fully distributed and standalone FRR deployments, without code or configuration changes.

High Availability and Scalability

Cosmolet runs as a DaemonSet — ensuring that every node independently manages its local advertisements. This design offers:

Horizontal scalability: Each node runs autonomously.
No single point of failure: BGP advertisements are localized.
Leader election support (optional): Used for centralized coordination or cluster-wide statistics.

This decentralized architecture matches Kubernetes’ fault-tolerance model — if a node or pod fails, its routes are automatically withdrawn by FRR’s BGP session teardown.

Security and Privileges

Cosmolet follows a minimal-privilege principle:

Runs as a non-root container.
Uses Linux capabilities (NET_ADMIN) only where required.
No dependency on sudo or privileged mode.
RBAC limited to read-only access on Service and Endpoints resources.

This ensures compatibility with hardened cluster policies and secure multi-tenant setups.

Deployment Approaches

Cosmolet supports multiple deployment methods:

Helm chart for production use with configurable BGP, FRR, and namespace settings.
YAML manifests for quick testing or custom integration.
Custom DaemonSet templates for embedding within existing network operators.

Since it always runs as a DaemonSet, deployment is per-node and aligns with FRR’s operational model.

Closing Thoughts

Cosmolet simplifies one of the most complex aspects of bare-metal Kubernetes — integrating service IP routing with real BGP networks. Its two-mode architecture ensures that it can adapt to both modern and legacy FRR topologies, making it an effective choice for clusters that span racks, fabrics, or hybrid environments.

By bringing BGP awareness natively into the Kubernetes control plane, Cosmolet enables operators to run cloud-like networking on bare metal — predictably, efficiently, and transparently.

Cosmolet: Dynamic BGP Service Advertisement for Bare-Metal Kubernetes

2025-10-14T00:00:00+00:00

Introduction

Kubernetes has transformed how we orchestrate containerized workloads. However, running Kubernetes on bare-metal clusters exposes networking challenges absent in cloud environments. Unlike managed cloud LoadBalancers, bare-metal setups often require manual route configuration, external proxies, or appliances to expose services externally.

Cosmolet is an open-source Kubernetes controller designed to solve this problem. By integrating with FRR (Free Range Routing), Cosmolet dynamically advertises service IPs over BGP, enabling bare-metal clusters to expose services to the network fabric automatically. It automates service discovery, loopback IP management, and BGP advertisement, ensuring healthy and available services are reachable without manual intervention.

Challenges in Bare-Metal Kubernetes Networking

Running Kubernetes on bare-metal introduces several unique hurdles:

Manual Route Management: Administrators often configure static routes or modify external routers to make services reachable.
Service Reliability: Without health-aware routing, traffic may be sent to pods that are down, causing black-hole routes.
Overlay Dependency: Many solutions rely on overlays, adding latency and operational overhead.
Scalability: As nodes and services grow, managing route advertisements manually becomes error-prone and unsustainable.

Cosmolet addresses these by running a lightweight daemonset on each node to monitor services and pods, manage loopback IPs, and advertise IPs dynamically to the network via BGP.

Core Features

Automatic Service Discovery
Cosmolet continuously monitors Kubernetes services across namespaces. It detects new, updated, and deleted services automatically, ensuring the network reflects the cluster’s current state.

BGP Advertisement
In dynamic mode, Cosmolet advertises service IPs via BGP using FRR, and withdraws them when services are unhealthy or inactive, preventing black-hole traffic.

Health-Aware Routing
Cosmolet evaluates pod liveness probes before advertising service IPs. Only healthy services are announced to BGP peers, improving reliability.

Node-Local Loopback Management
Service IPs are added to each node’s loopback interface. Stale or inactive IPs are removed automatically, keeping routing tables accurate.

Observability
Cosmolet exposes a /metrics endpoint compatible with Prometheus, offering metrics such as loopback IP states, BGP advertisement status, and control loop timing. Logs provide insight into loopback management, pod health checks, and BGP operations.

Security and RBAC
Cosmolet operates with a minimal-privilege Kubernetes service account. It requires only permissions to list pods, services, and nodes, adhering to the principle of least privilege.

How Cosmolet Works

Cosmolet operates in a recurring loop on each node:

Service Discovery: Queries Kubernetes API for services in configured namespaces.
Pod Health Check: Filters for pods scheduled on the local node and verifies their liveness probes.
Loopback Management: Adds active service IPs to the node’s loopback interface and removes stale ones.
BGP Advertisement: Uses vtysh and FRR to advertise or withdraw service IPs to BGP peers.
Metrics Exposure: Updates Prometheus metrics for observability.
Logging: Provides detailed debug information, including active and removed IPs, health checks, and BGP operations.

Deployment Approaches

Cosmolet can be deployed in multiple ways depending on operational preference and cluster management style:

Daemonset on Kubernetes: Run Cosmolet as a daemonset so every node participates in service IP advertisement and loopback management.
Helm Charts: Package Cosmolet for reproducible, configurable deployment across clusters.
GitOps / Operator Models: Integrate Cosmolet with GitOps pipelines or Kubernetes operators for automated configuration and lifecycle management.

This flexibility allows clusters of any size or topology to integrate seamlessly with existing FRR-based BGP fabrics.

Operational Workflow

Cosmolet identifies active service IPs.
Updates the node’s loopback interface.
Advertises IPs via BGP in dynamic mode.
Removes stale IPs from loopback.
Exposes metrics and logs for observability.

Benefits

Automation: Reduces manual intervention in BGP advertisement.
Health-Aware Routing: Ensures only healthy services are advertised.
High Availability: Node-local loopbacks with dynamic advertisement maintain service reachability even during failures.
Observability: Prometheus metrics and detailed logs provide real-time insights.
Security: Operates with minimal privileges and RBAC.

Conclusion

Cosmolet bridges Kubernetes and bare-metal network fabrics by providing automated service discovery, loopback management, health-aware BGP advertisement, and observability. It simplifies networking, reduces operational complexity, and allows Kubernetes services to be exposed reliably and efficiently across on-premises networks.

Explore Cosmolet: GitHub / Documentation

Kubernetes Networking Across On-Prem Datacenters with BGP, ECMP, and BFD

2025-09-21T00:00:00+00:00

Introduction

Kubernetes powers a vast array of mission-critical applications, many of which operate in on-premises datacenters spanning multiple sites. Unlike public cloud environments, which abstract networking complexity behind managed services, on-premises deployments expose the underlying challenges of routing Pods and Services efficiently across a distributed infrastructure. Designing a resilient, scalable, and observable network fabric is crucial to ensure predictable connectivity, high availability, and seamless failover behavior for workloads.

In this blog, we dive deep into how modern datacenter networking concepts—such as Clos fabrics, dynamic routing with BGP, ECMP load balancing, and BFD rapid failover—can be leveraged to integrate Kubernetes clusters directly into enterprise network fabrics.

Core Networking Challenges in On-Prem Kubernetes

On-premises Kubernetes networking presents unique challenges that require careful architectural consideration:

Scalable Routing: As the number of nodes, Pods, and Services grows, traditional Layer 2 overlays and static routing become bottlenecks. Without native Layer 3 routing visibility, scaling beyond a few hundred nodes is difficult.
Rapid Failover: Links or nodes can fail at any time. Slow detection and rerouting can render workloads unreachable, causing downtime for critical applications.
Multi-Datacenter Connectivity: Applications often span multiple datacenters. Maintaining consistent routing policies and ensuring Service reachability across sites are essential for high availability and global access.
Operational Simplicity: Managing IPs, routing tables, and point-to-point links in a large fabric can quickly become overwhelming. Automation and dynamic routing protocols are required to reduce human error and maintain consistency.

Clos Fabrics and Dynamic Routing

A Clos network, commonly referred to as a spine-leaf architecture, forms the backbone of modern datacenter fabrics. It provides high bandwidth, predictable latency, and multiple equal-cost paths to enable resilient connectivity.

Leaf Switches connect directly to servers, including Kubernetes nodes, handling BGP peering with upstream layers or external networks. They forward traffic to spine switches for inter-leaf communication.
Spine Switches interconnect leaf switches, creating a high-speed backbone that ensures low latency, redundancy, and ECMP-enabled multiple-path traffic forwarding.
Border Leaf Switches act as gateways to external networks, connecting the datacenter to WANs, other datacenters, or the Internet.
In very large-scale deployments, a Super Spine layer can be introduced above the spine layer to interconnect multiple spine blocks, further reducing oversubscription and improving multi-datacenter scalability.

Dynamic routing protocols, particularly BGP, allow nodes and services to advertise their presence directly into the fabric, providing a scalable, loop-free routing topology. ECMP (Equal-Cost Multi-Path) spreads traffic across multiple spine-leaf paths, optimizing bandwidth utilization and providing redundancy.

Scaling Control Planes with IPv6 BGP Unnumbered

Traditional IPv4 BGP deployments require dedicated subnets for each point-to-point link, which quickly consumes IP space in large fabrics.

IPv6 BGP Unnumbered solves this problem:

Uses link-local IPv6 addresses for BGP session establishment.
No need to allocate /31 or /30 subnets per link.
IPv4 routes for Pods and Services are still exchanged.

This approach simplifies automation, reduces operational overhead, and ensures consistent routing without wasting precious IPv4 addresses.

Key takeaway: IPv6 handles the control plane; IPv4 remains for workloads.

Running FRR on Kubernetes Nodes

FRR (Free Range Routing) is an open-source routing software suite that provides implementations of standard routing protocols such as BGP, OSPF, RIP, and IS-IS. It allows network devices—including Linux servers, virtual machines, and Kubernetes nodes—to participate in IP routing just like traditional routers.

In simpler terms, FRR turns a regular machine into a fully capable router, enabling it to advertise, receive, and manage network routes dynamically. It’s widely used in data centers, cloud networking, and Kubernetes environments to integrate workloads directly into the network fabric, support ECMP load balancing, and enable fast failover with BFD.

Running FRR on nodes converts each Kubernetes worker into a mini-router:

Advertises Pod CIDRs and Service VIPs directly to the fabric.
Supports BFD for rapid failure detection (<1 second).
Uses loopbacks as stable next-hops for ECMP.
Avoids overlay encapsulation overhead, enabling native L3 routing visibility.

This architecture ensures that all workloads are first-class citizens on the network, visible to switches and routers for direct routing, allowing the datacenter fabric to see Pods and Services as native IP prefixes, not just encapsulated traffic.

Addressing Model

A clean addressing scheme simplifies routing and policy management:

Node Segment (10.10.19.x): Each Kubernetes node is assigned a unique loopback IP that serves as its BGP router ID. For example, Node1 could have 10.10.19.10. This address remains stable even if the physical interface changes.
Pod Segment (10.10.20.x): Pods receive dynamic IPs from per-node CIDRs. For instance, a Pod on Node1 might get 10.10.20.5. These IPs are advertised automatically by FRR, reducing manual configuration.
Service Segment (10.10.21.x): ClusterIP or LoadBalancer VIPs are assigned /32 routes and advertised to the fabric via node loopbacks. This ensures services are reachable across the network, enabling high availability and ECMP forwarding.

Sample FRR Configuration

frr version 8.1
frr defaults traditional
hostname cp1
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config

interface eno5
 ipv6 nd ra-interval 6
 no ipv6 nd suppress-ra
exit

interface eno7
 ipv6 nd ra-interval 6
 no ipv6 nd suppress-ra
exit

interface lo
 ip address 10.10.19.10/32     # Node loopback
 ip address 10.10.21.66/32     # Service VIP
 ip address 10.10.21.227/32    # Additional VIP
exit

router bgp 65496
 bgp router-id 10.10.19.10
 bgp bestpath as-path multipath-relax

 neighbor TOR peer-group
 neighbor TOR remote-as internal
 neighbor TOR bfd
 neighbor TOR timers 1 3

 neighbor eno5 interface peer-group TOR
 neighbor eno7 interface peer-group TOR

 bgp fast-convergence

 address-family ipv4 unicast
  network 10.10.19.10/32
  network 10.10.21.66/32
  network 10.10.21.227/32
  redistribute kernel
  redistribute connected
 exit-address-family
exit

Fabrics Overview

Modern Kubernetes deployments often involve multiple interconnected fabrics:

App Fabrics: Individual datacenter clusters advertise node, Pod, and Service routes into the Clos fabric. ECMP load-balances traffic, and BFD enables sub-second failover.
DCI Fabric: Interconnects App Fabrics across geographies, sharing routing information and enabling cross-datacenter service reachability.
Edge Fabric: Handles north-south traffic, enforces security policies, and routes external traffic to the correct internal services.

BGP, ECMP, and BFD in Action

Multi-BGP Neighbor Setup

Each Kubernetes node typically peers with two upstream leaf switches. ECMP allows traffic to traverse both uplinks efficiently, while BFD ensures rapid failure detection.

Service VIPs are advertised from node loopbacks and propagated across the spine-leaf fabric, DCI, and edge networks, providing seamless connectivity even during failures.
Pod IPs are redistributed automatically from kernel routes, allowing real-time updates without manual intervention.

Service Advertisement Flow

Node advertises Service VIP to fabric via BGP.
Leaf switches propagate the route across spines (ECMP paths).
DCI fabric shares the route across datacenters.
Edge fabric routes external traffic to the correct node.
Failures are handled seamlessly, with remaining nodes continuing to advertise VIPs.

Why Service IPs Are Added to Loopback

Service IPs are virtual and unbound from physical interfaces.
Assigning them to the node’s loopback makes them stable BGP hosts.
Advertising these as /32 routes ensures accurate reachability.
Supports HA and ECMP forwarding of service traffic across the fabric.

Why Pod IPs Are Advertised Automatically

Pods obtain dynamic IPs within node-specific CIDRs.
These IPs map to kernel-managed routes automatically.
FRR redistributes kernel routes, advertising pod presence dynamically.
Automation reduces manual config and promotes real-time fabric updates.

Extended Theory

BGP Benefits in K8s

Loop-free routing: BGP prevents routing loops even with multiple paths.
Scalability: Thousands of nodes and services can be advertised without massive L2 overlays.
Policy Control: Route-maps and filters can enforce service placement policies.

ECMP Considerations

Provides load distribution across multiple paths.
Works best with stable loopback next-hops.
May require tuning for hash algorithms to avoid uneven traffic flows.

BFD Insights

Detects link or node failure in milliseconds.
Reduces downtime by quickly withdrawing unreachable routes.
Works alongside BGP to trigger failover without waiting for BGP timers.

Best Practices

Automate FRR deployment using tools like Ansible, Helm charts, or Kubernetes operators.
Maintain a consistent IP addressing scheme across datacenters to avoid conflicts.
Actively monitor BGP and BFD sessions for anomalies.
Automate adding Service IPs into FRR using Cosmolet (an open-source project by PlatformCosmo).
Secure routing with prefix filters and network policies.
Regularly test failover scenarios to ensure resilience.

Conclusion

Integrating Kubernetes clusters into Clos fabrics using FRR, BGP, ECMP, and BFD transforms on-premises networking. Pods and Services become natively routable, highly available, and globally reachable across datacenters.

Leveraging IPv6 BGP unnumbered reduces operational complexity, while BFD ensures sub-second failover. This architecture delivers cloud-like networking behavior for enterprises that demand predictable, scalable, and resilient on-premises deployments.

This design delivers cloud-like networking behavior for on-prem deployments — allowing Kubernetes services to remain globally reachable, highly available, and scalable across fabrics.