Delivering Distributed AI at the Edge with aarna.ml

Posted by Anna Williams
6
Sep 17, 2025
111 Views
Image

Not all AI workloads are the same. For some applications—like physical AI, real-time dialogue agents, digital avatars, and computer vision—speed is essential. Long waits caused by network or centralized computation are no longer acceptable. These applications call for compute to be situated close to where data is generated, ensuring low latency and reducing bandwidth requirements. Centralized models can’t always make the cut when responsiveness and scale matter.

As a result, there’s growing need for inference that is:

  • geographically distributed
  • dynamically orchestrated
  • tightly optimized for latency and bandwidth

Also crucial is using compute architectures at the edge that deliver strong performance per token and per watt.

This rising demand is driving explosive growth in distributed inference infrastructure—spanning GPU clusters in regional data centers and edge locations—while retaining cloud-like flexibility and scale. Between 2025 and 2030, the market is projected to grow from around USD 106.15 billion to about USD 254.98 billion, at a CAGR of 19.2%.

Why NVIDIA MGX Servers Matter for Edge Inference

NVIDIA’s MGX servers (a modular reference design) are well-suited for both intense datacenter workloads and edge inference. Key advantages include:

  • Modular scaling (from 1-rack unit up to many racks), enabling growth from small edge sites to large core deployments.
  • High performance per watt, meaning more compute capacity in energy-and-cost constrained environments.
  • Integration with NVIDIA AI Enterprise tools like NVCF (NVIDIA Cloud Functions) and NIM, giving access to a broad set of models and vertical solutions.

When paired with NVIDIA Spectrum-X Ethernet networking, these servers can deliver more of the performance potential from GPUs. Spectrum-X brings consistent, predictable network performance—even in multi-tenant environments—and reduces runtimes for large transformer-style models.

Challenges in Building an Edge Inference Stack

While hardware like MGX + Spectrum-X + NVIDIA AI Enterprise provide strong foundation, there are important challenges to address for successful distributed inference and GPU-as-a-Service (GPUaaS):

  1. Managing many sites – These include edge and core locations, often with minimal physical staffing (“light-out”), so there’s need for remote management of compute, storage, networking, gateways with low OPEX.
  2. Tenant isolation – Multiple users (tenants) sharing infrastructure must be isolated for security and performance, avoiding “noisy neighbour” problems.
  3. Workload-site matchmaking – Assigning tasks to GPU sites based on latency, data gravity, bandwidth, or compliance.
  4. Utilization efficiency – Since GPUs are expensive, utilization should be maximized. This means supporting dynamic scaling of compute for bursty workloads, scheduling batch jobs efficiently, and making idle capacity available (e.g. via NVCF).

Key Requirements: Secure, Dynamic Tenancy & Isolation

To meet these challenges, ideal software should provide:

  • Zero-touch management across possibly thousands of edge/core sites (to reduce operational expense).
  • Strict isolation across tenants—for compute, storage, networking to ensure both performance and security.
  • Dynamic resource scaling—so infrastructure adapts to fluctuating workloads.
  • Mechanisms to monetize underused capacity—for example, registering spare GPU capacity with NVCF to serve inference jobs.

Edge Workloads Include AI and RAN

Beyond AI inference, workloads like 5G/6G Radio Access Network (RAN) software are also edge-based and benefit from GPU acceleration. Instead of keeping separate infrastructure that sits underutilized (often at 20-30% usage), combining AI and RAN workloads on the same GPU infrastructure improves efficiency.

aarna.ml’s GPU Cloud Management Software (CMS)

aarna.ml offers a GPU Cloud Management Software that addresses many of the above needs. Key features:

  • On-demand isolation across CPU, GPU, network, storage, WAN gateway.
  • Supports bare-metal, VMs, or containerized deployments.
  • Automated infrastructure management across many sites.
  • Tenant discovery, onboarding, RBAC (Role-Based Access Control), and billing.
  • Integration with both open source (Ray, vLLM) and commercial PaaS platforms (like Red Hat OpenShift).
  • Ability to integrate with NVCF for monetizing unused compute.
  • Centralized orchestration of multiple edge sites.

Reference Architecture: Combining NVIDIA + aarna.ml for Edge Inference

Putting hardware and software together, here’s what the ideal setup looks like at each edge site:

  • NVIDIA MGX servers equipped with high-speed network cards or DPUs.
  • Spectrum-X switches for internal and out-of-band management networks.
  • NVIDIA AI Enterprise tools (NIM, NVCF).
  • Optionally, Quantum InfiniBand switches for high bandwidth East-West communication.
  • High performance storage.
  • The aarna.ml GPU CMS.
  • Integration with local IT infrastructure (gateways, DNS, etc.).

Process:

  1. Install: Edge and core sites are equipped with hardware and tested.
  2. Onboard: Infrastructure gets added to aarna.ml GPU CMS. Tenants are created. Resources (servers, GPUs, portions thereof) are allocated.
  3. Isolation: Each tenant gets fully isolated resources (compute, GPU, memory, networking, storage) so that workloads don’t interfere with one another.
  4. Workload Deployment: Use of Kubernetes or commercial cluster software, per-tenant clusters, to run RAN, AI/ML workloads. The clusters can also be registered with NVCF for distributed inference.
  5. Dynamic Scaling: Clusters scale up/down automatically based on policies (for example, RAN traffic patterns). During off-peak times AI inference workloads might use spare capacity; during peak RAN times, more resources shift to those tasks.
  6. External Connectivity: Endpoints for inference are made accessible via DNS, load balancers, firewalls, and gateways. All configured automatically and securely without manual steps.

Conclusion: Making Edge AI Scalable and Multi-Tenant

The future of AI increasingly depends on pushing compute to the edge. To do that well requires more than just fast GPUs—it demands architectures that deliver:

  • Secure multi-tenant isolation
  • Dynamic scaling
  • High utilization
  • Seamless integration with cloud native and AI services

The combination of NVIDIA MGX servers, Spectrum-X networking, and NVIDIA AI Enterprise delivers the performance, while aarna.ml’s GPU CMS adds the orchestration and management layers needed to turn infrastructure into scalable, revenue-gen services. For telecom operators especially, this offers a path to combine AI workloads with network functions (like RAN) for greater efficiency and new service opportunities.

Now is an opportune time for organizations to experiment, pilot, and partner with technologies like aarna.ml + NVIDIA to bring edge-based AI to production.

Comments
avatar
Please sign in to add comment.