Skip to content
FeatureSlurmVelda
Containerized Execution
Each task runs in a container that is isolated with other workloads on the same node, and workload cannot access tasks of other users
Environment customization
Every workload can run in fully customizable environment, including some system packages(apt, pip, etc.)
Cluster Management
Allocate compute resources from Cloud / Kubernetes based on demand, share resource with other Kubernetes on the same Node
Interactive Development
Developers have access to dev-environments that can be used as the primary dev solution, with full capability like IDEs and docker access, or VS-Code access from browser.
Service development
Run service directly in the cluster in one command, and access them with multiple options like DNS names, port-forwarding or load-balancer. vrun is all you need
SSO integration
Use your SSO provider to manage user access / login
RBAC
Role based access control: Limit who can see other's job/data
Web access
From browser, directly start development in VS Code Web, or view task logs
Third-party

What's right for me?

Both Velda and Slurm are options for managing HPC clusters, and they provide a similar command-line interface to run jobs. Users who are already familiar with Slurm will find it easy to get started with Velda, while Velda offers a more modern, cloud-native approach.

Resource Management and Deployment Model

Slurm’s Traditional Node-Based System: Slurm operates on a legacy HPC model where entire compute nodes are allocated to jobs, regardless of their actual resource requirements. Slurm is designed for traditional MPI applications. To ensure optimal performance, Slurm clusters must be carefully configured with partition systems, quality-of-service (QoS) configurations, and resource limit hierarchies. This often creates significant inefficiencies with modern, heterogeneous workloads. Users must understand these complex configurations to maximize cluster capacity and efficiency.

Velda’s Cloud-Native Architecture: Velda leverages multiple backend options to allocate compute resources: statically managed nodes in a traditional data center, virtual machines from a cloud platform, pods from Kubernetes, or resources provisioned through custom scripts. Velda supports a variety of instance configurations and can dynamically provision compute based on workload demands. This enables precise resource allocation and efficient multi-tenancy. The platform abstracts away infrastructure complexity while providing enterprise-grade security, automatic scaling, and comprehensive observability.

Development Experience

Slurm was built for running jobs, not for daily development. Slurm does not provide a native way to allocate development instances. Developers often rely on error-prone workarounds, such as submitting "sleep" jobs and using direct SSH access to the allocated node. Alternatively, they may use a shared login node, risking system instability. Debugging running jobs is challenging, and running services is poorly supported. These systems are not easy for new users to adopt and often require manual steps to sync code and builds from other systems to the Slurm cluster.

Velda puts the developer experience first. It offers the features of a modern Cloud Development Environment (CDE), allowing for the instant onboarding of new users by cloning instances from templates. It also manages resources for development instances with features like auto-shutdown, all accessible through a rich user interface. Developers access the cluster through dynamically provisioned personal environments. They can connect via SSH or any modern IDE (VSCode, Cursor, PyCharm, etc.), or access VSCode with one click from a browser. When running workloads, it can automatically forward ports for debugging or make services accessible via a developer-chosen DNS name. This makes accessing the cluster feel just like working on a local machine.

Velda's superior developer experience can eliminate the need for separate development GPUs or compute-heavy instances, thereby improving overall resource utilization.

Job Scheduler & Workload Management

Slurm has powerful workload management features, thanks to its long-standing industry adoption and recognition. Velda, on the other hand, is an emerging product. It has support for common features for AI/ML training, serving, and batch jobs, but may currently lack support for some long-tail use cases that Slurm provides, such as gang scheduling with suspend/resume capabilities. These features will be added as usage grows.

Built for Model Training & AI Workloads

Velda is built for the full lifecycle of machine learning and natively supports many scheduling features, such as:

  • Advanced GPU scheduling options, E.g
    • Gang scheduling, start all workload when all resource are ready.
    • Network-topology-aware cloud instance allocation, extending your resource allocation backend with specific topology options (e.g., EC2 Placement Groups in AWS, Compact Placement in Google Cloud, or custom config in KQueue).
  • Define workflows that involve heterogeneous steps (e.g., data processing, training, evaluation, and model deployment).
  • Platform-agnostic, support PyTorch, Jax, Ray, and other frameworks.
  • Start model serving and serving development in one command, accessible via custom DNS names. See vrun is all you need .
  • Preemption & multi-cloud support is coming soon.

Workload Isolation

Slurm provides basic process isolation through Linux user accounts and relies on shared filesystems for data access. This approach creates potential security vulnerabilities and makes it difficult to ensure complete workload separation, especially in multi-tenant environments or when workloads require different system dependencies. Users often get direct SSH access to compute nodes, even those not allocated to their workloads.

Velda implements per-user, per-instance containers. Each user’s workload runs in a completely isolated environment, which addresses the challenge of secure multi-tenancy. This architecture allows fully customizable packages to be installed while enabling safe multi-tenancy for sensitive computational workloads. When developers need to access a compute node (e.g., for debugging), they can only access the container for their specific workload and are unable to interfere with other workloads on the same node.

Enterprise Features and Compliance

Modern organizations require robust access controls, audit logging, and compliance capabilities. Slurm’s basic accounting and limited Role-Based Access Control (RBAC) features may not meet enterprise security requirements and often rely heavily on third-party tools or customizations, such as syncing SSO with an LDAP directory.

Velda provides comprehensive enterprise features, including SSO / SAML integration, usage reporting, and RBAC policies. It also includes features for optimizing resource usage and cost, supporting the efficient management of workloads. Integration with existing security infrastructure ensures seamless adoption in regulated environments.

Begin your Velda journey today

Get Started