The real cost in ML infrastructure is not the GPUs hourly rate. It is human behavior and poorly designed workflows. A single g6.2xlarge EC2 instance costs roughly $1 per hour, about $700 per month if left on continuously. For a team of three engineers using three instances that are idle most of the time, that is over $2,100 per month just sitting there doing nothing. If deleting a GPU instance feels risky, it usually indicates the workflow is misaligned with modern ephemeral infrastructure practices.
Most teams do not need GPUs 24/7. They need them for development sessions, short-lived training, batch experiments, and occasional model exploration. Traditional always-on infrastructure creates inertia and encourages waste.
Other workflow components, such as ECR repositories, CloudWatch logs, and S3 datasets, also incur persistent costs that accumulate if unmanaged. Separating ephemeral compute from persistent resources is key to safe and predictable cost optimization.
Note: All prices mentioned in this post are accurate at the time of writing and may change over time.
The Alternatives
Before diving into ephemeral GPUs, let’s contextualize other approaches and why they may or may not work for development-heavy workloads:
Managed Services
Pros: Simplified training and deployment, built-in pipelines, managed infrastructure
Cons: Limited control over CUDA versions, OS-level debugging, or custom PyTorch builds
Best for: Teams comfortable within the SageMaker environment or running standardized workloads
Kubernetes
Pros: Scales to zero, supports complex multi-node training
Cons: High operational overhead. Node scheduling, GPU allocation, autoscaling, device plugins, and monitoring complicate the workflow. It is overkill for development-heavy, intermittent workloads
Best for: Large teams with continuous, high-throughput GPU workloads. If a team already operates a mature GPU-enabled Kubernetes platform, ephemeral GPU nodes can still work well, but the operational cost should be consciously accepted rather than assumed.
Always-On EC2 with Auto Scaling
Pros: Handles spikes in demand
Cons: Auto Scaling does not eliminate idle GPU costs. Instances still accrue cost when idle, leaving human behavior unchecked
Best for: Rarely justified unless workloads are truly continuous
The problem is simple: stop paying for idle GPUs. Ephemeral infrastructure solves both cost and behavioral inertia.
The Ephemeral Infrastructure Pattern
Core Principle
Anything that can be destroyed safely should be destroyed. This requires a strict separation of resources:
Persistent Resources (Backbone)
- S3 buckets for datasets, models, and artifacts
- ECR repositories for Docker images
- CloudWatch logs
- Elastic IPs when necessary
- Parameter Store or Secrets Manager for secrets
Ephemeral Resources (Disposable Compute)
- GPU EC2 instances
- Instance-specific IAM roles and security groups
This separation ensures destroying ephemeral resources is risk-free.
Terraform State Isolation
Persistent and ephemeral resources live in separate Terraform states. Ephemeral modules read outputs from persistent ones via terraform_remote_state but cannot modify persistent resources. This ensures a terraform destroy cannot wipe critical data.
Ephemeral modules should have read-only access to the persistent state backend to prevent accidental modification of long-lived resources.
Bootstrap and AMI Strategy
Use hardened base AMIs with OS-level dependencies, CUDA drivers, and ML frameworks. Bootstrap scripts should:
- Pull configuration from persistent state
- Handle partial failures gracefully
- Be idempotent
Best Practices:
- Update base AMIs when possible
- Bootstrap dynamically at startup to reduce frequent AMI creation and replacement
- Log everything to CloudWatch for debugging
Pitfalls to Avoid
- Version conflicts between CUDA, drivers, and PyTorch
- Incomplete dependency installation. Always test in isolated ephemeral environments
Kill SSH, Embrace SSM
Because ephemeral GPU instances are short-lived and created dynamically, traditional SSH key management becomes risky. Keys rarely rotate, leave no audit trail, and are a liability if a developer’s laptop is lost. To maintain security, auditability, and ease of access, use AWS SSM Session Manager instead:
- IAM-authenticated sessions
- Full session logging
- Tag-based access control
- No inbound ports or keys
Secrets are retrieved at runtime via Parameter Store with scoped IAM policies. Rotation happens automatically without developer intervention.
Persistent Resource Cost Management
Even persistent resources have ongoing costs.
Discipline in persistent resource management is critical, costs can silently grow if ignored.
Workflow Guidelines for Teams
- Developer provisions GPU via ephemeral Terraform module
- Instance boots from hardened AMI, runs bootstrap, and attaches to persistent resources
- Developer completes experiments, training, or batch jobs
- If idle for more than a configurable threshold, the instance is destroyed automatically
- Logs, datasets, and images remain persistent
- Tagging and IAM policies enforce isolation and auditability
Monitoring and Metrics
Track GPU utilization, storage usage, and overall costs. Use:
- CloudWatch metrics for EC2 and storage
- AWS Cost Explorer for budget tracking
- Alerts for underutilized GPUs or storage anomalies
Visibility drives behavioral change, which is the real ROI of ephemeral GPUs.
Behavioral Change is the Real ROI
Before ephemeral GPUs, our engineers left instances idle for days or weeks, costing a lot of money. After adopting ephemeral infrastructure:
- Idle time dropped near zero
- GPU costs fell approximately 80 percent
Ephemeral infrastructure aligns costs with actual usage, not calendar time.
When Not to Use This Pattern
- 24/7 production inference
- Ultra-low-latency services
- Long warm-up workloads
- Strict compliance systems
Best suited for development, research, batch workloads, training jobs, and staging environments.
Additional Considerations
- Spot Instances can be combined with ephemeral GPUs for 50–70 percent cost reduction
- Multi-node distributed training using ephemeral clusters works for Horovod, DeepSpeed, or multi-GPU jobs. Ensure bootstrapping handles cluster coordination
- CI/CD pipelines can integrate ephemeral instances. Spin up ephemeral GPU, run training, destroy instance, and persist artifacts
Key Takeaways
- Treat GPUs as disposable and destroy them when idle
- Separate persistent and ephemeral resources with isolated Terraform states
- Use SSM Session Manager to eliminate SSH keys and improve auditability
- Track and manage persistent resource costs rigorously
- Behavioral change drives cost savings more than technical optimization
- Ephemeral infrastructure enables experimentation, predictable costs, and safer workflows
- Consider Spot instances and ephemeral multi-node clusters for additional efficiency
If GPUs are always-on by default, it is time to rethink your architecture. Build ephemeral, automated workflows and align costs with usage.