Triton Inference Server (UChicago)¶
Overview¶
The Triton Inference Server is deployed as a Kubernetes cluster service within the UChicago Analysis Facility. This service provides high-performance AI model inference with automatic scaling capabilities.
Key Information
- Access: Internal to UChicago AF Kubernetes cluster only (not externally exposed)
- Endpoint:
triton-traefik.triton.svc.cluster.local:8001(gRPC) - Version: Release 2.61.0 (NGC container 25.09)
- GPU Resources: 1-3 pods, each pod allocated 1 GPU (see hardware specs)
- CPU/Memory: Burstable (no minimum request, scales as needed)
- Autoscaling: Managed by Horizontal Pod Autoscaler (HPA) based on average queue time metrics
The Triton service automatically scales the number of server instances based on workload demand, ensuring efficient resource utilization.
Accessing Triton¶
Access to the Triton Inference Server is currently restricted to services and workloads running inside the UChicago AF Kubernetes cluster. If you need to access Triton from your analysis workflows, you can connect to the load balancer endpoint within the cluster environment using gRPC.
Load Balancer Endpoint (gRPC):
Model Repository Access¶
Triton can access models from two storage options at the UChicago Analysis Facility:
- CVMFS - For existing production ML models already stored in CVMFS (via Kubernetes hostPath mount)
- S3 Storage - For uploading new models to
https://s3.af.uchicago.edu
Using Models from CVMFS¶
If your production ML models are already stored in CVMFS, Triton can access them directly through a Kubernetes hostPath mount. This enables you to deploy existing models without needing to copy them to S3.
To use CVMFS models, contact the AF administrators with:
- Path to your model(s) in CVMFS
- Model name and type
- Backend requirements (TensorFlow, PyTorch, ONNX, etc.)
- Expected duration/timeframe for model usage
Uploading Your Models to S3¶
To upload and deploy your machine learning models on Triton, follow these steps:
1. Request Access and Credentials¶
Contact the UChicago AF administrators to request access to the S3 model repository.
Include in your request:
- Your UChicago AF username
- Brief description of your models and use case
- Expected storage requirements
2. Create Your Model Directory¶
Once approved, you'll receive S3 credentials. Create a subdirectory in the model repository using your AF username:
This keeps your models organized and separates them from other users' models.
3. Upload Models¶
Upload your models to your directory using any S3-compatible client:
4. Request Model Activation¶
After uploading your models, contact the AF administrators to have your models added to the Triton server configuration.
Include:
- Your username
- Model directory path
- Model name and type
- Any specific backend requirements (see below)
- Expected duration/timeframe for model usage
The Triton server polls the model repository every 60 seconds, so once configured, your models should become available automatically.
Requesting Additional Features¶
Backend Support¶
Triton supports multiple machine learning frameworks and backends (TensorFlow, PyTorch, ONNX, TensorRT, etc.). If you need support for a backend that is not currently included in the UChicago Triton deployment, contact the AF administrators.
Required Information:
- Backend/framework name (e.g., TensorFlow, PyTorch, ONNX)
- Framework version required
- Model type and format
- Expected use case
The AF team can deploy additional backends as extensions to the existing Triton configuration.
GPU Pinning¶
If your inference workloads require running models on a specific GPU model or have special GPU requirements, you can request GPU affinity configuration.
Required Information:
- Preferred GPU model or node type
- Expected workload characteristics
- Resource requirements (GPU memory, compute)
- Performance requirements
The AF team can configure GPU affinity for your inference service to ensure optimal performance.
Additional Model Repositories¶
We can set up additional model repositories to mount into the Triton pods as part of the HelmChart configuration that UChicago uses.
The following directories are currently configured:
s3://https://s3.af.uchicago.edu:443/triton-d4363a43-23b5-4a13-836b-c98175f4ac41/models/cvmfs/atlas.cern.ch/repo/sw/database/GroupData/BTagging/20250213/
Please get in touch if you want additional paths included.
Support and Contact¶
For any questions, access requests, model configuration issues, or troubleshooting assistance, please see the Getting Help page for UChicago facility contact information.
See Also:
- Triton Deployment Documentation - Technical details about the infrastructure and deployment configuration
- Machine Learning Containers - For running ML training and other ML workloads