Skip to content

Triton Inference Server (UChicago)

Overview

The Triton Inference Server is deployed as a Kubernetes cluster service within the UChicago Analysis Facility. This service provides high-performance AI model inference with automatic scaling capabilities.

Key Information

  • Access: Internal to UChicago AF Kubernetes cluster only (not externally exposed)
  • Endpoint: triton-traefik.triton.svc.cluster.local:8001 (gRPC)
  • Version: Release 2.61.0 (NGC container 25.09)
  • GPU Resources: 1-3 pods, each pod allocated 1 GPU (see hardware specs)
  • CPU/Memory: Burstable (no minimum request, scales as needed)
  • Autoscaling: Managed by Horizontal Pod Autoscaler (HPA) based on average queue time metrics

The Triton service automatically scales the number of server instances based on workload demand, ensuring efficient resource utilization.

Accessing Triton

Access to the Triton Inference Server is currently restricted to services and workloads running inside the UChicago AF Kubernetes cluster. If you need to access Triton from your analysis workflows, you can connect to the load balancer endpoint within the cluster environment using gRPC.

Load Balancer Endpoint (gRPC):

triton-traefik.triton.svc.cluster.local:8001

Model Repository Access

Triton can access models from two storage options at the UChicago Analysis Facility:

  1. CVMFS - For existing production ML models already stored in CVMFS (via Kubernetes hostPath mount)
  2. S3 Storage - For uploading new models to https://s3.af.uchicago.edu

Using Models from CVMFS

If your production ML models are already stored in CVMFS, Triton can access them directly through a Kubernetes hostPath mount. This enables you to deploy existing models without needing to copy them to S3.

To use CVMFS models, contact the AF administrators with:

  • Path to your model(s) in CVMFS
  • Model name and type
  • Backend requirements (TensorFlow, PyTorch, ONNX, etc.)
  • Expected duration/timeframe for model usage

Uploading Your Models to S3

To upload and deploy your machine learning models on Triton, follow these steps:

1. Request Access and Credentials

Contact the UChicago AF administrators to request access to the S3 model repository.

Include in your request:

  • Your UChicago AF username
  • Brief description of your models and use case
  • Expected storage requirements

2. Create Your Model Directory

Once approved, you'll receive S3 credentials. Create a subdirectory in the model repository using your AF username:

s3://triton-models/<your-username>/

This keeps your models organized and separates them from other users' models.

3. Upload Models

Upload your models to your directory using any S3-compatible client:

aws s3 cp /path/to/your/model s3://triton-models/<your-username>/model-name/ --recursive
s3cmd put /path/to/your/model s3://triton-models/<your-username>/model-name/ --recursive
mc cp --recursive /path/to/your/model s3/triton-models/<your-username>/model-name/

4. Request Model Activation

After uploading your models, contact the AF administrators to have your models added to the Triton server configuration.

Include:

  • Your username
  • Model directory path
  • Model name and type
  • Any specific backend requirements (see below)
  • Expected duration/timeframe for model usage

The Triton server polls the model repository every 60 seconds, so once configured, your models should become available automatically.

Requesting Additional Features

Backend Support

Triton supports multiple machine learning frameworks and backends (TensorFlow, PyTorch, ONNX, TensorRT, etc.). If you need support for a backend that is not currently included in the UChicago Triton deployment, contact the AF administrators.

Required Information:

  • Backend/framework name (e.g., TensorFlow, PyTorch, ONNX)
  • Framework version required
  • Model type and format
  • Expected use case

The AF team can deploy additional backends as extensions to the existing Triton configuration.

GPU Pinning

If your inference workloads require running models on a specific GPU model or have special GPU requirements, you can request GPU affinity configuration.

Required Information:

  • Preferred GPU model or node type
  • Expected workload characteristics
  • Resource requirements (GPU memory, compute)
  • Performance requirements

The AF team can configure GPU affinity for your inference service to ensure optimal performance.

Additional Model Repositories

We can set up additional model repositories to mount into the Triton pods as part of the HelmChart configuration that UChicago uses.

The following directories are currently configured:

  • s3://https://s3.af.uchicago.edu:443/triton-d4363a43-23b5-4a13-836b-c98175f4ac41/models
  • /cvmfs/atlas.cern.ch/repo/sw/database/GroupData/BTagging/20250213/

Please get in touch if you want additional paths included.

Support and Contact

For any questions, access requests, model configuration issues, or troubleshooting assistance, please see the Getting Help page for UChicago facility contact information.


See Also: