Deploy Your Model on Kubernetes¶
Prerequisites¶
- MinIO must be running and contain a bucket named
mlflow
.- If you haven’t set it up, see MinIO Deployment for instructions.
- A Ray Tune job must have completed and logged the best model to MLflow, which was stored in MinIO. If you haven’t done this yet, refer to the following guides:
- MLflow Deployment: Set up the MLflow tracking server and configure MinIO as the artifact store.
- Ray Deployment: Deploy a Ray cluster on Kubernetes.
- Ray Tune Integration Guide: Learn how to integrate Ray Tune with MLflow, Optuna imbalanced-learn, XGBoost and MinIO.
- Ray Tune Job Submission: Run the tuning job and log the best model to MLflow.
gRPCurl
installed
This guide walks you through how to deploy the model you previously trained with Ray and logged to MinIO via MLflow. You'll learn how to serve it using KServe with both REST and gRPC endpoints, and enable autoscaling—including scale-to-zero support.
Grant KServe Permission to Load Models from MinIO 1¶
To allow KServe to pull models from MinIO (or any S3-compatible storage), you'll need to provide access credentials via a Kubernetes Secret
and bind it to a ServiceAccount
. Then, reference that service account in your InferenceService
.
Start by creating a Secret
that holds your S3 credentials. This secret should include your access key and secret key, and be annotated with the MinIO endpoint and settings to disable HTTPS if you're working in a local or test environment.
These values should match what you specified when deploying MinIO on Kubernetes. For more details, refer to the configuration section below or revisit this article.
Info
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
namespace: minio
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio
args:
- server
- /data
- --console-address
- :9001
env:
- name: MINIO_ROOT_USER
value: minio_user
- name: MINIO_ROOT_PASSWORD
value: minio_password
ports:
- containerPort: 9000
protocol: TCP
- containerPort: 9001
protocol: TCP
livenessProbe:
httpGet:
path: /minio/health/live
port: 9000
initialDelaySeconds: 30
periodSeconds: 20
timeoutSeconds: 15
failureThreshold: 6
readinessProbe:
httpGet:
path: /minio/health/ready
port: 9000
initialDelaySeconds: 15
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
volumeMounts:
- name: storage
mountPath: /data
volumes:
- name: storage
hostPath:
path: /home/docker/data/minio
type: DirectoryOrCreate
restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
name: minio-console
namespace: minio
spec:
selector:
app: minio
type: NodePort
ports:
- name: console
port: 9001
targetPort: 9001
nodePort: 30901
---
apiVersion: v1
kind: Service
metadata:
name: minio-api
namespace: minio
spec:
selector:
app: minio
type: ClusterIP
ports:
- name: api
port: 9000
targetPort: 9000
---
apiVersion: batch/v1
kind: Job
metadata:
name: minio-create-bucket
namespace: minio
spec:
backoffLimit: 6
completions: 1
template:
metadata:
labels:
job: minio-create-bucket
spec:
initContainers:
- name: wait-for-minio
image: busybox
command:
- sh
- -c
- |
until nc -z minio-api.minio.svc.cluster.local 9000; do
echo "Waiting for MinIO..."
sleep 2
done
echo "MinIO is ready!"
containers:
- name: minio-create-buckets
image: minio/mc
command:
- sh
- -c
- |
mc alias set minio http://minio-api.minio.svc.cluster.local:9000 minio_user minio_password &&
for bucket in mlflow dbt sqlmesh ray; do
if ! mc ls minio/$bucket >/dev/null 2>&1; then
echo "Creating bucket: $bucket"
mc mb minio/$bucket
echo "Bucket created: $bucket"
else
echo "Bucket already exists: $bucket"
fi
done
restartPolicy: OnFailure
terminationGracePeriodSeconds: 30
Next, create a ServiceAccount
that references the secret. This will allow KServe to inject the credentials when pulling models.
Finally, define an InferenceService
that uses the ServiceAccount
and points to the model artifact stored in MinIO. In this example, we are deploying a model saved in MLflow format using the v2 inference protocol.
Deploy the Fraud Detection MLflow Model with InferenceService23¶
This example shows how to deploy your trained MLflow model on KServe using both the REST and gRPC protocol. The InferenceService
configuration specifies the model format (mlflow
), the s2 inference protocol, and the S3 URI where the model is stored. The serviceAccountName
field allows KServe to access the model stored in MinIO using the credentials provided earlier.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "fraud-detection-rest"
spec:
predictor:
minReplicas: 0
scaleTarget: 1
scaleMetric: qps
model:
modelFormat:
name: mlflow
protocolVersion: v2
storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
serviceAccountName: sa
Apply the configuration using kubectl
:
Once deployed, KServe will expose a REST endpoint where you can send inference requests. You can verify the service status using:
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "fraud-detection-grpc"
spec:
predictor:
minReplicas: 0
scaleTarget: 1
scaleMetric: qps
model:
modelFormat:
name: mlflow
protocolVersion: v2
storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
ports:
- containerPort: 9000
name: h2c
protocol: TCP
serviceAccountName: sa
Apply the configuration using kubectl
:
Once deployed, KServe will expose a gRPC endpoint where you can send inference requests. You can verify the service status using:
Make sure the storageUri
matches the path where your model is saved.
During deployment, you may encounter a few common issues:
- The model fails to load inside the pod during the storage initialization phase4. This is usually a permission issue—make sure your access credentials are correctly configured as shown in the section above.
- Sometimes the model loads successfully into the model server, but inference requests still fail. This could be due to:
- A mismatch between the model version and the model server runtime. In this case, try explicitly setting the
runtimeVersion
5. - Incorrect port settings, which prevent the server from responding properly.
- Architecture mismatch—for example, if you trained the model on a Mac (ARM64) but are using an x86-based KServe runtime image.
- Deployment in a control plane namespace. Namespaces labeled with
control-plane
are skipped by KServe’s webhook to avoid privilege escalation. This prevents the storage initializer from being injected into the pod, leading to errors like:No such file or directory: '/mnt/models'
.
- A mismatch between the model version and the model server runtime. In this case, try explicitly setting the
Test the Endpoints¶
Determine the ingress IP and port:
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')
This step retrieves the external IP address of the Istio Ingress Gateway and stores it in INGRESS_HOST
, and extracts the port named http2
to set as INGRESS_PORT
, allowing you to construct the full service endpoint for sending inference requests.
Set the required environment variables for the HTTP inference request:
export INPUT_PATH=input-example-rest-v2.json
export SERVICE_HOSTNAME=$(kubectl get inferenceservice fraud-detection-rest -o jsonpath='{.status.url}' | cut -d "/" -f 3)
input-example-rest-v2.json
{
"parameters": {
"content_type": "pd"
},
"inputs": [
{
"name": "has_fraud_7d",
"shape": [2],
"datatype": "BOOL",
"data": [false, false]
},
{
"name": "num_transactions_7d",
"shape": [2],
"datatype": "INT64",
"data": [7, 6]
},
{
"name": "account_age_days",
"shape": [2],
"datatype": "INT64",
"data": [655, 236]
},
{
"name": "credit_score",
"shape": [2],
"datatype": "INT64",
"data": [680, 737]
},
{
"name": "has_2fa_installed",
"shape": [2],
"datatype": "BOOL",
"data": [true, true]
}
]
}
# -v Enable verbose output for debugging
# -H "Host:..." Set the Host header to route through the ingress gateway
# -H "Content-Type:..." Specify the request content type as JSON
# -d @... Provide the input payload from the specified JSON file
# http://${INGRESS_HOST}... Target the model's inference endpoint
curl -v \
-H "Host: ${SERVICE_HOSTNAME}" \
-H "Content-Type: application/json" \
-d @${INPUT_PATH} \
http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/fraud-detection-http/infer
Expected Output
* Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80
> POST /v2/models/mlflow-apple-demand/infer HTTP/1.1
> Host: mlflow-apple-demand.default.127.0.0.1.sslip.io
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 1089
>
* upload completely sent off: 1089 bytes
< HTTP/1.1 200 OK
< ce-endpoint: mlflow-apple-demand
< ce-id: 9ddc841e-a8d4-405f-a7e4-73f7aa9bab09
< ce-inferenceservicename: mlserver
< ce-modelid: mlflow-apple-demand
< ce-namespace: default
< ce-requestid: 9ddc841e-a8d4-405f-a7e4-73f7aa9bab09
< ce-source: io.seldon.serving.deployment.mlserver.default
< ce-specversion: 0.3
< ce-type: io.seldon.serving.inference.response
< content-length: 240
< content-type: application/json
< date: Fri, 02 May 2025 04:06:58 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 247
<
* Connection #0 to host 127.0.0.1 left intact
{"model_name":"mlflow-apple-demand","id":"9ddc841e-a8d4-405f-a7e4-73f7aa9bab09","parameters":{"content_type":"np"},"outputs":[{"name":"output-1","shape":[1,1],"datatype":"FP32","parameters":{"content_type":"np"},"data":[1486.56298828125]}]}
Download the open_inference_grpc.proto
file:
curl -O https://raw.githubusercontent.com/kserve/open-inference-protocol/main/specification/protocol/open_inference_grpc.proto
open_inference_grpc.proto
open_inference_grpc.proto | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 |
|
Downloading this .proto
file gives you the standard gRPC interface definition for the Open Inference Protocol. It allows your client or server to communicate with ML model servers using a unified API.
Set the required environment variables for the gRPC inference request:
export PROTO_FILE=open_inference_grpc.proto
export INPUT_PATH=input-example-grpc-v2.json
export SERVICE_HOSTNAME=$(kubectl get inferenceservice fraud-detection-grpc -o jsonpath='{.status.url}' | cut -d "/" -f 3)
These variables specify the protobuf schema for gRPC, the input payload to send, and the target hostname for routing the request through the ingress gateway.
input-example-grpc-v2.json
{
"model_name": "fraud-detection-grpc",
"inputs": [
{
"name": "has_fraud_7d",
"shape": [2],
"datatype": "BOOL",
"contents": {
"bool_contents": [false, false]
}
},
{
"name": "num_transactions_7d",
"shape": [2],
"datatype": "INT64",
"contents": {
"int64_contents": [7, 6]
}
},
{
"name": "account_age_days",
"shape": [2],
"datatype": "INT64",
"contents": {
"int64_contents": [655, 236]
}
},
{
"name": "credit_score",
"shape": [2],
"datatype": "INT64",
"contents": {
"int64_contents": [680, 737]
}
},
{
"name": "has_2fa_installed",
"shape": [2],
"datatype": "BOOL",
"contents": {
"bool_contents": [true, true]
}
}
]
}
# -vv Verbose output for debugging
# -plaintext Use plaintext (non-TLS) connection
# -proto Path to the .proto file describing the gRPC service
# -authority Sets the HTTP/2 authority header (useful for ingress routing)
# -d Read the request body from stdin
# ${INGRESS_HOST}... Target host and port of the gRPC server
# inference.GRPC... Fully-qualified gRPC method to call
# <<<... Provide JSON request body from file as stdin
grpcurl -vv \
-plaintext \
-proto ${PROTO_FILE} \
-authority ${SERVICE_HOSTNAME} \
-d @ \
${INGRESS_HOST}:${INGRESS_PORT} \
inference.GRPCInferenceService.ModelInfer \
<<< $(cat "$INPUT_PATH")