Skip to content

Deploy Your Model on Kubernetes

Prerequisites

  • MinIO must be running and contain a bucket named mlflow.
  • A Ray Tune job must have completed and logged the best model to MLflow, which was stored in MinIO. If you haven’t done this yet, refer to the following guides:
  • gRPCurl installed
    brew install grpcurl
    

This guide walks you through how to deploy the model you previously trained with Ray and logged to MinIO via MLflow. You'll learn how to serve it using KServe with both REST and gRPC endpoints, and enable autoscaling—including scale-to-zero support.

Grant KServe Permission to Load Models from MinIO 1

To allow KServe to pull models from MinIO (or any S3-compatible storage), you'll need to provide access credentials via a Kubernetes Secret and bind it to a ServiceAccount. Then, reference that service account in your InferenceService.

Start by creating a Secret that holds your S3 credentials. This secret should include your access key and secret key, and be annotated with the MinIO endpoint and settings to disable HTTPS if you're working in a local or test environment.

secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: s3creds
  annotations:
     serving.kserve.io/s3-endpoint: minio-api.minio.svc.cluster.local:9000
     serving.kserve.io/s3-usehttps: "0" # by default 1, if testing with minio you can set to 0
type: Opaque
stringData: # This is for raw credential string. For base64 encoded string, use `data` instead
  AWS_ACCESS_KEY_ID: minio_user
  AWS_SECRET_ACCESS_KEY: minio_password

These values should match what you specified when deploying MinIO on Kubernetes. For more details, refer to the configuration section below or revisit this article.

Info
minio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: minio
  namespace: minio
spec:
  replicas: 1
  selector:
    matchLabels:
      app: minio
  template:
    metadata:
      labels:
        app: minio
    spec:
      containers:
        - name: minio
          image: minio/minio
          args:
            - server
            - /data
            - --console-address
            - :9001
          env:
            - name: MINIO_ROOT_USER
              value: minio_user
            - name: MINIO_ROOT_PASSWORD
              value: minio_password
          ports:
            - containerPort: 9000
              protocol: TCP
            - containerPort: 9001
              protocol: TCP
          livenessProbe:
            httpGet:
              path: /minio/health/live
              port: 9000
            initialDelaySeconds: 30
            periodSeconds: 20
            timeoutSeconds: 15
            failureThreshold: 6
          readinessProbe:
            httpGet:
              path: /minio/health/ready
              port: 9000
            initialDelaySeconds: 15
            periodSeconds: 10
            timeoutSeconds: 10
            failureThreshold: 3
          volumeMounts:
            - name: storage
              mountPath: /data
      volumes:
        - name: storage
          hostPath:
            path: /home/docker/data/minio
            type: DirectoryOrCreate
      restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: minio-console
  namespace: minio
spec:
  selector:
    app: minio
  type: NodePort
  ports:
    - name: console
      port: 9001
      targetPort: 9001
      nodePort: 30901
---
apiVersion: v1
kind: Service
metadata:
  name: minio-api
  namespace: minio
spec:
  selector:
    app: minio
  type: ClusterIP
  ports:
    - name: api
      port: 9000
      targetPort: 9000
---
apiVersion: batch/v1
kind: Job
metadata:
  name: minio-create-bucket
  namespace: minio
spec:
  backoffLimit: 6
  completions: 1
  template:
    metadata:
      labels:
        job: minio-create-bucket
    spec:
      initContainers:
        - name: wait-for-minio
          image: busybox
          command:
            - sh
            - -c
            - |
              until nc -z minio-api.minio.svc.cluster.local 9000; do
                echo "Waiting for MinIO..."
                sleep 2
              done
              echo "MinIO is ready!"
      containers:
        - name: minio-create-buckets
          image: minio/mc
          command:
            - sh
            - -c
            - |
              mc alias set minio http://minio-api.minio.svc.cluster.local:9000 minio_user minio_password &&
              for bucket in mlflow dbt sqlmesh ray; do
                if ! mc ls minio/$bucket >/dev/null 2>&1; then
                  echo "Creating bucket: $bucket"
                  mc mb minio/$bucket
                  echo "Bucket created: $bucket"
                else
                  echo "Bucket already exists: $bucket"
                fi
              done
      restartPolicy: OnFailure
      terminationGracePeriodSeconds: 30

Next, create a ServiceAccount that references the secret. This will allow KServe to inject the credentials when pulling models.

sa.yaml
1
2
3
4
5
6
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa
secrets:
- name: s3creds

Finally, define an InferenceService that uses the ServiceAccount and points to the model artifact stored in MinIO. In this example, we are deploying a model saved in MLflow format using the v2 inference protocol.

inference-service-rest-v2.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-detection-rest"
spec:
  predictor:
    minReplicas: 0
    scaleTarget: 1
    scaleMetric: qps
    model:
      modelFormat:
        name: mlflow
      protocolVersion: v2
      storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
    serviceAccountName: sa
inference-service-grpc-v2.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-detection-grpc"
spec:
  predictor:
    minReplicas: 0
    scaleTarget: 1
    scaleMetric: qps
    model:
      modelFormat:
        name: mlflow
      protocolVersion: v2
      storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
    serviceAccountName: sa

Deploy the Fraud Detection MLflow Model with InferenceService23

This example shows how to deploy your trained MLflow model on KServe using both the REST and gRPC protocol. The InferenceService configuration specifies the model format (mlflow), the s2 inference protocol, and the S3 URI where the model is stored. The serviceAccountName field allows KServe to access the model stored in MinIO using the credentials provided earlier.

inference-service-rest-v2.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-detection-rest"
spec:
  predictor:
    minReplicas: 0
    scaleTarget: 1
    scaleMetric: qps
    model:
      modelFormat:
        name: mlflow
      protocolVersion: v2
      storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
    serviceAccountName: sa

Apply the configuration using kubectl:

kubectl apply -f inference-service-rest-v2.yaml

Once deployed, KServe will expose a REST endpoint where you can send inference requests. You can verify the service status using:

kubectl get inferenceservice fraud-detection-http
inference-service-grpc-v2.yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "fraud-detection-grpc"
spec:
  predictor:
    minReplicas: 0
    scaleTarget: 1
    scaleMetric: qps
    model:
      modelFormat:
        name: mlflow
      protocolVersion: v2
      storageUri: s3://miflow/2/5ccd7dcabc1f49c1bc45f1f94d945dd6/artifacts/model
      ports:
        - containerPort: 9000
          name: h2c
          protocol: TCP
    serviceAccountName: sa

Apply the configuration using kubectl:

kubectl apply -f inference-service-grpc-v2.yaml

Once deployed, KServe will expose a gRPC endpoint where you can send inference requests. You can verify the service status using:

kubectl get inferenceservice fraud-detection-grpc

Make sure the storageUri matches the path where your model is saved.

During deployment, you may encounter a few common issues:

  1. The model fails to load inside the pod during the storage initialization phase4. This is usually a permission issue—make sure your access credentials are correctly configured as shown in the section above.
  2. Sometimes the model loads successfully into the model server, but inference requests still fail. This could be due to:
    • A mismatch between the model version and the model server runtime. In this case, try explicitly setting the runtimeVersion5.
    • Incorrect port settings, which prevent the server from responding properly.
    • Architecture mismatch—for example, if you trained the model on a Mac (ARM64) but are using an x86-based KServe runtime image.
    • Deployment in a control plane namespace. Namespaces labeled with control-plane are skipped by KServe’s webhook to avoid privilege escalation. This prevents the storage initializer from being injected into the pod, leading to errors like: No such file or directory: '/mnt/models'.

Test the Endpoints

Determine the ingress IP and port:

export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].port}')

This step retrieves the external IP address of the Istio Ingress Gateway and stores it in INGRESS_HOST, and extracts the port named http2 to set as INGRESS_PORT, allowing you to construct the full service endpoint for sending inference requests.

Set the required environment variables for the HTTP inference request:

export INPUT_PATH=input-example-rest-v2.json
export SERVICE_HOSTNAME=$(kubectl get inferenceservice fraud-detection-rest -o jsonpath='{.status.url}' | cut -d "/" -f 3)
input-example-rest-v2.json
input-example-rest-v2.json
{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "has_fraud_7d",
      "shape": [2],
      "datatype": "BOOL",
      "data": [false, false]
    },
    {
      "name": "num_transactions_7d",
      "shape": [2],
      "datatype": "INT64",
      "data": [7, 6]
    },
    {
      "name": "account_age_days",
      "shape": [2],
      "datatype": "INT64",
      "data": [655, 236]
    },
    {
      "name": "credit_score",
      "shape": [2],
      "datatype": "INT64",
      "data": [680, 737]
    },
    {
      "name": "has_2fa_installed",
      "shape": [2],
      "datatype": "BOOL",
      "data": [true, true]
    }
  ]
}
test-commands.txt
# -v                       Enable verbose output for debugging
# -H "Host:..."            Set the Host header to route through the ingress gateway
# -H "Content-Type:..."    Specify the request content type as JSON
# -d @...                  Provide the input payload from the specified JSON file
# http://${INGRESS_HOST}... Target the model's inference endpoint
curl -v \
  -H "Host: ${SERVICE_HOSTNAME}" \
  -H "Content-Type: application/json" \
  -d @${INPUT_PATH} \
  http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/fraud-detection-http/infer

Expected Output

*   Trying 127.0.0.1:80...
* Connected to 127.0.0.1 (127.0.0.1) port 80
> POST /v2/models/mlflow-apple-demand/infer HTTP/1.1
> Host: mlflow-apple-demand.default.127.0.0.1.sslip.io
> User-Agent: curl/8.7.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 1089
> 
* upload completely sent off: 1089 bytes
< HTTP/1.1 200 OK
< ce-endpoint: mlflow-apple-demand
< ce-id: 9ddc841e-a8d4-405f-a7e4-73f7aa9bab09
< ce-inferenceservicename: mlserver
< ce-modelid: mlflow-apple-demand
< ce-namespace: default
< ce-requestid: 9ddc841e-a8d4-405f-a7e4-73f7aa9bab09
< ce-source: io.seldon.serving.deployment.mlserver.default
< ce-specversion: 0.3
< ce-type: io.seldon.serving.inference.response
< content-length: 240
< content-type: application/json
< date: Fri, 02 May 2025 04:06:58 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 247
< 
* Connection #0 to host 127.0.0.1 left intact
{"model_name":"mlflow-apple-demand","id":"9ddc841e-a8d4-405f-a7e4-73f7aa9bab09","parameters":{"content_type":"np"},"outputs":[{"name":"output-1","shape":[1,1],"datatype":"FP32","parameters":{"content_type":"np"},"data":[1486.56298828125]}]}

Download the open_inference_grpc.proto file:

curl -O https://raw.githubusercontent.com/kserve/open-inference-protocol/main/specification/protocol/open_inference_grpc.proto
open_inference_grpc.proto
open_inference_grpc.proto
// Copyright 2023 The KServe Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//    http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

syntax = "proto3";
package inference;

// Inference Server GRPC endpoints.
service GRPCInferenceService
{
  // The ServerLive API indicates if the inference server is able to receive 
  // and respond to metadata and inference requests.
  rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}

  // The ServerReady API indicates if the server is ready for inferencing.
  rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}

  // The ModelReady API indicates if a specific model is ready for inferencing.
  rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}

  // The ServerMetadata API provides information about the server. Errors are 
  // indicated by the google.rpc.Status returned for the request. The OK code 
  // indicates success and other codes indicate failure.
  rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}

  // The per-model metadata API provides information about a model. Errors are 
  // indicated by the google.rpc.Status returned for the request. The OK code 
  // indicates success and other codes indicate failure.
  rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}

  // The ModelInfer API performs inference using the specified model. Errors are
  // indicated by the google.rpc.Status returned for the request. The OK code 
  // indicates success and other codes indicate failure.
  rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
}

message ServerLiveRequest {}

message ServerLiveResponse
{
  // True if the inference server is live, false if not live.
  bool live = 1;
}

message ServerReadyRequest {}

message ServerReadyResponse
{
  // True if the inference server is ready, false if not ready.
  bool ready = 1;
}

message ModelReadyRequest
{
  // The name of the model to check for readiness.
  string name = 1;

  // The version of the model to check for readiness. If not given the
  // server will choose a version based on the model and internal policy.
  string version = 2;
}

message ModelReadyResponse
{
  // True if the model is ready, false if not ready.
  bool ready = 1;
}

message ServerMetadataRequest {}

message ServerMetadataResponse
{
  // The server name.
  string name = 1;

  // The server version.
  string version = 2;

  // The extensions supported by the server.
  repeated string extensions = 3;
}

message ModelMetadataRequest
{
  // The name of the model.
  string name = 1;

  // The version of the model to check for readiness. If not given the
  // server will choose a version based on the model and internal policy.
  string version = 2;
}

message ModelMetadataResponse
{
  // Metadata for a tensor.
  message TensorMetadata
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape. A variable-size dimension is represented
    // by a -1 value.
    repeated int64 shape = 3;
  }

  // The model name.
  string name = 1;

  // The versions of the model available on the server.
  repeated string versions = 2;

  // The model's platform. See Platforms.
  string platform = 3;

  // The model's inputs.
  repeated TensorMetadata inputs = 4;

  // The model's outputs.
  repeated TensorMetadata outputs = 5;

  // Optional Model Properties
  map<string, string> properties = 6;
}

message ModelInferRequest
{
  // An input tensor for an inference request.
  message InferInputTensor
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape.
    repeated int64 shape = 3;

    // Optional inference input tensor parameters.
    map<string, InferParameter> parameters = 4;

    // The tensor contents using a data-type format. This field must
    // not be specified if "raw" tensor contents are being used for
    // the inference request.
    InferTensorContents contents = 5;
  }

  // An output tensor requested for an inference request.
  message InferRequestedOutputTensor
  {
    // The tensor name.
    string name = 1;

    // Optional requested output tensor parameters.
    map<string, InferParameter> parameters = 2;
  }

  // The name of the model to use for inferencing.
  string model_name = 1;

  // The version of the model to use for inference. If not given the
  // server will choose a version based on the model and internal policy.
  string model_version = 2;

  // Optional identifier for the request. If specified will be
  // returned in the response.
  string id = 3;

  // Optional inference parameters.
  map<string, InferParameter> parameters = 4;

  // The input tensors for the inference.
  repeated InferInputTensor inputs = 5;

  // The requested output tensors for the inference. Optional, if not
  // specified all outputs produced by the model will be returned.
  repeated InferRequestedOutputTensor outputs = 6;

  // The data contained in an input tensor can be represented in "raw"
  // bytes form or in the repeated type that matches the tensor's data
  // type. To use the raw representation 'raw_input_contents' must be
  // initialized with data for each tensor in the same order as
  // 'inputs'. For each tensor, the size of this content must match
  // what is expected by the tensor's shape and data type. The raw
  // data must be the flattened, one-dimensional, row-major order of
  // the tensor elements without any stride or padding between the
  // elements. Note that the FP16 and BF16 data types must be represented as
  // raw content as there is no specific data type for a 16-bit float type.
  //
  // If this field is specified then InferInputTensor::contents must
  // not be specified for any input tensor.
  repeated bytes raw_input_contents = 7;
}

message ModelInferResponse
{
  // An output tensor returned for an inference request.
  message InferOutputTensor
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape.
    repeated int64 shape = 3;

    // Optional output tensor parameters.
    map<string, InferParameter> parameters = 4;

    // The tensor contents using a data-type format. This field must
    // not be specified if "raw" tensor contents are being used for
    // the inference response.
    InferTensorContents contents = 5;
  }

  // The name of the model used for inference.
  string model_name = 1;

  // The version of the model used for inference.
  string model_version = 2;

  // The id of the inference request if one was specified.
  string id = 3;

  // Optional inference response parameters.
  map<string, InferParameter> parameters = 4;

  // The output tensors holding inference results.
  repeated InferOutputTensor outputs = 5;

  // The data contained in an output tensor can be represented in
  // "raw" bytes form or in the repeated type that matches the
  // tensor's data type. To use the raw representation 'raw_output_contents'
  // must be initialized with data for each tensor in the same order as
  // 'outputs'. For each tensor, the size of this content must match
  // what is expected by the tensor's shape and data type. The raw
  // data must be the flattened, one-dimensional, row-major order of
  // the tensor elements without any stride or padding between the
  // elements. Note that the FP16 and BF16 data types must be represented as
  // raw content as there is no specific data type for a 16-bit float type.
  //
  // If this field is specified then InferOutputTensor::contents must
  // not be specified for any output tensor.
  repeated bytes raw_output_contents = 6;
}

// An inference parameter value. The Parameters message describes a 
// “name”/”value” pair, where the “name” is the name of the parameter
// and the “value” is a boolean, integer, or string corresponding to 
// the parameter.
message InferParameter
{
  // The parameter value can be a string, an int64, a boolean
  // or a message specific to a predefined parameter.
  oneof parameter_choice
  {
    // A boolean parameter value.
    bool bool_param = 1;

    // An int64 parameter value.
    int64 int64_param = 2;

    // A string parameter value.
    string string_param = 3;

    // A double parameter value.
    double double_param = 4;

    // A uint64 parameter value.
    uint64 uint64_param = 5;
  }
}

// The data contained in a tensor represented by the repeated type
// that matches the tensor's data type. Protobuf oneof is not used
// because oneofs cannot contain repeated fields.
message InferTensorContents
{
  // Representation for BOOL data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated bool bool_contents = 1;

  // Representation for INT8, INT16, and INT32 data types. The size
  // must match what is expected by the tensor's shape. The contents
  // must be the flattened, one-dimensional, row-major order of the
  // tensor elements.
  repeated int32 int_contents = 2;

  // Representation for INT64 data types. The size must match what
  // is expected by the tensor's shape. The contents must be the
  // flattened, one-dimensional, row-major order of the tensor elements.
  repeated int64 int64_contents = 3;

  // Representation for UINT8, UINT16, and UINT32 data types. The size
  // must match what is expected by the tensor's shape. The contents
  // must be the flattened, one-dimensional, row-major order of the
  // tensor elements.
  repeated uint32 uint_contents = 4;

  // Representation for UINT64 data types. The size must match what
  // is expected by the tensor's shape. The contents must be the
  // flattened, one-dimensional, row-major order of the tensor elements.
  repeated uint64 uint64_contents = 5;

  // Representation for FP32 data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated float fp32_contents = 6;

  // Representation for FP64 data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated double fp64_contents = 7;

  // Representation for BYTES data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated bytes bytes_contents = 8;
}

Downloading this .proto file gives you the standard gRPC interface definition for the Open Inference Protocol. It allows your client or server to communicate with ML model servers using a unified API.

Set the required environment variables for the gRPC inference request:

export PROTO_FILE=open_inference_grpc.proto
export INPUT_PATH=input-example-grpc-v2.json
export SERVICE_HOSTNAME=$(kubectl get inferenceservice fraud-detection-grpc -o jsonpath='{.status.url}' | cut -d "/" -f 3)

These variables specify the protobuf schema for gRPC, the input payload to send, and the target hostname for routing the request through the ingress gateway.

input-example-grpc-v2.json
input-example-grpc-v2.json
{
  "model_name": "fraud-detection-grpc",
  "inputs": [
    {
      "name": "has_fraud_7d",
      "shape": [2],
      "datatype": "BOOL",
      "contents": {
        "bool_contents": [false, false]
      }
    },
    {
      "name": "num_transactions_7d",
      "shape": [2],
      "datatype": "INT64",
      "contents": {
        "int64_contents": [7, 6]
      }
    },
    {
      "name": "account_age_days",
      "shape": [2],
      "datatype": "INT64",
      "contents": {
        "int64_contents": [655, 236]
      }
    },
    {
      "name": "credit_score",
      "shape": [2],
      "datatype": "INT64",
      "contents": {
        "int64_contents": [680, 737]
      }
    },
    {
      "name": "has_2fa_installed",
      "shape": [2],
      "datatype": "BOOL",
      "contents": {
        "bool_contents": [true, true]
      }
    }
  ]
}
test-commands.txt
# -vv                Verbose output for debugging
# -plaintext         Use plaintext (non-TLS) connection
# -proto             Path to the .proto file describing the gRPC service
# -authority         Sets the HTTP/2 authority header (useful for ingress routing)
# -d                 Read the request body from stdin
# ${INGRESS_HOST}... Target host and port of the gRPC server
# inference.GRPC...  Fully-qualified gRPC method to call
# <<<...             Provide JSON request body from file as stdin
grpcurl -vv \
  -plaintext \
  -proto ${PROTO_FILE} \
  -authority ${SERVICE_HOSTNAME} \
  -d @ \
  ${INGRESS_HOST}:${INGRESS_PORT} \
  inference.GRPCInferenceService.ModelInfer \
  <<< $(cat "$INPUT_PATH")

Expected Output

TODO: Add expected output for gRPC inference request