How It Works?¶
KServe can be installed in two modes:
- Serverless Mode (Recommended): Powered by Knative and Istio, this mode offers benefits such as automatic scaling, enhanced security, simplified traffic management, and seamless integration with serverless workflows.1
- RawDeployment Mode: Utilizes native Kubernetes resources like Deployment, Service, Ingress, and Horizontal Pod Autoscaler, providing a more traditional approach to model serving.2

Architecture Components¶
Control Plane¶
- KServe Controller
-
Handles the creation of services, ingress resources, model server containers, and model agent containers to facilitate request/response logging, batching, and model retrieval.4
- Ingress Gateway
-
Acts as an entry point for directing external or internal traffic to the appropriate services.4
If operating in Serverless mode, the following additional components are included:
- Knative Serving Controller
-
Manages service revisions, sets up network routing configurations, and provisions serverless containers with a queue proxy to handle traffic metrics and enforce concurrency limits.4
- Knative Activator
-
Responsible for reviving pods that have been scaled down to zero and routing incoming requests to them.4
- Knative Autoscaler (KPA)
-
Monitors application traffic and dynamically adjusts the number of replicas based on predefined metrics.4
Data Plane¶
- InferenceService
-
An InferenceService in KServe is a Kubernetes custom resource designed to simplify the deployment and management of machine learning models for inference. It integrates components such as predictors, transformers, and explainers, offering capabilities like autoscaling, version control, and traffic splitting to optimize model serving in production.5
- Predictor
-
The predictor is the core component of the InferenceService, responsible for hosting the model and exposing it through a network endpoint for inference requests.5
- Explainer
-
The explainer provides an optional feature that generates model explanations alongside predictions, offering insights into the model's decision-making process.5
- Transformer
-
The transformer allows users to define custom pre-processing and post-processing steps, enabling data transformations before predictions or explanations are generated.5
Serving Runtimes¶
KServe utilizes two types of Custom Resource Definitions (CRDs) to define model serving environments: ServingRuntimes
and ClusterServingRuntimes
. The primary distinction between them is their scope—ServingRuntimes
are namespace-scoped, while ClusterServingRuntimes
are cluster-scoped.
- ServingRuntime / ClusterServingRuntime
-
These CRDs specify templates for Pods capable of serving one or more specific model formats. Each ServingRuntime includes essential details such as the runtime's container image and the list of supported model formats8.
KServe provides several pre-configured ClusterServingRuntimes, enabling users to deploy popular model formats without the need to manually define the runtimes.

Core Concepts¶
Inference Service¶
The following example demonstrates the minimum setup required to deploy an InferenceService in KServe10:
Predictor:
- Model Format: Specifies the framework or format of the model being served. In this example, the model format is
sklearn
. - Storage URI: Indicates the location of the model file. Here, the model is stored in a Google Cloud Storage bucket at
gs://kfserving-examples/models/sklearn/1.0/model
.
To ensure that the pod can successfully load the model, proper permissions must be configured.1617
Once applied, your InferenceService will be successfully deployed:
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 sklearn-iris-predictor-default-47q2g 7d23h
Inference Protocol¶
KServe's data plane protocol provides a framework-agnostic inference API that works seamlessly across various ML/DL frameworks and model servers. It supports two versions:
To use v2 REST protocol for inference with the deployed model, you set the protocolVersion
field to v211:
Not all server runtimes support v2 inference protocol and gRPC protocol, you should check here9 for more information.

To use v2 gRPC protocol for inference with the deployed model, you set the container port to be 8081
and the name of port to be h2c
12(this setup is not for TensorFlow and PyTorch, which have their own settings).
Autoscaling¶
Set up based on what condition to scale up the predictor using scaleTarget
and scaleMetric
.18
Scale to zero using minReplicas: 0
.18
Canary Rollout¶
KServe supports canary rollout, a deployment strategy that allows you to gradually shift traffic between different versions of a model. This approach minimizes risks by enabling you to test new versions (canary models) with a small percentage of traffic before fully rolling them out.
Promote the canary model or roll back to the previous model using the canaryTrafficPercent
field. In addition, you can use the serving.kserve.io/enable-tag-routing
annotation to route traffic explicitly. This allows you to direct traffic to the canary model (model v2) or the previous model (model v1) by including a tag in the request URL.13
Inference Graph¶
Modern ML inference systems are increasingly complex, often requiring multiple models to generate a single prediction. KServe simplifies this process by supporting InferenceGraph
, allowing users to define and deploy intricate ML inference pipelines in a declarative and scalable manner for production use.14

- InferenceGraph
-
Composed of routing
Nodes
, each containing a series of routingSteps
. EachStep
can direct traffic to either an InferenceService or anotherNode
within the graph, making theInferenceGraph
highly modular. TheInferenceGraph
supports four types of RoutingNodes
: Sequence, Switch, Ensemble, and Splitter.14 - Sequence Node
-
Enables users to define a series of
Steps
where eachStep
routes to anInferenceService
or anotherNode
in a sequential manner. The output of oneStep
can be configured to serve as the input for the nextStep
.14 - Ensemble Node
-
Facilitates model ensembles by running multiple models independently and combining their outputs into a single prediction. Various methods, such as majority voting for classification or averaging for regression, can be used to aggregate the results.14
- Splitter Node
-
Distributes traffic across multiple targets based on a specified weighted distribution.14
- Switch Node
-
Allows users to specify routing conditions to determine which
Step
to execute. The response is returned as soon as a condition is met. If no conditions are satisfied, the graph returns the original request.14
Example15
Behind the Scenes¶
When you apply an InferenceService
using kubectl apply
, the following steps occur behind the scenes (in serverless mode):
- The KServe Controller receives the request and deploys a Knative Service.
- A Knative Revision is prepared to manage versioning and traffic routing.
- The Transformer and Predictor Pods are deployed, with autoscaling configurations set up via the Knative Autoscaler.
- The Predictor Pod uses an InitContainer (Storage Initializer) to fetch the model from a storage location (e.g., GCS, S3).
- Once the model is retrieved, the Predictor Pod deploys the model using the specified Server Runtime.
- The Predictor Pod exposes its endpoint through a Queue Proxy, which handles traffic metrics and concurrency limits. The endpoint is then made accessible externally via a Service.
- The Transformer Pod, which handles pre-processing and post-processing logic, does not require a storage initializer. It simply deploys the transformer container.
- Similar to the Predictor Pod, the Transformer Pod exposes its endpoint through a Queue Proxy, making it accessible externally via a Service.
- Finally, the backend of your AI application can call the
InferenceService
endpoints to execute pre-processing, prediction, and post-processing. The system dynamically scales up or down based on the configured autoscaling metrics.
This seamless orchestration ensures efficient and scalable model serving for your AI applications.