ISTIO & KServe

ISTIO & KServe

IRSA (IAM Roles for Service Accounts)


Make sure OIDC is enabled for your Cluster

eksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve

Create IAM Policy


    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:s3:::pytorch-models"
            "Sid": "List",
            "Effect": "Allow",
            "Action": [
            "Resource": "arn:aws:s3:::pytorch-models/*"
aws iam create-policy \
    --policy-name S3ListTest \
    --policy-document file://iam-s3-test-policy.json

Make sure the policy is created

aws iam get-policy-version --policy-arn arn:aws:iam::ACCOUNT_ID:policy/S3ListTest --version-id v1


Create IRSA

eksctl create iamserviceaccount \
  --name s3-list-sa \
  --cluster basic-cluster \
  --attach-policy-arn arn:aws:iam::ACCOUNT_ID:policy/S3ListTest \
  --approve \
	--region ap-south-1

This will Create the s3-list-sa Service Account with our defined Policy

❯ kg sa
default      0         10m
s3-list-sa   0         5s
k describe sa s3-list-sa

Create a Pod with the Service Account Attached to it


apiVersion: v1
kind: Pod
  name: aws-cli
  serviceAccountName: s3-list-sa
  - name: aws-cli
    image: amazon/aws-cli:latest
      - sleep
      - "3600"
    imagePullPolicy: IfNotPresent
  restartPolicy: Always
k apply -f aws-cli-pod.yaml
k exec -it aws-cli -- aws sts get-caller-identity
kubectl exec -it aws-cli -- aws s3 ls s3://YOUR_BUCKET
❯ kubectl exec -it aws-cli -- aws s3 ls s3://pytorch-models
                           PRE config/
                           PRE model-store/
kubectl exec -it aws-cli -- bash

Now we can run aws s3 commands to check




Make sure OIDC is enabled for your Cluster

eksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve

Install the EBS CSI Addon

First create the IRSA for EBS

eksctl create iamserviceaccount \
    --name ebs-csi-controller-sa \
    --namespace kube-system \
    --cluster basic-cluster \
    --role-name AmazonEKS_EBS_CSI_DriverRole \
    --role-only \
    --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
    --approve \
		--region ap-south-1

Create the EBS CSI Driver Addon

eksctl create addon --name aws-ebs-csi-driver --cluster basic-cluster --service-account-role-arn arn:aws:iam::581084559687:role/AmazonEKS_EBS_CSI_DriverRole --region ap-south-1 --force
ebs-csi-controller-sa                0         4s
ebs-csi-node-sa                      0         4s

Testing out the EBS Dynamic PV

git clone
cd aws-ebs-csi-driver/examples/kubernetes/dynamic-provisioning/

Set the Storage Class to GP3

echo "parameters:
  type: gp3" >> manifests/storageclass.yaml

Deploy the example Pod, SC and PVC

kubectl apply -f manifests/
❯ kubectl describe storageclass ebs-sc
Name:            ebs-sc
IsDefaultClass:  No
Parameters:            type=gp3
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     WaitForFirstConsumer
Events:                <none>
❯ kg pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM               STORAGECLASS   REASON   AGE
pvc-a1ab07e7-20c4-43f8-9ca8-be47542b3128   4Gi        RWO            Delete           Bound    default/ebs-claim   ebs-sc                  48s
❯ kg pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
ebs-claim   Bound    pvc-a1ab07e7-20c4-43f8-9ca8-be47542b3128   4Gi        RWO            ebs-sc         69s

We can exec inside the pod to verify

❯ kubectl exec -it app -- bash
[root@app /]# ls
bin  data  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
[root@app /]# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          80G  4.7G   76G   6% /
tmpfs            64M     0   64M   0% /dev
tmpfs           1.9G     0  1.9G   0% /sys/fs/cgroup
/dev/nvme1n1    3.9G   28K  3.9G   1% /data
/dev/nvme0n1p1   80G  4.7G   76G   6% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           3.3G   12K  3.3G   1% /run/secrets/
tmpfs           1.9G     0  1.9G   0% /proc/acpi
tmpfs           1.9G     0  1.9G   0% /sys/firmware

Nodes can also be described!

kubectl describe node
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           83873772Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3955652Ki
  pods:                        17
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           76224326324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3400644Ki
  pods:                        17
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         265m (13%)  100m (5%)
  memory                      320Mi (9%)  1068Mi (32%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0




Just add the relevant Instance Type and thats it!

  - name: ng-gpu-spot-1
    instanceType: g4dn.xlarge
    desiredCapacity: 1
      allow: true
    spot: true
      role: spot
    propagateASGTags: true
        autoScaler: true
        awsLoadBalancerController: true
        certManager: true
        externalDNS: true
        ebs: true

GPU Nodes on EKS

❯ kubectl get nodes -L
NAME                                           STATUS   ROLES    AGE   VERSION                INSTANCE-TYPE   Ready    <none>   98m   v1.25.13-eks-43840fb   g4dn.xlarge   Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium   Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium    Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium   Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium    Ready    <none>   98m   v1.25.13-eks-43840fb   g4dn.xlarge   Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium   Ready    <none>   98m   v1.25.13-eks-43840fb   t3a.medium


MIG (Multi-Instance GPU) can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.




With the MIG single strategy, you can create MIG devices of the same size. For instance, on a P4d.24XL, the options include creating 56 slices of 1g.5gb, 24 slices of 2g.10gb, 16 slices of 3g.20gb, or a single slice of either 4g.40gb or 7g.40gb.

Suppose we have several teams working on simulations and deep learning. With the MIG single strategy, you can give each team the GPU power they need, making the most of the P4d.24XL. This approach is great when all tasks need the same amount of GPU and when tasks can run at the same time.

On the other hand, the mixed strategy offers more flexibility. It lets you combine different-sized slices, such as a few 1g.5gb slices with some 2g.10gb and 3g.20gb slices. This approach is particularly beneficial when your cluster handles tasks with diverse GPU needs.


Istio is the leading example of a new class of projects called Service Meshes. Service meshes manage traffic between microservices at layer 7 of the OSI Model. Using this in-depth knowledge of the traffic semantics – for example HTTP request hosts, methods, and paths – traffic handling can be much more sophisticated.

Istio works by having a small network proxy sit alongside each microservice. This so-called “sidecar” intercepts all of the service’s traffic, and handles it more intelligently than a simple layer 3 network can. Istio uses the Envoy proxy as its sidecar.


Istio Architecture:

An Istio service mesh is logically split into a data plane and a control plane.

  • The data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices. They also collect and report telemetry on all mesh traffic.
  • The control plane manages and configures the proxies to route traffic.

Key Use Cases of Istio:

  1. Traffic Management: Istio enables sophisticated traffic routing and load balancing, such as canary releases, A/B testing, and blue-green deployments.
  2. Security: It provides authentication, authorization, and encryption for service-to-service communication, enhancing the overall security of your microservices.
  3. Observability: Istio offers powerful tools for monitoring, tracing, and logging, which help in diagnosing issues, understanding performance, and maintaining a healthy application.
  4. Resilience: Istio can automatically retry failed requests, handle timeouts, and implement circuit breaking to improve the overall reliability of microservices.

Simple Examples:

  1. Traffic Routing: Imagine you have a web application with a new feature. You can use Istio to gradually send a percentage of user traffic to the new feature, allowing you to test it with a subset of users before a full rollout.
  2. Security: Istio can ensure that only authenticated services can communicate with each other. You can demonstrate this by creating policies that deny communication between services unless they meet specific authentication requirements.
  3. Observability: You can show how Istio generates telemetry data and integrates with tools like Grafana and Prometheus to monitor service performance and troubleshoot issues.


Why to use Istio?

  1. Fine-Grained Traffic Control: While Kubernetes provides basic load balancing and Ingress controllers, Istio offers more advanced traffic control. With Istio, you can perform complex routing, such as canary releases, A/B testing, and gradual deployments, which Kubernetes Ingress may not handle as flexibly.
  2. Security: Istio provides robust security features, including mutual TLS (mTLS) authentication between microservices. This ensures secure communication, which might not be as comprehensive in a basic Kubernetes setup.
  3. Observability: Istio offers advanced observability features, including metrics, distributed tracing, and logging, which are valuable for diagnosing issues and understanding performance. While Kubernetes provides some observability, Istio enhances it significantly.

When not to use Istio

  1. Simplicity: If you have a simple, monolithic application or a small-scale microservices setup, the overhead of configuring and managing Istio may outweigh its benefits. Simpler solutions like basic Kubernetes ingress controllers or a lightweight proxy can suffice.
  2. Learning Curve: Istio has a steep learning curve, which may be challenging for teams new to microservices or containers. If your team lacks the necessary expertise, you might consider simpler alternatives.
  3. Resource Overhead: The sidecar proxies deployed with Istio can consume additional resources, which may be a concern in resource-constrained environments. This overhead may not be justified for smaller applications.

What is a Service Mesh?

Istio Installation on EKS Cluster

We’re going to use HELM to install Istio

istio_chart_url     = ""
istio_chart_version = "1.19.1"
helm repo add istio
helm repo update
kubectl create namespace istio-system
helm install istio-base istio/base -n istio-system --set defaultRevision=default
helm install istiod istio/istiod -n istio-system --wait
kubectl create namespace istio-ingress

Its important to set the correct annotations so the ALB can be allocated

helm install istio-ingress istio/gateway -n istio-ingress \
	--set "labels.istio=ingressgateway" \
  --set service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-type"="nlb" \
  --set service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-scheme"="internet-facing" \
  --set service.annotations."service\.beta\.kubernetes\.io/aws-load-balancer-attributes"="load_balancing.cross_zone.enabled=true" \
helm ls -n istio-system
helm ls -n istio-ingress
❯ helm ls -A
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
istio-base      istio-system    1               2023-10-11 23:09:52.890167 +0530 IST    deployed        base-1.19.1     1.19.1     
istio-ingress   istio-ingress   1               2023-10-11 23:10:30.631913 +0530 IST    deployed        gateway-1.19.1  1.19.1     
istiod          istio-system    1               2023-10-11 23:10:05.03971 +0530 IST     deployed        istiod-1.19.1   1.19.1
kubectl rollout restart deployment istio-ingress -n istio-ingress
for ADDON in kiali jaeger prometheus grafana
    kubectl apply -f $ADDON_URL
# Visualize Istio Mesh console using Kiali
kubectl port-forward svc/kiali 20001:20001 -n istio-system
# Get to the Prometheus UI
kubectl port-forward svc/prometheus 9090:9090 -n istio-system
# Visualize metrics in using Grafana
kubectl port-forward svc/grafana 3000:3000 -n istio-system
kubectl get pods,svc -n istio-system
kubectl get pods,svc -n istio-ingress
❯ kubectl get pods,svc -n istio-system
NAME                              READY   STATUS    RESTARTS   AGE
pod/grafana-5f98b97b64-mrlzg      1/1     Running   0          25s
pod/istiod-5b4f9c79cd-pgnxf       1/1     Running   0          15m
pod/jaeger-76cd7c7566-htt97       1/1     Running   0          29s
pod/kiali-7799445c94-s7w67        0/1     Running   0          30s
pod/prometheus-67599c8d5c-6lgc4   2/2     Running   0          27s
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                 AGE
service/grafana            ClusterIP    <none>        3000/TCP                                25s
service/istiod             ClusterIP    <none>        15010/TCP,15012/TCP,443/TCP,15014/TCP   15m
service/jaeger-collector   ClusterIP    <none>        14268/TCP,14250/TCP,9411/TCP            28s
service/kiali              ClusterIP   <none>        20001/TCP,9090/TCP                      30s
service/prometheus         ClusterIP   <none>        9090/TCP                                27s
service/tracing            ClusterIP    <none>        80/TCP,16685/TCP                        28s
service/zipkin             ClusterIP    <none>        9411/TCP                                28s
❯ kubectl get pods,svc -n istio-ingress
NAME                                 READY   STATUS    RESTARTS   AGE
pod/istio-ingress-858849b544-q76zv   1/1     Running   0          2m10s
NAME                    TYPE           CLUSTER-IP       EXTERNAL-IP                                                                      PORT(S)                                      AGE
service/istio-ingress   LoadBalancer   15021:32125/TCP,80:30181/TCP,443:30285/TCP   2m10s

Enable SideCar Injection

kubectl label namespace default istio-injection=enabled



Install the Gateway CRDs

kubectl get crd &> /dev/null || \
  { kubectl kustomize "" | kubectl apply -f -; }


Testing out Istio Installation

Test ISTIO Installation with the BookInfo Example

k apply -f

We can hit the bookinfo endpoint

kubectl exec "$(kubectl get pod -l app=ratings -o jsonpath='{.items[0]}')" -c ratings -- curl -sS productpage:9080/productpage | grep -o "<title>.*</title>"

Install the bookinfo gateway

k apply -f

Now we need to figure out the Gateway URL which we can hit

export INGRESS_HOST=$(kubectl get gtw bookinfo-gateway -o jsonpath='{.status.addresses[0].value}')
export INGRESS_PORT=$(kubectl get gtw bookinfo-gateway -o jsonpath='{.spec.listeners[?("http")].port}')
❯ kubectl get gtw bookinfo-gateway
NAME               CLASS   ADDRESS                                                                   PROGRAMMED   AGE
bookinfo-gateway   istio   True         49s

You can get the LoadBalancer URL from service as well

k get svc
curl -s "http://${GATEWAY_URL}/productpage" | grep -o "<title>.*</title>"
echo "http://${GATEWAY_URL}/productpage"


KServe is an open-source framework designed to streamline the deployment and serving of machine learning models on Kubernetes. It simplifies the process of deploying models for inference in production environments.


KServe is specifically tailored for serving machine learning models efficiently and scalably on Kubernetes. It simplifies the operational aspects of ML model serving, allowing data scientists and ML engineers to focus on their models while taking advantage of Kubernetes' flexibility and scaling capabilities.

KServe Architecture


KServe's architecture has a control plane and a data plane. The control plane manages model deployments, creating service endpoints, handling scaling, and connecting to model storage (like cloud storage).

The data plane processes requests and responses for specific models. It includes the predictor for inference, the transformer for data processing, and optionally, an explainer for model explainability.


  1. KServe Controller: Manages the lifecycle of model deployments, including creating services, ingresses, containers, and model agent containers for logging and batching.
  2. Model Store: A repository for registered models, usually cloud-based (e.g., Amazon S3, Google Cloud Storage).
  3. Predictor: Receives inference requests and invokes the transformer for data processing and, optionally, the explainer for model explainability.

Supported Frameworks and Runtimes:

KServe supports various machine learning and deep learning frameworks, including TensorFlow, ONNX, PyTorch, and more. It also works with classical ML models based on SKLearn, XGBoost, and others. You can even extend KServe to support custom runtimes that follow the V2 inference protocol.

KServe Model Mesh


Installing KServe

We need to create an istio ingress class which will be used by KServe


kind: IngressClass
  name: istio
k apply -f istio-kserve-ingress.yaml

Install the Metrics API

kubectl apply -f

Install Cert Manager

kubectl apply -f

Install KServe Manifest

kubectl apply -f

KServe Runtime

kubectl apply -f
kubectl patch configmap/inferenceservice-config -n kserve --type=strategic -p '{"data": {"deploy": "{\"defaultDeploymentMode\": \"RawDeployment\"}"}}'

Test KServe Installation

We can deploy the IRIS Example to test the KServe Installation


apiVersion: ""
kind: "InferenceService"
  name: "sklearn-iris"
        name: sklearn
      storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"


  "instances": [
    [6.8,  2.8,  4.8,  1.4],
    [6.0,  3.4,  4.5,  1.6]

Install iris inference service

k apply -f iris.yaml
kubectl get inferenceservices sklearn-iris
❯ kubectl get inferenceservices sklearn-iris
NAME           URL                                       READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
sklearn-iris   True                                                                  2m51s
SERVICE_HOSTNAME=$(kubectl get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
sklearn-iris-predictor-78cc5dd598-njsjw   2/2     Running   0          92s

Now we need the Ingress URL to hit the iris endpoint

export INGRESS_HOST=$(kubectl -n istio-ingress get service istio-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
export INGRESS_PORT=$(kubectl -n istio-ingress get service istio-ingress -o jsonpath='{.spec.ports[?("http2")].port}')
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json
❯ curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json
*   Trying
* Connected to ( port 80 (#0)
> POST /v1/models/sklearn-iris:predict HTTP/1.1
> Host:
> User-Agent: curl/8.1.2
> Accept: */*
> Content-Type: application/json
> Content-Length: 170
< HTTP/1.1 200 OK
< date: Tue, 10 Oct 2023 17:07:43 GMT
< server: istio-envoy
< content-length: 23
< content-type: application/json
< x-envoy-upstream-service-time: 3
* Connection #0 to host left intact

So we have successfully installed KServe!

S3 Access for KServe

Enable OIDC if not done already

eksctl utils associate-iam-oidc-provider --region ap-south-1 --cluster basic-cluster --approve

Create IRSA for S3 Read Only Access

eksctl create iamserviceaccount \
	--cluster=basic-cluster \
	--name=s3-read-only \
	--attach-policy-arn=arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
	--override-existing-serviceaccounts \
	--region ap-south-1 \

Create S3 Secret to be used by our Inference Service


apiVersion: v1
kind: Secret
  name: s3-secret
  annotations: # replace with your s3 endpoint e.g minio-service.kubeflow:9000 "1" # by default 1, if testing with minio you can set to 0 "ap-south-1" "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
type: Opaque
k apply -f s3-secret.yaml
k patch serviceaccount s3-read-only -p '{"secrets": [{"name": "s3-secret"}]}'

Deploy SDXL on EKS with KServe


This deployment is going to take in the .mar files we generated in TorchServe Assignment and just deploy that over KServe!





import logging
import zipfile
from abc import ABC
import numpy as np
import torch
from diffusers import StableDiffusionXLPipeline
from ts.torch_handler.base_handler import BaseHandler
logger = logging.getLogger(__name__)"Loading sdxl_handler...")
class SDXLHandler(BaseHandler, ABC):
    def __init__(self):
        self.initialized = False
    def initialize(self, ctx):
        """In this initialize function, the Stable Diffusion model is loaded and
        initialized here.
            ctx (context): It is a JSON Object containing information
            pertaining to the model artefacts parameters.
        self.manifest = ctx.manifest
        properties = ctx.system_properties
        model_dir = properties.get("model_dir")
       "torch.cuda.is_available() = {torch.cuda.is_available()}")'gpu_id = {properties.get("gpu_id")}')
        self.device = torch.device(
            "cuda:" + str(properties.get("gpu_id"))
            if torch.cuda.is_available() and properties.get("gpu_id") is not None
            else "cpu"
        with zipfile.ZipFile(model_dir + "/", "r") as zip_ref:
            zip_ref.extractall(model_dir + "/model")
        self.pipe = StableDiffusionXLPipeline.from_pretrained(model_dir + "/model", torch_dtype=torch.float16, variant="fp16")
        self.pipe ="loaded StableDiffusionXL model from {model_dir}/")
        self.initialized = True
    def preprocess(self, requests):
        """Basic text preprocessing, of the user's prompt.
            requests (str): The Input data in the form of text is passed on to the preprocess
            list : The preprocess function returns a list of prompts.
        inputs = []
        for _, data in enumerate(requests):
            input_text = data.get("data")
            if input_text is None:
                input_text = data.get("body")
            if isinstance(input_text, (bytes, bytearray)):
                input_text = input_text.decode("utf-8")
  "received text: '%s'", input_text)
        return inputs
    def inference(self, inputs):
        """Generates the image relevant to the received text.
            input_batch (list): List of Text from the pre-process function is passed here
            list : It returns a list of the generate images for the input text
        inferences = self.pipe(
            inputs, num_inference_steps=50, height=1024, width=1024
"generated images: {inferences}")
        return inferences
    def postprocess(self, inference_output):
        """Post Process Function converts the generated image into Torchserve readable format.
            inference_output (list): It contains the generated image of the input text.
            (list): Returns a list of the images.
        images = []
        for image in inference_output:
        return images

Now we need to start a torchserve docker container to create the model archive file

docker run -it --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v `pwd`:/opt/src pytorch/torchserve:0.8.1-gpu bash

Create MAR from handler and model artifact

cd /opt/src
torch-model-archiver --model-name sdxl --version 1.0 --handler --extra-files -r requirements.txt

This will create sdxl.mar

Create a S3 Bucket where we will store these model .mar files, which will be loaded by KServe

❯ aws s3 cp s3://pytorch-models/config/
upload: ./ to s3://pytorch-models/config/
aws s3 cp sdxl.mar s3://pytorch-models/model-store/

Deploy SDXL!


apiVersion: ""
kind: "InferenceService"
  name: "torchserve"
    serviceAccountName: s3-read-only
        name: pytorch
      storageUri: s3://pytorch-fr-models
          memory: 4Gi

Make sure serviceAccountName is correct!

k apply -f sdxl.yaml
❯ k apply -f sdxl.yaml created

It will start the torchserve predictor pod

❯ kgpo
NAME                                    READY   STATUS     RESTARTS      AGE
torchserve-predictor-69d746c8dc-2sptp   0/2     Init:0/2   1 (52s ago)   79s
klo torchserve-predictor-69d746c8dc-wqnw9



Once created you will get a url for your inferenceservice

❯ kg isvc
NAME           URL                                       READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION   AGE
sklearn-iris   True                                                                  169m
torchserve     True                                                                  6m13s

Now we need to figure out how do we access this model

export INGRESS_HOST=$(kubectl -n istio-ingress get service istio-ingress -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
export INGRESS_PORT=$(kubectl -n istio-ingress get service istio-ingress -o jsonpath='{.spec.ports[?("http2")].port}')
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict -d @./input.json

We can always SSH into our GPU Node to see what’s happening

sh-4.2$ nvidia-smi
Wed Oct 11 20:54:12 2023
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   57C    P0    72W /  70W |  13750MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A    119154      C   /home/venv/bin/python           13747MiB |
❯ echo http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict

Python Script to hit the inference endpoint and saving the results

import requests
import json
import numpy as np
from PIL import Image
input = {
	"instances": [
			"data": "a cat on a farm"
headers= {
	"Host": ""
url = ""
response =, data=json.dumps(input), headers=headers)
# with open("raw.txt", "w") as f:
# 	f.write(response.text)
# with open("raw.txt", "r") as f:
# 	a =
image = Image.fromarray(np.array(json.loads(response.text)['predictions'][0], dtype="uint8"))"out.jpg")





That’s it ! We have deployed SDXL on KServe on Kubernetes on EKS !

Monitoring the Deployment

Nvidia DCGM (Data Center GPU Manager)


During the cluster creation process, the NVIDIA device plugin will get installed. You will need to remove it after cluster creation because we will use the NVIDIA GPU Operator instead.

k get ds -n kube-system
kubectl -n kube-system delete daemonset nvidia-device-plugin-daemonset

What is a DaemonSet?

A DaemonSet is a controller that ensures that the pod runs on all the nodes of the cluster. If a node is added/removed from a cluster, DaemonSet automatically adds/deletes the pod.


kubectl create namespace gpu-operator
curl > dcgm-metrics.csv
kubectl create configmap metrics-config -n gpu-operator --from-file=dcgm-metrics.csv
helm repo add nvidia && helm repo update
helm install --wait \
	--generate-name -n gpu-operator \
	--create-namespace nvidia/gpu-operator \
	--set ""=metrics-config \
	--set "dcgmExporter.env[0].name"=DCGM_EXPORTER_COLLECTORS \
	--set "dcgmExporter.env[0].value"=/etc/dcgm-exporter/dcgm-metrics.csv \
	--set toolkit.enabled=false
kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ' ' -f 1 | head -n 1)
❯ kubectl -n gpu-operator logs -f $(kubectl -n gpu-operator get pods | grep dcgm | cut -d ' ' -f 1 | head -n 1)
Defaulted container "nvidia-dcgm-exporter" out of: nvidia-dcgm-exporter, toolkit-validation (init)
time="2023-10-12T15:27:05Z" level=info msg="Starting dcgm-exporter"
time="2023-10-12T15:27:05Z" level=info msg="DCGM successfully initialized!"
time="2023-10-12T15:27:05Z" level=info msg="Collecting DCP Metrics"
time="2023-10-12T15:27:05Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-10-12T15:27:05Z" level=info msg="Initializing system entities of type: GPU"
time="2023-10-12T15:27:05Z" level=info msg="Initializing system entities of type: NvSwitch"
time="2023-10-12T15:27:05Z" level=info msg="Not collecting switch metrics: no switches to monitor"
time="2023-10-12T15:27:05Z" level=info msg="Initializing system entities of type: NvLink"
time="2023-10-12T15:27:05Z" level=info msg="Not collecting link metrics: no switches to monitor"
time="2023-10-12T15:27:05Z" level=info msg="Kubernetes metrics collection enabled!"
time="2023-10-12T15:27:05Z" level=info msg="Pipeline starting"
time="2023-10-12T15:27:05Z" level=info msg="Starting webserver"

Modify Prometheus to collect logs from DGCA


Add the following

    - honor_labels: true
      job_name: 'kubernetes-pod-dcgm-exporter'
      sample_limit: 10000
      metrics_path: /api/v1/metrics/prometheus
      - role: pod
      - source_labels: [__meta_kubernetes_pod_container_name]
        action: keep
        regex: '^DCGM.*$'
      - source_labels: [__address__]
        action: replace
        regex: ([^:]+)(?::\d+)?
        replacement: ${1}:9400
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - action: replace
        - __meta_kubernetes_namespace
        target_label: Namespace
      - source_labels: [__meta_kubernetes_pod]
        action: replace
        target_label: pod
      - action: replace
        - __meta_kubernetes_pod_container_name
        target_label: container_name
      - action: replace
        - __meta_kubernetes_pod_controller_name
        target_label: pod_controller_name
      - action: replace
        - __meta_kubernetes_pod_controller_kind
        target_label: pod_controller_kind
      - action: replace
        - __meta_kubernetes_pod_phase
        target_label: pod_phase
      - action: replace
        - __meta_kubernetes_pod_node_name
        target_label: NodeName
k apply -f prometheus.yaml
kubectl port-forward svc/prometheus 9090:9090 -n istio-system

Open http://localhost:9090



The metrics can be viewed on Grafana as well!

kubectl port-forward svc/grafana 3000:3000 -n istio-system

Download DCGM Grafana Dashboard JSON from and import it in Grafana


Under Load





kubectl port-forward svc/kiali 20001:20001 -n istio-system




Grafana Metrics for Istio



Metrics for the SDXL Deployment