MIG (Multi-Instance GPU) can partition the GPU into as many as seven instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.
With the MIG single strategy, you can create MIG devices of the same size. For instance, on a P4d.24XL, the options include creating 56 slices of 1g.5gb, 24 slices of 2g.10gb, 16 slices of 3g.20gb, or a single slice of either 4g.40gb or 7g.40gb.
Suppose we have several teams working on simulations and deep learning. With the MIG single strategy, you can give each team the GPU power they need, making the most of the P4d.24XL. This approach is great when all tasks need the same amount of GPU and when tasks can run at the same time.
On the other hand, the mixed strategy offers more flexibility. It lets you combine different-sized slices, such as a few 1g.5gb slices with some 2g.10gb and 3g.20gb slices. This approach is particularly beneficial when your cluster handles tasks with diverse GPU needs.
Istio
Istio is the leading example of a new class of projects called Service Meshes. Service meshes manage traffic between microservices at layer 7 of the OSI Model. Using this in-depth knowledge of the traffic semantics – for example HTTP request hosts, methods, and paths – traffic handling can be much more sophisticated.
Istio works by having a small network proxy sit alongside each microservice. This so-called “sidecar” intercepts all of the service’s traffic, and handles it more intelligently than a simple layer 3 network can. Istio uses the Envoy proxy as its sidecar.
Istio Architecture:
An Istio service mesh is logically split into a data plane and a control plane.
The data plane is composed of a set of intelligent proxies (Envoy) deployed as sidecars. These proxies mediate and control all network communication between microservices. They also collect and report telemetry on all mesh traffic.
The control plane manages and configures the proxies to route traffic.
Key Use Cases of Istio:
Traffic Management: Istio enables sophisticated traffic routing and load balancing, such as canary releases, A/B testing, and blue-green deployments.
Security: It provides authentication, authorization, and encryption for service-to-service communication, enhancing the overall security of your microservices.
Observability: Istio offers powerful tools for monitoring, tracing, and logging, which help in diagnosing issues, understanding performance, and maintaining a healthy application.
Resilience: Istio can automatically retry failed requests, handle timeouts, and implement circuit breaking to improve the overall reliability of microservices.
Simple Examples:
Traffic Routing: Imagine you have a web application with a new feature. You can use Istio to gradually send a percentage of user traffic to the new feature, allowing you to test it with a subset of users before a full rollout.
Security: Istio can ensure that only authenticated services can communicate with each other. You can demonstrate this by creating policies that deny communication between services unless they meet specific authentication requirements.
Observability: You can show how Istio generates telemetry data and integrates with tools like Grafana and Prometheus to monitor service performance and troubleshoot issues.
Why to use Istio?
Fine-Grained Traffic Control: While Kubernetes provides basic load balancing and Ingress controllers, Istio offers more advanced traffic control. With Istio, you can perform complex routing, such as canary releases, A/B testing, and gradual deployments, which Kubernetes Ingress may not handle as flexibly.
Security: Istio provides robust security features, including mutual TLS (mTLS) authentication between microservices. This ensures secure communication, which might not be as comprehensive in a basic Kubernetes setup.
Observability: Istio offers advanced observability features, including metrics, distributed tracing, and logging, which are valuable for diagnosing issues and understanding performance. While Kubernetes provides some observability, Istio enhances it significantly.
Simplicity: If you have a simple, monolithic application or a small-scale microservices setup, the overhead of configuring and managing Istio may outweigh its benefits. Simpler solutions like basic Kubernetes ingress controllers or a lightweight proxy can suffice.
Learning Curve: Istio has a steep learning curve, which may be challenging for teams new to microservices or containers. If your team lacks the necessary expertise, you might consider simpler alternatives.
Resource Overhead: The sidecar proxies deployed with Istio can consume additional resources, which may be a concern in resource-constrained environments. This overhead may not be justified for smaller applications.
Its important to set the correct annotations so the ALB can be allocated
Enable SideCar Injection
Install the Gateway CRDs
Testing out Istio Installation
Test ISTIO Installation with the BookInfo Example
We can hit the bookinfo endpoint
Install the bookinfo gateway
Now we need to figure out the Gateway URL which we can hit
You can get the LoadBalancer URL from service as well
KServe
KServe is an open-source framework designed to streamline the deployment and serving of machine learning models on Kubernetes. It simplifies the process of deploying models for inference in production environments.
KServe is specifically tailored for serving machine learning models efficiently and scalably on Kubernetes. It simplifies the operational aspects of ML model serving, allowing data scientists and ML engineers to focus on their models while taking advantage of Kubernetes' flexibility and scaling capabilities.
KServe Architecture
KServe's architecture has a control plane and a data plane. The control plane manages model deployments, creating service endpoints, handling scaling, and connecting to model storage (like cloud storage).
The data plane processes requests and responses for specific models. It includes the predictor for inference, the transformer for data processing, and optionally, an explainer for model explainability.
Components:
KServe Controller: Manages the lifecycle of model deployments, including creating services, ingresses, containers, and model agent containers for logging and batching.
Model Store: A repository for registered models, usually cloud-based (e.g., Amazon S3, Google Cloud Storage).
Predictor: Receives inference requests and invokes the transformer for data processing and, optionally, the explainer for model explainability.
Supported Frameworks and Runtimes:
KServe supports various machine learning and deep learning frameworks, including TensorFlow, ONNX, PyTorch, and more. It also works with classical ML models based on SKLearn, XGBoost, and others. You can even extend KServe to support custom runtimes that follow the V2 inference protocol.
We can deploy the IRIS Example to test the KServe Installation
iris.yaml
iris-input.json
Install iris inference service
Now we need the Ingress URL to hit the iris endpoint
So we have successfully installed KServe!
S3 Access for KServe
Enable OIDC if not done already
Create IRSA for S3 Read Only Access
Create S3 Secret to be used by our Inference Service
s3-secret.yaml
Deploy SDXL on EKS with KServe
This deployment is going to take in the .mar files we generated in TorchServe Assignment and just deploy that over KServe!
config/config.properties
requirement.txt
sdxl_handler.py
Now we need to start a torchserve docker container to create the model archive file
Create MAR from handler and model artifact
This will create sdxl.mar
Create a S3 Bucket where we will store these model .mar files, which will be loaded by KServe
Deploy SDXL!
sdxl.yaml
Make sure serviceAccountName is correct!
It will start the torchserve predictor pod
Once created you will get a url for your inferenceservice
Now we need to figure out how do we access this model
We can always SSH into our GPU Node to see what’s happening
Python Script to hit the inference endpoint and saving the results
test_kserve.py
That’s it ! We have deployed SDXL on KServe on Kubernetes on EKS !
Monitoring the Deployment
Nvidia DCGM (Data Center GPU Manager)
During the cluster creation process, the NVIDIA device plugin will get installed. You will need to remove it after cluster creation because we will use the NVIDIA GPU Operator instead.
What is a DaemonSet?
A DaemonSet is a controller that ensures that the pod runs on all the nodes of the cluster. If a node is added/removed from a cluster, DaemonSet automatically adds/deletes the pod.