Xoriant’s Contribution to Linux Foundation's “Cloud-Native PNDA” Project
By Atul Dambalkar and Sreenivasa Gopireddy
Recently, Xoriant contributed to an Open Source Linux Foundation community project – Cloud-Native PNDA. Xoriant team supported the migration of project Red-PNDA into Cloud-Native PNDA (Red-PNDA with containerization) for the Deployment Manager module. In this blog, we’ll go through the challenges, learnings and outcomes of this cloud-native development project for the Linux Foundation.
Overview and Comparison of PNDA, Red PNDA and Cloud-Native PNDA
Key components of the PNDA platform include-
HDFS (Hadoop Distributed File System) is an open-source distributed file system that demonstrates fault tolerance with a self-healing mechanism. It is ideal for large-scale data processing workloads. In PNDA, Apache Gobblin runs every half an hour to copy all data from Kafka into the master dataset in HDFS. Applications can also store output data into HDFS.
HBase is a distributed, scalable key-value data store, designed for fast, random access to big data sets, i.e., billions of rows and millions of columns. In PNDA, a custom app accessing data from a data store such as Kafka or HDFS can write arbitrary key/value data into HBase.
3. Apache Spark Streaming
Spark Streaming is a core Spark API extension that allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Here, Kafka, Kinesis, or TCP sockets are sources for data extraction. For data processing, complex algorithms can be used through high-level functions such as join, map, window and reduce. Processed data can also be pushed out to filesystems, databases, and live dashboards.
4. Apache Kafka
Apache Kafka is a publish-subscribe messaging system. It is a distributed streaming platform known for its high throughput. In PNDA, it enables messaging-based data processing for deployed applications.
5. Deployment Manager
Deployment Manager component provides an API for deploying various data processing applications based on Spark and Spark Streaming. It helps to extract the packages which contain the program binaries and configuration files for a specific task and create application instances from those packages inside the PNDA platform. It also provides the ability to Start/Stop/Terminate/Restart the applications and can track runtime application status along with viewing of application logs from the PNDA console.
6. Package Repository
Package Repository provides an API for uploading application packages into the PNDA platform.
Rancher-Kubernetes Cluster-based PNDA Platform Set-Up
In order to create the initial set-up of the Cloud-Native PNDA Platform, we needed a Kubernetes cluster. We chose the Rancher platform to set up this cluster. Our Kubernetes cluster was created on Xoriant’s OpenStack-based Private Cloud Rancher Kubernetes Engine.
Xoriant’s Cloud-Native PNDA Development Journey
The work involved Kubernetization of the PNDA components such as Deployment Manager as well as the addition of necessary components such as Spark Operator (Kubernetes CRD) along with the addition of new features. One of the important aspects of any Open-Source Contribution is to understand the existing components and details so that new features can be added as extensions. The original PNDA project is built in Python. We took a deep dive into the code and iterations were part of the new features that got developed as part of the Deployment Manager module and few other components.
New PNDA project features contributed by Xoriant Cloud Engineers
1. Addition of Spark Operator (CRD) in the Helm Chart-based Deployment
We added Spark-Kubernetes Operator Helm Chart to the PNDA platform. The Helm Chart created necessary service accounts and added necessary RBAC policies for Pod creation inside the Kubernetes platform. This enabled deployment of Spark applications in a declarative manner (YAML file) as Kubernetes Pods. The Spark-Kubernetes Operator also enabled tracking the status of deployed Spark applications Pods.
2. Enhancements to Deployment Manager Module
We enhanced the Deployment Manager module to accommodate the deployment of Spark and Spark Streaming applications by making use of Spark-Kubernetes Operator. With the new code that got added, we also built the Docker image for Deployment Manager with the following enhancements,
- Creation of Spark application Pods using Python Kubernetes API using Spark-Kubernetes Operator (CRD)
- Management of applications deployed inside PNDA platform with the Python Kubernetes API through PNDA console
- Displaying application Pod run-time status and logs in PNDA console
- Modifications to the Helm Chart with the newly created Docker image
3. Creation of Spark Application Pods Inside Kubernetes Cluster
We used the Python Kubernetes API to create and deploy the Spark Application Pods through the use of the Spark-Kubernetes Operator (CRD). This effectively allowed the PNDA platform to make declarative deployment of Spark Applications inside the Kubernetes cluster.
4. Management of Applications Deployed Inside the PNDA Platform
We added the functionality to control the status of the deployed applications. This involved invoking necessary Spark-Kubernetes Operator APIs to Start, Restart as well as Delete actions for the deployed Spark applications. As Kubernetes does not support Stop/Pause of the current state of Pod and Resume when needed, we removed the application stop functionality from the PNDA UI console.
5. Displaying Application Pod Run-Time Status and Logs
Along with the above enhancements to Deployment Manager, we also updated the feature to fetch and display application Pod run-time status such as Pending, Running, Succeeded, Failed and CrashLoopBackOff, etc. The status is displayed inside the PNDA console. Besides, we updated the PNDA console to display application run-time logs inside the PNDA console.
6. Modifications to the Helm Chart
We modified the existing Helm Chart with the necessary configuration and settings so that the newly created Docker image for the Deployment Manager can be referred to during the Kubernetes deployment.
Benefits of the New Functionality
With the newly added functionality, Cloud-Native PNDA can now support the following features:
- Support for Spark Operator (CRD)
- Easy deployment and management (Start/Restart/Delete) of Spark and Spark Streaming applications
- Ability to view Pod status (Started/Running/Completed) of the deployed Spark applications
- Ability to view run-time logs of the deployed Spark applications through PNDA console
For more information on the Cloud-Native PNDA project, visit:
Do you have questions? Would you like to discuss modernization using Kubernetes with one of our experts? Write to PE@xoriant.com
Disclaimer: The screenshots used in the blog are part of the open-source PNDA platform. All rights belong to the respective owners.