High-Performance Data Centre

Kerala Genome Data Centre: A State-of-the-Art Facility for Genomic Data Sharing and Collaboration

High-Performance Computing (HPC) Clusters have become a critical tool for managing and processing the large datasets commonly encountered in bioinformatics and artificial intelligence (Al) systems. Standard server arrays, while capable, often lack the power and speed to handle such tasks efficiently, which can result in crippling execution delays.In contrast,HPC Clusters, made up of a mix of Central Processing Unit (CPU) and Graphics Processing Unit (GPU) nodes and managed by an efficient Cluster Control Unit, are specifically designed to handle these computationally intensive tasks.

CPU Compute Nodes function as the backbone of the cluster. These powerful computational workhorses share similarities with traditional servers but are equipped with higher-performance CPUs, more substantial amounts of RAM, and vast storage capabilities. They are perfectly suited to bioinformatics tasks, which frequently involve complex search operations, slicing and splicing of genomic sequences, or other data-intensive tasks. A section of these units can also be set aside as Virtual Machines (VMs) for general applications such as task management tools like JIRA, internal Intranet applications, or database management systems.

GPU Compute Nodes, on the other hand, are the heavy lifters when it comes to mathematics and model computations, which are integral parts of Al and machine learning (ML) algorithms. These nodes feature fewer CPUs but more GPUs, which have proven to be remarkably efficient for mathematical processing. The CPUs in these nodes handle task scheduling and distribution to the GPUs, making GPU Compute Nodes ideal for Al/ML tasks that require massive parallel processing capabilities.

Just as the cardiovascular system in our body ensures efficient and effective transport of blood, the network system within an HPC cluster ensures fast and reliable data transfer between nodes. This network infrastructure is vital for maintaining quick data turnaround times and seamless computational execution across the cluster.

The storage in an HPC cluster is a three-tiered system designed to balance cost-effectiveness with performance and storage space. This system often includes high-speed storage for frequently accessed data, intermediate storage for less frequently used data, and long-term storage for archival purposes. The multi-tiered approach helps to ensure fast data throughput while providing ample storage space for the various needs of bioinformatics and Al applications.

The vision for KGDC’s high-capacity data centre

Data sources: Humans, animals, plants, and environmental samples

Infrastructure, data storage, and computational resources

High-Performance Server Arrays

High-performance server arrays are integral to any data centre designed to handle genomic data. These arrays comprise 100 servers, with each server fitted with 66 CPU cores and 1,200 GPU cores, leading to a total of 6600 CPU cores and120,000 GPU cores in the array. These CPU cores handle a variety of tasks, including data input/output, instructions execution, and multitasking, while the GPU cores specialize in performing simultaneous calculations, crucial for tasks such as DNA sequencing and genomic analysis.
The collective computing power amounts to 120 TeraFLOPS CPU (trillions of operations per second) and l PetaFLOPS GPU (quadrillions of operations per second). These performance numbers mean that our servers are built to execute massively parallel operations, making them ideal for the extensive data processing involved in genomic analysis.
Moreover, our servers are equipped with a total of 84TB of internal RAM, which serves as a massive workspace for computations, contributing to rapid processing times. As for storage, the server arrays boast 630TB of internal storage. This is over a thousand times more than what an average laptop holds,underlining the scale of operations that our data centre can handle.

Al/ML Environment

Our Al/ML environment, optimized for high-performance computing, is furnished with state-of the-art GPUs designed to facilitate intensive computational workloads. It’s capable of delivering 480 TeraFLOPS and 30 PetaOPS(INT) of computational power, addressing the enormous data sets and complex computations required by Machine Learning (ML) and Artificial Intelligence (Al).
We utilize eight configurations of Nvidia’s A100 Tensor Core GPUs, with each configuration containing three GPUs. The Al00 is Nvidia’s most advanced data centre GPU, designed for the most demanding computational tasks. These GPUs are integral to performing sophisticated pattern recognition and statistical analysis on the genome data, indispensable in predictive models and data mining in genomics.

Specialized Hardware and Software

To cater to the unique requirements of genomics, we are using specialized hardware and software for faster analysis. Our dedicated Bio-ITplatform provides a substantial performance boost to specific bioinformatics pipelines, making the complex calculations involved in genomics more efficient.This platform’s primary function is to accelerate computational tasks related to genomic sequencing, genomic mapping, and structural bioinformatics, among other things.
We’re also leveraging integrated analysis software to aid in executing bioinformatics pipelines effectively. These software packages,used worldwide by researchers and clinicians, will facilitate a myriad of tasks, from raw sequence data processing to advanced data visualization and statistical analysis.

Cluster, Workftow and Monitoring Software

Our data centre uses cutting-edge cluster, workfiow, and monitoring software to ensure smooth and efficient operations. Altair PBS Pro is a powerful cluster scheduler that manages and optimizes workloads within our computing environment, delivering the resources necessary for data processing tasks and ensuring a high level of performance.
NextFlow is an open-source workfiow framework designed to simplify the deployment of complex computational pipelines across clouds and clusters. It enables the development of scalable and reproducible scientific workfiows using software containers, fostering better collaboration and facilitating reproducible results, which are crucial in genomic research.
SolarWinds, an IT monitoring and management software, provides comprehensive visibility and performance monitoring of our network infrastructure. It helps us maintain system performance, prevent, identify, and address potential issues, thereby ensuring the reliability and integrity of our data centre.

Persistent Storage Architecture

Our data centre adopts a three-tier strategic storage architecture to handle the vast amounts of genomic data.
- Primary Storage: 3PB of high-throughput NVMe SSDs will be used for active data. This data will be quickly accessible and can be processed without delay.NVMe SSDs are known for their low latency and high 1/0 capabilities, making them ideal for the frequent read-write operations involved in genomic analysis.
- Secondary Storage: 8PB of storage composed of SATA/SAS drives and an NVMe SSD cache will be used for less active data. This tier is designed for efficiency and cost-effectiveness. Once active processing of data is complete, it’s moved here for reporting and reference purposes.
- Archival Storage: 20PB of tape storage will serve as our archival system. Tape storage is cost etfective, scalable, and reliable, making it perfect for long-term storage of enormous genomic datasets.

Multipurpose Computing Platform

Our data centre will feature a robust virtualization platform equipped with a 320 core CPU and
2.5TB of RAM. This platform can host over 50 virtual machines, providing a versatile solution for a variety of computing tasks. Virtualization enables efficient hardware utilization,improved system provisioning and deployment, and enhanced disaster recovery processes. It will support various critical tasks, including administrative tasks, database systems, tracking and ticketing,version control for applications and documents, and intranet and internet servers.

Infrastructure Redundancy

To ensure reliable and uninterrupted operation, we plan to implement various redundancies in our data centre’s infrastructure. These include having two high-speed dedicated leased line internet connections, two advanced active-active firewalls, and two sets of UPS systems and power generators. These redundancies ensure that in case of a component failure, there’s always a backup component ready to take over, ensuring our operations are never interrupted.

Security Measures

Physical Security: Our data centre will employ stringent physical security measures, including biometric access control systems and CCTV cameras for round-the-clock surveillance. These measures will ensure only authorized personnel can access our data centre, providing a robust first line of defense against physical threats.

Safety Measures

The safety of our data centre’s physical infrastructure is paramount. Hence, we’ll install comprehensive alert systems forflre, smoke, humidity, and electrical anomalies.
These systems will enable us to swiftly detect and respond to any potential threats to our data centre.
Internal Data Security: We will implement a zero-trust data access policy that dictates that every attempt to access our network, both from within and outside, should be authenticated and verified. This policy greatly reduces the chances of a security breach.
Network Security: We will deploy a suite of network security measures, including secure firewalls, VPNs, Security Information and Event Management (SIEM) systems, and email and endpoint monitoring tools. These tools will ensure our network remains secure from threats and attacks.
In conclusion, our proposed state-of-the-art genomic data centre is meticulously planned to handle the massive computing power required for genomic data analysis, deliver robust data security, and ensure operational continuity. The centre will be a significant asset in the field of medical research and analysis, contributing to advancements in our understanding of genomics.

Standards for data collection, analysis, and sharing

Ethical guidelines for responsible use of sensitive genomic data

As a genomic data centre, we will handle highly sensitive and private data. To protect this data, we employ high-security data transfer systems like Aspera and SureSync. Both systems use military-grade encryption for data transfers, safeguarding the data during the transmission process. Moreover, we will have stringent data policies and access controls, such as multi-factor authentication and encrypted data storage, to prevent unauthorized access to the data. Regular audits will be conducted to ensure that our data handling practices comply with all relevant regulations and best practices.

The collective computing power amounts to 120 TeraFLOPS CPU (trillions of operations per second) and l PetaFLOPS GPU (quadrillions of operations per second).

The Genomics Revolution

Transforming our understanding of genetic traits and diseases.

Know More