Azhop backbone cost analyses
In the rapidly evolving landscape of High-Performance Computing (HPC) and Artificial Intelligence (AI), the quest for optimizing operational cost without compromising performance has become paramount. Microsoft Azure’s HPC On-Demand Platform (AzHOP) serves as an innovative solution that addresses both scale and flexibility needs. However, one area that often warrants scrutiny is the daily cost associated with the backbone infrastructure of AzHOP, which includes critical components such as Management VMs, persistent storage volumes, and more.
We will compare the daily backbone costs associated with different AzHOP configurations. Specifically, we will look at setups with SLURM DB and Azure Active Directory (AAD) enabled. We will explore three different storage options to examine how each impacts the overall cost and performance:
- 4TB Azure Files
- 4TB Premium Azure NetApp Files
- Azure Managed Lustre File System (AMLFS)
The objective is to arm decision-makers and technical experts with concrete insights that can guide them in selecting the most cost-effective yet performant backbone infrastructure for their Azure HPC deployments.
Azure Files (4TB)
The experiment was conducted over the period from September 2nd to September 4th. Cost data for both the starting day, September 2nd, and the concluding day, September 4th, are partial and therefore lower than the figures from September 3rd. In contrast, the data for September 3rd represents a complete 24-hour cycle.
Let’s break down the cost for 09/03 in the table below:
Date | Service Name | Cost |
---|---|---|
Sep 03 | Virtual Machines | 9.960719999999998 |
Sep 03 | Storage | 9.644774781549 |
Sep 03 | Azure Database for MariaDB | 2.989564258064516 |
Sep 03 | Virtual Network | 0.12240000000000001 |
Sep 03 | Azure DNS | 0.006294902258064519 |
Sep 03 | Bandwidth | 0.00022860545162111526 |
Sep 03 | Advanced Threat Protection | 0.0000011999999999999997 |
The primary cost components for running AzHOP include Virtual Machines
and Azure Files
. Specifically, this experiment allocates 4TB for Azure Files. However, this size can be scaled down to 1TB, depending on your storage requirements. An additional cost is associated with Azure Database for MariaDB
, which serves as the database backend for SLURM accounting. If SLURM accounting is not a critical feature for your specific use case, you may opt to disable it to further reduce costs. By minimizing the Azure Files storage to 1TB and foregoing MariaDB, the estimated minimal daily expenditure stands at approximately $12.5/day.
Azure Netapp Files (4TB Premium)
Analogous to the previous experiment, the cost data for both September 2nd and September 4th are partial and not representative of a full 24-hour cycle. In contrast, the data from September 3rd is complete and spans an entire 24-hour period.
Here is a table breakdown for 09/03:
Date | Service Name | Cost |
---|---|---|
Sep 03 | Azure NetApp Files | 13.469614080000001 |
Sep 03 | Virtual Machines | 9.953625947537999 |
Sep 03 | Azure Database for MariaDB | 2.989564258064516 |
Sep 03 | Storage | 0.20762154493899998 |
Sep 03 | Virtual Network | 0.12240000000000001 |
Sep 03 | Azure DNS | 0.006294712258064519 |
Sep 03 | Bandwidth | 0.00023048860579729093 |
Sep 03 | Advanced Threat Protection | 6e-7 |
With Azure NetApp Files (ANF), the smallest allowable volume size is 4TB, translating to an estimated daily cost of approximately $13.5. If you opt to run your setup without MariaDB, the projected cost increases to $25/day. This is roughly double the expense when compared to utilizing Azure Files (1TB without MariaDB).
AMLFS
NOTE: If you want to use integrated Azure Blob storage with AMLFS, you must specify it in the Blob integration section when you create the file system. You can’t add an HSM-integrated blob container to an existing file system. Integrating blob storage when you create a file system is optional, but it’s the only way to use Lustre Hierarchical Storage Management (HSM) features. If you don’t want the benefits of Lustre HSM, you can import and export data for the Azure Managed Lustre file system by using client commands directly.
Without Blob integration
Setup
With Blob integration
Details on AMLFS
Determining network size
The size of subnet that you need depends on the size of the file system you create. The following table gives a rough estimate of the minimum subnet size for Azure Managed Lustre file systems of different sizes.
Storage capacity | Recommended CIDR prefix value |
---|---|
4 TiB to 16 TiB | /27 or larger |
20 TiB to 40 TiB | /26 or larger |
44 TiB to 92 TiB | /25 or larger |
96 TiB to 196 TiB | /24 or larger |
200 TiB to 400 TiB | /23 or larger |
Steps to mount AMLFS to AzHOP
- Create AMLFS resource group in the same region AMLFS RG Details:
Attribute | Value |
---|---|
Subscription | XXXX |
Resource group | JZ-AMLFS |
Region | South Central US |
Availability zone | 1 |
File system name | lustre |
Storage capacity | 8 TiB |
Throughput per TiB | 250 MB/s |
Total Throughput | 2000 MB/s |
Virtual network | (New) lustre-vnet |
Subnet | (New) default (10.4.0.0/27) |
Maintenance window | Sunday, 12:00 |
- Create AMLFS and AzHOP vnet peering.
- Select Allow access to remote virtual network for both vnet
- Select Allow traffic to remote virtual network for both vnet
- In AzHOP RG, edit
nsg-common
.- Change Inbound security rule 3100 to Allow
- Change Outbound security rule 3100 to Allow
- Install pre-built client software on AzHOP
- Connect clients to an AMLFS
[root@scheduler ~]# mkdir /lustre
[root@scheduler ~]# sudo mount -t lustre -o noatime,flock 10.4.0.4@tcp:/lustrefs /lustre
[root@scheduler ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs 3.9G 417M 3.5G 11% /run
tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda2 30G 3.6G 26G 13% /
/dev/sda1 494M 74M 421M 15% /boot
/dev/sda15 495M 12M 484M 3% /boot/efi
/dev/sdb1 16G 45M 15G 1% /mnt/resource
nfsfilespya6el4wo2vwgx.file.core.windows.net:/nfsfilespya6el4wo2vwgx/nfshome 1.0T 0 1.0T 0% /clusterhome
tmpfs 783M 0 783M 0% /run/user/1000
10.4.0.4@tcp:/lustrefs 8.0T 1.3M 7.6T 1% /lustre