September 26, 2024

H100/H200 GPUs Tender Announcement

3 minute read

Dear industry colleagues:

Due to the need for large video model training, HPC AI Tech has decided to recruit GPU service providers publicly. We sincerely invite leading and potential suppliers to collaborate to liberate AI productivity.

Bidders must be compliant entities with good business reputation and relevant business undertaking experience (supporting documents may be attached). Subcontracting is not allowed.

GPUs

Numbers: 36~64 H100 or H200(preferred) SXM nodes (288~512 GPUs) in a cluster.

Priority is given to replaceable H200/Blackwell GPUs in the future.(Please include replacement clause)

Form: Bare metal, or cloud host with console from mature cloud vendors (preferred)

Reference configuration (not necessarily 100% identical)

H200

CPU: 2 x Intel® Xeon® Platinum 8580

GPU: 8 x NVIDIA HGX H200 SXM5 GPUs

Memory: 32 x DDR5 5600MHz 64GBOS

Disk: 2 x M.2 960GB SSD NVMEData

Disk: 8 x NVMe SSD 2.5 7.68TB

NIC: 1 x CX7 200G

NIC: 8 x IB 400G

NIC: 1 x 25G

H100

CPU: 2 * Intel 8480+ (2.0GHz, 56Cores, 105MB, 350W)

GPU: 8 * NVIDIA HGX H100 SXM5

Memory: 32 * 64G DDR5 4800MHz

Hard Drive: 1 * 960GB M.2 SSD

Hard Drive: 8 * 7.68T U.2 NVMe

IB Card: 1 * BlueField-3 DPU, DUal-port 200Gb/s

IB Card: 8 * MCX75310AAC, Single-port 400Gb/s

Network Card: 1 * 25 GbE, Dual-port

CPUs

Virtual machines CPU

At least three virtual machine CPUs (distributed across different physical machines, provide HCI (virtualization management platform)), each referring to 32c + 64G memory + at least 16TB (32TB preferred) NVMe SSD data disk.

Bare metal CPU

At least three bare metal CPUs, each referring to

AMD 7H12 64-Core * 2

DDR4 3200MHz 64G * 8

SSD SATA Intel 480G * 2 (RAID1)

U.2 16T NVMe local disk * 4

Mellanox CX4 Lx 25G 2P * 1

MCX755106AS-HEAT 200G 2P * 1

Storage

SSD shared storage

For example, Lustre/DDN/weka, etc.

IOPS: min {22000 * storage capacity (TiB), 3200000}

Throughput: min {300 * storage capacity (TiB), 20000} MBps

Storage requirements: Above 150TB

Interface: K8S link support is required, and shared storage can be mounted through CSI (not mandatory)

Interface: Providing CSI (Container Storage Interface) support would be the best option, if not, File System Administration API support would still be a valuable addition, such as usage querying, quota setting.

HDD cold storage

At least 300TB

It is connected to SSD shared storage and GPUs via 200Gbps(preferred) storage network.

Network

Public network: At least 2Gbps shared bandwidth, and it can be dynamically split into CPU and GPU nodes based on VPC

Storage network: At least 200Gbps

Computing network: 400Gbps * 8 IB instead of RoCE

The number of IPs, temporarily two, can be increased later. All ports are open by default.

Ethernet interconnection speed between different nodes within the cluster (preferred 25 Gbps or above):

CPU and CPU: minimum 10Gbps
GPU and GPU: minimum 10Gbps
CPU and GPU network speed: minimum 5Gbps
CPU and storage network speed: minimum 10Gbps

Delivery and Testing

All resources will be available in October(preferred) or November.
Functional testing requires only 2 GPU nodes and storage, CPU, network for about 3 days.
Full cluster testing can be done at the beginning of the official delivery for about 3 days.
The most competitive few suppliers will participate in the appointment functional test.

Business

Region: Any region. Priority for Western United States in similar conditions. All resources need to be connected to the same data center.

Scalability: GPU, CPU, storage, and network bandwidth all need to be able to be expanded on demand in the future. (The estimated delivery time of the expansion needs to be clear, such as a few hours after the request or 2 days/weeks)

Signing and payment

Price is the most important factor.
Singapore/United States/Cayman entity, paid in US dollars.
A one-year contract is acceptable. If the penalty for mid-term termination is reasonable, a two or three-year contract can also be considered.
Monthly prepayment and the ability to make monthly appointments for mid-term termination without punishment are expected.
Reliable SLA and clear fault compensation are expected.
Tiered pricing is welcome, eg.

  GPU monthly price （a>b>c>d>e）

  Monthly a K

  Over 3 months b K

  Over 6 months c K

  Over 9 months d K

  Over 12 months, e K

If we use resources continuously for a long time, you can return the vouchers every quarter to align the price difference from the beginning.

Bidding Documents

(i) Deadline for submission of bid documents: 9:00 am, September 30, 2024 (Singapore time). Bid documents received after the deadline or that do not meet the requirements will not be accepted.

(ii) Due to the large number of potential bidders, we cannot reply to each bid document one by one. Thank you for your understanding.

(iii) Please strictly follow the requirements and send complete materials to tender@hpcaitech.com

Please check the completeness of the following content before sending your bid documents:

Introduction to the signing entity and business experience.
Cluster location, detailed configuration of demand items, performance benchmark(especially storage), cluster delivery time, and scalability prediction.
Quotations divided into demand items, SLA, fault compensation.
Contract period, payment, how to calculate tax with a Singapore entity, and termination terms (A variety of options can be provided for reference).

HPC AI TECHNOLOGY PTE. LTD.