Dear industry colleagues:
Due to the need for large video model training, HPC AI Tech has decided to recruit GPU service providers publicly. We sincerely invite leading and potential suppliers to collaborate to liberate AI productivity.
Bidders must be compliant entities with good business reputation and relevant business undertaking experience (supporting documents may be attached). Subcontracting is not allowed.
GPUs
Numbers: 36~64 H100 or H200(preferred) SXM nodes (288~512 GPUs) in a cluster.
Priority is given to replaceable H200/Blackwell GPUs in the future.(Please include replacement clause)
Form: Bare metal, or cloud host with console from mature cloud vendors (preferred)
Reference configuration (not necessarily 100% identical)
H200
CPU: 2 x Intel® Xeon® Platinum 8580
GPU: 8 x NVIDIA HGX H200 SXM5 GPUs
Memory: 32 x DDR5 5600MHz 64GBOS
Disk: 2 x M.2 960GB SSD NVMEData
Disk: 8 x NVMe SSD 2.5 7.68TB
NIC: 1 x CX7 200G
NIC: 8 x IB 400G
NIC: 1 x 25G
H100
CPU: 2 * Intel 8480+ (2.0GHz, 56Cores, 105MB, 350W)
GPU: 8 * NVIDIA HGX H100 SXM5
Memory: 32 * 64G DDR5 4800MHz
Hard Drive: 1 * 960GB M.2 SSD
Hard Drive: 8 * 7.68T U.2 NVMe
IB Card: 1 * BlueField-3 DPU, DUal-port 200Gb/s
IB Card: 8 * MCX75310AAC, Single-port 400Gb/s
Network Card: 1 * 25 GbE, Dual-port
CPUs
Virtual machines CPU
At least three virtual machine CPUs (distributed across different physical machines, provide HCI (virtualization management platform)), each referring to 32c + 64G memory + at least 16TB (32TB preferred) NVMe SSD data disk.
Bare metal CPU
At least three bare metal CPUs, each referring to
AMD 7H12 64-Core * 2
DDR4 3200MHz 64G * 8
SSD SATA Intel 480G * 2 (RAID1)
U.2 16T NVMe local disk * 4
Mellanox CX4 Lx 25G 2P * 1
MCX755106AS-HEAT 200G 2P * 1
Storage
SSD shared storage
For example, Lustre/DDN/weka, etc.
IOPS: min {22000 * storage capacity (TiB), 3200000}
Throughput: min {300 * storage capacity (TiB), 20000} MBps
Storage requirements: Above 150TB
Interface: K8S link support is required, and shared storage can be mounted through CSI (not mandatory)
Interface: Providing CSI (Container Storage Interface) support would be the best option, if not, File System Administration API support would still be a valuable addition, such as usage querying, quota setting.
HDD cold storage
At least 300TB
It is connected to SSD shared storage and GPUs via 200Gbps(preferred) storage network.
Network
Public network: At least 2Gbps shared bandwidth, and it can be dynamically split into CPU and GPU nodes based on VPC
Storage network: At least 200Gbps
Computing network: 400Gbps * 8 IB instead of RoCE
The number of IPs, temporarily two, can be increased later. All ports are open by default.
Ethernet interconnection speed between different nodes within the cluster (preferred 25 Gbps or above):
-
CPU and CPU: minimum 10Gbps
-
GPU and GPU: minimum 10Gbps
-
CPU and GPU network speed: minimum 5Gbps
-
CPU and storage network speed: minimum 10Gbps
Delivery and Testing
-
All resources will be available in October(preferred) or November.
-
Functional testing requires only 2 GPU nodes and storage, CPU, network for about 3 days.
-
Full cluster testing can be done at the beginning of the official delivery for about 3 days.
-
The most competitive few suppliers will participate in the appointment functional test.
Business
Region: Any region. Priority for Western United States in similar conditions. All resources need to be connected to the same data center.
Scalability: GPU, CPU, storage, and network bandwidth all need to be able to be expanded on demand in the future. (The estimated delivery time of the expansion needs to be clear, such as a few hours after the request or 2 days/weeks)
Signing and payment
-
Price is the most important factor.
-
Singapore/United States/Cayman entity, paid in US dollars.
-
A one-year contract is acceptable. If the penalty for mid-term termination is reasonable, a two or three-year contract can also be considered.
-
Monthly prepayment and the ability to make monthly appointments for mid-term termination without punishment are expected.
-
Reliable SLA and clear fault compensation are expected.
-
Tiered pricing is welcome, eg.
GPU monthly price (a>b>c>d>e)
Monthly a K
Over 3 months b K
Over 6 months c K
Over 9 months d K
Over 12 months, e K
If we use resources continuously for a long time, you can return the vouchers every quarter to align the price difference from the beginning.
Bidding Documents
(i) Deadline for submission of bid documents: 9:00 am, September 30, 2024 (Singapore time). Bid documents received after the deadline or that do not meet the requirements will not be accepted.
(ii) Due to the large number of potential bidders, we cannot reply to each bid document one by one. Thank you for your understanding.
(iii) Please strictly follow the requirements and send complete materials to tender@hpcaitech.com
Please check the completeness of the following content before sending your bid documents:
-
Introduction to the signing entity and business experience.
-
Cluster location, detailed configuration of demand items, performance benchmark(especially storage), cluster delivery time, and scalability prediction.
-
Quotations divided into demand items, SLA, fault compensation.
- Contract period, payment, how to calculate tax with a Singapore entity, and termination terms (A variety of options can be provided for reference).
HPC AI TECHNOLOGY PTE. LTD.
September 26, 2024