OSCER users,
Yesterday, OU was awarded a National Science Foundation
Campus Cyberinfrastructure Compute grant of $400,000 for
Graphics Processing Unit (GPU) cards for Machine Learning (ML)
calculations:
https://nsf.gov/awardsearch/showAward?AWD_ID=2201561&HistoricalAwards=false
(The GPU cards can also be used for traditional floating point
number crunching.)
Here's how this we expect this project to play out:
SUMMARY:
(1) Near term (through mid-CY2023)
(1a) GPU-based Machine Learning computing starting very soon
(1b) GPU Debug Servers
(2) Middle term (late CY2023 and on): next generation GPUs (H100)
---
(1) Near term (through mid-CY2023)
(1a) GPU-based Machine Learning computing starting very soon
You'll be able to start doing GPU-based Machine Learning (ML)
and traditional floating point calculations very soon.
Specifically, we already have 2 OSCER-owned general-use
servers that each have dual NVIDIA A100 GPU cards in them.
Currently, both GPU servers are being used for a few specific
projects, but soon, at least one of those servers (with
dual A100 GPU cards) will become available for general use
by all OSCER users.
We're waiting for delivery of several more such servers,
hopefully by late July, at which point we'll have:
* 4 servers, each with dual A100 GPUs, each GPU card with
80 GB internal card memory, in an "NVlink" configuration;
* 3 servers, each with dual A100 GPUs, each GPU card with
40 GB internal card memory, in a non-NVlink configuration.
Note that "NVlink" means that the pair of GPU cards in
the server are connected to each other at
600 Gigabytes per second.
This is helpful when you want to gang the pair of GPU cards
together, to get more aggregate RAM for a calculation.
So a pair of 80 GB A100 GPU cards connected via NVlink will
give you up to 160 GB of internal card memory to use in
a single run, with very fast data movement between
that server's pair of GPU cards (600 GB/sec).
(We'll also have quite a few "condominium" A100 GPU cards
that are owned by specific faculty, so those are available
only to those research teams, not to all OSCER users.)
---
(1b) GPU Debug Servers
Also arriving in the next few months will be 2 GPU debug
servers.
These servers will be available for debugging and testing
jobs of up to 30 minutes per job.
(If you need longer, you'll have to do that on the A100 GPUs.)
One of the GPU debug servers will have a single A30 GPU card
with 24 GB of internal card memory.
The other GPU debug server will have dual A10 GPU cards,
each with 24 GB of internal card memory.
A30 is 53% as fast as A100, on both (i) ML calculations
(TensorFlow/Keras/PyTorch) and
(ii) traditional floating point number crunching.
A10 is 40% as fast as A100 on ML calculations, and
10% as fast as A100 on traditional floating point number
crunching.
So the GPU debug servers won't be useful for performance
benchmarking, but they'll be great for debugging and testing.
Why did we pick slow cards? Because they're much cheaper
than A100 GPU cards, and we didn't want to spend a huge amount
of our hardware budget on GPU debug servers.
---
(2) Middle term (late CY2023 and on): next generation GPUs (H100)
Assuming no delays in releasing the new generation of
NVIDIA GPU card ("H100") or the new generation of
Intel and/or AMD CPUs/servers:
We expect to be able to submit a purchase order for
servers with dual H100 GPU cards in late 2022/early 2023.
Because of terrible supply chain delays, that probably
means that the H100 servers will be delivered in mid-to-late
Calendar Year 2023.
(But in the meantime you'll be able to use the A100 GPU cards,
which have pretty much the same features as the H100 GPUs,
including that many of the A100 GPUs have the same memory
size, 80 GB, as the initial release of H100 GPUs.)
We might decide to purchase only a modest number of
dual H100 servers initially, in case the early server options
don't include NVlink.
That is, we might buy some dual H100 servers without NVlink
initially, and then some dual H100 servers with NVlink later.
Note that H100 will be ~2.5 times (~250%) as fast as A100 on
both ML calculations and traditional floating point number
crunching.
---
References:
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
https://www.nvidia.com/content/dam/en-zz/Solutions/data-center/products/a30-gpu/pdf/a30-datasheet.pdf
https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a10/pdf/datasheet-new/nvidia-a10-datasheet.pdf
https://www.techpowerup.com/gpu-specs/a10-pcie.c3793
https://www.nvidia.com/en-us/data-center/h100/
---
Henry Neeman ([log in to unmask])
Director, OU Supercomputing Center for Education & Research (OSCER)
Associate Professor, Gallogly College of Engineering
Adjunct Associate Professor, School of Computer Science
OU Information Technology
The University of Oklahoma
Engineering Lab 212, 200 Felgar St, Norman OK 73019
405-325-5386 (office), 405-325-5486 (fax), 405-245-3823 (cell),
[log in to unmask] (to e-mail me a text message)
http://www.oscer.ou.edu/
|