How to Rack 30 Petabytes of Storage

title: How to Rack 30 Petabytes of Storage

author: Standard Intelligence Team

content_type: article

publication: Blog

published: 2025-09-30T00:00:00

source_url: https://si.inc/posts/the-heap/

word_count: 3432

We built a storage cluster in downtown SF to store 90 million hours worth of video data. Why? We’re pretraining models to solve computer use. Compared to text LLMs like LLaMa-405B, which require ~60 TB of text data to train, videos are sufficiently large that we need 500 times more storage. Instead of paying the $12 million / yr it would cost to store all of this on AWS, we rented space from a colocation center in San Francisco to bring that cost down ~40x to $354k per year, including depreciation.

Our use case for data is unique. Most cloud providers care highly about redundancy, availability, and data integrity, which tends to be unnecessary for ML training data. Since pretraining data is a commodity—we can lose any individual 5% with minimal impact—we can handle relatively large amounts of data corruption compared to enterprises who need guarantees that their user data isn’t going anywhere. In other words, we don’t need AWS’s 13 nines of reliability; 2 is more than enough.

Additionally, storage tends to be priced substantially above cost. Most companies use relatively small amounts of storage (even ones like Discord still use under a petabyte for messages), and the companies that use petabytes are so large that storage remains a tiny fraction of their total compute spend.

Data is one of our biggest contraints, and would be prohibitively expensive otherwise. As long as the cost predictions work out in favor of a local datacenter, and it would not consume too much of the core team’s time, it would make sense to stack hard drives ourselves.

1. We talked to some engineers at the Internet Archive, which had basically the same problem as us; even after massive friends & family discounts on AWS, it was still 10 times more cost-effective to buy racks and store the data themselves!

The Cost Breakdown: Cloud Alternatives vs In-House

Internet and electricity total $17.5k as our only recurring expenses (the price of colocation space, cooling, etc were bundled into electricity costs). One-time costs were dominated by hard drive capex.

2. When deciding the datacenter location we had multiple options across the Bay Area, including options in Fremont through Hurricane Electric for around $10k in setup fees and $12.8k per month, saving us $38.5k initially and $4.7k per month, but ended up opting for a datacenter that was only a couple blocks from our office in SF. Though this came at a premium, it was extremely helpful to get the initial nodes setup and for ongoing maintenance. Our team is just a few people, so any friction in going to the datacenter would come at a noticeable cost to team productivity.

Table 1: Cost comparison of cloud alternatives vs in-house. AWS is $1,130,000/month including estimated egress, Cloudflare is $270,000/month (with bulk-discounted pricing), and our datacenter is $29,500/month (including recurring costs and depreciation).

Monthly Recurring Costs

Item | Cost | Notes

Internet | $7,500/month | 100Gbps DIA from Zayo, 1yr term.

Electricity | $10,000/month | 1 kW/PB, $330/kW. Includes cabinet space & cooling. 1yr term.

Total Monthly

$17,500/month

One-Time Costs

Category | Item | Cost | Details

Storage | Hard drives (HDDs) | $300,000 | 2,400 drives. Mostly 12TB used enterprise drives (3/4 SATA, 1/4 SAS). The JBOD DS4246s work for either.

Storage Infrastructure | NetApp DS4246 chassis | $35,000 | 100 dual SATA/SAS chassis, 4U each

Compute | CPU head nodes | $6,000 | 10 Intel RR2000s from eBay

Datacenter Setup | Install fee | $38,500 | One-off datacenter install fee

Labor | Contractors | $27,000 | Contractors to help physically screw in / install racks and wire cables

Networking & Misc | Install expenses | $20,000 | Power cables, 100GbE QSFP CX4 NICs, Arista router, copper jumpers, one-time internet install fee

Total One-Time

$426,500

Our price assuming three-year depreciation (including for the one-off install fees) is $17.5k/month in fixed monthly costs (internet, power, etc.) and $12k/month in depreciation, for $29.5k/month overall.

We compare our costs to two main providers: AWS’s public pricing numbers as a baseline, and Cloudflare’s discounted pricing for 30PB of storage. It’s important to note that AWS egress would be substantially lower if we utilized AWS GPUs. This is not reflected on our graph because AWS GPUs are priced at substantially above market prices and large clusters are difficult to attain, untenable at our compute scales.

Here are the pricing breakdowns:

AWS Pricing Breakdown

Cost Component | Rate | Monthly Cost | Notes

Storage | $0.021/GB/month | $630,000 | For data over 500TB

Egress | $0.05/GB | $500,000 | Entire dataset egressed quarterly (10 PB/month)

Total AWS Monthly

$1,130,000

Cloudflare R2 Pricing

Pricing Tier | Rate | Monthly Cost | Notes

Published Rate | $0.015/GB/month | $450,000 | No egress fees

Estimated Private Pricing

3. Cloudflare has a more reasonable estimate for the 30 PB, placing it at an overall monthly cost of $270k without egress fees. We also have bulk-discounted pricing estimates after getting pricing quotes—this was our main point of comparison for the datacenter.

$0.009/GB/month | $270,000 | Estimated rate for >20 PB scale

That brings monthly costs to $38/TB/month for AWS, $10/TB/month for Cloudflare, and $1/TB/month for our datacenter—about 38x lower and 10x lower respectively. (At the very cheapest end of the spectrum, Backblaze has a $6/TB product that is unsuitable for model training due to egress speed limitations; their $15/TB Overdrive AI-specific storage product is closer to Cloudflare’s in price & performance)

While we use Cloudflare as a comparison point, we’ve sometimes done too much load for their R2 servers. In particular, in the past we’ve done enough load during large model training runs that they rate-limited us, later confirming we were saturating their metadata layer and the rate limit wasn’t synthetic. Because our metadata on the heap is so simple, and we have a 100Gbps DIA connection, we haven’t ran into any issues there.

4. We love Cloudflare and use many of their products often; we include this anecdote as a fact about our scale being difficult to handle, not as a dig!

This setup was and is necessary for our video data pipelines, and we’re extremely happy that we made this investment. By gathering large scale data at low costs, we can be competitive with frontier labs with billions of dollars in capital.

Setup/The Process

We cared a lot about getting this built *fast*, because this kind of project can easily stretch on for months if not careful. Hence Storage Stacking Saturday, or S3. We threw a hard drive stacking party in downtown SF and got our friends to come, offering food and custom-engraved hard drives to all who helped. The hard drive stacking started at 6am and continued for 36 hours (with a break to sleep), and by the end of that time we had 30 PB of functioning hardware racked and wired up. We brought in contractors for additional help and professional installation later on in the event.

People at the hard drive stacking party!

Cool shots of the servers

The storage software landscape offers many options, but every option available comes with drawbacks. People experienced with Ceph strongly warned us to avoid it unless we were willing to hire dedicated Ceph specialists—our research confirmed this advice. Ceph appears far more complex than justified for most use cases, only worthwhile for companies that absolutely need maximum performance and customizability and are prepared to invest heavily in tuning. Minio presents an interesting option if S3 compatibility is essential, but otherwise remains a bit too fancy for us and similar use-cases. Weka and Vast are absurdly expensive at 2k / TB / year or so and are primarily designed for NVMEs, not spinning disks.

Post-Mortem

Building the datacenter was a large endeavor and we definitely learned lessons, both good and bad.

Things That We Got Correct

We think the redundancy & capability tradeoffs we made are very reasonable at our disk speeds. We’re able to approximately saturate our 100G network for both read & write.

Doing this locally a couple blocks away was well worth it because of the amount of debugging and manual work needed.

Ebay is good to find vendors but bad to actually buy things with. After finding vendors, they can often individually supply all the parts we need and provide warranties, which are extremely valuable.

100G dedicated internet is pretty important, and much much easier to debug issues with than using cloud products.

Having high-quality cable management during the racking process saved us a ton of time debugging in the long run; making it easy to switch up the networking saved us a lot of headache.

We had a very strong simplicity prior, and this saved an immense amount of effort. We are quite happy that we didn’t use ceph or minio. Unlike e.g. nginx, they do not work out of the box. We were willing to write a simple Rust script and roughly saturated our network read & write at 100 Gbps without any fancy code.

We were basically right about the price and advantages this offered, and did not substantially overestimate the amount of time / effort it would take. While the improvements list is longer than this,

most of those are minor; fundamentally we built a cluster rivaling massive clouds for 40x cheaper.

Difficult Bits

A map of reality only gets you so far—while setting up the datacenter we ran into a couple problems and unexpected challenges. We’ll include a list:

We used frontloaders instead of toploaders for our server rack. This meant we had to screw every single individual drive in—tedious for 2.4k HDDs

Our storage was not dense—we could have saved 5x the work on physical placement and screwing by having a denser array of hard drives

Shortcuts like daisy-chaining are usually a bad idea. We could have gotten substantially higher read/write speeds without daisy chaining networked nodes, giving each chassis its own HBA (Host Bus Adapter, not a significant cost).

Compatibility is key—specifically in networking functionally everything is locked to a specific brand. We had many pain points here. Fiber transceivers will ~never work unless used with the right brand, but copper cables are much more forgiving.

FS.comis pretty good and well priced (though their speed estimates were pretty inconsistent); Amazon will also often have the parts you need rapidly. - Networking came at substantial cost and required experimentation. In general, with our relatively non-sensitive training data, we optimized for convenience and ease of use over all else: we did not use DHCP as our used enterprise switches didn’t support it out of the box, and we didn’t use NAT as we wanted public IPs for the nodes for convenient and performant access from our servers. (We firewalled off unused ports and had basic security with nginx secure_link; we would not be able to do this if handling customer data, but it was fine for our use case.) While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.

We were often bottlenecked by easy access to servers via monitor/keyboard; idle crash carts during setup are helpful.

Ideas Worth Trying

Working KVMs are extremely useful, and you shouldn’t go without them or good IPMI. Physically going to a datacenter is really inconvenient, even if it’s a block away. IPMI is good, but only if you have pretty consistent machines.

Think through your management Ethernet network as much as your real network - it’s really nice to be able to SSH into servers while configuring the network, and IPMI is great!

Overprovision your network—e.g. if doable it’s worth having 400 Gigabit internally (you can use 100G cards etc for this!)

We could have substantially increased density at additional upfront cost by buying 90-drive SuperMicro SuperServers and putting 20TB drives into them. This would allow us to use 2 racks instead of 10, give us about the equivalent of 20 AMD 9654s in total CPU capacity, and use less total power.

How to Rack 30 Petabytes of Storage

Brief

Why it matters

Key details