Data Science: Understanding Dataset and Infrastructure Scale
When your dataset has a billion rows or your API handles a million requests, what does that actually mean? Visualize data infrastructure numbers.
Data Scale Is the New Literacy
In 2026, data professionals work with numbers that would have seemed absurd a decade ago. A "small" dataset might have a million rows. A production database might hold billions of records. An API might handle tens of thousands of requests per second. But what do these numbers actually mean in terms of resources, time, and cost?
Row Counts in Context
- 10,000 rows: Fits in a spreadsheet. You could scroll through it in an afternoon. Processing takes milliseconds.
- 1 million rows: A CSV file might be 100-500 MB. Processing takes seconds to minutes. Still fits on a laptop.
- 100 million rows: You need a proper database. A full table scan might take minutes. This is where indexing strategy starts to matter seriously.
- 1 billion rows: You're in distributed systems territory. A single machine probably can't handle this efficiently. Query optimization is critical. Storage runs into terabytes.
- 100 billion rows: This is Google/Meta/Amazon scale. You need specialized infrastructure (BigQuery, Redshift, Snowflake clusters). Queries that scan the full dataset can take hours and cost hundreds of dollars.
API Request Scale
When someone says "we handle 10,000 requests per second," here's what that looks like over time:
- Per minute: 600,000 requests
- Per hour: 36 million requests
- Per day: 864 million requests
- Per month: ~26 billion requests
At 10K req/s, if each request generates 1 KB of log data, you're producing 10 MB of logs per second, 864 GB per day, and about 25 TB per month. That's just logs. The actual data processed per request is typically much larger.
Storage Numbers
- 1 GB: a couple hundred high-res photos, or about 250 MP3 songs
- 1 TB: roughly 500 hours of HD video, or about 6.5 million document pages
- 1 PB (petabyte, 10^15 bytes): about 500 billion pages of text, or 13.3 years of continuous HD video
- 1 EB (exabyte, 10^18 bytes): estimated total of all words ever spoken by humans, converted to text, is about 5 EB
Cost Implications
At major cloud providers, storage costs roughly $0.02-0.03 per GB per month. Seems cheap until you scale:
- 1 TB: ~$20-30/month
- 1 PB: ~$20,000-30,000/month
- 1 EB: ~$20-30 million/month
Processing costs are even more dramatic. A BigQuery scan of 1 PB costs about $5,000. If you run that query once a day, you're spending $150,000/month on a single query.
Why Visualization Matters for Data Teams
When a PM asks "can we just add this field to every record?" and your table has 2 billion rows, the answer depends on understanding what 2 billion actually means for storage, processing time, and cost. Use the How Big? tool to show stakeholders the difference between "a million rows" (manageable) and "a billion rows" (infrastructure project).
Step-by-step guide
- 1
Enter your dataset size into the How Big? tool to understand its scale
- 2
Compare your dataset to known reference points (e.g., 1M rows vs 1B rows)
- 3
Estimate processing time by understanding the magnitude difference between dataset sizes
- 4
Use time and physical comparisons to communicate data scale to non-technical stakeholders
- 5
Plan infrastructure capacity by visualizing growth trajectories