When Tests Become the Bottleneck
Scaling end-to-end testing, without breaking development velocity
When end-to-end tests are first introduced to a product they are usually awesome: you start with a suite that completes within ten minutes, catches real bugs before they hit production. Then, as the product grows and more tests get added, that suite stretches to twenty minutes, then thirty, and by the time you realize it takes two hours for the complete suite to run.
At that point, definitely nobody wants to run it on every commit anymore. The once awesome suite turns into a slow, pre-release gatekeeper instead of a daily safety net — and the whole point of catching issues early starts to fade away.
Michael Feathers, in Working Effectively with Legacy Code, says: “to a large extent, legacy code is simply code without tests.” And that’s entirely true, but notice that when we talk about “testing” we often use differing definitions.
Unit tests run fast because they test small pieces of code directly. End-to-end tests act more like real users: they wait for pages to load, fill out forms, click around on the site. By nature they take longer, and as the suite grows they can also get flakier — sometimes a click doesn’t take, sometimes the network lags just enough to break the test.
When the feedback loop stretches from minutes to hours, developer habits shift. People start batching bigger commits to avoid waiting. They skip running the suite locally because nobody wants to lose half a morning if they don’t absolutely have to. Testing ends up happening mostly in CI, and even then it feels like crossing your fingers and hoping nothing breaks in areas you didn’t touch.
What the big players do
If you're drowning in a two-hour test suite, you're definitely not alone. Good news is that the larger players hit this wall years ago, and invented best practices around it.
Google’s approach is basically to throw hardware at the problem. They built a massive parallel execution grid that can spin up thousands of test runners at once, so the whole suite finishes before lunch. For the rest of us who don’t have Google’s data centers, this idea still scales down: you can parallelize your tests using cloud services like Sauce Labs or BrowserStack, or spin up your own fleet of test runners on Kubernetes. The principle stays the same: if one runner takes two hours, then twenty runners might get it done in six minutes.
Microsoft's Azure DevOps took a different approach with Test Impact Analysis. Instead of running everything faster, they got smarter about what to run in the first place. Their tools look at the code changes in a commit and work out which tests are actually relevant. Change a CSS file? Maybe you can skip the backend API tests. Update a database migration? Better run the integration tests but skip the UI ones.
The immediate wins here are pretty practical. First, parallelize what you can. Second, organize your tests by what they really check because not every test needs to run every time. Your quick smoke tests can run on every push. The full regression suite might run nightly. Performance tests run on weekends or before big releases.
The real low hanging fruit is to stop running tests that don’t add real value. It’s surprisingly common to see thousands of tests, half of which cover the same happy paths in slightly different ways. It’s like having six smoke detectors in a studio flat — more doesn’t always mean safer. Go through your suite, find the duplicates, and focus on the tests that actually catch real bugs, not the ones that just pad your coverage report.
Testing without the Google budget
Now, if you’re reading this from a startup where “massive parallel execution grid” sounds like something out of science fiction — don’t panic. You don’t need Google’s infrastructure to catch bugs before your users see them. What matters more is being smart about where you spend your testing effort.
Martin Fowler’s test pyramid helps here. Lots of unit tests at the bottom: they’re fast, reliable, and catch the obvious stuff. A smaller number of integration tests in the middle: these catch the “my component works, but doesn’t play nicely with others” problems. And just a handful of end-to-end tests at the top: the critical user journeys that simply cannot break. It’s a bit like home security: you don’t need cameras in every room, but you definitely want them watching the front door.
Visual regression testing is where small teams can really punch above their weight. Tools like BackstopJS — or even simple screenshot comparison scripts — can catch unexpected UI changes across your whole app in minutes.
Don’t underestimate a good manual testing checklist. Sure, it’s not automated, and yes, it doesn’t scale forever. But if you’ve got three critical user flows, spending 15 minutes running through them before each release can be more useful than weeks spent building an automated suite that breaks every time you rename a CSS class. The trick is knowing when to switch: usually when you find yourself repeating the same manual checks every few days.
The real lesson here is to match your testing strategy to your product and your team. Catch the bugs that matter — and don’t let “perfect” stop you from shipping.