roadrunnertwice | The case of the default thread pool heuristic (Reply)

As mentioned previously, this year I switched to hosting eardogger.com in what's either a highly unconventional environment or an unusually conventional environment, depending on your perspective. This has mostly gone completely fine! However, I did have one incident several weeks ago, and it was a funny one.

I was out reading webcomics on my phone, and got creepy 500 errors on Eardogger; when I got home, the logs showed a Resource temporarily unavailable error when trying to access the database.

⁉️ (Metal Gear Solid guard alert noise)

All right, first off: That database isn't a remote server; it's a file on the local disk. If THAT's "unavailable," something's very wrong. A quick web search indicated that error comes from the operating system itself, not anything in my tech stack (like sqlite maybe). At some point, I visited a page on the site, then tried to run a command in my SSH session:

$ ls
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: retry: Resource temporarily unavailable
-bash: fork: Resource temporarily unavailable

Hahahaha holy shit.

Ok anyway, long story short: I was hitting my user's process limit and being prevented from spawning new processes or threads. The 500s were happening when concurrent DB reads would have made the reader pool spawn a new thread, and it hit the wall instead.

A given user may only run a certain number of processes at once on this server. Eardogger is at maximum a single process instance, so I thought I was fine. But then my web host upgraded my server's OS, which changed the process limit accounting to also include sub-process threads. And Eardogger IS multi-threaded.

How multi-threaded, exactly? Well, I was using the Tokio multi-threaded runtime with default configuration. And it turns out the default behavior is to immediately spawn one worker thread per logical CPU core...

...on what turns out to be a 128-core web server. The process limit (not advertised, but support will divulge if you ask) is 25.

I wouldn't want that even if it WAS allowed!! This app has like three goddamn users! I made my thread pool configurable and set it to single digits, and that immediately banished the errors and the shell lockups. 🌈 As a bonus, it also cut the app's cold startup time from "barely perceptible" to "legit gone" — apparently spawning more than a hundred threads on startup takes a noticeable amount of time, but Rust is so fast in general that it covered most of that sin and I wasn't immediately suspicious.

Lessons learned:

Resource heuristics that inspect the local system's capacity are a huge red flag, because they act normal on your laptop and then go berserk elsewhere. Set your ceiling explicitly!
top and ps on Linux don't list threads by default, you have to use an extra argument to see that.
Wow.

EDIT: But actually, in most modern deployment scenarios, your server's CPU and memory resources really are much closer to the scene on your laptop, since the standard practice is to slice up computer resources into tiny single-purpose shards via containers or VMs. Like, Ruth was showing me something from her work that we would both consider "fairly extreme" in terms of resource allocation, and I was like "oh yeah, that's a whole lot of laptop... but it ain't a web server, you know?" So yes, this whole problem did in fact stem directly from my runtime environment being weirdly atavistic, and most people are probably fine with Tokio's default behavior.

Roadrunner Twice

The case of the default thread pool heuristic (Reply)

The case of the default thread pool heuristic

Profile

Static and Noise

Expand Cut Tags

Ppl are Talking

Wet Paint