-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: concurrent phase of GC causing full request stalls #17341
Comments
@zviadm Can you send an execution trace? There are several known issues around this, which we're working on for Go 1.8. From the trace we should be able to tell which one(s) are affecting you. |
I reproduced this using the given gist. Log output here. Relevant bit of the trace: I attached the full trace here. For me, the pause appears in the split 6/8 (15.636125645s-19.117813135s). Edit: I originally posted a trace using Go 1.7. I updated the log and trace using head (dce0df2) instead. |
Thanks! It's clear from the trace that this is #16528. Closing as a duplicate. @zviadm, if you subscribe to #16528, it would be great to get feedback on future fixes. You can also try CL 30031 from that issue, though based on my own tests that just shifts the pauses around (albeit to an easier-to-fix place, which is progress in some sense :) |
Thanks for quick responses, I have subscribed to the other task. Will try to see what happens with that CL too, just as a data point. |
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?Go 1.7
Also tried with: GO: fd29628
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0"
CXX="g++"
CGO_ENABLED="1"
What did you do?
If possible, provide a recipe for reproducing the error.
A complete runnable program is good.
A link on play.golang.org is best.
Wrote a benchmark to showcase the issue using native net/http library:
https://gist.github.com/zviadm/18263037d4b9b624fc8a5b4cb7005bb4
What this benchmark does is, create a simple http server, and then create lots of connections to it and perform very simple requests to it from those connections/goroutines. This simulates having an http server and lots of clients where most of them are idle.
What did you expect to see?
When run using something like:
$: ulimit -Sn 500000
$: GODEBUG=gctrace=1 ./net_gctest -port=5002 -nroutines=30000 -sleep_t=3s -max_latency=30ms &> benchout.txt
Expect to see it doing 10k QPS, with no stalls or tail latencies.
What did you see instead?
Sample output: https://gist.github.com/zviadm/82e9b899833db9dc741f9a6400a09d6c
gc 25 @222.337s 0%: 0.34+165+1.3 ms clock, 11+477/1307/2896+43 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 26 @237.844s 0%: 0.30+165+1.4 ms clock, 9.8+504/1306/3076+47 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
gc 27 @253.378s 0%: 0.30+162+1.2 ms clock, 9.4+485/1286/2993+39 ms cpu, 1411->1415->723 MB, 1447 MB goal, 32 P
even though, STW phase of GC is <2ms, we see lots of stalls of >30ms while concurrent phase of the GC runs. pretty much all requests stall during that time period.
This is a pretty serious issue because that basically means if you have lots of idle connections (or potentially just go routines?) in a server, it causes huge tail latency spikes. And in real programs GC frequency will be higher, it can happen once every 5-10 seconds and now these spikes easily show up in p95/p99 latency times.
Note that the cpu utilization is very low throughout the test. If we perform same test with much smaller number of go routines, say only 200 or 300, it can do 200k QPS without any STALLs on that same machine.
The text was updated successfully, but these errors were encountered: