-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/trace: provide aggregate 'go tool trace' goroutine analysis view #29103
Comments
Btw, I am working on a PR that fixes this - its fairly straightforward. |
Additionally, accumulator buckets like 'sync time' are not very useful. The system should report min/avg/max, or even percentiles - the total does not provide enough information. |
I have created a project that has the completed code changes here Here is a screenshot of the new summary page: The PR is going to be some work, because they were made externally, and involve the internal package, so maybe before I perform the PR, someone might review the changes in the project, and we can discuss there ? |
Thanks for the change. I agree that having some sort of summary in the goroutines view will be helpful. Based on the screen shot, I guess the Total is a sum, which is not very meaningful. The color bar seems useful to tell the characteristics of each goroutine. The goroutine pc:0 are the goroutines that pre-existed before the tracing started so we couldn't the origin of the goroutines. So, they are mix of various goroutines, which I am not sure the data other than count is meaningful. |
It is a sum, but it is the same sum that was used in the 'detail' view to provide the 'exec time %' - it was just duplicated into the summary. Similarly for the other aspects. I think all of the sums are useful to a certain extent, as it gives a profile of Go routines (which is what the color bar represents). Again, this is the same representation as the original detail view. The min/avg/max fields are most important, since as you state, in many cases the sum isn't helpful when doing latency or performance analysis. I am suspect that your pc:0 assessment is correct, only because in the sample I test with it is taken after ALL of the go routines are created (so they should all be PC:0). Furthermore, the 1:1 between the pc:0 routines and the number of actual routines seems too much of a coincidence. |
@hyangah also, for the PC:0 routines, the detail is essentially the same for all of them, which seems strange. |
@hyangah OK, so the count of the PC:0 routines is not related to the "real" go routine count. But, I just did a trace, and it shows the following Go routines as PC:0 But if I perform a full stack dump of the application, I get: stack dumpgoroutine 71 [running]: runtime/pprof.writeGoroutineStacks(0x1b4c360, 0xc000098000, 0x100d82f, 0xc0001464e0) /usr/local/Cellar/go/1.11.2/libexec/src/runtime/pprof/pprof.go:678 +0xa7 runtime/pprof.writeGoroutine(0x1b4c360, 0xc000098000, 0x2, 0xc000082400, 0x0) /usr/local/Cellar/go/1.11.2/libexec/src/runtime/pprof/pprof.go:667 +0x44 runtime/pprof.(*Profile).WriteTo(0x25ea580, 0x1b4c360, 0xc000098000, 0x2, 0xc000098000, 0x1ab3212) /usr/local/Cellar/go/1.11.2/libexec/src/runtime/pprof/pprof.go:328 +0x3e4 net/http/pprof.handler.ServeHTTP(0xc0002ae011, 0x9, 0x1b61760, 0xc000098000, 0xc0001a2000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/pprof/pprof.go:245 +0x210 net/http/pprof.Index(0x1b61760, 0xc000098000, 0xc0001a2000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/pprof/pprof.go:268 +0x723 net/http.HandlerFunc.ServeHTTP(0x1ad37b8, 0x1b61760, 0xc000098000, 0xc0001a2000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/server.go:1964 +0x44 net/http.(*ServeMux).ServeHTTP(0x25f9ea0, 0x1b61760, 0xc000098000, 0xc0001a2000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/server.go:2361 +0x127 net/http.serverHandler.ServeHTTP(0xc000238000, 0x1b61760, 0xc000098000, 0xc0001a2000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/server.go:2741 +0xab net/http.(*conn).serve(0xc0001fc0a0, 0x1b62760, 0xc000152000) /usr/local/Cellar/go/1.11.2/libexec/src/net/http/server.go:1847 +0x646 created by net/http.(*Server).Serve /usr/local/Cellar/go/1.11.2/libexec/src/net/http/server.go:2851 +0x2f5 so it does seem that these routines are there, but the problem is they had no state transitions during the trace capture time. So I think that either 1) they should be omitted entirely, or 2) the trace capture should also perform a 'full stack dump' at trace start, and then create events like "in syscall", not just "start syscall", "stop syscall", or some other way of tracking where/what these routines are. Btw, I captured the trace while the application was idle - which leads to lots of PC:0 routines. *** Notice though that some of the routines in the trace capture do not appear at all in the 'stack dump', like 2 or 3.... which is very strange. |
Those goroutines don't affect any operation latency which the tracing tool aims to analyze. I'd go with omitting them. Still the count is meaningful since it give us an idea about how many goroutines are in the system during system. What performance problems do you want the new page to help diagnosing? |
@hyangah I think the next enhancement would be to record the 'event' where the 'max' occurred, so then you could click on the 'max' on open the detailed trace window to that exact moment in time. I believe the purpose of the tool is to easily observe "anti-patterns" - code doing little work before a context switch, scheduler inefficiencies (scheduler delays). The min/avg/max gives you a simplified histogram of the performance of the routine, and for a request/response, or event/reaction type system this is very important. The individual list by routine is still there - you click on the routine name from the summary - which then reports by routine the various timings, including min/avg/max. To summarize, the old summary page and pprof are not suitable for diagnosing performance problems in a concurrent environment. I will say that the 'user events/tasks' is the way to Go! This is ideal for monitoring this type of work (multi-stage, multi-event) - I think it is a shame that it is not promoted and documented more. |
You can select goroutine with min/max of each metric from the goroutine page (The table is sortable). On the other hand, I found the pprof outputs are more helpful in analyzing inefficiencies involved in context switch, or scheduling while debugging production issues. If some of the information you present in your prototype is useful, how about presenting them in each goroutine view? I agree that this tool needs more love and the UI needs improvement. I think the discussion will be more fruitful if we have the list of specific questions along with examples, the new visualization can help to answer. |
@hyangah it is not the sorting - that is min/max by total - I added the min/avg/max which is per event. For example you may have 1000 ms of sync block, but if you see that the min/avg/max is all under 100 nanos, you don't really have any contention - it is just acquiring the mutex a lot... Similarly, you might have 1000 ms of execution time, but if the min/max/avg is 10 us, that means that the code if doing very little before it is descheduled, but it is running a lot. Using "totals" is difficult for performance analysis. Similarly in pprof, it doesn't break things apart by go routine - this is the biggest problem with pprof. It s the aggregate only - so it if very hard to see what is happening - if you look at the pprof output in issue #29104 you will see that it is not very useful. |
@robaho I am not sure if I understand all correctly but it seems like I was questioning the value of the new goroutine overview page in #29103 (comment) I think the distribution of the each observed event duration is a good addition and I want it. For the parsing api (those in internal/), the api is not stable and no where near to be public - at some point, we were thinking about changing it dramatically for optimization. Good news is that it's part of Go release and doesn't change often. So what I usually do is just have a shell script that copies and replaces the import paths with sed and run it twice a year. |
@hyangah I think I understand your concerns. The overview page is essential IMO, otherwise you need to jump into each go routine to find the cpu usage (for example), and the remember it to perform comparisons - the summary allows easy determination of which go routines are consuming the cpu (pprof doesn't give this). It also allows easy access to the analysis graphs without jumping into each routine. I did change the detail page to show the min/avg/max - I just didn't post the screen shot - which is really important as it makes it easy to determine when you have multiple routines of the same type if they are distributing the load equally, and where the max latencies are occurring - totals don't allow this. I think using a histogram would be great, the only problem is that the UI might need to be more advanced, as it is already crowded just showing min/avg/max in addition to the total. In some ways the totals could be removed (in both the summary and the detail), since they are reflected in the bar, and then there would be more room to show histogram details. |
@hyangah to clarify, the summary page shows the min/avg/max across all go routines of that type, so it is very easy to find high max and avg then drill deeper. |
@hyangah I just posted another commit that allows you to perform a detailed trace on the selected go routine group by clicking on the 'count' of that group in the summary page. This is very useful as well. I think it may be helpful if you clone the repo and run it live, as the screen shots don't represent the usage too well. |
I agree with @robaho that this sort of summary view would be nice. What's the current blocker? |
You can use the goanalyzer in my github. I think it is simply a lower priority item to address. Plus I need to probably create a patch set which I’ve been reluctant to do since it didn’t seem it was going to be accepted. |
The current 'go tool trace' has a 'goroutine analysis' page, that shows data similar to this:
and after selecting a go routine, a detail page is displayed similar to this:
This should be changed, so that either 1) the main summary page shows the 'detail' for each in a list, or 2) add an option to the summary page labelled 'details' that provides this information.
If it far more useful than having to view the detail on each go routine individually.
The text was updated successfully, but these errors were encountered: