I have no idea if this is relevant to your situation, because I'm not familiar with DFU's code or with the Profiler you're using. I just wanted to pass along a bit of painful advice for understanding "sampling-based" profilers.
Quick note: profilers generally come in two forms: "instrumented" _can_ be more accurate in a world of many small functions at the expense of massive slowdown and unrealistic interactions with drivers and the real world, or "sampling" which has low impact on run time performance but can only give a statistical approximation of what's happening. Most profilers are sampling profilers, or at least default to that mode.
The problem with sampling profilers is that they don't see all your code. Commonly, they take one sample every 1ms, or maybe every 0.1ms, to see where in the code the CPU happens to be at that instant in time. If the rendering code is VSYNC'ed, and your screen refresh rate is an integral number of milliseconds, you could easily see an apparent high usage in a small portion of the rendering pipeline, simply because the sampling time happens to line up there. Say you're locked to a 60Hz monitor. Frame time is 16.67ms, which isn't integral, but if we consider it at 17ms, every 17 samples of your data will be looking at a particular place in the rendering pipeline processing. If OnGui takes 1ms consistently, that's wide enough for the sampler to hit it consistently. This could lead to the profiler claiming that OnGui was taking ~6% of the time of the system, when in reality it is only consuming 0.1%.
There are other issues with sampling profilers, largely based on the fact that a 1GHz single core computer can nominally execute 1 million opcodes in 1ms, so you have 1 sample point out of 1 million things that that CPU did in that time period. Of course, the CPU is never able to maintain full theoretical throughput, but given that modern processors are more in the range of 3GHz and 4+ cores, it's not unreasonable to take a rule-of-thumb that your profiler is only seeing 0.0001% of the executing code (or maybe 0.001%, if you can speed it up to 0.1ms sampling) and then telling you which pieces of your code are most heavily used. It can work, but if it's telling you something that seems impossible, just remember that it's having to extrapolate wildly from limited data.
In particular, for any kind of video application, I would STRONGLY recommend turning off VSYNC before profiling if you want to have any chance of understanding what the code is actually doing.
(And my apologies if you already have, and if I've just wasted your time
But maybe it will help someone else...)