I did a quick test last week where I replaced the billboard LOD with a simple billboard shader and set all of the billboards to use the same texture. Performance wise, it should have produced the same effect as what you described since they all share the same texture/material, they should get batched. But that's kind of the problem... this dynamic batching is sucking major CPU time just to save some draw calls.
The reason that the DFU billboard shader is so efficient is that it combines all the tree billboards into one mesh. Each of the quads that the trees are drawn onto are billboarded (its orientation changed to face the player) by the shader using the GPU, so no involvement of the CPU. All the CPU has to do is send instructions on how to render the combined mesh of all the trees. Very few draw calls, next to no batching. The trees are just one object. The CPU spends its time telling the GPU how to renderer the trees as a singular collective object instead of wasting its time instructing how the GPU should render this one, then this other one, then this other one, then this other, etc...
The downside to this is I believe you cannot use LODs or culling since the combined mesh is all the trees. They can either all be on, or all be off. Nothing in between. So when you play vanilla DFU, all the trees of a terrain chunk are rendered, unless you are looking straight up or maybe down to where none of the combined meshes are in the view frustum. Like I've said before, this takes the load off the CPU and puts it onto the GPU, but most modern GPUs can handle vanilla DFU billboarded trees in the thousands without breaking a sweat so this isn't an issue. Most modern GPUs however can't handle thousands of tree meshes at the same time, so the combining meshes trick is kind of out the question there. Too many polygons to combine anyhow.
I predict that if you replace the LOD0 meshs with a billboarded sprite and don't alter the distance to which they stop rendering, there will be a nearly trivial gain in performance because it still takes nearly as much CPU time to batch all the billboards as it would to batch billboards + meshes. You'd save a little bit because the meshes have multiple materials whereas the billboards do not, but the CPU time savings would still be quite minor.
My only solution (as Ive mentioned before) is to do custom billboard shader, in which we have to use big texture atlas, for the entire zone (tree zone). This way it would be made quite similar to the vanilla DU billboards. Down side is that each tree would have just one viewing side, atm billboards have about 8.
I had been interpreting this to mean that we simply replace the SpeedTree billboard shader with a custom shader that does mostly the same thing and keep the mesh LOD0. Are you saying that we should scrap the LOD system and replace it with 8 view billboard sprites? If so, I can get behind that. But again, I think it will only help if we combine the billboard meshes like how InterKarma did.
Without scrapping the LOD system, we can't combine meshes. If we can't combine meshes, then we must rely on dynamic batching. If we rely on dynamic batches, CPU is going to get choked.