Difference between revisions of "Antarctica: Performance"

m (13 revisions imported)
(Remove useless idea)
 
Line 1: Line 1:
 
== Some idea to improve performance ==
 
== Some idea to improve performance ==
 
* GPU scene management. Instead of culling objects on the cpu like Irrlicht does, gpu are now capable do to it (to some extents) which avoids ping-pong between rams and may use intermediate computations (like occluders rendering). The general idea is to send instancied objects to a geometry shader which will discard objects not visible, either because outside of the view frustum, or because they are occluded. In the later case, occluders object (ie the biggest mesh in a scene) are rendered first, depth map is then "max-mipmapped" (ie the max depth are propagated instead of weigthed average) and the mipmap levels are then used to discard object depth as fast as possible. More details is available in the paper from sisgraph here : http://developer.amd.com/resources/documentation-articles/samples-demos/gpu-demos/ati-radeon-hd-4800-series-real-time-demos/
 
 
* We can even go further and make allmost all draw calls generated by the gpu, using MultiDrawIndirect. While it is a GL 4.2 feature, it requires a pipeline that minimizes states changes which is beneficial on all gpus. In order to use Multi Draw Indirect, the renderer must gather textures in textures array that avoids texture rebindings (taking care of gpu vram availability), gather uniforms in either uniforms buffer objects or in vbo + instance attribute, and doing all indexing in the shaders. See http://www.geeks3d.com/20140321/opengl-approaching-zero-driver-overhead/
 
  
 
* We may investigate compute shaders (GL 4.3 features...) to improve our post process filters. Post process filters usually read adjacent pixels (for instance gaussian blur) which means redundant texture fetches are done. While texture cache mitigate the performance inpact, compute shaders introduced shared memory that acts as an almost as fast gpr read. This shared memory can remove redundant fetches, and avoid using a fullscreen quad vertex shader. However there is some overhead involved in dealing with thread dispatch, most article on the web says that CS are a gain for kernel with > 7 radius. One of our gaussian shader saw a +200% speed increase by implementing it using CS.
 
* We may investigate compute shaders (GL 4.3 features...) to improve our post process filters. Post process filters usually read adjacent pixels (for instance gaussian blur) which means redundant texture fetches are done. While texture cache mitigate the performance inpact, compute shaders introduced shared memory that acts as an almost as fast gpr read. This shared memory can remove redundant fetches, and avoid using a fullscreen quad vertex shader. However there is some overhead involved in dealing with thread dispatch, most article on the web says that CS are a gain for kernel with > 7 radius. One of our gaussian shader saw a +200% speed increase by implementing it using CS.

Latest revision as of 06:13, 1 February 2018

Some idea to improve performance

  • We may investigate compute shaders (GL 4.3 features...) to improve our post process filters. Post process filters usually read adjacent pixels (for instance gaussian blur) which means redundant texture fetches are done. While texture cache mitigate the performance inpact, compute shaders introduced shared memory that acts as an almost as fast gpr read. This shared memory can remove redundant fetches, and avoid using a fullscreen quad vertex shader. However there is some overhead involved in dealing with thread dispatch, most article on the web says that CS are a gain for kernel with > 7 radius. One of our gaussian shader saw a +200% speed increase by implementing it using CS.


  • About shaders
    • We use whole InverseProjection matrix to get view position from projected screen. However most of the coeffs are 0, which is unknow by the shader. Using a formula may reduce the computation amount.
    • Similarly division is a slower operation than mult/add and it might be profitable to avoid dividing by the w component when when know that the matrix is orthogonal. It shouldn't happen a lot though.
    • Generally speaking , mult/add/sub operations are fast on gpu for float (and int on radeons) and can even be reduced to mad by the compiler. fdiv, fexp, flog, fsqrt, cos, sin, fract are usually slower (typically the cost of 4 mult). Integer division/modulo operator is one of the slowest base operation available (because it is almost never implemented on hw and need to be emulated using quite a lot of simplier instructions) and should be avoind. Considers using fdiv + fract on float cast if you don't need exact precision.
    • Avoid discard operation and writing to gl_FragDepth. It disables early z features on gpu. Early z is used to write depth value as soon as fragment shader starts (thus hiding depth write latency) and can also avoid executing a fragment shader if a fragment with a lower depth is already at the same position. In short early z reduces bandwidth usage and partially hide some write latencies.
  • The "target hardware" for stk maxed out is hd7750 (entry level graphic card from amd 2012) and is used to tune the performance of our effects. As of 06/06/14 the cost of every 3d passes are (1680x1050 on Hacienda, single run) :
    • Solid Pass 1 : 0.59 ms
    • Shadows : 1.98ms. This is heavily scene dépendent too and the cost can be reduced by caching some shadow map.
    • Radiance Hint : 1.87 ms
    • GI : 3.54 ms. Unfortunatly GPS2 does not provide helpfull insight on what is causing such a high cost, the shader seems to be spending lot of time doing nothing.
    • Env Map : 0.70 ms
    • Sun light : 1.65 ms. High cost but the shader also fetch the shadow map.
    • SSAO : 7.1 ms. This scales very well with the number of sample (we currently have a "high quality" ssao with 16 samples at full screen resolution, switching to 8 samples almost halves the cost)
    • Solid Pass 2 : 1.02 ms
    • DoF : 5.84 ms
    • Bloom : 1.61 ms. This could be lowered probably by using a compute shader.
    • Tonemap : 0.46 ms
    • There are also Point Lights, Transparent and Displacement passes, which are heavily dependant on the scene and are thus not included in the list.
    • The fps are below 30 and the sum of all 3d passes hightlight a cost of 27ms, this means the 2d display is costing something like 5ms (there is no accurate way to test it)
  • According to Gpu Perf Studio 2 we are bandwidth bound only on DoF pass (which isn't really optimised atm). Most shaders thus scale well with the number of instruction.