You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All demosaicing is now done with internal tiling in CPU and OpenCL code (if required by available
memory and used algorithms).
We use simple horizontal tiles for performance.
For CPU this does not require any copy of input data and we only have to stitch output data.
For OpenCL we have to copy image data before and after the tiling code but this is fast as data
is contiguous and all happens on gpu memory.
Only if the input/tile height ratio is too large we do a fallback to CPU.
Writing of the pipe's detail mask is calculated from sharpened output data after internal tiling.
If we don't have to tile, there is no performance penalty at all.
In general, the new internal tiling is faster in the vast majority of cases,
- stitching is much faster especially with OpenCL.
We avoid transfer from/to graphics memory, all is done in graphics memory.
This strategy leads to more tiles as we have to keep the output buffer for stitching.
On my 8GB nvidia card with default setting a 40mpix xtrans doing markjestejn3 with two tiles
took ~930msec, the new internal tiling code does 10 tiles but takes just 860msec.
- the generic tiling required the costly tiling_roi variants
- if we want a details blending mask and mem resources would need tiling we now avoid
the CPU fallback with drastically improved performance.
Some tiling related logs and deduplications.
0 commit comments