-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Demosaic internal tiling #19217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demosaic internal tiling #19217
Conversation
2642a4c
to
f137dd0
Compare
f305c0b
to
4640644
Compare
Release note: The demosaicer module uses a faster internal tiling variant for CPU and OpenCL codepaths. Dual demosaicing and details blend masks are also supported by the tiling so far less fallbacks to CPU code on smaller graphics cards / large raw files. |
81d2f57
to
d516152
Compare
Did a lot of testing, a) could not spot any problem b) dual demosaicing much faster c) OpenCL with restricted memory also much faster. |
found an issue with following scenario:
need to check this with a plain master branch - currently agx is in my codbase and might have side effect ... so stay tuned |
Hi @jenshannoschwalm I did a quick test first in current master [61ebab8] and then in your PR [d516152]. Doing the same thing in your PR crashed, please see the attached "-d all" log. Please drop a line, if you need a different logging. |
Also crashes for me. I only have my R5m2 patches and no agx. Also when switching to "Amaza dual". |
d516152
to
78bb0c3
Compare
fixed for me with latest commits |
And here is a log with dual-mosaicing full-res exports: dt-tiling-pr-dual-mosaic-pipe-log.txt |
I was testing this PR and noticed the diffuse module failing on tiling and going to the CPU path. The issue is on master and I can reproduce whenever the central radius is large (more than 400).
This discovery lead me to review the pixelpipe in more detail. I'm seeing more modules going into GPU tiling when I dont think it should be the case. I have a 12GB card and I dont recall it doing this in the past. For example:
Even the kernel loading time for opencl seems long. Its midnight here and I'm tired and I might be overlooking something. I think this should be a new Issue. |
821d218
to
e5d829d
Compare
|
Found the issue. I had setup dt to |
sorry to be late to the party. Just today I almost finished the freaking MS exchange to cooperate with my postfix via XOAUTH2 :-( To my setup:
excerpts from my ~/.config/darktable/darktablerc:
I invoked dt with your todays commit 5.3.0-286-ge5d829d6ae like this: darktable -d pipe -d opencl 2>&1 >>darktable-demosaic.log Edited three files: 1 Nikon Z8, 1 Nikon Z6iii an done Olympus EM5 MKii And here is the zipped log file: I haven't looked at it even, because only so many days before my wife leaves for three weaks :) |
ef7bca0
to
33d119e
Compare
@piratenpanda if you find time again for a test, i think it's all good now :-) |
can't reproduce anymore |
33d119e
to
cd81b8b
Compare
Release note suggestion:
@TurboGit i think this is finally good now after a lot of preliminary work. |
Thank you, @jenshannoschwalm ! |
ee1b051
to
bd5d396
Compare
made it "draft" again as there is some overtiling and problems with low-mem systems |
93af9d2
to
3c32e32
Compare
@TurboGit a lot more testing, a) no observed performance drops b) no failing due to low memory c) integration tests are good here. From my side it's finally good and ready for review/merge. |
d5aa0ef
to
26713a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some changes after review. Those were done some days ago but I forgot to publish them!
All demosaicing is now done with internal tiling in CPU and OpenCL code (if required by available memory and used algorithms). We use simple horizontal tiles for performance. For CPU this does not require any copy of input data and we only have to stitch output data. For OpenCL we have to copy image data before and after the tiling code but this is fast as data is contiguous and all happens on gpu memory. Only if the input/tile height ratio is too large we do a fallback to CPU. Writing of the pipe's detail mask is calculated from sharpened output data after internal tiling. If we don't have to tile, there is no performance penalty at all. In general, the new internal tiling is faster in the vast majority of cases, - stitching is much faster especially with OpenCL. We avoid transfer from/to graphics memory, all is done in graphics memory. This strategy leads to more tiles as we have to keep the output buffer for stitching. On my 8GB nvidia card with default setting a 40mpix xtrans doing markjestejn3 with two tiles took ~930msec, the new internal tiling code does 10 tiles but takes just 860msec. - the generic tiling required the costly tiling_roi variants - if we want a details blending mask and mem resources would need tiling we now avoid the CPU fallback with drastically improved performance. Some tiling related logs and deduplications.
26713a0
to
c544923
Compare
fixed all as requested ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
EDIT 2:
Last part of demosaicer changes for 5.4 after preliminary work has been done.
_default_process_tiling_roi()
variants with much larger overlaps because of demosaic doing the scaling. Stitching the tiled output was quite costly especially for OpenCL as that takes place on main memory.We now always run demosaic in untiled mode and do the tiling internally. The internal tiles are horizontal "bars" over full width so on CPU code we can process each tile from original raw data and the stitching is just a plain copy, on OpenCL the copy of in/out data is very fast as full width is used.
Overall