Skip to content

Conversation

jenshannoschwalm
Copy link
Collaborator

@jenshannoschwalm jenshannoschwalm commented Aug 18, 2025

EDIT 2:
Last part of demosaicer changes for 5.4 after preliminary work has been done.

  1. a) Dual demosaicing, b) preparing the data required for details threshold in mask blending and c) full green equilibration required the demosaicer to run in non-tiled mode until now. Especially on systems with restricted OpenCL memory or when using large xtrans images this resulted in fallbacks to CPU code with consecutive bad performance.
  2. If tiling was possible we had to use the slower _default_process_tiling_roi() variants with much larger overlaps because of demosaic doing the scaling. Stitching the tiled output was quite costly especially for OpenCL as that takes place on main memory.

We now always run demosaic in untiled mode and do the tiling internally. The internal tiles are horizontal "bars" over full width so on CPU code we can process each tile from original raw data and the stitching is just a plain copy, on OpenCL the copy of in/out data is very fast as full width is used.

Overall

  1. the performance does not change if there is no tiling
  2. if we must tile the performance is always better
  3. dual demosaicing, details mask and green equilibration are handled internally
  4. Latest commits include OpenCL internal tiling

@jenshannoschwalm jenshannoschwalm added this to the 5.4 milestone Aug 18, 2025
@jenshannoschwalm jenshannoschwalm added wip pull request in making, tests and feedback needed bug: wip someone is currently working on that, check with them before taking over scope: image processing correcting pixels scope: performance doing everything the same but faster OpenCL Related to darktable OpenCL code labels Aug 18, 2025
@jenshannoschwalm jenshannoschwalm marked this pull request as draft August 18, 2025 05:03
@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 6 times, most recently from 2642a4c to f137dd0 Compare August 24, 2025 02:56
@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 4 times, most recently from f305c0b to 4640644 Compare August 31, 2025 01:18
@jenshannoschwalm jenshannoschwalm removed wip pull request in making, tests and feedback needed bug: wip someone is currently working on that, check with them before taking over labels Aug 31, 2025
@jenshannoschwalm
Copy link
Collaborator Author

Release note: The demosaicer module uses a faster internal tiling variant for CPU and OpenCL codepaths. Dual demosaicing and details blend masks are also supported by the tiling so far less fallbacks to CPU code on smaller graphics cards / large raw files.

@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 3 times, most recently from 81d2f57 to d516152 Compare August 31, 2025 16:04
@jenshannoschwalm jenshannoschwalm marked this pull request as ready for review August 31, 2025 17:06
@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Aug 31, 2025

Did a lot of testing, a) could not spot any problem b) dual demosaicing much faster c) OpenCL with restricted memory also much faster.
Pinging some people in hope for tests as they have tested OpenCL code before or have special OpenCL setting iirc @MStraeten @piratenpanda @sarunasb @gi-man @Macchiato17 @kofa73 (if you find any time) @AxelG-DE @da-phil

@MStraeten
Copy link
Collaborator

found an issue with following scenario:
switched to dual demosaic rcd(dual) then to amaze resulting in an crash:

Process 14946 stopped
* thread #14, name = 'worker res 0', stop reason = EXC_BAD_ACCESS (code=1, address=0x3d7b5fdd3d810fea)
    frame #0: 0x00000001201c6fdc libdemosaic.so`::amaze_demosaic(dt_dev_pixelpipe_iop_t *, const float *const, float *, const int, const int, uint32_t) [inlined] dt_iop_get_processed_minimum(piece=<unavailable>) at imageop_math.h:176:17 [opt]
   173 	{
   174 	  return  fmaxf(1.0f,
   175 	          fminf(piece->pipe->dsc.processed_maximum[0],
-> 176 	          fminf(piece->pipe->dsc.processed_maximum[1],
   177 	                piece->pipe->dsc.processed_maximum[2])));
   178 	}
   179
Target 0: (darktable) stopped.
warning: libdemosaic.so was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) up
frame #1: 0x00000001201c6fdc libdemosaic.so`amaze_demosaic(piece=0x00000003a39d4000, in=0x0000000370f18000, out=0x0000000000001780, width=4021, height=-1802201964, filters=1) at amaze.cc:136:25 [opt]
   133 	                    const int height,
   134 	                    const uint32_t filters)
   135 	{
-> 136 	  const float clip_pt = dt_iop_get_processed_minimum(piece);
   137 	  const float clip_pt8 = 0.8f * clip_pt;
   138
   139 	// this allows to pass AMAZETS to the code. On some machines larger AMAZETS is faster
(lldb)
frame #2: 0x00000001201a997c libdemosaic.so`process(self=<unavailable>, piece=0x00000001409e44c0, i=<unavailable>, o=0x0000000154000000, roi_in=<unavailable>, roi_out=0x0000000171284190) at demosaic.c:688:11 [opt]
   685 	        else if(method != DT_IOP_DEMOSAIC_AMAZE)
   686 	          demosaic_ppg(t_out, t_in, width, t_height, filters, d->median_thrs);
   687 	        else
-> 688 	          amaze_demosaic(t_in, t_out, width, t_height, filters, procmin);
   689 	      }
   690
   691 	      if(do_capture)
(lldb) (lldb)
warning: libdarktable.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
frame #3: 0x00000001008ac78c libdarktable.dylib`_pixelpipe_process_on_CPU(pipe=0x000000015111a800, dev=<unavailable>, input=0x00000003a39d4000, input_format=<unavailable>, roi_in=0x0000000171283e00, output=<unavailable>, out_format=<unavailable>, roi_out=0x0000000171284190, module=0x000000015132e000, piece=0x00000001409e44c0, tiling=0x0000000171283db0, pixelpipe_flow=0x0000000171283dd4, position=9) at pixelpipe_hb.c:1410:7 [opt]
   1407	    }
   1408	    else
   1409	    {
-> 1410	      module->process(module, piece, input, *output, roi_in, roi_out);
   1411	      if(relevant)
   1412	      {
   1413	        if(pipe->mask_display == DT_DEV_PIXELPIPE_DISPLAY_NONE
(lldb)
frame #4: 0x00000001008a9954 libdarktable.dylib`_dev_pixelpipe_process_rec(pipe=0x000000015111a800, dev=0x0000000151119c00, output=<unavailable>, cl_mem_output=<unavailable>, out_format=0x0000000171284178, roi_out=0x0000000171284190, modules=<unavailable>, pieces=<unavailable>, pos=9) at pixelpipe_hb.c:2692:10 [opt]
   2689	        valid_input_on_gpu_only = FALSE;
   2690	      }
   2691
-> 2692	      if(_pixelpipe_process_on_CPU(pipe, dev, input, input_format, &roi_in,
   2693	                                   output, out_format,
   2694	                                   roi_out, module, piece, &tiling, &pixelpipe_flow, pos))
   2695	        return TRUE;
(lldb)

need to check this with a plain master branch - currently agx is in my codbase and might have side effect ... so stay tuned

@Macchiato17
Copy link
Contributor

Macchiato17 commented Aug 31, 2025

Hi @jenshannoschwalm I did a quick test first in current master [61ebab8] and then in your PR [d516152]. Doing the same thing in your PR crashed, please see the attached "-d all" log. Please drop a line, if you need a different logging.
darktable_dualDemosaicTest.zip

@piratenpanda
Copy link
Contributor

piratenpanda commented Sep 1, 2025

need to check this with a plain master branch - currently agx is in my codbase and might have side effect ... so stay tuned

Also crashes for me. I only have my R5m2 patches and no agx. Also when switching to "Amaza dual".

@piratenpanda
Copy link
Contributor

fixed for me with latest commits

@da-phil
Copy link
Contributor

da-phil commented Sep 4, 2025

Here is my report for 3 images from different cameras and the following HW setup:

Hardware Information:

* **Processor:**                                   AMD Ryzen™ 7 8845HS

* **Graphics:**                                    AMD Radeon™ 780M Graphics × 16

Software Information:

* **OS Name:**                                     Ubuntu 24.04.3 LTS

* **Windowing System:**                            X11

* **Kernel Version:**                              Linux 6.15.11-061511-generic

* **OpenCL backend:**                      ROCr from ROCM-6.4.1

I didn't experience any issue and had the impression that the export was quite speedy! Well done Hanno 🙇

Here is the log file with a camera comment for each 2048px export: dt-tiling-pr-pipe-log.txt

And here are some full-res exports: dt-tiling-pr-full-res-pipe-log.txt

And here is a log with dual-mosaicing full-res exports: dt-tiling-pr-dual-mosaic-pipe-log.txt

@gi-man
Copy link
Contributor

gi-man commented Sep 5, 2025

I was testing this PR and noticed the diffuse module failing on tiling and going to the CPU path. The issue is on master and I can reproduce whenever the central radius is large (more than 400).

   507.2148 process tiles             CL0 [full]           diffuse                3500   (798/370)  1865x1133 sc=0.579; IOP_CS_RGB
   507.2148 Error: process_tiling     CL0 [full]           diffuse                3500   (798/370)  1865x1133 sc=0.579; device=0 (nvidiacudanvidiageforcertx3060), DT_OPENCL_PROCESS_CL
   507.2148 pipe aborts               CL0 [full]           diffuse                3500   (798/370)  1865x1133 sc=0.579; couldn't run module on GPU, falling back to CPU
   507.2149 process                   CPU [full]           diffuse                3500   (798/370)  1865x1133 sc=0.579; IOP_CS_RGB 549MB

This discovery lead me to review the pixelpipe in more detail. I'm seeing more modules going into GPU tiling when I dont think it should be the case. I have a 12GB card and I dont recall it doing this in the past. For example:

30.2292 process tiles             CL0 [full]           denoiseprofile         1000  (1475/604)  2206x1346 sc=0.770; IOP_CS_RGB
30.2292 process *tiled* ptp       CL0 [full]           denoiseprofile         1000  (1475/604)  2206x1346 sc=0.770; 3x1 tiles, size=928x1090

Even the kernel loading time for opencl seems long. Its midnight here and I'm tired and I might be overlooking something. I think this should be a new Issue.

@jenshannoschwalm
Copy link
Collaborator Author

  1. The updated force-pushed version hopefully fixes issues as reported by @piratenpanda.
  2. I am sure the tiling observed by @gi-man is not related to this PR or the lately introduced demosaic changes. It could be related to the new darkroom canvas (full pipe) visualising allowing to drag the image by 20% without a pipe run as this increases the mem requirements by 40% (depends on the hidden setting but the default is 20%). Anyway, tiling requirements are calculated just inside each module so there could be problems we were yet not aware of.

@gi-man
Copy link
Contributor

gi-man commented Sep 5, 2025

I might be overlooking something

Found the issue. I had setup dt to small resources before I started using the --conf resourcelevel="notebook". This explains why I had some many modules going into tiling. I will investigate the central radius issue with D&S and start a Issue outside this thread.

@AxelG-DE
Copy link

AxelG-DE commented Sep 5, 2025

@jenshannoschwalm

sorry to be late to the party. Just today I almost finished the freaking MS exchange to cooperate with my postfix via XOAUTH2 :-(

To my setup:

  • The GTX1060 carries the two monitors
  • The RTX2070super carries nothing, hence I dare to run a rather aggressive memory setting
  • OS : Linux - kernel 6.16.2-gentoo
  • Distro : Gentoo Base System release 2.17
  • Processor : Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
  • Memory : 32 GB (4 x 8 GB) + 5GB Swap
  • Graphics card0 : NVIDIA GeForce GTX 1060 6GB
    #do not get confused about my "priorities settings below, I checked in detail, numbering scheme is different in different tools
  • Graphics card1 : NVIDIA GeForce RTX 2070 SUPER
  • Graphics driver : nvidia-drivers-575.64.05
  • OpenCL installed : Yes (opencl-headers-2024.10.24)
  • OpenCL activated : Yes
  • Xorg : xorg-server-21.1.18
  • Desktop : KDE 6
  • GTK+ : gtk+-3.24.49-r1
  • gcc : 14.3.0
  • cflags : CMAKE_FLAGS="-march=native-O2-mtune=native-pipe"
  • CMAKE_BUILD_TYPE : "Release"

excerpts from my ~/.config/darktable/darktablerc:

gerber@brain darktable $ cat darktablerc | grep prio
opencl_device_priority=0,1/1,*/0,1,*/0,1,*/*
gerber@brain darktable $ cat darktablerc | grep opencl_scheduling_profile
opencl_scheduling_profile=default
gerber@brain darktable $ cat darktablerc | grep opencl
clplatform_intelropenclhdgraphics=TRUE
clplatform_openclon12=FALSE
opencl=FALSE
opencl_async_pixelpipe=true
opencl_avoid_atomics=false
opencl_building_gpu0=-cl-mad-enable -cl-no-signed-zeros -cl-unsafe-math-optimizations -cl-finite-math-only -cl-fast-relaxed-math
opencl_building_gpu1=-cl-mad-enable -cl-no-signed-zeros -cl-unsafe-math-optimizations -cl-finite-math-only -cl-fast-relaxed-math
opencl_building_gpu2=-cl-mad-enable -cl-no-signed-zeros -cl-unsafe-math-optimizations -cl-finite-math-only -cl-fast-relaxed-math
opencl_checksum=1983112761
opencl_device_priority=0,1/1,*/0,1,*/0,1,*/*
opencl_disable_drivers_blacklist=true
opencl_library=
opencl_mandatory_timeout=100
opencl_memory_headroom=600
opencl_memory_requirement=1024
opencl_micro_nap=70
opencl_number_event_handles=1028
opencl_scheduling_profile=default
opencl_size_roundup=16
opencl_synch_cache=active module
opencl_tune_headroom=TRUE
opencl_tuning_mode=memory size
opencl_use_cpu_devices=false
opencl_use_pinned_memory=true
tuneopencl=TRUE
gerber@brain darktable $ cat darktablerc | grep device
cldevice_v5_nvidiacudanvidiageforcegtx10606gb=0 70 0 16 16 1024 1 0 0.000 0.000 0.250
cldevice_v5_nvidiacudanvidiageforcegtx10606gb_building=-cl-fast-relaxed-math
cldevice_v5_nvidiacudanvidiageforcegtx10606gb_id1=800
cldevice_v5_nvidiacudanvidiageforcertx2070super=0 10 0 16 16 1024 1 0 0.000 0.000 0.250
cldevice_v5_nvidiacudanvidiageforcertx2070super_building=-cl-fast-relaxed-math
cldevice_v5_nvidiacudanvidiageforcertx2070super_id0=400
opencl_device_priority=0,1/1,*/0,1,*/0,1,*/*
opencl_use_cpu_devices=false
plugins/lighttable/midi/devices=
plugins/midi/devices=
gerber@brain darktable $ cat darktablerc | grep memory
cache_memory=282460585984
host_memory_limit=16017
opencl_memory_headroom=600
opencl_memory_requirement=1024
opencl_tuning_mode=memory size
opencl_use_pinned_memory=true
plugins/lighttable/preview/max_in_memory_images=4
gerber@brain darktable $ cat darktablerc | grep large
compress_xmp_tags=only large entries
plugins/darkroom/enlargecanvas/expanded=
plugins/darkroom/enlargecanvas/favorite=FALSE
plugins/darkroom/enlargecanvas/visible=FALSE
resource_large=700 64 128 900
resourcelevel=large
ui_last/colorpicker_large=

I invoked dt with your todays commit 5.3.0-286-ge5d829d6ae like this:

darktable -d pipe -d opencl 2>&1 >>darktable-demosaic.log

Edited three files: 1 Nikon Z8, 1 Nikon Z6iii an done Olympus EM5 MKii

And here is the zipped log file:
darktable-demosaic.log.zip

I haven't looked at it even, because only so many days before my wife leaves for three weaks :)

@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 2 times, most recently from ef7bca0 to 33d119e Compare September 6, 2025 06:24
@jenshannoschwalm
Copy link
Collaborator Author

@piratenpanda if you find time again for a test, i think it's all good now :-)

@piratenpanda
Copy link
Contributor

can't reproduce anymore

@jenshannoschwalm
Copy link
Collaborator Author

jenshannoschwalm commented Sep 6, 2025

Release note suggestion:

The demosaicer module got some maintenance for slight performance gains.
It also now does internal tiling for large images / low memory instead of the generic tiling strategy.
This results in better performance in many cases especially with dual demosaicing or when using a
details blending mask (that made tiling impossible).

@TurboGit i think this is finally good now after a lot of preliminary work.
There is at least one idea for further performance improvements but we need more results with this code.

@sarunasb
Copy link
Contributor

sarunasb commented Sep 6, 2025

demoasicer
→ demosaicer

Thank you, @jenshannoschwalm !

@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 2 times, most recently from ee1b051 to bd5d396 Compare September 7, 2025 06:23
@jenshannoschwalm jenshannoschwalm marked this pull request as draft September 9, 2025 05:24
@jenshannoschwalm
Copy link
Collaborator Author

made it "draft" again as there is some overtiling and problems with low-mem systems

@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 2 times, most recently from 93af9d2 to 3c32e32 Compare September 10, 2025 06:02
@jenshannoschwalm jenshannoschwalm marked this pull request as ready for review September 10, 2025 06:03
@jenshannoschwalm
Copy link
Collaborator Author

@TurboGit a lot more testing, a) no observed performance drops b) no failing due to low memory c) integration tests are good here.

From my side it's finally good and ready for review/merge.

@jenshannoschwalm jenshannoschwalm force-pushed the demosaic_internal_tiling branch 2 times, most recently from d5aa0ef to 26713a0 Compare September 10, 2025 15:55
Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes after review. Those were done some days ago but I forgot to publish them!

All demosaicing is now done with internal tiling in CPU and OpenCL code (if required by available
memory and used algorithms).

We use simple horizontal tiles for performance.
For CPU this does not require any copy of input data and we only have to stitch output data.
For OpenCL we have to copy image data before and after the tiling code but this is fast as data
is contiguous and all happens on gpu memory.
Only if the input/tile height ratio is too large we do a fallback to CPU.

Writing of the pipe's detail mask is calculated from sharpened output data after internal tiling.

If we don't have to tile, there is no performance penalty at all.
In general, the new internal tiling is faster in the vast majority of cases,
- stitching is much faster especially with OpenCL.
  We avoid transfer from/to graphics memory, all is done in graphics memory.
  This strategy leads to more tiles as we have to keep the output buffer for stitching.
  On my 8GB nvidia card with default setting a 40mpix xtrans doing markjestejn3 with two tiles
  took ~930msec, the new internal tiling code does 10 tiles but takes just 860msec.
- the generic tiling required the costly tiling_roi variants
- if we want a details blending mask and mem resources would need tiling we now avoid
  the CPU fallback with drastically improved performance.

Some tiling related logs and deduplications.
@jenshannoschwalm
Copy link
Collaborator Author

fixed all as requested ...

Copy link
Member

@TurboGit TurboGit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@TurboGit TurboGit merged commit 7fbf0a9 into darktable-org:master Sep 11, 2025
6 checks passed
@jenshannoschwalm jenshannoschwalm deleted the demosaic_internal_tiling branch September 12, 2025 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OpenCL Related to darktable OpenCL code scope: image processing correcting pixels scope: performance doing everything the same but faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants