-
Notifications
You must be signed in to change notification settings - Fork 403
Description
I am running RTPEngine mr6.5.4.2 built from source on EL7, plus recording-daemon from the same suite. libav*
dependencies come from the nux-dextop
repo. RTPEngine is writing frames into the /proc
sink (--recording-method=proc
) and the recording daemon is writing out mixed mono WAVs, with file
-only metadata, no DB, and all in all the following invocation options:
/usr/local/sbin/rtpengine-recording \
--spool-dir=/recordings \
--output-storage=file \
--output-dir=/recordings \
--output-format=wav \
--output-mixed \
--pidfile=/var/run/rtpengine-recording.pid
What I am seeing is runaway growth in the number of worker threads spawned by the recording daemon, wildly disproportionate to the number of RTPEngine targets:
# cat /proc/rtpengine/0/status
Refcount: 1
Control PID: 3131
Targets: 72
# ps aux | grep -i rtpengine-rec
root 8635 19.4 5.3 4172872 416356 ? Sl 18:48 18:03 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
root 25573 0.0 0.0 112712 996 pts/0 S+ 20:21 0:00 grep --color=auto -i rtpengine-rec
# ps -p 8635 -lfT | wc -l
418
Almost all of them appear to be in a futex
state, so I assume some sort of deadlock, e.g.
1 S root 8635 25622 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root 8635 25623 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root 8635 25625 1 0 80 0 - 1047316 futex_ 20:22 ? 00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
The way this issue was detected is that the recording daemon started complaining about running into file descriptor limits ("Too many open files" error), which struck me as curious given the relatively small number of concurrent streams recorded and the fact that the recording daemon is running as EUID/EGID root
.
However, what I have found is that every one of those LWPs has several hundred open descriptors. For instance, PID 8635 above:
# cd /proc/8635/fd
# ls -w 5 | wc -l
291
This seems to be the story with all the LWPs:
# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done
SPID: find: ‘/proc/SPID/fd’: No such file or directory
0
8635: 284
8636: 284
8637: 284
8638: 284
8639: 284
8640: 284
[... same all the way down the line ...]
Since the descriptor count is exactly the same across all the LWPs, I assume this is because they are cloned into every LWP. But regardless, it contributes to a rather large cumulative descriptor count across all the LWPs for that process:
# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done | awk '{print $2}' | awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum }'
find: ‘/proc/SPID/fd’: No such file or directory
110826
The number of LWPs steadily increases. We found it at a peak of 1200 before restarting the recording daemon. At that point, we seem to have bumped into the system-wide FD limit:
# cat /proc/sys/fs/file-max
763006
This situation appears to play out regardless of whether the recording daemon is invoked with a certain number of --num-threads=...
explicitly, or left at the defaults (as now).
There is nothing interesting in the logs (until the "Too many open files" messages start). Just fairly routine things like:
INFO: [C 2fcb0ec6-e8ef-4e84-8fda-163e9ac7626d-94e8e401f7b2a8ed.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1
And:
WARNING: [C 63b7c294-a546-4623-a034-6d2b26f54cc3-63ab5712488d039b.meta] [S tag-0-media-1-component-1-RTP-id-2] [0x554f12d] Cannot decode RTP payload type 101 (telephone-event/8000)