Skip to content

Runaway LWPs/threads on recording daemon #808

@abalashov

Description

@abalashov

I am running RTPEngine mr6.5.4.2 built from source on EL7, plus recording-daemon from the same suite. libav* dependencies come from the nux-dextop repo. RTPEngine is writing frames into the /proc sink (--recording-method=proc) and the recording daemon is writing out mixed mono WAVs, with file-only metadata, no DB, and all in all the following invocation options:

/usr/local/sbin/rtpengine-recording \
   --spool-dir=/recordings \
   --output-storage=file \
   --output-dir=/recordings \
   --output-format=wav \
   --output-mixed \
   --pidfile=/var/run/rtpengine-recording.pid

What I am seeing is runaway growth in the number of worker threads spawned by the recording daemon, wildly disproportionate to the number of RTPEngine targets:

# cat /proc/rtpengine/0/status 
Refcount:    1
Control PID: 3131
Targets:     72
# ps aux | grep -i rtpengine-rec
root      8635 19.4  5.3 4172872 416356 ?      Sl   18:48  18:03 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-recording.pid
root     25573  0.0  0.0 112712   996 pts/0    S+   20:21   0:00 grep --color=auto -i rtpengine-rec
# ps -p 8635 -lfT | wc -l
418

Almost all of them appear to be in a futex state, so I assume some sort of deadlock, e.g.

1 S root      8635 25622     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root      8635 25623     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec
1 S root      8635 25625     1  0  80   0 - 1047316 futex_ 20:22 ?      00:00:00 /usr/local/sbin/rtpengine-recording --spool-dir=/recordings --output-storage=file --output-dir=/recordings --output-format=wav --output-mixed --pidfile=/var/run/rtpengine-rec

The way this issue was detected is that the recording daemon started complaining about running into file descriptor limits ("Too many open files" error), which struck me as curious given the relatively small number of concurrent streams recorded and the fact that the recording daemon is running as EUID/EGID root.

However, what I have found is that every one of those LWPs has several hundred open descriptors. For instance, PID 8635 above:

# cd /proc/8635/fd
# ls -w 5 | wc -l
291

This seems to be the story with all the LWPs:

# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done 
SPID: find: ‘/proc/SPID/fd’: No such file or directory
0
8635: 284
8636: 284
8637: 284
8638: 284
8639: 284
8640: 284
[... same all the way down the line ...]

Since the descriptor count is exactly the same across all the LWPs, I assume this is because they are cloned into every LWP. But regardless, it contributes to a rather large cumulative descriptor count across all the LWPs for that process:

# ps -p 8635 -fT | awk '{print $3}' | while read THIS_PID; do echo -n "$THIS_PID: "; find "/proc/$THIS_PID/fd" | wc -l; done | awk '{print $2}' | awk 'BEGIN { sum = 0 } { sum += $1 } END { print sum }'
find: ‘/proc/SPID/fd’: No such file or directory
110826

The number of LWPs steadily increases. We found it at a peak of 1200 before restarting the recording daemon. At that point, we seem to have bumped into the system-wide FD limit:

# cat /proc/sys/fs/file-max
763006

This situation appears to play out regardless of whether the recording daemon is invoked with a certain number of --num-threads=... explicitly, or left at the defaults (as now).

There is nothing interesting in the logs (until the "Too many open files" messages start). Just fairly routine things like:

INFO: [C 2fcb0ec6-e8ef-4e84-8fda-163e9ac7626d-94e8e401f7b2a8ed.meta] [S tag-1-media-1-component-2-RTCP-id-1] EOF on stream tag-1-media-1-component-2-RTCP-id-1

And:

WARNING: [C 63b7c294-a546-4623-a034-6d2b26f54cc3-63ab5712488d039b.meta] [S tag-0-media-1-component-1-RTP-id-2] [0x554f12d] Cannot decode RTP payload type 101 (telephone-event/8000)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions