Page MenuHomePhabricator

NumpyVisdomLogger crashes with segmentation fault
Closed, ResolvedPublic

Description

During training after N episodes (between 1000 and 2000, varies), experiment crashes with the following traceback:

Process Process-1:
[1]    17259 segmentation fault (core dumped)  python vae_base.py
Traceback (most recent call last):                                                                   
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jens/repos/vislogger/vislogger/numpyvisdomlogger.py", line 66, in __show
    vis_task = queue.get()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/jens/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

multiprocessing documentation tells us that the EOFError ist raised when there is nothing left to receive and the sender was already closed. Could it be that this error is only a symptom of the actual problem?

Event Timeline

petersej created this task.

I can confirm that the segfault is in fact the problem. Running without visdom logging still results in segfault, but much later (sample size 1, ca. 13000 episodes). The EOFError is then only the result of the dead sender. Random idea from the internet: Stack memory limit too low? Mine is currently at 8192MB (ulimit -s)

Running python -m pdb myscript.py now gave out the following RuntimeWarning:

/home/jens/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

I'm assuming this is the basis of the issue and that the segfault is indeed caused by exceeding the stack memory limit.

Idea: Currently we're making .savefig threaded, maybe creating a little helper that does .savefig and then .clear in a thread could help?

I did two things that apparently solved the problem (>50k episodes without issue). 1. Move matplotlib.use("agg") to the __init__.py, because I was actually using Qt5Agg. This is a necessary fix regardless of this particular problem. 2. I "un-threaded" the saving of figures. This is likely the fix, but I still don't know why... Will mark as resolved regardless.