NumpyVisdomLogger crashes with segmentation fault
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• petersej
	Apr 6 2018, 1:52 PM

Description

During training after N episodes (between 1000 and 2000, varies), experiment crashes with the following traceback:

Process Process-1:
[1]    17259 segmentation fault (core dumped)  python vae_base.py
Traceback (most recent call last):                                                                   
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jens/repos/vislogger/vislogger/numpyvisdomlogger.py", line 66, in __show
    vis_task = queue.get()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/home/jens/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

multiprocessing documentation tells us that the EOFError ist raised when there is nothing left to receive and the sender was already closed. Could it be that this error is only a symptom of the actual problem?

Revisions and Commits

rMITK MITK
	rMITK5fe224ed3442 Merge branch 'T24588-pointlist-fix-multiple-selection-error' Closes T24587

Event Timeline

• petersej triaged this task as High priority.Apr 6 2018, 1:52 PM

• petersej created this task.

I can confirm that the segfault is in fact the problem. Running without visdom logging still results in segfault, but much later (sample size 1, ca. 13000 episodes). The EOFError is then only the result of the dead sender. Random idea from the internet: Stack memory limit too low? Mine is currently at 8192MB (ulimit -s)

Running python -m pdb myscript.py now gave out the following RuntimeWarning:

/home/jens/anaconda3/lib/python3.6/site-packages/matplotlib/pyplot.py:524: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).

I'm assuming this is the basis of the issue and that the segfault is indeed caused by exceeding the stack memory limit.

Idea: Currently we're making .savefig threaded, maybe creating a little helper that does .savefig and then .clear in a thread could help?

• steint closed this task as Resolved by committing rMITK5fe224ed3442: Merge branch 'T24588-pointlist-fix-multiple-selection-error' Closes T24587.Apr 9 2018, 9:33 AM

• steint added a commit: rMITK5fe224ed3442: Merge branch 'T24588-pointlist-fix-multiple-selection-error' Closes T24587.

zimmerer reopened this task as Open.Apr 10 2018, 10:44 AM

• petersej moved this task from Backlog to In Progress on the trixi board.Apr 11 2018, 10:58 AM

I did two things that apparently solved the problem (>50k episodes without issue). 1. Move matplotlib.use("agg") to the __init__.py, because I was actually using Qt5Agg. This is a necessary fix regardless of this particular problem. 2. I "un-threaded" the saving of figures. This is likely the fix, but I still don't know why... Will mark as resolved regardless.

• petersej closed this task as Resolved.Apr 11 2018, 11:12 AM

NumpyVisdomLogger crashes with segmentation faultClosed, ResolvedPublicActions

Description

Revisions and Commits

Event Timeline

NumpyVisdomLogger crashes with segmentation fault
Closed, ResolvedPublic
Actions