During training after N episodes (between 1000 and 2000, varies), experiment crashes with the following traceback:
Process Process-1: [1] 17259 segmentation fault (core dumped) python vae_base.py Traceback (most recent call last): File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap self.run() File "/home/jens/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/home/jens/repos/vislogger/vislogger/numpyvisdomlogger.py", line 66, in __show vis_task = queue.get() File "/home/jens/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/jens/anaconda3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd fd = df.detach() File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/home/jens/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge message = connection.recv_bytes(256) # reject large message File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/home/jens/anaconda3/lib/python3.6/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError
multiprocessing documentation tells us that the EOFError ist raised when there is nothing left to receive and the sender was already closed. Could it be that this error is only a symptom of the actual problem?