diff --git a/README.md b/README.md
new file mode 100644
index 0000000..eb1e11a
--- /dev/null
+++ b/README.md
@@ -0,0 +1,111 @@
+
+<p align="center"><img src="assets/mdt_logo_2.png"  width=450></p><br>
+
+## Overview
+This is a fully automated framework for object detection featuring:
+- 2D + 3D implementations of prevalent object detectors: e.g. Mask R-CNN [1], Retina Net [2], Retina U-Net [3]. 
+- Modular and light-weight structure ensuring sharing of all processing steps (incl. backbone architecture) for comparability of models.
+- training with bounding box and/or pixel-wise annotations.
+- dynamic patching and tiling of 2D + 3D images (for training and inference).
+- weighted consolidation of box predictions across patch-overlaps, ensembles, and dimensions [3].
+- monitoring + evaluation simultaneously on object and patient level. 
+- 2D + 3D output visualizations.
+- integration of COCO mean average precision metric [5]. 
+- integration of MIC-DKFZ batch generators for extensive data augmentation [6].
+- easy modification to evaluation of instance segmentation and/or semantic segmentation.
+<br/>
+[1] He, Kaiming, et al.  <a href="https://arxiv.org/abs/1703.06870">"Mask R-CNN"</a> ICCV, 2017<br>
+[2] Lin, Tsung-Yi, et al.  <a href="https://arxiv.org/abs/1708.02002">"Focal Loss for Dense Object Detection"</a> TPAMI, 2018.<br>
+[3] Jaeger, Paul et al. <a href="http://arxiv.org/abs/1811.08661"> "Retina U-Net: Embarrassingly Simple Exploitation
+of Segmentation Supervision for Medical Object Detection" </a>, 2018
+
+[5] https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py<br/>
+[6] https://github.com/MIC-DKFZ/batchgenerators<br/><br>
+
+## Installation
+Setup package in virtual environment
+```
+git clone https://github.com/pfjaeger/medicaldetectiontoolkit.git .
+cd medicaldetectiontoolkit
+virtualenv -p python3 venv
+source venv/bin/activate
+pip3 install -e .
+```
+Install MIC-DKFZ batch-generators
+```
+cd ..
+git clone https://github.com/MIC-DKFZ/batchgenerators
+cd batchgenerators
+pip3 install -e .
+cd mdt
+```
+
+## Prepare the Data
+This framework is meant for you to be able to train models on your own data sets. 
+An example data loader is provided in medicaldetectiontoolkit/experiments including thorough documentation to ensure a quick start for your own project. 
+
+## Execute
+1. Set I/O paths, model and training specifics in the configs file: medicaldetectiontoolkit/experiments/your_experiment/configs.py
+2. Train the model: 
+
+    ```
+    python exec.py --mode train --exp_source experiments/my_experiment --exp_dir path/to/experiment/directory       
+    ``` 
+    This copies snapshots of configs and model to the specified exp_dir, where all outputs will be saved. By default, the data is split into 60% training and 20% validation and 20% testing data to perform a 5-fold cross validation (can be changed to hold-out test set in configs) and all folds will be trained iteratively. In order to train a single fold, specify it using the folds arg: 
+    ```
+    python exec.py --folds 0 1 2 .... # specify any combination of folds [0-4]
+    ```
+3. Run inference:
+    ```
+    python exec.py --mode test --exp_dir path/to/experiment/directory 
+    ```
+    This runs the prediction pipeline and saves all results to exp_dir.
+    
+    
+## Models
+
+This framework features all models explored in [3] (implemented in 2D + 3D): The proposed Retina U-Net, a simple but effective Architecture fusing state-of-the-art semantic segmentation with object detection,<br><br>
+<p align="center"><img src="assets/retu_figure.png"  width=50%></p><br>
+also implementations of prevalent object detectors, such as Mask R-CNN, Faster R-CNN+ (Faster R-CNN w\ RoIAlign), Retina Net, U-Faster R-CNN+ (the two stage counterpart of Retina U-Net: Faster R-CNN with auxiliary semantic segmentation), DetU-Net (a U-Net like segmentation architecture with heuristics for object detection.)<br><br><br>
+<p align="center"><img src="assets/baseline_figure.png"  width=85%></p><br>
+
+## Training annotations
+This framework features training with pixelwise and/or bounding box annotations. To overcome the issue of box coordinates in 
+data augmentation, we feed the annotation masks through data augmentation (create a pseudo mask, if only bounding box annotations provided) and draw the boxes afterwards.<br><br>
+<p align="center"><img src="assets/annotations.png"  width=85%></p><br>
+
+
+## Prediction pipeline
+This framework provides an inference module, which automatically handles patching of inputs, and tiling, ensembling, and weighted consolidation of output predictions:<br><br><br>
+<img src="assets/prediction_pipeline.png" ><br><br>
+
+
+## Consolidation of predictions (Weighted Box Clustering)
+Multiple predictions of the same image (from  test time augmentations, tested epochs and overlapping patches), result in a high amount of boxes (or cubes), which need to be consolidated. In semantic segmentation, the final output would typically be obtained by averaging every pixel over all predictions. As described in [3], **weighted box clustering** (WBC) does this for box predictions:<br>
+<p align="center"><img src="assets/wcs_text.png"  width=650><br><br></p>
+<p align="center"><img src="assets/wcs_readme.png"  width=800><br><br></p>
+
+
+
+## Visualization / Monitoring
+By default, loss functions and performance metrics are monitored:<br><br><br>
+<img src="assets/loss_monitoring.png"  width=700><br>
+<hr>
+Histograms of matched output predictions for training/validation/testing are plotted per foreground class:<br><br><br>
+<img src="assets/hist_example.png"  width=550>
+<hr>
+Input images + ground truth annotations + output predictions of a sampled validation abtch are plotted after each epoch (here 2D sampled slice with +-3 neighbouring context slices in channels):<br><br><br>
+<img src="assets/output_monitoring_1.png"  width=750>
+<hr>
+Zoomed into the last two lines of the plot:<br><br><br>
+<img src="assets/output_monitoring_2.png"  width=700>
+
+## How to cite this code
+Please cite the original publication [3].
+
+## License
+The code is published under the [Apache License Version 2.0](LICENSE).
+
+
+
+
diff --git a/assets/.directory b/assets/.directory
new file mode 100644
index 0000000..4e2d005
--- /dev/null
+++ b/assets/.directory
@@ -0,0 +1,4 @@
+[Dolphin]
+Timestamp=2018,11,4,16,51,18
+Version=3
+ViewMode=1
diff --git a/assets/annotations.png b/assets/annotations.png
new file mode 100644
index 0000000..bf615eb
Binary files /dev/null and b/assets/annotations.png differ
diff --git a/assets/baseline_figure.png b/assets/baseline_figure.png
new file mode 100644
index 0000000..2e6f71c
Binary files /dev/null and b/assets/baseline_figure.png differ
diff --git a/assets/hist_example.png b/assets/hist_example.png
new file mode 100644
index 0000000..26ccea9
Binary files /dev/null and b/assets/hist_example.png differ
diff --git a/assets/loss_monitoring.png b/assets/loss_monitoring.png
new file mode 100644
index 0000000..1b53c27
Binary files /dev/null and b/assets/loss_monitoring.png differ
diff --git a/assets/mdt_logo_2.png b/assets/mdt_logo_2.png
new file mode 100644
index 0000000..dcf9e84
Binary files /dev/null and b/assets/mdt_logo_2.png differ
diff --git a/assets/output_monitoring_1.png b/assets/output_monitoring_1.png
new file mode 100644
index 0000000..602a846
Binary files /dev/null and b/assets/output_monitoring_1.png differ
diff --git a/assets/output_monitoring_2.png b/assets/output_monitoring_2.png
new file mode 100644
index 0000000..dd98540
Binary files /dev/null and b/assets/output_monitoring_2.png differ
diff --git a/assets/prediction_pipeline.png b/assets/prediction_pipeline.png
new file mode 100644
index 0000000..1ed3a03
Binary files /dev/null and b/assets/prediction_pipeline.png differ
diff --git a/assets/retu_figure.png b/assets/retu_figure.png
new file mode 100644
index 0000000..cb1348f
Binary files /dev/null and b/assets/retu_figure.png differ
diff --git a/assets/toy_readme.png b/assets/toy_readme.png
new file mode 100644
index 0000000..a0c61b6
Binary files /dev/null and b/assets/toy_readme.png differ
diff --git a/assets/wcs_hists.png b/assets/wcs_hists.png
new file mode 100644
index 0000000..4565a57
Binary files /dev/null and b/assets/wcs_hists.png differ
diff --git a/assets/wcs_readme.png b/assets/wcs_readme.png
new file mode 100644
index 0000000..99384e1
Binary files /dev/null and b/assets/wcs_readme.png differ
diff --git a/assets/wcs_sketch.png b/assets/wcs_sketch.png
new file mode 100644
index 0000000..919d1ef
Binary files /dev/null and b/assets/wcs_sketch.png differ
diff --git a/assets/wcs_text.png b/assets/wcs_text.png
new file mode 100644
index 0000000..75764a5
Binary files /dev/null and b/assets/wcs_text.png differ
diff --git a/readme.txt b/cuda_functions/nms_2D/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/nms_2D/__init__.py
diff --git a/cuda_functions/nms_2D/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_2D/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..08425eb
Binary files /dev/null and b/cuda_functions/nms_2D/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_2D/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_2D/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..2eb81da
Binary files /dev/null and b/cuda_functions/nms_2D/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-35.pyc b/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-35.pyc
new file mode 100644
index 0000000..1bf0a6c
Binary files /dev/null and b/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-35.pyc differ
diff --git a/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-36.pyc b/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-36.pyc
new file mode 100644
index 0000000..839361c
Binary files /dev/null and b/cuda_functions/nms_2D/__pycache__/pth_nms.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/nms_2D/_ext/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/nms_2D/_ext/__init__.py
diff --git a/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..ab74db1
Binary files /dev/null and b/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..3e87955
Binary files /dev/null and b/cuda_functions/nms_2D/_ext/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_2D/_ext/nms/__init__.py b/cuda_functions/nms_2D/_ext/nms/__init__.py
new file mode 100644
index 0000000..d71786f
--- /dev/null
+++ b/cuda_functions/nms_2D/_ext/nms/__init__.py
@@ -0,0 +1,15 @@
+
+from torch.utils.ffi import _wrap_function
+from ._nms import lib as _lib, ffi as _ffi
+
+__all__ = []
+def _import_symbols(locals):
+    for symbol in dir(_lib):
+        fn = getattr(_lib, symbol)
+        if callable(fn):
+            locals[symbol] = _wrap_function(fn, _ffi)
+        else:
+            locals[symbol] = fn
+        __all__.append(symbol)
+
+_import_symbols(locals())
diff --git a/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..e535879
Binary files /dev/null and b/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..7e1a9b1
Binary files /dev/null and b/cuda_functions/nms_2D/_ext/nms/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_2D/_ext/nms/_nms.so b/cuda_functions/nms_2D/_ext/nms/_nms.so
new file mode 100755
index 0000000..1856faf
Binary files /dev/null and b/cuda_functions/nms_2D/_ext/nms/_nms.so differ
diff --git a/cuda_functions/nms_2D/build.py b/cuda_functions/nms_2D/build.py
new file mode 100644
index 0000000..4d9a96b
--- /dev/null
+++ b/cuda_functions/nms_2D/build.py
@@ -0,0 +1,34 @@
+import os
+import torch
+from torch.utils.ffi import create_extension
+
+
+sources = ['src/nms.c']
+headers = ['src/nms.h']
+defines = []
+with_cuda = False
+
+if torch.cuda.is_available():
+    print('Including CUDA code.')
+    sources += ['src/nms_cuda.c']
+    headers += ['src/nms_cuda.h']
+    defines += [('WITH_CUDA', None)]
+    with_cuda = True
+
+this_file = os.path.dirname(os.path.realpath(__file__))
+print(this_file)
+extra_objects = ['src/cuda/nms_kernel.cu.o']
+extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]
+
+ffi = create_extension(
+    '_ext.nms',
+    headers=headers,
+    sources=sources,
+    define_macros=defines,
+    relative_to=__file__,
+    with_cuda=with_cuda,
+    extra_objects=extra_objects
+)
+
+if __name__ == '__main__':
+    ffi.build()
diff --git a/cuda_functions/nms_2D/pth_nms.py b/cuda_functions/nms_2D/pth_nms.py
new file mode 100644
index 0000000..bfdc29a
--- /dev/null
+++ b/cuda_functions/nms_2D/pth_nms.py
@@ -0,0 +1,39 @@
+import torch
+from ._ext import nms
+
+
+def nms_gpu(dets, thresh):
+  """
+  dets has to be a tensor
+  """
+
+  scores = dets[:, 4]
+  order = scores.sort(0, descending=True)[1]
+  dets = dets[order].contiguous()
+
+  keep = torch.LongTensor(dets.size(0))
+  num_out = torch.LongTensor(1)
+  nms.gpu_nms(keep, num_out, dets, thresh)
+  return order[keep[:num_out[0]].cuda()].contiguous()
+
+
+
+def nms_cpu(dets, thresh):
+
+  dets = dets.cpu()
+  x1 = dets[:, 0]
+  y1 = dets[:, 1]
+  x2 = dets[:, 2]
+  y2 = dets[:, 3]
+  scores = dets[:, 4]
+
+  areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+  order = scores.sort(0, descending=True)[1]
+  # order = torch.from_numpy(np.ascontiguousarray(scores.numpy().argsort()[::-1])).long()
+
+  keep = torch.LongTensor(dets.size(0))
+  num_out = torch.LongTensor(1)
+  nms.cpu_nms(keep, num_out, dets, order, areas, thresh)
+
+  return keep[:num_out[0]]
+
diff --git a/cuda_functions/nms_2D/src/cuda/nms_kernel.cu b/cuda_functions/nms_2D/src/cuda/nms_kernel.cu
new file mode 100644
index 0000000..1174f22
--- /dev/null
+++ b/cuda_functions/nms_2D/src/cuda/nms_kernel.cu
@@ -0,0 +1,87 @@
+// ------------------------------------------------------------------
+// Faster R-CNN
+// Copyright (c) 2015 Microsoft
+// Licensed under The MIT License [see fast-rcnn/LICENSE for details]
+// Written by Shaoqing Ren
+// ------------------------------------------------------------------
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <math.h>
+#include <stdio.h>
+#include <float.h>
+#include "nms_kernel.h"
+
+__device__ inline float devIoU(float const * const a, float const * const b) {
+  float left = fmaxf(a[0], b[0]), right = fminf(a[2], b[2]);
+  float top = fmaxf(a[1], b[1]), bottom = fminf(a[3], b[3]);
+  float width = fmaxf(right - left + 1, 0.f), height = fmaxf(bottom - top + 1, 0.f);
+  float interS = width * height;
+  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1);
+  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1);
+  return interS / (Sa + Sb - interS);
+}
+
+__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
+                           const float *dev_boxes, unsigned long long *dev_mask) {
+  const int row_start = blockIdx.y;
+  const int col_start = blockIdx.x;
+
+  // if (row_start > col_start) return;
+
+  const int row_size =
+        fminf(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
+  const int col_size =
+        fminf(n_boxes - col_start * threadsPerBlock, threadsPerBlock);
+
+  __shared__ float block_boxes[threadsPerBlock * 5];
+  if (threadIdx.x < col_size) {
+    block_boxes[threadIdx.x * 5 + 0] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 0];
+    block_boxes[threadIdx.x * 5 + 1] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 1];
+    block_boxes[threadIdx.x * 5 + 2] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 2];
+    block_boxes[threadIdx.x * 5 + 3] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 3];
+    block_boxes[threadIdx.x * 5 + 4] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 5 + 4];
+  }
+  __syncthreads();
+
+  if (threadIdx.x < row_size) {
+    const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
+    const float *cur_box = dev_boxes + cur_box_idx * 5;
+    int i = 0;
+    unsigned long long t = 0;
+    int start = 0;
+    if (row_start == col_start) {
+      start = threadIdx.x + 1;
+    }
+    for (i = start; i < col_size; i++) {
+      if (devIoU(cur_box, block_boxes + i * 5) > nms_overlap_thresh) {
+        t |= 1ULL << i;
+      }
+    }
+    const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
+    dev_mask[cur_box_idx * col_blocks + col_start] = t;
+  }
+}
+
+
+void _nms(int boxes_num, float * boxes_dev,
+          unsigned long long * mask_dev, float nms_overlap_thresh) {
+
+  dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
+              DIVUP(boxes_num, threadsPerBlock));
+  dim3 threads(threadsPerBlock);
+  nms_kernel<<<blocks, threads>>>(boxes_num,
+                                  nms_overlap_thresh,
+                                  boxes_dev,
+                                  mask_dev);
+}
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/cuda_functions/nms_2D/src/cuda/nms_kernel.cu.o b/cuda_functions/nms_2D/src/cuda/nms_kernel.cu.o
new file mode 100644
index 0000000..00135bf
Binary files /dev/null and b/cuda_functions/nms_2D/src/cuda/nms_kernel.cu.o differ
diff --git a/cuda_functions/nms_2D/src/cuda/nms_kernel.h b/cuda_functions/nms_2D/src/cuda/nms_kernel.h
new file mode 100644
index 0000000..2f40582
--- /dev/null
+++ b/cuda_functions/nms_2D/src/cuda/nms_kernel.h
@@ -0,0 +1,19 @@
+#ifndef _NMS_KERNEL
+#define _NMS_KERNEL
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
+int const threadsPerBlock = sizeof(unsigned long long) * 8;
+
+void _nms(int boxes_num, float * boxes_dev,
+          unsigned long long * mask_dev, float nms_overlap_thresh);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
+
diff --git a/cuda_functions/nms_2D/src/nms.c b/cuda_functions/nms_2D/src/nms.c
new file mode 100644
index 0000000..4795cc1
--- /dev/null
+++ b/cuda_functions/nms_2D/src/nms.c
@@ -0,0 +1,69 @@
+#include <TH/TH.h>
+#include <math.h>
+
+int cpu_nms(THLongTensor * keep_out, THLongTensor * num_out, THFloatTensor * boxes, THLongTensor * order, THFloatTensor * areas, float nms_overlap_thresh) {
+    // boxes has to be sorted
+    THArgCheck(THLongTensor_isContiguous(keep_out), 0, "keep_out must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(boxes), 2, "boxes must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(order), 3, "order must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(areas), 4, "areas must be contiguous");
+    // Number of ROIs
+    long boxes_num = THFloatTensor_size(boxes, 0);
+    long boxes_dim = THFloatTensor_size(boxes, 1);
+
+    long * keep_out_flat = THLongTensor_data(keep_out);
+    float * boxes_flat = THFloatTensor_data(boxes);
+    long * order_flat = THLongTensor_data(order);
+    float * areas_flat = THFloatTensor_data(areas);
+
+    THByteTensor* suppressed = THByteTensor_newWithSize1d(boxes_num);
+    THByteTensor_fill(suppressed, 0);
+    unsigned char * suppressed_flat =  THByteTensor_data(suppressed);
+
+    // nominal indices
+    int i, j;
+    // sorted indices
+    int _i, _j;
+    // temp variables for box i's (the box currently under consideration)
+    float ix1, iy1, ix2, iy2, iarea;
+    // variables for computing overlap with box j (lower scoring box)
+    float xx1, yy1, xx2, yy2;
+    float w, h;
+    float inter, ovr;
+
+    long num_to_keep = 0;
+    for (_i=0; _i < boxes_num; ++_i) {
+        i = order_flat[_i];
+        if (suppressed_flat[i] == 1) {
+            continue;
+        }
+        keep_out_flat[num_to_keep++] = i;
+        ix1 = boxes_flat[i * boxes_dim];
+        iy1 = boxes_flat[i * boxes_dim + 1];
+        ix2 = boxes_flat[i * boxes_dim + 2];
+        iy2 = boxes_flat[i * boxes_dim + 3];
+        iarea = areas_flat[i];
+        for (_j = _i + 1; _j < boxes_num; ++_j) {
+            j = order_flat[_j];
+            if (suppressed_flat[j] == 1) {
+                continue;
+            }
+            xx1 = fmaxf(ix1, boxes_flat[j * boxes_dim]);
+            yy1 = fmaxf(iy1, boxes_flat[j * boxes_dim + 1]);
+            xx2 = fminf(ix2, boxes_flat[j * boxes_dim + 2]);
+            yy2 = fminf(iy2, boxes_flat[j * boxes_dim + 3]);
+            w = fmaxf(0.0, xx2 - xx1 + 1);
+            h = fmaxf(0.0, yy2 - yy1 + 1);
+            inter = w * h;
+            ovr = inter / (iarea + areas_flat[j] - inter);
+            if (ovr >= nms_overlap_thresh) {
+                suppressed_flat[j] = 1;
+            }
+        }
+    }
+
+    long *num_out_flat = THLongTensor_data(num_out);
+    *num_out_flat = num_to_keep;
+    THByteTensor_free(suppressed);
+    return 1;
+}
\ No newline at end of file
diff --git a/cuda_functions/nms_2D/src/nms.h b/cuda_functions/nms_2D/src/nms.h
new file mode 100644
index 0000000..25ca0a3
--- /dev/null
+++ b/cuda_functions/nms_2D/src/nms.h
@@ -0,0 +1 @@
+int cpu_nms(THLongTensor * keep_out, THLongTensor * num_out, THFloatTensor * boxes, THLongTensor * order, THFloatTensor * areas, float nms_overlap_thresh);
\ No newline at end of file
diff --git a/cuda_functions/nms_2D/src/nms_cuda.c b/cuda_functions/nms_2D/src/nms_cuda.c
new file mode 100644
index 0000000..5a9a70f
--- /dev/null
+++ b/cuda_functions/nms_2D/src/nms_cuda.c
@@ -0,0 +1,67 @@
+// ------------------------------------------------------------------
+// Faster R-CNN
+// Copyright (c) 2015 Microsoft
+// Licensed under The MIT License [see fast-rcnn/LICENSE for details]
+// Written by Shaoqing Ren
+// ------------------------------------------------------------------
+#include <THC/THC.h>
+#include <TH/TH.h>
+#include <math.h>
+#include <stdio.h>
+
+#include "cuda/nms_kernel.h"
+
+
+extern THCState *state;
+
+int gpu_nms(THLongTensor * keep, THLongTensor* num_out, THCudaTensor * boxes, float nms_overlap_thresh) {
+  // boxes has to be sorted
+  THArgCheck(THLongTensor_isContiguous(keep), 0, "boxes must be contiguous");
+  THArgCheck(THCudaTensor_isContiguous(state, boxes), 2, "boxes must be contiguous");
+  // Number of ROIs
+  int boxes_num = THCudaTensor_size(state, boxes, 0);
+  int boxes_dim = THCudaTensor_size(state, boxes, 1);
+
+  float* boxes_flat = THCudaTensor_data(state, boxes);
+
+  const int col_blocks = DIVUP(boxes_num, threadsPerBlock);
+  THCudaLongTensor * mask = THCudaLongTensor_newWithSize2d(state, boxes_num, col_blocks);
+  unsigned long long* mask_flat = THCudaLongTensor_data(state, mask);
+
+  _nms(boxes_num, boxes_flat, mask_flat, nms_overlap_thresh);
+
+  THLongTensor * mask_cpu = THLongTensor_newWithSize2d(boxes_num, col_blocks);
+  THLongTensor_copyCuda(state, mask_cpu, mask);
+  THCudaLongTensor_free(state, mask);
+
+  unsigned long long * mask_cpu_flat = THLongTensor_data(mask_cpu);
+
+  THLongTensor * remv_cpu = THLongTensor_newWithSize1d(col_blocks);
+  unsigned long long* remv_cpu_flat = THLongTensor_data(remv_cpu);
+  THLongTensor_fill(remv_cpu, 0);
+
+  long * keep_flat = THLongTensor_data(keep);
+  long num_to_keep = 0;
+
+  int i, j;
+  for (i = 0; i < boxes_num; i++) {
+    int nblock = i / threadsPerBlock;
+    int inblock = i % threadsPerBlock;
+
+    if (!(remv_cpu_flat[nblock] & (1ULL << inblock))) {
+      keep_flat[num_to_keep++] = i;
+      unsigned long long *p = &mask_cpu_flat[0] + i * col_blocks;
+      for (j = nblock; j < col_blocks; j++) {
+        remv_cpu_flat[j] |= p[j];
+      }
+    }
+  }
+
+  long * num_out_flat = THLongTensor_data(num_out);
+  * num_out_flat = num_to_keep;
+
+  THLongTensor_free(mask_cpu);
+  THLongTensor_free(remv_cpu);
+
+  return 1;
+}
diff --git a/cuda_functions/nms_2D/src/nms_cuda.h b/cuda_functions/nms_2D/src/nms_cuda.h
new file mode 100644
index 0000000..0826111
--- /dev/null
+++ b/cuda_functions/nms_2D/src/nms_cuda.h
@@ -0,0 +1 @@
+int gpu_nms(THLongTensor * keep_out, THLongTensor* num_out, THCudaTensor * boxes, float nms_overlap_thresh);
\ No newline at end of file
diff --git a/readme.txt b/cuda_functions/nms_3D/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/nms_3D/__init__.py
diff --git a/cuda_functions/nms_3D/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_3D/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..1cf1238
Binary files /dev/null and b/cuda_functions/nms_3D/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_3D/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_3D/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..e09a2cb
Binary files /dev/null and b/cuda_functions/nms_3D/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-35.pyc b/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-35.pyc
new file mode 100644
index 0000000..29a502f
Binary files /dev/null and b/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-35.pyc differ
diff --git a/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-36.pyc b/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-36.pyc
new file mode 100644
index 0000000..2fa4c5d
Binary files /dev/null and b/cuda_functions/nms_3D/__pycache__/pth_nms.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/nms_3D/_ext/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/nms_3D/_ext/__init__.py
diff --git a/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..6ee8ff3
Binary files /dev/null and b/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..f733093
Binary files /dev/null and b/cuda_functions/nms_3D/_ext/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_3D/_ext/nms/__init__.py b/cuda_functions/nms_3D/_ext/nms/__init__.py
new file mode 100644
index 0000000..d71786f
--- /dev/null
+++ b/cuda_functions/nms_3D/_ext/nms/__init__.py
@@ -0,0 +1,15 @@
+
+from torch.utils.ffi import _wrap_function
+from ._nms import lib as _lib, ffi as _ffi
+
+__all__ = []
+def _import_symbols(locals):
+    for symbol in dir(_lib):
+        fn = getattr(_lib, symbol)
+        if callable(fn):
+            locals[symbol] = _wrap_function(fn, _ffi)
+        else:
+            locals[symbol] = fn
+        __all__.append(symbol)
+
+_import_symbols(locals())
diff --git a/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-35.pyc b/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..10160ab
Binary files /dev/null and b/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-36.pyc b/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..74019e7
Binary files /dev/null and b/cuda_functions/nms_3D/_ext/nms/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/nms_3D/_ext/nms/_nms.so b/cuda_functions/nms_3D/_ext/nms/_nms.so
new file mode 100755
index 0000000..c8498a0
Binary files /dev/null and b/cuda_functions/nms_3D/_ext/nms/_nms.so differ
diff --git a/cuda_functions/nms_3D/build.py b/cuda_functions/nms_3D/build.py
new file mode 100644
index 0000000..4d9a96b
--- /dev/null
+++ b/cuda_functions/nms_3D/build.py
@@ -0,0 +1,34 @@
+import os
+import torch
+from torch.utils.ffi import create_extension
+
+
+sources = ['src/nms.c']
+headers = ['src/nms.h']
+defines = []
+with_cuda = False
+
+if torch.cuda.is_available():
+    print('Including CUDA code.')
+    sources += ['src/nms_cuda.c']
+    headers += ['src/nms_cuda.h']
+    defines += [('WITH_CUDA', None)]
+    with_cuda = True
+
+this_file = os.path.dirname(os.path.realpath(__file__))
+print(this_file)
+extra_objects = ['src/cuda/nms_kernel.cu.o']
+extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]
+
+ffi = create_extension(
+    '_ext.nms',
+    headers=headers,
+    sources=sources,
+    define_macros=defines,
+    relative_to=__file__,
+    with_cuda=with_cuda,
+    extra_objects=extra_objects
+)
+
+if __name__ == '__main__':
+    ffi.build()
diff --git a/cuda_functions/nms_3D/pth_nms.py b/cuda_functions/nms_3D/pth_nms.py
new file mode 100644
index 0000000..3639b5b
--- /dev/null
+++ b/cuda_functions/nms_3D/pth_nms.py
@@ -0,0 +1,38 @@
+import torch
+from ._ext import nms
+
+
+def nms_gpu(dets, thresh):
+  """
+  dets has to be a tensor
+  """
+
+  scores = dets[:, -1]
+  order = scores.sort(0, descending=True)[1]
+  dets = dets[order].contiguous()
+
+  keep = torch.LongTensor(dets.size(0))
+  num_out = torch.LongTensor(1)
+  nms.gpu_nms(keep, num_out, dets, thresh)
+  return order[keep[:num_out[0]].cuda()].contiguous()
+
+
+def nms_cpu(dets, thresh):
+
+  dets = dets.cpu()
+  x1 = dets[:, 0]
+  y1 = dets[:, 1]
+  x2 = dets[:, 2]
+  y2 = dets[:, 3]
+  z1 = dets[:, 4]
+  z2 = dets[:, 5]
+  scores = dets[:, 6]
+  areas = (x2 - x1 +1) * (y2 - y1 +1) * (z2 - z1 +1)
+  order = scores.sort(0, descending=True)[1]
+
+  keep = torch.LongTensor(dets.size(0))
+  num_out = torch.LongTensor(1)
+  nms.cpu_nms(keep, num_out, dets, order, areas, thresh)
+
+  return keep[:num_out[0]]
+
diff --git a/cuda_functions/nms_3D/src/cuda/nms_kernel.cu b/cuda_functions/nms_3D/src/cuda/nms_kernel.cu
new file mode 100644
index 0000000..5692de8
--- /dev/null
+++ b/cuda_functions/nms_3D/src/cuda/nms_kernel.cu
@@ -0,0 +1,96 @@
+// ------------------------------------------------------------------
+// Faster R-CNN
+// Copyright (c) 2015 Microsoft
+// Licensed under The MIT License [see fast-rcnn/LICENSE for details]
+// Written by Shaoqing Ren
+// ------------------------------------------------------------------
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <math.h>
+#include <stdio.h>
+#include <float.h>
+#include "nms_kernel.h"
+
+__device__ inline float devIoU(float const * const a, float const * const b) {
+  float left = fmaxf(a[0], b[0]), right = fminf(a[2], b[2]);
+  float top = fmaxf(a[1], b[1]), bottom = fminf(a[3], b[3]);
+  float front = fmaxf(a[4], b[4]), back = fminf(a[5], b[5]);
+
+  float width = fmaxf(right - left + 1, 0.f), height = fmaxf(bottom - top + 1, 0.f), depth = fmaxf(back - front + 1, 0.f);
+  float interS = width * height * depth;
+  float Sa = (a[2] - a[0] + 1) * (a[3] - a[1] + 1) * (a[5] - a[4] + 1);
+  float Sb = (b[2] - b[0] + 1) * (b[3] - b[1] + 1) * (b[5] - b[4] + 1);
+  //printf("IoU 3D %f \n", interS / (Sa + Sb - interS));
+
+  return interS / (Sa + Sb - interS);
+}
+
+__global__ void nms_kernel(const int n_boxes, const float nms_overlap_thresh,
+                           const float *dev_boxes, unsigned long long *dev_mask) {
+  const int row_start = blockIdx.y;
+  const int col_start = blockIdx.x;
+
+  // if (row_start > col_start) return;
+
+  const int row_size =
+        fminf(n_boxes - row_start * threadsPerBlock, threadsPerBlock);
+  const int col_size =
+        fminf(n_boxes - col_start * threadsPerBlock, threadsPerBlock);
+
+  __shared__ float block_boxes[threadsPerBlock * 7];
+  if (threadIdx.x < col_size) {
+    block_boxes[threadIdx.x * 7 + 0] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 0];
+    block_boxes[threadIdx.x * 7 + 1] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 1];
+    block_boxes[threadIdx.x * 7 + 2] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 2];
+    block_boxes[threadIdx.x * 7 + 3] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 3];
+    block_boxes[threadIdx.x * 7 + 4] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 4];
+    block_boxes[threadIdx.x * 7 + 5] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 5];
+    block_boxes[threadIdx.x * 7 + 6] =
+        dev_boxes[(threadsPerBlock * col_start + threadIdx.x) * 7 + 6];
+  }
+  __syncthreads();
+
+  if (threadIdx.x < row_size) {
+    const int cur_box_idx = threadsPerBlock * row_start + threadIdx.x;
+    const float *cur_box = dev_boxes + cur_box_idx * 7;
+    int i = 0;
+    unsigned long long t = 0;
+    int start = 0;
+    if (row_start == col_start) {
+      start = threadIdx.x + 1;
+    }
+    for (i = start; i < col_size; i++) {
+      if (devIoU(cur_box, block_boxes + i * 7) > nms_overlap_thresh) {
+        t |= 1ULL << i;
+      }
+    }
+    const int col_blocks = DIVUP(n_boxes, threadsPerBlock);
+    dev_mask[cur_box_idx * col_blocks + col_start] = t;
+  }
+}
+
+
+void _nms(int boxes_num, float * boxes_dev,
+          unsigned long long * mask_dev, float nms_overlap_thresh) {
+
+
+  dim3 blocks(DIVUP(boxes_num, threadsPerBlock),
+              DIVUP(boxes_num, threadsPerBlock));
+  dim3 threads(threadsPerBlock);
+  nms_kernel<<<blocks, threads>>>(boxes_num,
+                                  nms_overlap_thresh,
+                                  boxes_dev,
+                                  mask_dev);
+}
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/cuda_functions/nms_3D/src/cuda/nms_kernel.cu.o b/cuda_functions/nms_3D/src/cuda/nms_kernel.cu.o
new file mode 100644
index 0000000..ee3ed41
Binary files /dev/null and b/cuda_functions/nms_3D/src/cuda/nms_kernel.cu.o differ
diff --git a/cuda_functions/nms_3D/src/cuda/nms_kernel.h b/cuda_functions/nms_3D/src/cuda/nms_kernel.h
new file mode 100644
index 0000000..2f40582
--- /dev/null
+++ b/cuda_functions/nms_3D/src/cuda/nms_kernel.h
@@ -0,0 +1,19 @@
+#ifndef _NMS_KERNEL
+#define _NMS_KERNEL
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define DIVUP(m,n) ((m) / (n) + ((m) % (n) > 0))
+int const threadsPerBlock = sizeof(unsigned long long) * 8;
+
+void _nms(int boxes_num, float * boxes_dev,
+          unsigned long long * mask_dev, float nms_overlap_thresh);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
+
diff --git a/cuda_functions/nms_3D/src/nms.c b/cuda_functions/nms_3D/src/nms.c
new file mode 100644
index 0000000..dd64336
--- /dev/null
+++ b/cuda_functions/nms_3D/src/nms.c
@@ -0,0 +1,74 @@
+#include <TH/TH.h>
+#include <math.h>
+
+
+int cpu_nms(THLongTensor * keep_out, THLongTensor * num_out, THFloatTensor * boxes, THLongTensor * order, THFloatTensor * areas, float nms_overlap_thresh) {
+    // boxes has to be sorted
+    THArgCheck(THLongTensor_isContiguous(keep_out), 0, "keep_out must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(boxes), 2, "boxes must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(order), 3, "order must be contiguous");
+    THArgCheck(THLongTensor_isContiguous(areas), 4, "areas must be contiguous");
+    // Number of ROIs
+    long boxes_num = THFloatTensor_size(boxes, 0);
+    long boxes_dim = THFloatTensor_size(boxes, 1);
+
+    long * keep_out_flat = THLongTensor_data(keep_out);
+    float * boxes_flat = THFloatTensor_data(boxes);
+    long * order_flat = THLongTensor_data(order);
+    float * areas_flat = THFloatTensor_data(areas);
+
+    THByteTensor* suppressed = THByteTensor_newWithSize1d(boxes_num);
+    THByteTensor_fill(suppressed, 0);
+    unsigned char * suppressed_flat =  THByteTensor_data(suppressed);
+    // nominal indices
+    int i, j;
+    // sorted indices
+    int _i, _j;
+    // temp variables for box i's (the box currently under consideration)
+    float ix1, iy1, ix2, iy2, iz1, iz2, iarea;
+    // variables for computing overlap with box j (lower scoring box)
+    float xx1, yy1, xx2, yy2, zz1, zz2;
+    float w, h, d;
+    float inter, ovr;
+
+    long num_to_keep = 0;
+    for (_i=0; _i < boxes_num; ++_i) {
+        i = order_flat[_i]; // from sorted index to nominal index in boxes list.
+        if (suppressed_flat[i] == 1) { //maybe flag for later. overlapping boxes are surpressed.
+            continue;
+        }
+        keep_out_flat[num_to_keep++] = i; //num to keep is read and then increased. the box index i is saved in keep_out.
+        ix1 = boxes_flat[i * boxes_dim];
+        iy1 = boxes_flat[i * boxes_dim + 1];
+        ix2 = boxes_flat[i * boxes_dim + 2];
+        iy2 = boxes_flat[i * boxes_dim + 3];
+        iz1 = boxes_flat[i * boxes_dim + 4];
+        iz2 = boxes_flat[i * boxes_dim + 5];
+        iarea = areas_flat[i];
+        for (_j = _i + 1; _j < boxes_num; ++_j) {
+            j = order_flat[_j];
+            if (suppressed_flat[j] == 1) {
+                continue;
+            }
+            xx1 = fmaxf(ix1, boxes_flat[j * boxes_dim]);
+            yy1 = fmaxf(iy1, boxes_flat[j * boxes_dim + 1]);
+            xx2 = fminf(ix2, boxes_flat[j * boxes_dim + 2]);
+            yy2 = fminf(iy2, boxes_flat[j * boxes_dim + 3]);
+            zz1 = fmaxf(iz1, boxes_flat[j * boxes_dim + 4]);
+            zz2 = fminf(iz2, boxes_flat[j * boxes_dim + 5]);
+            w = fmaxf(0.0, xx2 - xx1 + 1);
+            h = fmaxf(0.0, yy2 - yy1 + 1);
+            d = fmaxf(0.0, zz2 - zz1 + 1);
+            inter = w * h * d;
+            ovr = inter / (iarea + areas_flat[j] - inter);
+            if (ovr >= nms_overlap_thresh) {
+                suppressed_flat[j] = 1; // can be surpressed because score j < score i (from order: _j = _i + 1 ...)
+            }
+        }
+    }
+
+    long *num_out_flat = THLongTensor_data(num_out);
+    *num_out_flat = num_to_keep;
+    THByteTensor_free(suppressed);
+    return 1;
+}
\ No newline at end of file
diff --git a/cuda_functions/nms_3D/src/nms.h b/cuda_functions/nms_3D/src/nms.h
new file mode 100644
index 0000000..d17d9c9
--- /dev/null
+++ b/cuda_functions/nms_3D/src/nms.h
@@ -0,0 +1 @@
+int cpu_nms(THLongTensor * keep_out, THLongTensor * num_out, THFloatTensor * boxes, THLongTensor * order, THFloatTensor * areas, float nms_overlap_thresh);
diff --git a/cuda_functions/nms_3D/src/nms_cuda.c b/cuda_functions/nms_3D/src/nms_cuda.c
new file mode 100644
index 0000000..5a9a70f
--- /dev/null
+++ b/cuda_functions/nms_3D/src/nms_cuda.c
@@ -0,0 +1,67 @@
+// ------------------------------------------------------------------
+// Faster R-CNN
+// Copyright (c) 2015 Microsoft
+// Licensed under The MIT License [see fast-rcnn/LICENSE for details]
+// Written by Shaoqing Ren
+// ------------------------------------------------------------------
+#include <THC/THC.h>
+#include <TH/TH.h>
+#include <math.h>
+#include <stdio.h>
+
+#include "cuda/nms_kernel.h"
+
+
+extern THCState *state;
+
+int gpu_nms(THLongTensor * keep, THLongTensor* num_out, THCudaTensor * boxes, float nms_overlap_thresh) {
+  // boxes has to be sorted
+  THArgCheck(THLongTensor_isContiguous(keep), 0, "boxes must be contiguous");
+  THArgCheck(THCudaTensor_isContiguous(state, boxes), 2, "boxes must be contiguous");
+  // Number of ROIs
+  int boxes_num = THCudaTensor_size(state, boxes, 0);
+  int boxes_dim = THCudaTensor_size(state, boxes, 1);
+
+  float* boxes_flat = THCudaTensor_data(state, boxes);
+
+  const int col_blocks = DIVUP(boxes_num, threadsPerBlock);
+  THCudaLongTensor * mask = THCudaLongTensor_newWithSize2d(state, boxes_num, col_blocks);
+  unsigned long long* mask_flat = THCudaLongTensor_data(state, mask);
+
+  _nms(boxes_num, boxes_flat, mask_flat, nms_overlap_thresh);
+
+  THLongTensor * mask_cpu = THLongTensor_newWithSize2d(boxes_num, col_blocks);
+  THLongTensor_copyCuda(state, mask_cpu, mask);
+  THCudaLongTensor_free(state, mask);
+
+  unsigned long long * mask_cpu_flat = THLongTensor_data(mask_cpu);
+
+  THLongTensor * remv_cpu = THLongTensor_newWithSize1d(col_blocks);
+  unsigned long long* remv_cpu_flat = THLongTensor_data(remv_cpu);
+  THLongTensor_fill(remv_cpu, 0);
+
+  long * keep_flat = THLongTensor_data(keep);
+  long num_to_keep = 0;
+
+  int i, j;
+  for (i = 0; i < boxes_num; i++) {
+    int nblock = i / threadsPerBlock;
+    int inblock = i % threadsPerBlock;
+
+    if (!(remv_cpu_flat[nblock] & (1ULL << inblock))) {
+      keep_flat[num_to_keep++] = i;
+      unsigned long long *p = &mask_cpu_flat[0] + i * col_blocks;
+      for (j = nblock; j < col_blocks; j++) {
+        remv_cpu_flat[j] |= p[j];
+      }
+    }
+  }
+
+  long * num_out_flat = THLongTensor_data(num_out);
+  * num_out_flat = num_to_keep;
+
+  THLongTensor_free(mask_cpu);
+  THLongTensor_free(remv_cpu);
+
+  return 1;
+}
diff --git a/cuda_functions/nms_3D/src/nms_cuda.h b/cuda_functions/nms_3D/src/nms_cuda.h
new file mode 100644
index 0000000..08bf147
--- /dev/null
+++ b/cuda_functions/nms_3D/src/nms_cuda.h
@@ -0,0 +1 @@
+int gpu_nms(THLongTensor * keep_out, THLongTensor* num_out, THCudaTensor * boxes, float nms_overlap_thresh);
diff --git a/readme.txt b/cuda_functions/roi_align_2D/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/roi_align_2D/__init__.py
diff --git a/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..6a821bb
Binary files /dev/null and b/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..385ecda
Binary files /dev/null and b/cuda_functions/roi_align_2D/__pycache__/__init__.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/roi_align_2D/roi_align/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/roi_align_2D/roi_align/__init__.py
diff --git a/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..438fada
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..5611b92
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc b/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc
new file mode 100644
index 0000000..e23974d
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc b/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc
new file mode 100644
index 0000000..ca931d9
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/roi_align_2D/roi_align/_ext/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/roi_align_2D/roi_align/_ext/__init__.py
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..080f7b4
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..1a5aa20
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__init__.py b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__init__.py
new file mode 100644
index 0000000..4486c09
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__init__.py
@@ -0,0 +1,15 @@
+
+from torch.utils.ffi import _wrap_function
+from ._crop_and_resize import lib as _lib, ffi as _ffi
+
+__all__ = []
+def _import_symbols(locals):
+    for symbol in dir(_lib):
+        fn = getattr(_lib, symbol)
+        if callable(fn):
+            locals[symbol] = _wrap_function(fn, _ffi)
+        else:
+            locals[symbol] = fn
+        __all__.append(symbol)
+
+_import_symbols(locals())
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..27f3502
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..972175c
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/_crop_and_resize.so b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/_crop_and_resize.so
new file mode 100755
index 0000000..e852f11
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/_ext/crop_and_resize/_crop_and_resize.so differ
diff --git a/cuda_functions/roi_align_2D/roi_align/build.py b/cuda_functions/roi_align_2D/roi_align/build.py
new file mode 100755
index 0000000..3798d82
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/build.py
@@ -0,0 +1,40 @@
+import os
+import torch
+from torch.utils.ffi import create_extension
+
+
+sources = ['src/crop_and_resize.c']
+headers = ['src/crop_and_resize.h']
+defines = []
+with_cuda = False
+
+extra_objects = []
+if torch.cuda.is_available():
+    print('Including CUDA code.')
+    sources += ['src/crop_and_resize_gpu.c']
+    headers += ['src/crop_and_resize_gpu.h']
+    defines += [('WITH_CUDA', None)]
+    extra_objects += ['src/cuda/crop_and_resize_kernel.cu.o']
+    with_cuda = True
+
+extra_compile_args = ['-fopenmp', '-std=c99']
+
+this_file = os.path.dirname(os.path.realpath(__file__))
+print(this_file)
+sources = [os.path.join(this_file, fname) for fname in sources]
+headers = [os.path.join(this_file, fname) for fname in headers]
+extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]
+
+ffi = create_extension(
+    '_ext.crop_and_resize',
+    headers=headers,
+    sources=sources,
+    define_macros=defines,
+    relative_to=__file__,
+    with_cuda=with_cuda,
+    extra_objects=extra_objects,
+    extra_compile_args=extra_compile_args
+)
+
+if __name__ == '__main__':
+    ffi.build()
diff --git a/cuda_functions/roi_align_2D/roi_align/crop_and_resize.py b/cuda_functions/roi_align_2D/roi_align/crop_and_resize.py
new file mode 100755
index 0000000..4291ae4
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/crop_and_resize.py
@@ -0,0 +1,66 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Function
+
+from ._ext import crop_and_resize as _backend
+
+
+class CropAndResizeFunction(Function):
+
+    def __init__(self, crop_height, crop_width, extrapolation_value=0):
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.extrapolation_value = extrapolation_value
+
+    def forward(self, image, boxes, box_ind):
+        crops = torch.zeros_like(image)
+        if image.is_cuda:
+            _backend.crop_and_resize_gpu_forward(
+                image, boxes, box_ind,
+                self.extrapolation_value, self.crop_height, self.crop_width, crops)
+        else:
+            _backend.crop_and_resize_forward(
+                image, boxes, box_ind,
+                self.extrapolation_value, self.crop_height, self.crop_width, crops)
+
+        # save for backward
+        self.im_size = image.size()
+        self.save_for_backward(boxes, box_ind)
+
+        return crops
+
+    def backward(self, grad_outputs):
+        boxes, box_ind = self.saved_tensors
+
+        grad_outputs = grad_outputs.contiguous()
+        grad_image = torch.zeros_like(grad_outputs).resize_(*self.im_size)
+
+        if grad_outputs.is_cuda:
+            _backend.crop_and_resize_gpu_backward(
+                grad_outputs, boxes, box_ind, grad_image
+            )
+        else:
+            _backend.crop_and_resize_backward(
+                grad_outputs, boxes, box_ind, grad_image
+            )
+
+        return grad_image, None, None
+
+
+class CropAndResize(nn.Module):
+    """
+    Crop and resize ported from tensorflow
+    See more details on https://www.tensorflow.org/api_docs/python/tf/image/crop_and_resize
+    """
+
+    def __init__(self, crop_height, crop_width, extrapolation_value=0):
+        super(CropAndResize, self).__init__()
+
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.extrapolation_value = extrapolation_value
+
+    def forward(self, image, boxes, box_ind):
+        return CropAndResizeFunction(self.crop_height, self.crop_width, self.extrapolation_value)(image, boxes, box_ind)
diff --git a/cuda_functions/roi_align_2D/roi_align/roi_align.py b/cuda_functions/roi_align_2D/roi_align/roi_align.py
new file mode 100644
index 0000000..6931539
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/roi_align.py
@@ -0,0 +1,48 @@
+import torch
+from torch import nn
+
+from .crop_and_resize import CropAndResizeFunction, CropAndResize
+
+
+class RoIAlign(nn.Module):
+
+    def __init__(self, crop_height, crop_width, extrapolation_value=0, transform_fpcoor=True):
+        super(RoIAlign, self).__init__()
+
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.extrapolation_value = extrapolation_value
+        self.transform_fpcoor = transform_fpcoor
+
+    def forward(self, featuremap, boxes, box_ind):
+        """
+        RoIAlign based on crop_and_resize.
+        See more details on https://github.com/ppwwyyxx/tensorpack/blob/6d5ba6a970710eaaa14b89d24aace179eb8ee1af/examples/FasterRCNN/model.py#L301
+        :param featuremap: NxCxHxW
+        :param boxes: Mx4 float box with (x1, y1, x2, y2) **without normalization**
+        :param box_ind: M
+        :return: MxCxoHxoW
+        """
+        x1, y1, x2, y2 = torch.split(boxes, 1, dim=1)
+        image_height, image_width = featuremap.size()[2:4]
+
+        if self.transform_fpcoor:
+            spacing_w = (x2 - x1) / float(self.crop_width)
+            spacing_h = (y2 - y1) / float(self.crop_height)
+
+            nx0 = (x1 + spacing_w / 2 - 0.5) / float(image_width - 1)
+            ny0 = (y1 + spacing_h / 2 - 0.5) / float(image_height - 1)
+            nw = spacing_w * float(self.crop_width - 1) / float(image_width - 1)
+            nh = spacing_h * float(self.crop_height - 1) / float(image_height - 1)
+
+            boxes = torch.cat((ny0, nx0, ny0 + nh, nx0 + nw), 1)
+        else:
+            x1 = x1 / float(image_width - 1)
+            x2 = x2 / float(image_width - 1)
+            y1 = y1 / float(image_height - 1)
+            y2 = y2 / float(image_height - 1)
+            boxes = torch.cat((y1, x1, y2, x2), 1)
+
+        boxes = boxes.detach().contiguous()
+        box_ind = box_ind.detach()
+        return CropAndResizeFunction(self.crop_height, self.crop_width, self.extrapolation_value)(featuremap, boxes, box_ind)
diff --git a/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.c b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.c
new file mode 100644
index 0000000..e1fce67
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.c
@@ -0,0 +1,252 @@
+#include <TH/TH.h>
+#include <stdio.h>
+#include <math.h>
+
+
+void CropAndResizePerBox(
+    const float * image_data, 
+    const int batch_size,
+    const int depth,
+    const int image_height,
+    const int image_width,
+
+    const float * boxes_data, 
+    const int * box_index_data,
+    const int start_box, 
+    const int limit_box,
+
+    float * corps_data,
+    const int crop_height,
+    const int crop_width,
+    const float extrapolation_value
+) {
+    const int image_channel_elements = image_height * image_width;
+    const int image_elements = depth * image_channel_elements;
+
+    const int channel_elements = crop_height * crop_width;
+    const int crop_elements = depth * channel_elements;
+
+    int b;
+    #pragma omp parallel for
+    for (b = start_box; b < limit_box; ++b) {
+        const float * box = boxes_data + b * 4;
+        const float y1 = box[0];
+        const float x1 = box[1];
+        const float y2 = box[2];
+        const float x2 = box[3];
+
+        const int b_in = box_index_data[b];
+        if (b_in < 0 || b_in >= batch_size) {
+            printf("Error: batch_index %d out of range [0, %d)\n", b_in, batch_size);
+            exit(-1);
+        }
+
+        const float height_scale =
+            (crop_height > 1)
+                ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1)
+                             : 0;
+
+        for (int y = 0; y < crop_height; ++y)
+        {
+            const float in_y = (crop_height > 1)
+                                   ? y1 * (image_height - 1) + y * height_scale
+                                   : 0.5 * (y1 + y2) * (image_height - 1);
+
+            if (in_y < 0 || in_y > image_height - 1)
+            {
+                for (int x = 0; x < crop_width; ++x)
+                {
+                    for (int d = 0; d < depth; ++d)
+                    {
+                        // crops(b, y, x, d) = extrapolation_value;
+                        corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = extrapolation_value;
+                    }
+                }
+                continue;
+            }
+            
+            const int top_y_index = floorf(in_y);
+            const int bottom_y_index = ceilf(in_y);
+            const float y_lerp = in_y - top_y_index;
+
+            for (int x = 0; x < crop_width; ++x)
+            {
+                const float in_x = (crop_width > 1)
+                                       ? x1 * (image_width - 1) + x * width_scale
+                                       : 0.5 * (x1 + x2) * (image_width - 1);
+                if (in_x < 0 || in_x > image_width - 1)
+                {
+                    for (int d = 0; d < depth; ++d)
+                    {
+                        corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = extrapolation_value;
+                    }
+                    continue;
+                }
+            
+                const int left_x_index = floorf(in_x);
+                const int right_x_index = ceilf(in_x);
+                const float x_lerp = in_x - left_x_index;
+
+                for (int d = 0; d < depth; ++d)
+                {   
+                    const float *pimage = image_data + b_in * image_elements + d * image_channel_elements;
+
+                    const float top_left = pimage[top_y_index * image_width + left_x_index];
+                    const float top_right = pimage[top_y_index * image_width + right_x_index];
+                    const float bottom_left = pimage[bottom_y_index * image_width + left_x_index];
+                    const float bottom_right = pimage[bottom_y_index * image_width + right_x_index];
+                    
+                    const float top = top_left + (top_right - top_left) * x_lerp;
+                    const float bottom =
+                        bottom_left + (bottom_right - bottom_left) * x_lerp;
+                        
+                    corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = top + (bottom - top) * y_lerp;
+                }
+            }   // end for x
+        }   // end for y
+    }   // end for b
+
+}
+
+
+void crop_and_resize_forward(
+    THFloatTensor * image,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THFloatTensor * crops
+) {
+    const int batch_size = image->size[0];
+    const int depth = image->size[1];
+    const int image_height = image->size[2];
+    const int image_width = image->size[3];
+
+    const int num_boxes = boxes->size[0];
+
+    // init output space
+    THFloatTensor_resize4d(crops, num_boxes, depth, crop_height, crop_width);
+    THFloatTensor_zero(crops);
+
+    // crop_and_resize for each box
+    CropAndResizePerBox(
+        THFloatTensor_data(image),
+        batch_size,
+        depth,
+        image_height,
+        image_width,
+
+        THFloatTensor_data(boxes),
+        THIntTensor_data(box_index),
+        0,
+        num_boxes,
+
+        THFloatTensor_data(crops),
+        crop_height,
+        crop_width,
+        extrapolation_value
+    );
+
+}
+
+
+void crop_and_resize_backward(
+    THFloatTensor * grads,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    THFloatTensor * grads_image // resize to [bsize, c, hc, wc]
+)
+{   
+    // shape
+    const int batch_size = grads_image->size[0];
+    const int depth = grads_image->size[1];
+    const int image_height = grads_image->size[2];
+    const int image_width = grads_image->size[3];
+
+    const int num_boxes = grads->size[0];
+    const int crop_height = grads->size[2];
+    const int crop_width = grads->size[3];
+
+    // n_elements
+    const int image_channel_elements = image_height * image_width;
+    const int image_elements = depth * image_channel_elements;
+
+    const int channel_elements = crop_height * crop_width;
+    const int crop_elements = depth * channel_elements;
+
+    // init output space
+    THFloatTensor_zero(grads_image);
+
+    // data pointer
+    const float * grads_data = THFloatTensor_data(grads);
+    const float * boxes_data = THFloatTensor_data(boxes);
+    const int * box_index_data = THIntTensor_data(box_index);
+    float * grads_image_data = THFloatTensor_data(grads_image);
+
+    for (int b = 0; b < num_boxes; ++b) {
+        const float * box = boxes_data + b * 4;
+        const float y1 = box[0];
+        const float x1 = box[1];
+        const float y2 = box[2];
+        const float x2 = box[3];
+
+        const int b_in = box_index_data[b];
+        if (b_in < 0 || b_in >= batch_size) {
+            printf("Error: batch_index %d out of range [0, %d)\n", b_in, batch_size);
+            exit(-1);
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                              : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1)
+                             : 0;
+
+        for (int y = 0; y < crop_height; ++y)
+        {
+            const float in_y = (crop_height > 1)
+                                   ? y1 * (image_height - 1) + y * height_scale
+                                   : 0.5 * (y1 + y2) * (image_height - 1);
+            if (in_y < 0 || in_y > image_height - 1)
+            {
+                continue;
+            }
+            const int top_y_index = floorf(in_y);
+            const int bottom_y_index = ceilf(in_y);
+            const float y_lerp = in_y - top_y_index;
+
+            for (int x = 0; x < crop_width; ++x)
+            {
+                const float in_x = (crop_width > 1)
+                                       ? x1 * (image_width - 1) + x * width_scale
+                                       : 0.5 * (x1 + x2) * (image_width - 1);
+                if (in_x < 0 || in_x > image_width - 1)
+                {
+                    continue;
+                }
+                const int left_x_index = floorf(in_x);
+                const int right_x_index = ceilf(in_x);
+                const float x_lerp = in_x - left_x_index;
+
+                for (int d = 0; d < depth; ++d)
+                {   
+                    float *pimage = grads_image_data + b_in * image_elements + d * image_channel_elements;
+                    const float grad_val = grads_data[crop_elements * b + channel_elements * d + y * crop_width + x];
+
+                    const float dtop = (1 - y_lerp) * grad_val;
+                    pimage[top_y_index * image_width + left_x_index] += (1 - x_lerp) * dtop;
+                    pimage[top_y_index * image_width + right_x_index] += x_lerp * dtop;
+
+                    const float dbottom = y_lerp * grad_val;
+                    pimage[bottom_y_index * image_width + left_x_index] += (1 - x_lerp) * dbottom;
+                    pimage[bottom_y_index * image_width + right_x_index] += x_lerp * dbottom;
+                }   // end d
+            }   // end x
+        }   // end y
+    }   // end b
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.h b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.h
new file mode 100644
index 0000000..d494865
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize.h
@@ -0,0 +1,16 @@
+void crop_and_resize_forward(
+    THFloatTensor * image,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THFloatTensor * crops
+);
+
+void crop_and_resize_backward(
+    THFloatTensor * grads,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    THFloatTensor * grads_image // resize to [bsize, c, hc, wc]
+);
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.c b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.c
new file mode 100644
index 0000000..dd347c6
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.c
@@ -0,0 +1,68 @@
+#include <THC/THC.h>
+#include "cuda/crop_and_resize_kernel.h"
+
+extern THCState *state;
+
+
+void crop_and_resize_gpu_forward(
+    THCudaTensor * image,
+    THCudaTensor * boxes,           // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THCudaTensor * crops
+) {
+    const int batch_size = THCudaTensor_size(state, image, 0);
+    const int depth = THCudaTensor_size(state, image, 1);
+    const int image_height = THCudaTensor_size(state, image, 2);
+    const int image_width = THCudaTensor_size(state, image, 3);
+
+    const int num_boxes = THCudaTensor_size(state, boxes, 0);
+
+    // init output space
+    THCudaTensor_resize4d(state, crops, num_boxes, depth, crop_height, crop_width);
+    THCudaTensor_zero(state, crops);
+    cudaStream_t stream = THCState_getCurrentStream(state);
+    CropAndResizeLaucher(
+        THCudaTensor_data(state, image),
+        THCudaTensor_data(state, boxes),
+        THCudaIntTensor_data(state, box_index),
+        num_boxes, batch_size, image_height, image_width,
+        crop_height, crop_width, depth, extrapolation_value,
+        THCudaTensor_data(state, crops),
+        stream
+    );
+}
+
+
+void crop_and_resize_gpu_backward(
+    THCudaTensor * grads,
+    THCudaTensor * boxes,      // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    THCudaTensor * grads_image // resize to [bsize, c, hc, wc]
+) {
+    // shape
+    const int batch_size = THCudaTensor_size(state, grads_image, 0);
+    const int depth = THCudaTensor_size(state, grads_image, 1);
+    const int image_height = THCudaTensor_size(state, grads_image, 2);
+    const int image_width = THCudaTensor_size(state, grads_image, 3);
+
+    const int num_boxes = THCudaTensor_size(state, grads, 0);
+    const int crop_height = THCudaTensor_size(state, grads, 2);
+    const int crop_width = THCudaTensor_size(state, grads, 3);
+
+    // init output space
+    THCudaTensor_zero(state, grads_image);
+
+    cudaStream_t stream = THCState_getCurrentStream(state);
+    CropAndResizeBackpropImageLaucher(
+        THCudaTensor_data(state, grads),
+        THCudaTensor_data(state, boxes),
+        THCudaIntTensor_data(state, box_index),
+        num_boxes, batch_size, image_height, image_width,
+        crop_height, crop_width, depth,
+        THCudaTensor_data(state, grads_image),
+        stream
+    );
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.h b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.h
new file mode 100644
index 0000000..c2a64cf
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/crop_and_resize_gpu.h
@@ -0,0 +1,16 @@
+void crop_and_resize_gpu_forward(
+    THCudaTensor * image,
+    THCudaTensor * boxes,           // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THCudaTensor * crops
+);
+
+void crop_and_resize_gpu_backward(
+    THCudaTensor * grads,
+    THCudaTensor * boxes,      // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    THCudaTensor * grads_image // resize to [bsize, c, hc, wc]
+);
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/cuda/backup.cu b/cuda_functions/roi_align_2D/roi_align/src/cuda/backup.cu
new file mode 100644
index 0000000..3a1ab8b
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/cuda/backup.cu
@@ -0,0 +1,243 @@
+#include <math.h>
+#include <stdio.h>
+#include "crop_and_resize_kernel.h"
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+     i += blockDim.x * gridDim.x)
+
+
+__global__
+void CropAndResizeKernel(
+    const int nthreads, const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+ //       printf("INIT CUDA SCRIPT %f \n", idx);
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1) : 0;
+
+        const float in_y = (crop_height > 1)
+                                ? y1 * (image_height - 1) + y * height_scale
+                                : 0.5 * (y1 + y2) * (image_height - 1);
+        if (in_y < 0 || in_y > image_height - 1)
+        {
+            crops_ptr[out_idx] = extrapolation_value;
+            continue;
+        }
+
+        const float in_x = (crop_width > 1)
+                                ? x1 * (image_width - 1) + x * width_scale
+                                : 0.5 * (x1 + x2) * (image_width - 1);
+        if (in_x < 0 || in_x > image_width - 1)
+        {
+            crops_ptr[out_idx] = extrapolation_value;
+            continue;
+        }
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        const float *pimage = image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float top_left = pimage[top_y_index * image_width + left_x_index];
+        const float top_right = pimage[top_y_index * image_width + right_x_index];
+        const float bottom_left = pimage[bottom_y_index * image_width + left_x_index];
+        const float bottom_right = pimage[bottom_y_index * image_width + right_x_index];
+       // if (top_left == 0){
+         //   const float top = top_right}
+       // elif (top_right == 0){
+         //   const float top = top_left}
+       // else{
+            const float top = top_left + (top_right - top_left) * x_lerp;
+            //}
+
+        //if (bottom_left == 0){
+       //     const float bottom = bottom_right}
+       // elif (bottom_right == 0){
+        //    const float bottom = bottom_left}
+       // else{
+        const float bottom = bottom_left + (bottom_right - bottom_left) * x_lerp;
+        //}
+
+        //if (top == 0){
+         //    crops_ptr[out_idx] = bottom }
+       // elif (bottom == 0){
+        //    crops_ptr[out_idx] = top
+            //}
+       // else{
+        crops_ptr[out_idx] = top + (bottom - top) * y_lerp;
+        //}
+    }
+}
+
+__global__
+void CropAndResizeBackpropImageKernel(
+    const int nthreads, const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1) : 0;
+
+        const float in_y = (crop_height > 1)
+                                ? y1 * (image_height - 1) + y * height_scale
+                                : 0.5 * (y1 + y2) * (image_height - 1);
+        if (in_y < 0 || in_y > image_height - 1)
+        {
+            continue;
+        }
+
+        const float in_x = (crop_width > 1)
+                                ? x1 * (image_width - 1) + x * width_scale
+                                : 0.5 * (x1 + x2) * (image_width - 1);
+        if (in_x < 0 || in_x > image_width - 1)
+        {
+            continue;
+        }
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        float *pimage = grads_image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float dtop = (1 - y_lerp) * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + top_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dtop
+        );
+        atomicAdd(
+            pimage + top_y_index * image_width + right_x_index,
+            x_lerp * dtop
+        );
+
+        const float dbottom = y_lerp * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + bottom_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dbottom
+        );
+        atomicAdd(
+            pimage + bottom_y_index * image_width + right_x_index,
+            x_lerp * dbottom
+        );
+    }
+}
+
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, image_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, extrapolation_value, crops_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
+
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeBackpropImageKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, grads_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, grads_image_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu
new file mode 100644
index 0000000..0702551
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu
@@ -0,0 +1,250 @@
+#include <math.h>
+#include <stdio.h>
+#include "crop_and_resize_kernel.h"
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+     i += blockDim.x * gridDim.x)
+
+
+__global__
+void CropAndResizeKernel(
+    const int nthreads, const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        //printf("start %i \n", idx);
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+         const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height) / (crop_height)
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width) / (crop_width) : 0;
+
+
+        float tmp_in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height);
+
+        if (tmp_in_y > image_height - 1)
+ 	{
+  		 tmp_in_y = image_height - 1;
+       }
+        if (tmp_in_y < 0)
+        {
+            tmp_in_y = 0;
+        }
+	const float in_y = tmp_in_y;
+
+        float tmp_in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+
+        if (tmp_in_x > image_width - 1)
+ 	{
+	 tmp_in_x = image_width - 1;
+	}
+        if (tmp_in_x < 0)
+        {
+            tmp_in_x= 0;
+        }
+	const float in_x = tmp_in_x;
+
+        //printf("height_scale %f \n", height_scale);
+        //printf("width_scale %f \n", width_scale);
+        //printf("in_x %f \n", in_x);
+        //printf("in_y %f \n", in_y);
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        const float *pimage = image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float top_left = pimage[top_y_index * image_width + left_x_index];
+        const float top_right = pimage[top_y_index * image_width + right_x_index];
+        const float bottom_left = pimage[bottom_y_index * image_width + left_x_index];
+        const float bottom_right = pimage[bottom_y_index * image_width + right_x_index];
+
+        const float top = top_left + (top_right - top_left) * x_lerp;
+        const float bottom = bottom_left + (bottom_right - bottom_left) * x_lerp;
+        crops_ptr[out_idx] = top + (bottom - top) * y_lerp;
+    }
+}
+
+__global__
+void CropAndResizeBackpropImageKernel(
+    const int nthreads, const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height ) / (crop_height )
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width ) / (crop_width ) : 0;
+
+        float tmp_in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height);
+
+        if (tmp_in_y > image_height - 1)
+ 	{
+ 		 tmp_in_y = image_height - 1;
+       }
+        if (tmp_in_y < 0)
+        {
+            tmp_in_y = 0;
+        }
+	const float in_y = tmp_in_y;
+
+        float tmp_in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+
+        if (tmp_in_x > image_width - 1)
+ 	{
+	 tmp_in_x = image_width - 1;
+	}
+        if (tmp_in_x < 0)
+        {
+            tmp_in_x= 0;
+        }
+	const float in_x = tmp_in_x;
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        float *pimage = grads_image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float dtop = (1 - y_lerp) * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + top_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dtop
+        );
+        atomicAdd(
+            pimage + top_y_index * image_width + right_x_index,
+            x_lerp * dtop
+        );
+
+        const float dbottom = y_lerp * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + bottom_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dbottom
+        );
+        atomicAdd(
+            pimage + bottom_y_index * image_width + right_x_index,
+            x_lerp * dbottom
+        );
+    }
+}
+
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, image_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, extrapolation_value, crops_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError in Roi Align () failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
+
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeBackpropImageKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, grads_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, grads_image_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed in Roi Align : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu.o b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu.o
new file mode 100644
index 0000000..2f1a1b9
Binary files /dev/null and b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.cu.o differ
diff --git a/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.h b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.h
new file mode 100644
index 0000000..893aee1
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/cuda/crop_and_resize_kernel.h
@@ -0,0 +1,24 @@
+#ifndef _CropAndResize_Kernel
+#define _CropAndResize_Kernel
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream);
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr, cudaStream_t stream);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
\ No newline at end of file
diff --git a/cuda_functions/roi_align_2D/roi_align/src/cuda/fix.cu b/cuda_functions/roi_align_2D/roi_align/src/cuda/fix.cu
new file mode 100644
index 0000000..6eea4a8
--- /dev/null
+++ b/cuda_functions/roi_align_2D/roi_align/src/cuda/fix.cu
@@ -0,0 +1,243 @@
+#include <math.h>
+#include <stdio.h>
+#include "crop_and_resize_kernel.h"
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+     i += blockDim.x * gridDim.x)
+
+
+__global__
+void CropAndResizeKernel(
+    const int nthreads, const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+ //       printf("INIT CUDA SCRIPT %f \n", idx);
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height ) / (crop_height )
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width) / (crop_width ) : 0;
+
+        const float in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height );
+        if (in_y < 0 || in_y > image_height )
+        {
+            crops_ptr[out_idx] = extrapolation_value;
+            continue;
+        }
+
+        const float in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+        if (in_x < 0 || in_x > image_width )
+        {
+            crops_ptr[out_idx] = extrapolation_value;
+            continue;
+        }
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        const float *pimage = image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float top_left = pimage[top_y_index * image_width + left_x_index];
+        const float top_right = pimage[top_y_index * image_width + right_x_index];
+        const float bottom_left = pimage[bottom_y_index * image_width + left_x_index];
+        const float bottom_right = pimage[bottom_y_index * image_width + right_x_index];
+       // if (top_left == 0){
+         //   const float top = top_right}
+       // elif (top_right == 0){
+         //   const float top = top_left}
+       // else{
+            const float top = top_left + (top_right - top_left) * x_lerp;
+            //}
+
+        //if (bottom_left == 0){
+       //     const float bottom = bottom_right}
+       // elif (bottom_right == 0){
+        //    const float bottom = bottom_left}
+       // else{
+        const float bottom = bottom_left + (bottom_right - bottom_left) * x_lerp;
+        //}
+
+        //if (top == 0){
+         //    crops_ptr[out_idx] = bottom }
+       // elif (bottom == 0){
+        //    crops_ptr[out_idx] = top
+            //}
+       // else{
+        crops_ptr[out_idx] = top + (bottom - top) * y_lerp;
+        //}
+    }
+}
+
+__global__
+void CropAndResizeBackpropImageKernel(
+    const int nthreads, const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        int idx = out_idx;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 4];
+        const float x1 = boxes_ptr[b * 4 + 1];
+        const float y2 = boxes_ptr[b * 4 + 2];
+        const float x2 = boxes_ptr[b * 4 + 3];
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height ) / (crop_height )
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width ) / (crop_width ) : 0;
+
+        const float in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height );
+        if (in_y < 0 || in_y > image_height )
+        {
+            continue;
+        }
+
+        const float in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+        if (in_x < 0 || in_x > image_width )
+        {
+            continue;
+        }
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        float *pimage = grads_image_ptr + (b_in * depth + d) * image_height * image_width;
+        const float dtop = (1 - y_lerp) * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + top_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dtop
+        );
+        atomicAdd(
+            pimage + top_y_index * image_width + right_x_index,
+            x_lerp * dtop
+        );
+
+        const float dbottom = y_lerp * grads_ptr[out_idx];
+        atomicAdd(
+            pimage + bottom_y_index * image_width + left_x_index,
+            (1 - x_lerp) * dbottom
+        );
+        atomicAdd(
+            pimage + bottom_y_index * image_width + right_x_index,
+            x_lerp * dbottom
+        );
+    }
+}
+
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, image_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, extrapolation_value, crops_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
+
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int crop_height, int crop_width, int depth,
+    float *grads_image_ptr, cudaStream_t stream)
+{
+    const int total_count = num_boxes * crop_height * crop_width * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeBackpropImageKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, grads_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width,
+            crop_height, crop_width, depth, grads_image_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
\ No newline at end of file
diff --git a/readme.txt b/cuda_functions/roi_align_3D/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/roi_align_3D/__init__.py
diff --git a/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..853e83e
Binary files /dev/null and b/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..2cdfb29
Binary files /dev/null and b/cuda_functions/roi_align_3D/__pycache__/__init__.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/roi_align_3D/roi_align/__init__.py
similarity index 100%
copy from readme.txt
copy to cuda_functions/roi_align_3D/roi_align/__init__.py
diff --git a/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..fa3d8d7
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..cb9081a
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc b/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc
new file mode 100644
index 0000000..88ce998
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc b/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc
new file mode 100644
index 0000000..30d30f5
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/__pycache__/crop_and_resize.cpython-36.pyc differ
diff --git a/readme.txt b/cuda_functions/roi_align_3D/roi_align/_ext/__init__.py
similarity index 100%
rename from readme.txt
rename to cuda_functions/roi_align_3D/roi_align/_ext/__init__.py
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..d50935c
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..e2b65f5
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/._crop_and_resize.so.swp b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/._crop_and_resize.so.swp
new file mode 100644
index 0000000..3db0ea4
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/._crop_and_resize.so.swp differ
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__init__.py b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__init__.py
new file mode 100644
index 0000000..4486c09
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__init__.py
@@ -0,0 +1,15 @@
+
+from torch.utils.ffi import _wrap_function
+from ._crop_and_resize import lib as _lib, ffi as _ffi
+
+__all__ = []
+def _import_symbols(locals):
+    for symbol in dir(_lib):
+        fn = getattr(_lib, symbol)
+        if callable(fn):
+            locals[symbol] = _wrap_function(fn, _ffi)
+        else:
+            locals[symbol] = fn
+        __all__.append(symbol)
+
+_import_symbols(locals())
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc
new file mode 100644
index 0000000..93afa7e
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-35.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc
new file mode 100644
index 0000000..5dd726e
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/__pycache__/__init__.cpython-36.pyc differ
diff --git a/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/_crop_and_resize.so b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/_crop_and_resize.so
new file mode 100755
index 0000000..81dc147
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/_ext/crop_and_resize/_crop_and_resize.so differ
diff --git a/cuda_functions/roi_align_3D/roi_align/build.py b/cuda_functions/roi_align_3D/roi_align/build.py
new file mode 100755
index 0000000..3798d82
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/build.py
@@ -0,0 +1,40 @@
+import os
+import torch
+from torch.utils.ffi import create_extension
+
+
+sources = ['src/crop_and_resize.c']
+headers = ['src/crop_and_resize.h']
+defines = []
+with_cuda = False
+
+extra_objects = []
+if torch.cuda.is_available():
+    print('Including CUDA code.')
+    sources += ['src/crop_and_resize_gpu.c']
+    headers += ['src/crop_and_resize_gpu.h']
+    defines += [('WITH_CUDA', None)]
+    extra_objects += ['src/cuda/crop_and_resize_kernel.cu.o']
+    with_cuda = True
+
+extra_compile_args = ['-fopenmp', '-std=c99']
+
+this_file = os.path.dirname(os.path.realpath(__file__))
+print(this_file)
+sources = [os.path.join(this_file, fname) for fname in sources]
+headers = [os.path.join(this_file, fname) for fname in headers]
+extra_objects = [os.path.join(this_file, fname) for fname in extra_objects]
+
+ffi = create_extension(
+    '_ext.crop_and_resize',
+    headers=headers,
+    sources=sources,
+    define_macros=defines,
+    relative_to=__file__,
+    with_cuda=with_cuda,
+    extra_objects=extra_objects,
+    extra_compile_args=extra_compile_args
+)
+
+if __name__ == '__main__':
+    ffi.build()
diff --git a/cuda_functions/roi_align_3D/roi_align/crop_and_resize.py b/cuda_functions/roi_align_3D/roi_align/crop_and_resize.py
new file mode 100755
index 0000000..cff4e90
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/crop_and_resize.py
@@ -0,0 +1,69 @@
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.autograd import Function
+
+from ._ext import crop_and_resize as _backend
+
+
+class CropAndResizeFunction(Function):
+
+    def __init__(self, crop_height, crop_width, crop_zdepth, extrapolation_value=0):
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.crop_zdepth = crop_zdepth
+        self.extrapolation_value = extrapolation_value
+
+    def forward(self, image, boxes, box_ind):
+        crops = torch.zeros_like(image)
+
+        if image.is_cuda:
+            _backend.crop_and_resize_gpu_forward(
+                image, boxes, box_ind,
+                self.extrapolation_value, self.crop_height, self.crop_width, self.crop_zdepth, crops)
+        else:
+            _backend.crop_and_resize_forward(
+                image, boxes, box_ind,
+                self.extrapolation_value, self.crop_height, self.crop_width, self.crop_zdepth, crops)
+
+        # save for backward
+        self.im_size = image.size()
+        self.save_for_backward(boxes, box_ind)
+
+        return crops
+
+    def backward(self, grad_outputs):
+        boxes, box_ind = self.saved_tensors
+
+        grad_outputs = grad_outputs.contiguous()
+        grad_image = torch.zeros_like(grad_outputs).resize_(*self.im_size)
+
+        if grad_outputs.is_cuda:
+            _backend.crop_and_resize_gpu_backward(
+                grad_outputs, boxes, box_ind, grad_image
+            )
+        else:
+            _backend.crop_and_resize_backward(
+                grad_outputs, boxes, box_ind, grad_image
+            )
+
+        return grad_image, None, None
+
+
+class CropAndResize(nn.Module):
+    """
+    Crop and resize ported from tensorflow
+    See more details on https://www.tensorflow.org/api_docs/python/tf/image/crop_and_resize
+    """
+
+    def __init__(self, crop_height, crop_width, crop_zdepth, extrapolation_value=0):
+        super(CropAndResize, self).__init__()
+
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.crop_zdepth = crop_zdepth
+        self.extrapolation_value = extrapolation_value
+
+    def forward(self, image, boxes, box_ind):
+        return CropAndResizeFunction(self.crop_height, self.crop_width, self.crop_zdepth, self.extrapolation_value)(image, boxes, box_ind)
diff --git a/cuda_functions/roi_align_3D/roi_align/roi_align.py b/cuda_functions/roi_align_3D/roi_align/roi_align.py
new file mode 100644
index 0000000..6931539
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/roi_align.py
@@ -0,0 +1,48 @@
+import torch
+from torch import nn
+
+from .crop_and_resize import CropAndResizeFunction, CropAndResize
+
+
+class RoIAlign(nn.Module):
+
+    def __init__(self, crop_height, crop_width, extrapolation_value=0, transform_fpcoor=True):
+        super(RoIAlign, self).__init__()
+
+        self.crop_height = crop_height
+        self.crop_width = crop_width
+        self.extrapolation_value = extrapolation_value
+        self.transform_fpcoor = transform_fpcoor
+
+    def forward(self, featuremap, boxes, box_ind):
+        """
+        RoIAlign based on crop_and_resize.
+        See more details on https://github.com/ppwwyyxx/tensorpack/blob/6d5ba6a970710eaaa14b89d24aace179eb8ee1af/examples/FasterRCNN/model.py#L301
+        :param featuremap: NxCxHxW
+        :param boxes: Mx4 float box with (x1, y1, x2, y2) **without normalization**
+        :param box_ind: M
+        :return: MxCxoHxoW
+        """
+        x1, y1, x2, y2 = torch.split(boxes, 1, dim=1)
+        image_height, image_width = featuremap.size()[2:4]
+
+        if self.transform_fpcoor:
+            spacing_w = (x2 - x1) / float(self.crop_width)
+            spacing_h = (y2 - y1) / float(self.crop_height)
+
+            nx0 = (x1 + spacing_w / 2 - 0.5) / float(image_width - 1)
+            ny0 = (y1 + spacing_h / 2 - 0.5) / float(image_height - 1)
+            nw = spacing_w * float(self.crop_width - 1) / float(image_width - 1)
+            nh = spacing_h * float(self.crop_height - 1) / float(image_height - 1)
+
+            boxes = torch.cat((ny0, nx0, ny0 + nh, nx0 + nw), 1)
+        else:
+            x1 = x1 / float(image_width - 1)
+            x2 = x2 / float(image_width - 1)
+            y1 = y1 / float(image_height - 1)
+            y2 = y2 / float(image_height - 1)
+            boxes = torch.cat((y1, x1, y2, x2), 1)
+
+        boxes = boxes.detach().contiguous()
+        box_ind = box_ind.detach()
+        return CropAndResizeFunction(self.crop_height, self.crop_width, self.extrapolation_value)(featuremap, boxes, box_ind)
diff --git a/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.c b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.c
new file mode 100644
index 0000000..e1fce67
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.c
@@ -0,0 +1,252 @@
+#include <TH/TH.h>
+#include <stdio.h>
+#include <math.h>
+
+
+void CropAndResizePerBox(
+    const float * image_data, 
+    const int batch_size,
+    const int depth,
+    const int image_height,
+    const int image_width,
+
+    const float * boxes_data, 
+    const int * box_index_data,
+    const int start_box, 
+    const int limit_box,
+
+    float * corps_data,
+    const int crop_height,
+    const int crop_width,
+    const float extrapolation_value
+) {
+    const int image_channel_elements = image_height * image_width;
+    const int image_elements = depth * image_channel_elements;
+
+    const int channel_elements = crop_height * crop_width;
+    const int crop_elements = depth * channel_elements;
+
+    int b;
+    #pragma omp parallel for
+    for (b = start_box; b < limit_box; ++b) {
+        const float * box = boxes_data + b * 4;
+        const float y1 = box[0];
+        const float x1 = box[1];
+        const float y2 = box[2];
+        const float x2 = box[3];
+
+        const int b_in = box_index_data[b];
+        if (b_in < 0 || b_in >= batch_size) {
+            printf("Error: batch_index %d out of range [0, %d)\n", b_in, batch_size);
+            exit(-1);
+        }
+
+        const float height_scale =
+            (crop_height > 1)
+                ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1)
+                             : 0;
+
+        for (int y = 0; y < crop_height; ++y)
+        {
+            const float in_y = (crop_height > 1)
+                                   ? y1 * (image_height - 1) + y * height_scale
+                                   : 0.5 * (y1 + y2) * (image_height - 1);
+
+            if (in_y < 0 || in_y > image_height - 1)
+            {
+                for (int x = 0; x < crop_width; ++x)
+                {
+                    for (int d = 0; d < depth; ++d)
+                    {
+                        // crops(b, y, x, d) = extrapolation_value;
+                        corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = extrapolation_value;
+                    }
+                }
+                continue;
+            }
+            
+            const int top_y_index = floorf(in_y);
+            const int bottom_y_index = ceilf(in_y);
+            const float y_lerp = in_y - top_y_index;
+
+            for (int x = 0; x < crop_width; ++x)
+            {
+                const float in_x = (crop_width > 1)
+                                       ? x1 * (image_width - 1) + x * width_scale
+                                       : 0.5 * (x1 + x2) * (image_width - 1);
+                if (in_x < 0 || in_x > image_width - 1)
+                {
+                    for (int d = 0; d < depth; ++d)
+                    {
+                        corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = extrapolation_value;
+                    }
+                    continue;
+                }
+            
+                const int left_x_index = floorf(in_x);
+                const int right_x_index = ceilf(in_x);
+                const float x_lerp = in_x - left_x_index;
+
+                for (int d = 0; d < depth; ++d)
+                {   
+                    const float *pimage = image_data + b_in * image_elements + d * image_channel_elements;
+
+                    const float top_left = pimage[top_y_index * image_width + left_x_index];
+                    const float top_right = pimage[top_y_index * image_width + right_x_index];
+                    const float bottom_left = pimage[bottom_y_index * image_width + left_x_index];
+                    const float bottom_right = pimage[bottom_y_index * image_width + right_x_index];
+                    
+                    const float top = top_left + (top_right - top_left) * x_lerp;
+                    const float bottom =
+                        bottom_left + (bottom_right - bottom_left) * x_lerp;
+                        
+                    corps_data[crop_elements * b + channel_elements * d + y * crop_width + x] = top + (bottom - top) * y_lerp;
+                }
+            }   // end for x
+        }   // end for y
+    }   // end for b
+
+}
+
+
+void crop_and_resize_forward(
+    THFloatTensor * image,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THFloatTensor * crops
+) {
+    const int batch_size = image->size[0];
+    const int depth = image->size[1];
+    const int image_height = image->size[2];
+    const int image_width = image->size[3];
+
+    const int num_boxes = boxes->size[0];
+
+    // init output space
+    THFloatTensor_resize4d(crops, num_boxes, depth, crop_height, crop_width);
+    THFloatTensor_zero(crops);
+
+    // crop_and_resize for each box
+    CropAndResizePerBox(
+        THFloatTensor_data(image),
+        batch_size,
+        depth,
+        image_height,
+        image_width,
+
+        THFloatTensor_data(boxes),
+        THIntTensor_data(box_index),
+        0,
+        num_boxes,
+
+        THFloatTensor_data(crops),
+        crop_height,
+        crop_width,
+        extrapolation_value
+    );
+
+}
+
+
+void crop_and_resize_backward(
+    THFloatTensor * grads,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    THFloatTensor * grads_image // resize to [bsize, c, hc, wc]
+)
+{   
+    // shape
+    const int batch_size = grads_image->size[0];
+    const int depth = grads_image->size[1];
+    const int image_height = grads_image->size[2];
+    const int image_width = grads_image->size[3];
+
+    const int num_boxes = grads->size[0];
+    const int crop_height = grads->size[2];
+    const int crop_width = grads->size[3];
+
+    // n_elements
+    const int image_channel_elements = image_height * image_width;
+    const int image_elements = depth * image_channel_elements;
+
+    const int channel_elements = crop_height * crop_width;
+    const int crop_elements = depth * channel_elements;
+
+    // init output space
+    THFloatTensor_zero(grads_image);
+
+    // data pointer
+    const float * grads_data = THFloatTensor_data(grads);
+    const float * boxes_data = THFloatTensor_data(boxes);
+    const int * box_index_data = THIntTensor_data(box_index);
+    float * grads_image_data = THFloatTensor_data(grads_image);
+
+    for (int b = 0; b < num_boxes; ++b) {
+        const float * box = boxes_data + b * 4;
+        const float y1 = box[0];
+        const float x1 = box[1];
+        const float y2 = box[2];
+        const float x2 = box[3];
+
+        const int b_in = box_index_data[b];
+        if (b_in < 0 || b_in >= batch_size) {
+            printf("Error: batch_index %d out of range [0, %d)\n", b_in, batch_size);
+            exit(-1);
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height - 1) / (crop_height - 1)
+                              : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width - 1) / (crop_width - 1)
+                             : 0;
+
+        for (int y = 0; y < crop_height; ++y)
+        {
+            const float in_y = (crop_height > 1)
+                                   ? y1 * (image_height - 1) + y * height_scale
+                                   : 0.5 * (y1 + y2) * (image_height - 1);
+            if (in_y < 0 || in_y > image_height - 1)
+            {
+                continue;
+            }
+            const int top_y_index = floorf(in_y);
+            const int bottom_y_index = ceilf(in_y);
+            const float y_lerp = in_y - top_y_index;
+
+            for (int x = 0; x < crop_width; ++x)
+            {
+                const float in_x = (crop_width > 1)
+                                       ? x1 * (image_width - 1) + x * width_scale
+                                       : 0.5 * (x1 + x2) * (image_width - 1);
+                if (in_x < 0 || in_x > image_width - 1)
+                {
+                    continue;
+                }
+                const int left_x_index = floorf(in_x);
+                const int right_x_index = ceilf(in_x);
+                const float x_lerp = in_x - left_x_index;
+
+                for (int d = 0; d < depth; ++d)
+                {   
+                    float *pimage = grads_image_data + b_in * image_elements + d * image_channel_elements;
+                    const float grad_val = grads_data[crop_elements * b + channel_elements * d + y * crop_width + x];
+
+                    const float dtop = (1 - y_lerp) * grad_val;
+                    pimage[top_y_index * image_width + left_x_index] += (1 - x_lerp) * dtop;
+                    pimage[top_y_index * image_width + right_x_index] += x_lerp * dtop;
+
+                    const float dbottom = y_lerp * grad_val;
+                    pimage[bottom_y_index * image_width + left_x_index] += (1 - x_lerp) * dbottom;
+                    pimage[bottom_y_index * image_width + right_x_index] += x_lerp * dbottom;
+                }   // end d
+            }   // end x
+        }   // end y
+    }   // end b
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.h b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.h
new file mode 100644
index 0000000..d494865
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize.h
@@ -0,0 +1,16 @@
+void crop_and_resize_forward(
+    THFloatTensor * image,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    THFloatTensor * crops
+);
+
+void crop_and_resize_backward(
+    THFloatTensor * grads,
+    THFloatTensor * boxes,      // [y1, x1, y2, x2]
+    THIntTensor * box_index,    // range in [0, batch_size)
+    THFloatTensor * grads_image // resize to [bsize, c, hc, wc]
+);
\ No newline at end of file
diff --git a/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.c b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.c
new file mode 100644
index 0000000..8e07b3d
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.c
@@ -0,0 +1,73 @@
+#include <THC/THC.h>
+#include "cuda/crop_and_resize_kernel.h"
+
+extern THCState *state;
+
+
+void crop_and_resize_gpu_forward(
+    THCudaTensor * image,
+    THCudaTensor * boxes,           // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    const int crop_zdepth,
+    THCudaTensor * crops
+) {
+    const int batch_size = THCudaTensor_size(state, image, 0);
+    const int depth = THCudaTensor_size(state, image, 1);
+    const int image_height = THCudaTensor_size(state, image, 2);
+    const int image_width = THCudaTensor_size(state, image, 3);
+    const int image_zdepth = THCudaTensor_size(state, image, 4);
+
+    const int num_boxes = THCudaTensor_size(state, boxes, 0);
+
+    // init output space
+    THCudaTensor_resize5d(state, crops, num_boxes, depth, crop_height, crop_width, crop_zdepth);
+    THCudaTensor_zero(state, crops);
+
+    cudaStream_t stream = THCState_getCurrentStream(state);
+    CropAndResizeLaucher(
+        THCudaTensor_data(state, image),
+        THCudaTensor_data(state, boxes),
+        THCudaIntTensor_data(state, box_index),
+        num_boxes, batch_size, image_height, image_width, image_zdepth,
+        crop_height, crop_width, crop_zdepth, depth, extrapolation_value,
+        THCudaTensor_data(state, crops),
+        stream
+    );
+}
+
+
+void crop_and_resize_gpu_backward(
+    THCudaTensor * grads,
+    THCudaTensor * boxes,      // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    THCudaTensor * grads_image // resize to [bsize, c, hc, wc]
+) {
+    // shape
+    const int batch_size = THCudaTensor_size(state, grads_image, 0);
+    const int depth = THCudaTensor_size(state, grads_image, 1);
+    const int image_height = THCudaTensor_size(state, grads_image, 2);
+    const int image_width = THCudaTensor_size(state, grads_image, 3);
+    const int image_zdepth = THCudaTensor_size(state, grads_image, 4);
+
+    const int num_boxes = THCudaTensor_size(state, grads, 0);
+    const int crop_height = THCudaTensor_size(state, grads, 2);
+    const int crop_width = THCudaTensor_size(state, grads, 3);
+    const int crop_zdepth = THCudaTensor_size(state, grads, 4);
+
+    // init output space
+    THCudaTensor_zero(state, grads_image);
+
+    cudaStream_t stream = THCState_getCurrentStream(state);
+    CropAndResizeBackpropImageLaucher(
+        THCudaTensor_data(state, grads),
+        THCudaTensor_data(state, boxes),
+        THCudaIntTensor_data(state, box_index),
+        num_boxes, batch_size, image_height, image_width, image_zdepth,
+        crop_height, crop_width, crop_zdepth, depth,
+        THCudaTensor_data(state, grads_image),
+        stream
+    );
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.h b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.h
new file mode 100644
index 0000000..dd2eb5a
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/crop_and_resize_gpu.h
@@ -0,0 +1,17 @@
+void crop_and_resize_gpu_forward(
+    THCudaTensor * image,
+    THCudaTensor * boxes,           // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    const float extrapolation_value,
+    const int crop_height,
+    const int crop_width,
+    const int crop_zdepth,
+    THCudaTensor * crops
+);
+
+void crop_and_resize_gpu_backward(
+    THCudaTensor * grads,
+    THCudaTensor * boxes,      // [y1, x1, y2, x2]
+    THCudaIntTensor * box_index,    // range in [0, batch_size)
+    THCudaTensor * grads_image // resize to [bsize, c, hc, wc]
+);
\ No newline at end of file
diff --git a/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu
new file mode 100644
index 0000000..e381dab
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu
@@ -0,0 +1,361 @@
+#include <math.h>
+#include <stdio.h>
+#include "crop_and_resize_kernel.h"
+#include <stdio.h>
+
+#define CUDA_1D_KERNEL_LOOP(i, n)                            \
+for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < n; \
+     i += blockDim.x * gridDim.x)
+
+
+__global__
+void CropAndResizeKernel(
+    const int nthreads, const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float extrapolation_value, float *crops_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads) // nthreads = total_count!
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b)) position in out grid!!!
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))   NCYX yes seems like xy is exchanged!
+        // NCHWZ: out_idx = z + crop_zdepth * (w + crop_width * (h + crop_height * (d + depth * b))) z == last.
+
+        int idx = out_idx;
+
+        const int z = idx % crop_zdepth;
+        idx /= crop_zdepth;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+
+        const int d = idx % depth;
+        const int b = idx / depth; // batch
+
+        const float y1 = boxes_ptr[b * 6]; // b = batch -> 0 // normalized coords!!
+        const float x1 = boxes_ptr[b * 6 + 1];
+        const float y2 = boxes_ptr[b * 6 + 2];
+        const float x2 = boxes_ptr[b * 6 + 3];
+        const float z1 = boxes_ptr[b * 6 + 4];
+        const float z2 = boxes_ptr[b * 6 + 5];
+
+        const int b_in = box_ind_ptr[b]; // == 0 in my case.
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        // e.g. (0.4-0.3)*100 = 10 / 7 = 1.3 ratio proposal_size / crops_size. one cell in crops has size 1.3 in_pixel.
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1)  * (image_height ) / (crop_height ) : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width ) / (crop_width ) : 0;
+
+        const float zdepth_scale =
+            (crop_zdepth > 1) ? (z2 - z1) * (image_zdepth ) / (crop_zdepth ) : 0;
+
+
+        // e.g.  0.3*100 + 5 * 1.3 . Which floating coordinate is going into cell?
+        // e.g. y: 30 (lower bound prop) + 7.5 (current crop position * scale)
+
+
+        float tmp_in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height);
+
+        if (tmp_in_y > image_height - 1)
+        {
+         tmp_in_y = image_height - 1;
+        }
+        if (tmp_in_y < 0)
+        {
+         tmp_in_y = 0;
+        }
+        const float in_y = tmp_in_y;
+
+
+        float tmp_in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+
+        if (tmp_in_x > image_width - 1)
+        {
+         tmp_in_x = image_width - 1;
+        }
+        if (tmp_in_x < 0)
+        {
+         tmp_in_x= 0;
+        }
+	    const float in_x = tmp_in_x;
+
+
+        float tmp_in_z = (crop_zdepth > 1)
+                            ? z1 * (image_zdepth ) + z * zdepth_scale + zdepth_scale/2 - 0.5
+                            : 0.5 * (z1 + z2) * (image_zdepth);
+
+        if (tmp_in_z > image_zdepth - 1)
+        {
+         tmp_in_z = image_zdepth - 1;
+        }
+        if (tmp_in_z < 0)
+        {
+         tmp_in_z= 0;
+        }
+        const float in_z = tmp_in_z;
+
+        // this is just rounding of the floating coord of grid cell. The distances to nearest grid points are
+        // memorized (lerp) to be used for bilinear interpolation later.
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index; //
+
+        const int front_z_index = floorf(in_z);
+        const int back_z_index = ceilf(in_z);
+        const float z_lerp = in_z - front_z_index;
+
+
+        // address of image + going to the right feature map.
+        const float *pimage = image_ptr + (b_in * depth + d) * image_height * image_width * image_zdepth;
+
+        // 1D address of corner points of in_coords to grid cell.
+        // NCHWZ: out_idx = z + crop_zdepth * (w + crop_width * (h + crop_height * (d + depth * b))) z == last.
+        const float top_left_front = pimage[front_z_index + image_zdepth * (left_x_index + image_width * top_y_index)];
+        const float top_right_front = pimage[front_z_index + image_zdepth * (right_x_index + image_width * top_y_index)];
+        const float bottom_left_front = pimage[front_z_index + image_zdepth * (left_x_index + image_width * bottom_y_index)];
+        const float bottom_right_front = pimage[front_z_index + image_zdepth * (right_x_index + image_width * bottom_y_index)];
+        const float top_left_back = pimage[back_z_index + image_zdepth * (left_x_index + image_width * top_y_index)];
+        const float top_right_back = pimage[back_z_index + image_zdepth * (right_x_index + image_width * top_y_index)];
+        const float bottom_left_back = pimage[back_z_index + image_zdepth * (left_x_index + image_width * bottom_y_index)];
+        const float bottom_right_back = pimage[back_z_index + image_zdepth * (right_x_index + image_width * bottom_y_index)];
+
+        // Bilinear Interpolation!! These are pixel values now! lerp is the interpolation distance!
+        // No Maxpool, only one point is sampled!
+        const float top_front = top_left_front + (top_right_front - top_left_front) * x_lerp;
+        const float bottom_front = bottom_left_front + (bottom_right_front - bottom_left_front) * x_lerp;
+        const float top_back = top_left_back + (top_right_back - top_left_back) * x_lerp;
+        const float bottom_back = bottom_left_back + (bottom_right_back - bottom_left_back) * x_lerp;
+
+        const float front = top_front + (bottom_front - top_front) * y_lerp;
+        const float back = top_back + (bottom_back - top_back) * y_lerp;
+
+        crops_ptr[out_idx] = front + (back - front) * z_lerp; // assign interpolated value to Grid cell!
+
+
+    }
+}
+
+__global__
+void CropAndResizeBackpropImageKernel(
+    const int nthreads, const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float *grads_image_ptr)
+{
+    CUDA_1D_KERNEL_LOOP(out_idx, nthreads)
+    {
+        // NHWC: out_idx = d + depth * (w + crop_width * (h + crop_height * b))
+        // NCHW: out_idx = w + crop_width * (h + crop_height * (d + depth * b))
+        // NCHWZ: out_idx = z + crop_zdepth * (w + crop_width * (h + crop_height * (d + depth * b))) z == last.
+        int idx = out_idx;
+
+        const int z = idx % crop_zdepth;
+        idx /= crop_zdepth;
+        const int x = idx % crop_width;
+        idx /= crop_width;
+        const int y = idx % crop_height;
+        idx /= crop_height;
+        const int d = idx % depth;
+        const int b = idx / depth;
+
+        const float y1 = boxes_ptr[b * 6]; // b = batch -> 0 // normalized coords!!
+        const float x1 = boxes_ptr[b * 6 + 1];
+        const float y2 = boxes_ptr[b * 6 + 2];
+        const float x2 = boxes_ptr[b * 6 + 3];
+        const float z1 = boxes_ptr[b * 6 + 4];
+        const float z2 = boxes_ptr[b * 6 + 5];
+
+
+        const int b_in = box_ind_ptr[b];
+        if (b_in < 0 || b_in >= batch)
+        {
+            continue;
+        }
+
+        const float height_scale =
+            (crop_height > 1) ? (y2 - y1) * (image_height ) / (crop_height )
+                                : 0;
+        const float width_scale =
+            (crop_width > 1) ? (x2 - x1) * (image_width ) / (crop_width ) : 0;
+
+        const float zdepth_scale =
+            (crop_zdepth > 1) ? (z2 - z1) * (image_zdepth ) / (crop_zdepth ) : 0;
+
+
+        float tmp_in_y = (crop_height > 1)
+                                ? y1 * (image_height ) + y * height_scale + height_scale/2 - 0.5
+                                : 0.5 * (y1 + y2) * (image_height);
+        if (tmp_in_y > image_height - 1)
+        {
+         tmp_in_y = image_height - 1;
+        }
+        if (tmp_in_y < 0)
+        {
+         tmp_in_y = 0;
+        }
+        const float in_y = tmp_in_y;
+
+
+        float tmp_in_x = (crop_width > 1)
+                                ? x1 * (image_width ) + x * width_scale + width_scale/2 - 0.5
+                                : 0.5 * (x1 + x2) * (image_width );
+        if (tmp_in_x > image_width - 1)
+        {
+         tmp_in_x = image_width - 1;
+        }
+        if (tmp_in_x < 0)
+        {
+         tmp_in_x= 0;
+        }
+	    const float in_x = tmp_in_x;
+
+
+        float tmp_in_z = (crop_zdepth > 1)
+                            ? z1 * (image_zdepth ) + z * zdepth_scale + zdepth_scale/2 - 0.5
+                            : 0.5 * (z1 + z2) * (image_zdepth);
+        if (tmp_in_z > image_zdepth - 1)
+        {
+         tmp_in_z = image_zdepth - 1;
+        }
+        if (tmp_in_z < 0)
+        {
+         tmp_in_z= 0;
+        }
+        const float in_z = tmp_in_z;
+
+        const int top_y_index = floorf(in_y);
+        const int bottom_y_index = ceilf(in_y);
+        const float y_lerp = in_y - top_y_index;
+
+        const int left_x_index = floorf(in_x);
+        const int right_x_index = ceilf(in_x);
+        const float x_lerp = in_x - left_x_index;
+
+        const int front_z_index = floorf(in_z);
+        const int back_z_index = ceilf(in_z);
+        const float z_lerp = in_z - front_z_index;
+
+        float *pimage = grads_image_ptr + (b_in * depth + d) * image_height * image_width * image_zdepth;
+
+        // top left front
+        atomicAdd(
+            pimage + front_z_index + image_zdepth * (left_x_index + image_width * top_y_index),
+            (1 - x_lerp) * (1 - z_lerp) * (1 - y_lerp) * grads_ptr[out_idx]   // THIS IS BACKWARD INTERPOL.
+        );
+
+        // top left back
+        atomicAdd(
+            pimage + back_z_index + image_zdepth * (left_x_index + image_width * top_y_index),
+            (1 - x_lerp) * (z_lerp) * (1 - y_lerp) * grads_ptr[out_idx]   // THIS IS BACKWARD INTERPOL.
+        );
+
+        // top right front
+        atomicAdd(
+            pimage + front_z_index + image_zdepth * (right_x_index + image_width * top_y_index),
+            (x_lerp) * (1 - z_lerp) * (1 - y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+        // top right back
+        atomicAdd(
+            pimage + back_z_index + image_zdepth * (right_x_index + image_width * top_y_index),
+            (x_lerp) * (z_lerp) * (1 - y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+        // bottom left front
+        atomicAdd(
+            pimage + front_z_index + image_zdepth * (left_x_index + image_width * bottom_y_index),
+            (1 - x_lerp) * (1 - z_lerp) * (y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+        // bottom left back
+        atomicAdd(
+            pimage + back_z_index + image_zdepth * (left_x_index + image_width * bottom_y_index),
+            (1 - x_lerp) * (z_lerp) * (y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+        // bottom right front
+        atomicAdd(
+            pimage + front_z_index + image_zdepth * (right_x_index + image_width * bottom_y_index),
+            (x_lerp) * (1 - z_lerp) * (y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+        // bottom right back
+        atomicAdd(
+            pimage + back_z_index + image_zdepth * (right_x_index + image_width * bottom_y_index),
+            (x_lerp) * (z_lerp) * (y_lerp) * grads_ptr[out_idx]   // THIS IS backward INTERPOL.
+        );
+
+    }
+}
+
+
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream)
+{   
+    const int total_count = num_boxes * crop_height * crop_width * crop_zdepth * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, image_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width, image_zdepth,
+            crop_height, crop_width, crop_zdepth, depth, extrapolation_value, crops_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
+
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float *grads_image_ptr, cudaStream_t stream)
+{   
+    const int total_count = num_boxes * crop_height * crop_width * crop_zdepth * depth;
+    const int thread_per_block = 1024;
+    const int block_count = (total_count + thread_per_block - 1) / thread_per_block;
+    cudaError_t err;
+
+    if (total_count > 0)
+    {
+        CropAndResizeBackpropImageKernel<<<block_count, thread_per_block, 0, stream>>>(
+            total_count, grads_ptr, boxes_ptr,
+            box_ind_ptr, num_boxes, batch, image_height, image_width, image_zdepth,
+            crop_height, crop_width, crop_zdepth, depth, grads_image_ptr);
+
+        err = cudaGetLastError();
+        if (cudaSuccess != err)
+        {
+            fprintf(stderr, "cudaCheckError() failed in Roi Align : %s\n", cudaGetErrorString(err));
+            exit(-1);
+        }
+    }
+}
\ No newline at end of file
diff --git a/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu.o b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu.o
new file mode 100644
index 0000000..d488598
Binary files /dev/null and b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.cu.o differ
diff --git a/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.h b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.h
new file mode 100644
index 0000000..9244582
--- /dev/null
+++ b/cuda_functions/roi_align_3D/roi_align/src/cuda/crop_and_resize_kernel.h
@@ -0,0 +1,24 @@
+#ifndef _CropAndResize_Kernel
+#define _CropAndResize_Kernel
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+void CropAndResizeLaucher(
+    const float *image_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float extrapolation_value, float *crops_ptr, cudaStream_t stream);
+
+void CropAndResizeBackpropImageLaucher(
+    const float *grads_ptr, const float *boxes_ptr,
+    const int *box_ind_ptr, int num_boxes, int batch, int image_height,
+    int image_width, int image_zdepth, int crop_height, int crop_width, int crop_zdepth, int depth,
+    float *grads_image_ptr, cudaStream_t stream);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
\ No newline at end of file
diff --git a/default_configs.py b/default_configs.py
new file mode 100644
index 0000000..44c2618
--- /dev/null
+++ b/default_configs.py
@@ -0,0 +1,134 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""Default Configurations script. Avoids changing configs of all experiments if general settings are to be changed."""
+
+import os
+
+class DefaultConfigs:
+
+    def __init__(self, model, server_env=None, dim=2):
+
+        #########################
+        #         I/O           #
+        #########################
+
+        self.model = model
+        self.dim = dim
+        # int [0 < dataset_size]. select n patients from dataset for prototyping.
+        self.select_prototype_subset = None
+
+        # some default paths.
+        self.backbone_path = 'models/backbone.py'
+        self.source_dir = os.path.dirname(os.path.realpath(__file__)) #current dir.
+        self.input_df_name = 'info_df.pickle'
+        self.model_path = 'models/{}.py'.format(self.model)
+
+        if server_env:
+            self.source_dir = '/home/jaegerp/code/mamma_code/medicaldetectiontoolkit'
+
+
+        #########################
+        #      Data Loader      #
+        #########################
+
+        #random seed for fold_generator and batch_generator.
+        self.seed = 0
+
+        #number of threads for multithreaded batch generation.
+        self.n_workers = 6
+
+        # if True, segmentation losses learn all categories, else only foreground vs. background.
+        self.class_specific_seg_flag = False
+
+        #########################
+        #      Architecture      #
+        #########################
+
+        self.weight_decay = 0.0
+
+        # nonlinearity to be applied after convs with nonlinearity. one of 'relu' or 'leaky_relu'
+        self.relu = 'relu'
+
+        # if True initializes weights as specified in model script. else use default Pytorch init.
+        self.custom_init = False
+
+        # if True adds high-res decoder levels to feature pyramid: P1 + P0. (e.g. set to true in retina_unet configs)
+        self.operate_stride1 = False
+
+        #########################
+        #  Schedule             #
+        #########################
+
+        # number of folds in cross validation.
+        self.n_cv_splits = 5
+
+
+        # number of probabilistic samples in validation.
+        self.n_probabilistic_samples = None
+
+        #########################
+        #   Testing / Plotting  #
+        #########################
+
+        # perform mirroring at test time. (only XY. Z not done to not blow up predictions times).
+        self.test_aug = True
+
+        # if True, test data lies in a separate folder and is not part of the cross validation.
+        self.hold_out_test_set = False
+
+        # if hold_out_test_set provided, ensemble predictions over models of all trained cv-folds.
+        self.ensemble_folds = False
+
+        # color specifications for all box_types in prediction_plot.
+        self.box_color_palette = {'det': 'b', 'gt': 'r', 'neg_class': 'purple',
+                                  'prop': 'w', 'pos_class': 'g', 'pos_anchor': 'c', 'neg_anchor': 'c'}
+
+        # scan over confidence score in evaluation to optimize it on the validation set.
+        self.scan_det_thresh = False
+
+        # plots roc-curves / prc-curves in evaluation.
+        self.plot_stat_curves = False
+
+        # evaluates average precision per image and averages over images. instead computing one ap over data set.
+        self.per_patient_ap = False
+
+        # threshold for clustering 2D box predictions to 3D Cubes. Overlap is computed in XY.
+        self.merge_3D_iou = 0.1
+
+        # monitor any value from training.
+        self.n_monitoring_figures = 1
+        # dict to assign specific plot_values to monitor_figures > 0. {1: ['class_loss'], 2: ['kl_loss', 'kl_sigmas']}
+        self.assign_values_to_extra_figure = {}
+
+        #########################
+        #   MRCNN               #
+        #########################
+
+        # if True, mask loss is not applied. used for data sets, where no pixel-wise annotations are provided.
+        self.frcnn_mode = False
+
+        # if True, unmolds masks in Mask R-CNN to full-res for plotting/monitoring.
+        self.return_masks_in_val = False
+        self.return_masks_in_test = False # needed if doing instance segmentation. evaluation not yet implemented.
+
+        # add P6 to Feature Pyramid Network.
+        self.sixth_pooling = False
+
+        # for probabilistic detection
+        self.n_latent_dims = 0
+
+
diff --git a/evaluator.py b/evaluator.py
new file mode 100644
index 0000000..0f2a1b3
--- /dev/null
+++ b/evaluator.py
@@ -0,0 +1,437 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import os
+import numpy as np
+import pandas as pd
+from sklearn.metrics import roc_auc_score, average_precision_score
+from sklearn.metrics import roc_curve, precision_recall_curve
+import utils.model_utils as mutils
+import plotting
+from multiprocessing import Pool
+
+
+class Evaluator():
+
+
+    def __init__(self, cf, logger, mode='test'):
+        """
+        :param mode: either 'val_sampling', 'val_patient' or 'test'. handles prediction lists of different forms.
+        """
+        self.cf = cf
+        self.logger = logger
+        self.mode = mode
+
+
+    def evaluate_predictions(self, results_list, monitor_metrics=None):
+        """
+        Performs the matching of predicted boxes and ground truth boxes. Loops over list of matching IoUs and foreground classes.
+        Resulting info of each prediction is stored as one line in an internal dataframe, with the keys:
+        det_type: 'tp' (true positive), 'fp' (false positive), 'fn' (false negative), 'tn' (true negative)
+        pred_class: foreground class which the object predicts.
+        pid: corresponding patient-id.
+        pred_score: confidence score [0, 1]
+        fold: corresponding fold of CV.
+        match_iou: utilized IoU for matching.
+        :param results_list: list of model predictions. Either from train/val_sampling (patch processing) for monitoring with form:
+        [[[results_0, ...], [pid_0, ...]], [[results_n, ...], [pid_n, ...]], ...]
+        Or from val_patient/testing (patient processing), with form: [[results_0, pid_0], [results_1, pid_1], ...])
+        :param monitor_metrics (optional):  dict of dicts with all metrics of previous epochs.
+        :return monitor_metrics: if provided (during training), return monitor_metrics now including results of current epoch.
+        """
+        # gets results_list = [[batch_instances_box_lists], [batch_instances_pids]]*n_batches
+        # we want to evaluate one batch_instance (= 2D or 3D image) at a time.
+
+        df_list_preds = []
+        df_list_labels = []
+        df_list_class_preds = []
+        df_list_pids = []
+        df_list_type = []
+        df_list_match_iou = []
+
+        self.logger.info('evaluating in mode {}'.format(self.mode))
+
+
+        if self.mode == 'train' or self.mode=='val_sampling':
+            # batch_size > 1, with varying patients across batch:
+            # [[[results_0, ...], [pid_0, ...]], [[results_n, ...], [pid_n, ...]], ...]
+            # -> [results_0, results_1, ..] , [pid_0, pid_1, ...]
+            batch_elements_list = [[b_box_list] for item in results_list for b_box_list in item[0]]
+            pid_list = [pid for item in results_list for pid in item[1]]
+        else:
+            # patient processing, one element per batch = one patient.
+            # [[results_0, pid_0], [results_1, pid_1], ...] -> [results_0, results_1, ..] , [pid_0, pid_1, ...]
+            batch_elements_list = [item[0] for item in results_list]
+            pid_list = [item[1] for item in results_list]
+
+        for match_iou in self.cf.ap_match_ious:
+            self.logger.info('evaluating with match_iou: {}'.format(match_iou))
+            for cl in list(self.cf.class_dict.keys()):
+                for pix, pid in enumerate(pid_list):
+
+                    len_df_list_before_patient = len(df_list_pids)
+
+                    # input of each batch element is a list of boxes, where each box is a dictionary.
+                    for bix, b_boxes_list in enumerate(batch_elements_list[pix]):
+
+                        b_tar_boxes = np.array([box['box_coords'] for box in b_boxes_list if
+                                                (box['box_type'] == 'gt' and box['box_label'] == cl)])
+                        b_cand_boxes = np.array([box['box_coords'] for box in b_boxes_list if
+                                                 (box['box_type'] == 'det' and
+                                                  box['box_pred_class_id'] == cl)])
+                        b_cand_scores = np.array([box['box_score'] for box in b_boxes_list if
+                                                  (box['box_type'] == 'det' and
+                                                   box['box_pred_class_id'] == cl)])
+
+                        # check if predictions and ground truth boxes exist and match them according to match_iou.
+                        if not 0 in b_cand_boxes.shape and not 0 in b_tar_boxes.shape:
+                            overlaps = mutils.compute_overlaps(b_cand_boxes, b_tar_boxes)
+                            match_cand_ixs = np.argwhere(np.max(overlaps, 1) > match_iou)[:, 0]
+                            non_match_cand_ixs = np.argwhere(np.max(overlaps, 1) <= match_iou)[:, 0]
+                            match_gt_ixs = np.argmax(overlaps[match_cand_ixs, :],
+                                                     1) if not 0 in match_cand_ixs.shape else np.array([])
+                            non_match_gt_ixs = np.array(
+                                [ii for ii in np.arange(b_tar_boxes.shape[0]) if ii not in match_gt_ixs])
+                            unique, counts = np.unique(match_gt_ixs, return_counts=True)
+
+                            # check for double assignments, i.e. two predictions having been assigned to the same gt.
+                            # according to the COCO-metrics, only one prediction counts as true positive, the rest counts as
+                            # false positive. This case is supposed to be avoided by the model itself by,
+                            #  e.g. using a low enough NMS threshold.
+                            if np.any(counts > 1):
+                                double_match_gt_ixs = unique[np.argwhere(counts > 1)[:, 0]]
+                                keep_max = []
+                                double_match_list = []
+                                for dg in double_match_gt_ixs:
+                                    double_match_cand_ixs = match_cand_ixs[np.argwhere(match_gt_ixs == dg)]
+                                    keep_max.append(double_match_cand_ixs[np.argmax(b_cand_scores[double_match_cand_ixs])])
+                                    double_match_list += [ii for ii in double_match_cand_ixs]
+
+                                fp_ixs = np.array([ii for ii in match_cand_ixs if
+                                                     (ii in double_match_list and ii not in keep_max)])
+
+                                match_cand_ixs = np.array([ii for ii in match_cand_ixs if ii not in fp_ixs])
+
+                                df_list_preds += [ii for ii in b_cand_scores[fp_ixs]]
+                                df_list_labels += [0] * fp_ixs.shape[0]
+                                df_list_class_preds += [cl] * fp_ixs.shape[0]
+                                df_list_pids += [pid] * fp_ixs.shape[0]
+                                df_list_type += ['det_fp'] * fp_ixs.shape[0]
+
+                            # matched:
+                            if not 0 in match_cand_ixs.shape:
+                                df_list_preds += [ii for ii in b_cand_scores[match_cand_ixs]]
+                                df_list_labels += [1] * match_cand_ixs.shape[0]
+                                df_list_class_preds += [cl] * match_cand_ixs.shape[0]
+                                df_list_pids += [pid] * match_cand_ixs.shape[0]
+                                df_list_type += ['det_tp'] * match_cand_ixs.shape[0]
+                            # rest fp:
+                            if not 0 in non_match_cand_ixs.shape:
+                                df_list_preds += [ii for ii in b_cand_scores[non_match_cand_ixs]]
+                                df_list_labels += [0] * non_match_cand_ixs.shape[0]
+                                df_list_class_preds += [cl] * non_match_cand_ixs.shape[0]
+                                df_list_pids += [pid] * non_match_cand_ixs.shape[0]
+                                df_list_type += ['det_fp'] * non_match_cand_ixs.shape[0]
+                            # rest fn:
+                            if not 0 in non_match_gt_ixs.shape:
+                                df_list_preds += [0] * non_match_gt_ixs.shape[0]
+                                df_list_labels += [1] * non_match_gt_ixs.shape[0]
+                                df_list_class_preds += [cl] * non_match_gt_ixs.shape[0]
+                                df_list_pids += [pid]  * non_match_gt_ixs.shape[0]
+                                df_list_type += ['det_fn']  * non_match_gt_ixs.shape[0]
+                        # only fp:
+                        if not 0 in b_cand_boxes.shape and 0 in b_tar_boxes.shape:
+                            df_list_preds += [ii for ii in b_cand_scores]
+                            df_list_labels += [0] * b_cand_scores.shape[0]
+                            df_list_class_preds += [cl] * b_cand_scores.shape[0]
+                            df_list_pids += [pid] * b_cand_scores.shape[0]
+                            df_list_type += ['det_fp'] * b_cand_scores.shape[0]
+                        # only fn:
+                        if 0 in b_cand_boxes.shape and not 0 in b_tar_boxes.shape:
+                            df_list_preds += [0] * b_tar_boxes.shape[0]
+                            df_list_labels += [1] * b_tar_boxes.shape[0]
+                            df_list_class_preds += [cl] * b_tar_boxes.shape[0]
+                            df_list_pids += [pid] * b_tar_boxes.shape[0]
+                            df_list_type += ['det_fn'] * b_tar_boxes.shape[0]
+
+                    # empty patient with 0 detections needs patient dummy score, in order to not disappear from stats.
+                    # filtered out for roi-level evaluation later. During training (and val_sampling),
+                    # tn are assigned per sample independently of associated patients.
+                    if len(df_list_pids) == len_df_list_before_patient:
+                        df_list_preds += [0] * 1
+                        df_list_labels += [0] * 1
+                        df_list_class_preds += [cl] * 1
+                        df_list_pids += [pid] * 1
+                        df_list_type += ['patient_tn'] * 1 # true negative: no ground truth boxes, no detections.
+
+            df_list_match_iou += [match_iou] * (len(df_list_preds) - len(df_list_match_iou))
+
+        self.test_df = pd.DataFrame()
+        self.test_df['pred_score'] = df_list_preds
+        self.test_df['class_label'] = df_list_labels
+        self.test_df['pred_class'] = df_list_class_preds
+        self.test_df['pid'] = df_list_pids
+        self.test_df['det_type'] = df_list_type
+        self.test_df['fold'] = self.cf.fold
+        self.test_df['match_iou'] = df_list_match_iou
+        if monitor_metrics is not None:
+            return self.return_metrics(monitor_metrics)
+
+
+    def return_metrics(self, monitor_metrics=None):
+        """
+        calculates AP/AUC scores for internal dataframe. called directly from evaluate_predictions during training for monitoring,
+        or from score_test_df during inference (for single folds or aggregated test set). Loops over foreground classes
+        and score_levels (typically 'roi' and 'patient'), gets scores and stores them. Optionally creates plots of
+        prediction histograms and roc/prc curves.
+        :param monitor_metrics: dict of dicts with all metrics of previous epochs.
+        this function adds metrics for current epoch and returns the same object.
+        :return: all_stats: list. Contains dicts with resulting scores for each combination of foreground class and
+        score_level.
+        :return: monitor_metrics
+        """
+        df = self.test_df
+
+        all_stats = []
+        for cl in list(self.cf.class_dict.keys()):
+            cl_df = df[df.pred_class == cl]
+
+            for score_level in self.cf.report_score_level:
+                stats_dict = {}
+                stats_dict['name'] = 'fold_{} {} cl_{}'.format(self.cf.fold, score_level, cl)
+
+                if score_level == 'rois':
+                    # kick out dummy entries for true negative patients. not needed on roi-level.
+                    spec_df = cl_df[cl_df.det_type != 'patient_tn']
+                    stats_dict['ap'] = get_roi_ap_from_df([spec_df, self.cf.min_det_thresh, self.cf.per_patient_ap])
+                    # AUC not sensible on roi-level, since true negative box predictions do not exist. Would reward
+                    # higher amounts of low confidence false positives.
+                    stats_dict['auc'] = 0
+                    stats_dict['roc'] = None
+                    stats_dict['prc'] = None
+
+                    # for the aggregated test set case, additionally get the scores for averaging over fold results.
+                    if len(df.fold.unique()) > 1:
+                        aps = []
+                        for fold in df.fold.unique():
+                            fold_df = spec_df[spec_df.fold == fold]
+                            aps.append(get_roi_ap_from_df([fold_df, self.cf.min_det_thresh, self.cf.per_patient_ap]))
+                        stats_dict['mean_ap'] = np.mean(aps)
+                        stats_dict['mean_auc'] = 0
+
+                # on patient level, aggregate predictions per patient (pid): The patient predicted score is the highest
+                # confidence prediction for this class. The patient class label is 1 if roi of this class exists in patient, else 0.
+                if score_level == 'patient':
+                    spec_df = cl_df.groupby(['pid'], as_index=False).agg({'class_label': 'max', 'pred_score': 'max', 'fold': 'first'})
+
+                    if len(spec_df.class_label.unique()) > 1:
+                        stats_dict['auc'] = roc_auc_score(spec_df.class_label.tolist(), spec_df.pred_score.tolist())
+                        stats_dict['roc'] = roc_curve(spec_df.class_label.tolist(), spec_df.pred_score.tolist())
+                    else:
+                        stats_dict['auc'] = np.nan
+                        stats_dict['roc'] = np.nan
+
+                    if (spec_df.class_label == 1).any():
+                        stats_dict['ap'] = average_precision_score(spec_df.class_label.tolist(), spec_df.pred_score.tolist())
+                        stats_dict['prc'] = precision_recall_curve(spec_df.class_label.tolist(), spec_df.pred_score.tolist())
+                    else:
+                        stats_dict['ap'] = np.nan
+                        stats_dict['prc'] = np.nan
+
+                    # for the aggregated test set case, additionally get the scores for averaging over fold results.
+                    if len(df.fold.unique()) > 1:
+                        aucs = []
+                        aps = []
+                        for fold in df.fold.unique():
+                            fold_df = spec_df[spec_df.fold == fold]
+                            if len(fold_df.class_label.unique()) > 1:
+                                aucs.append(roc_auc_score(fold_df.class_label.tolist(), fold_df.pred_score.tolist()))
+                            if (fold_df.class_label == 1).any():
+                                aps.append(average_precision_score(fold_df.class_label.tolist(), fold_df.pred_score.tolist()))
+                        stats_dict['mean_auc'] = np.mean(aucs)
+                        stats_dict['mean_ap'] = np.mean(aps)
+
+                # fill new results into monitor_metrics dict. for simplicity, only one class (of interest) is monitored on patient level.
+                if monitor_metrics is not None and not (score_level == 'patient' and cl != self.cf.patient_class_of_interest):
+                    score_level_name = 'patient' if score_level == 'patient' else self.cf.class_dict[cl]
+                    monitor_metrics[score_level_name + '_ap'].append(stats_dict['ap'] if stats_dict['ap'] > 0 else None)
+                    if score_level == 'patient':
+                        monitor_metrics[score_level_name + '_auc'].append(
+                            stats_dict['auc'] if stats_dict['auc'] > 0 else None)
+
+                if self.cf.plot_prediction_histograms:
+                    out_filename = os.path.join(
+                        self.cf.plot_dir, 'pred_hist_{}_{}_{}_cl{}'.format(
+                            self.cf.fold, 'val' if 'val' in self.mode else self.mode, score_level, cl))
+                    type_list = None if score_level == 'patient' else spec_df.det_type.tolist()
+                    plotting.plot_prediction_hist(spec_df.class_label.tolist(), spec_df.pred_score.tolist(), type_list, out_filename)
+
+                all_stats.append(stats_dict)
+
+                # analysis of the  hyper-parameter cf.min_det_thresh, for optimization on validation set.
+                if self.cf.scan_det_thresh:
+                    conf_threshs = list(np.arange(0.9, 1, 0.01))
+                    pool = Pool(processes=10)
+                    mp_inputs = [[spec_df, ii, self.cf.per_patient_ap] for ii in conf_threshs]
+                    aps = pool.map(get_roi_ap_from_df, mp_inputs, chunksize=1)
+                    pool.close()
+                    pool.join()
+                    self.logger.info('results from scanning over det_threshs:', [[i, j] for i, j in zip(conf_threshs, aps)])
+
+        if self.cf.plot_stat_curves:
+            out_filename = os.path.join(self.cf.plot_dir, '{}_{}_stat_curves'.format(self.cf.fold, self.mode))
+            plotting.plot_stat_curves(all_stats, out_filename)
+
+
+        # get average stats over foreground classes on roi level.
+        avg_ap = np.mean([d['ap'] for d in all_stats if 'rois' in d['name']])
+        all_stats.append({'name': 'average_foreground_roi', 'auc': 0, 'ap': avg_ap})
+        if len(df.fold.unique()) > 1:
+            avg_mean_ap = np.mean([d['mean_ap'] for d in all_stats if 'rois' in d['name']])
+            all_stats[-1]['mean_ap'] = avg_mean_ap
+            all_stats[-1]['mean_auc'] = 0
+
+        # in small data sets, values of model_selection_criterion can be identical across epochs, wich breaks the
+        # ranking of model_selector. Thus, pertube identical values by a neglectibale random term.
+        for sc in self.cf.model_selection_criteria:
+            if 'val' in self.mode and monitor_metrics[sc].count(monitor_metrics[sc][-1]) > 1 and monitor_metrics[sc][-1] is not None:
+                monitor_metrics[sc][-1] += 1e-6 * np.random.rand()
+
+        return all_stats, monitor_metrics
+
+
+    def score_test_df(self, internal_df=True):
+        """
+        Writes out resulting scores to text files: First checks for class-internal-df (typically current) fold,
+        gets resulting scores, writes them to a text file and pickles data frame. Also checks if data-frame pickles of
+        all folds of cross-validation exist in exp_dir. If true, loads all dataframes, aggregates test sets over folds,
+        and calculates and writes out overall metrics.
+        """
+        if internal_df:
+
+            self.test_df.to_pickle(os.path.join(self.cf.exp_dir, '{}_test_df.pickle'.format(self.cf.fold)))
+            stats, _ = self.return_metrics()
+
+            with open(os.path.join(self.cf.exp_dir, 'results.txt'), 'a') as handle:
+                handle.write('\n****************************\n')
+                handle.write('\nresults for fold {} \n'.format(self.cf.fold))
+                handle.write('\n****************************\n')
+                handle.write('\nfold df shape {}\n  \n'.format(self.test_df.shape))
+                for s in stats:
+                    handle.write('AUC {:0.4f}  AP {:0.4f} {} \n'.format(s['auc'], s['ap'], s['name']))
+
+        fold_df_paths = [ii for ii in os.listdir(self.cf.exp_dir) if 'test_df.pickle' in ii]
+        if len(fold_df_paths) == self.cf.n_cv_splits:
+            with open(os.path.join(self.cf.exp_dir, 'results.txt'), 'a') as handle:
+                self.cf.fold = 'overall'
+                dfs_list = [pd.read_pickle(os.path.join(self.cf.exp_dir, ii)) for ii in fold_df_paths]
+                for ix, df in enumerate(dfs_list):
+                    df['fold'] = ix
+                self.test_df = pd.concat(dfs_list)
+                stats, _ = self.return_metrics()
+                handle.write('\n****************************\n')
+                handle.write('\nOVERALL RESULTS \n')
+                handle.write('\n****************************\n')
+                handle.write('\ndf shape \n  \n'.format(self.test_df.shape))
+                for s in stats:
+                    handle.write('\nAUC {:0.4f} (mu {:0.4f})  AP {:0.4f} (mu {:0.4f})  {}\n '
+                                 .format(s['auc'], s['mean_auc'], s['ap'], s['mean_ap'], s['name']))
+                results_table_path = os.path.join(("/").join(self.cf.exp_dir.split("/")[:-1]), 'results_table.txt')
+                with open(results_table_path, 'a') as handle2:
+                    for s in stats:
+                        handle2.write('\nAUC {:0.4f} (mu {:0.4f})  AP {:0.4f} (mu {:0.4f})  {} {}'
+                                      .format(s['auc'], s['mean_auc'], s['ap'], s['mean_ap'], s['name'], self.cf.exp_dir.split('/')[-1]))
+                    handle2.write('\n')
+
+
+
+def get_roi_ap_from_df(inputs):
+    '''
+    :param df: data frame.
+    :param det_thresh: min_threshold for filtering out low confidence predictions.
+    :param per_patient_ap: boolean flag. evaluate average precision per image and average over images,
+    instead of computing one ap over data set.
+    :return: average_precision (float)
+    '''
+    df, det_thresh, per_patient_ap = inputs
+
+    if per_patient_ap:
+        pids_list = df.pid.unique()
+        aps = []
+        for match_iou in df.match_iou.unique():
+            iou_df = df[df.match_iou == match_iou]
+            for pid in pids_list:
+                pid_df = iou_df[iou_df.pid == pid]
+                all_p = len(pid_df[pid_df.class_label == 1])
+                pid_df = pid_df[(pid_df.det_type == 'det_fp') | (pid_df.det_type == 'det_tp')].sort_values('pred_score', ascending=False)
+                pid_df = pid_df[pid_df.pred_score > det_thresh]
+                if (len(pid_df) ==0 and all_p == 0):
+                   pass
+                elif (len(pid_df) > 0 and all_p == 0):
+                    aps.append(0)
+                else:
+                    aps.append(compute_roi_ap(pid_df, all_p))
+        return np.mean(aps)
+
+    else:
+        aps = []
+        for match_iou in df.match_iou.unique():
+            iou_df = df[df.match_iou == match_iou]
+            all_p = len(iou_df[iou_df.class_label == 1])
+            iou_df = iou_df[(iou_df.det_type == 'det_fp') | (iou_df.det_type == 'det_tp')].sort_values('pred_score', ascending=False)
+            iou_df = iou_df[iou_df.pred_score > det_thresh]
+            if all_p > 0:
+                aps.append(compute_roi_ap(iou_df, all_p))
+        return np.mean(aps)
+
+
+
+def compute_roi_ap(df, all_p):
+    """
+    adapted from: https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocotools/cocoeval.py
+    :param df: dataframe containing class labels of predictions sorted in descending manner by their prediction score.
+    :param all_p: number of all ground truth objects. (for denominator of recall.)
+    :return:
+    """
+    tp = df.class_label.values
+    fp = (tp == 0) * 1
+    #recall thresholds, where precision will be measured
+    R = np.linspace(.0, 1, 101, endpoint=True)
+    tp_sum = np.cumsum(tp)
+    fp_sum = np.cumsum(fp)
+    nd = len(tp)
+    rc = tp_sum / all_p
+    pr = tp_sum / (fp_sum + tp_sum)
+    # initialize precision array over recall steps.
+    q = np.zeros((len(R),))
+
+    # numpy is slow without cython optimization for accessing elements
+    # use python array gets significant speed improvement
+    pr = pr.tolist()
+    q = q.tolist()
+    for i in range(nd - 1, 0, -1):
+        if pr[i] > pr[i - 1]:
+            pr[i - 1] = pr[i]
+
+    #discretize empiric recall steps with given bins.
+    inds = np.searchsorted(rc, R, side='left')
+    try:
+        for ri, pi in enumerate(inds):
+            q[ri] = pr[pi]
+    except:
+        pass
+
+    return np.mean(q)
\ No newline at end of file
diff --git a/exec.py b/exec.py
new file mode 100644
index 0000000..446393b
--- /dev/null
+++ b/exec.py
@@ -0,0 +1,219 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""execution script."""
+
+import argparse
+import os
+import time
+import torch
+
+import utils.exp_utils as utils
+from evaluator import Evaluator
+from predictor import Predictor
+from plotting import plot_batch_prediction
+
+
+def train(logger):
+    """
+    perform the training routine for a given fold. saves plots and selected parameters to the experiment dir
+    specified in the configs.
+    """
+    logger.info('performing training in {}D over fold {} on experiment {} with model {}'.format(
+        cf.dim, cf.fold, cf.exp_dir, cf.model))
+
+    net = model.net(cf, logger).cuda()
+    optimizer = torch.optim.Adam(net.parameters(), lr=cf.learning_rate[0], weight_decay=cf.weight_decay)
+    model_selector = utils.ModelSelector(cf, logger)
+    train_evaluator = Evaluator(cf, logger, mode='train')
+    val_evaluator = Evaluator(cf, logger, mode=cf.val_mode)
+
+    starting_epoch = 1
+    if cf.resume_to_checkpoint:
+        starting_epoch = utils.load_checkpoint(cf.resume_to_checkpoint, net, optimizer)
+        logger.info('resumed to checkpoint {} at epoch {}'.format(cf.resume_to_checkpoint, starting_epoch))
+
+    # prepare monitoring
+    monitor_metrics, TrainingPlot = utils.prepare_monitoring(cf)
+
+    logger.info('loading dataset and initializing batch generators...')
+    batch_gen = data_loader.get_train_generators(cf, logger)
+
+    for epoch in range(starting_epoch, cf.num_epochs + 1):
+
+        logger.info('starting training epoch {}'.format(epoch))
+        for param_group in optimizer.param_groups:
+            param_group['lr'] = cf.learning_rate[epoch - 1]
+
+        start_time = time.time()
+
+        net.train()
+        train_results_list = []
+
+        for bix in range(cf.num_train_batches):
+            batch = next(batch_gen['train'])
+            tic_fw = time.time()
+            results_dict = net.train_forward(batch)
+            tic_bw = time.time()
+            optimizer.zero_grad()
+            results_dict['torch_loss'].backward()
+            optimizer.step()
+            logger.info('tr. batch {0}/{1} (ep. {2}) fw {3:.3f}s / bw {4:.3f}s / total {5:.3f}s || '
+                        .format(bix + 1, cf.num_train_batches, epoch, tic_bw - tic_fw,
+                                time.time() - tic_bw, time.time() - tic_fw) + results_dict['logger_string'])
+            train_results_list.append([results_dict['boxes'], batch['pid']])
+            monitor_metrics['train']['monitor_values'][epoch].append(results_dict['monitor_values'])
+
+        _, monitor_metrics['train'] = train_evaluator.evaluate_predictions(train_results_list, monitor_metrics['train'])
+        train_time = time.time() - start_time
+
+        logger.info('starting validation in mode {}.'.format(cf.val_mode))
+        with torch.no_grad():
+            net.eval()
+            if cf.do_validation:
+                val_results_list = []
+                val_predictor = Predictor(cf, net, logger, mode='val')
+                for _ in range(batch_gen['n_val']):
+                    batch = next(batch_gen[cf.val_mode])
+                    if cf.val_mode == 'val_patient':
+                        results_dict = val_predictor.predict_patient(batch)
+                    elif cf.val_mode == 'val_sampling':
+                        results_dict = net.train_forward(batch, is_validation=True)
+                    val_results_list.append([results_dict['boxes'], batch['pid']])
+                    monitor_metrics['val']['monitor_values'][epoch].append(results_dict['monitor_values'])
+
+                _, monitor_metrics['val'] = val_evaluator.evaluate_predictions(val_results_list, monitor_metrics['val'])
+                model_selector.run_model_selection(net, optimizer, monitor_metrics, epoch)
+
+            # update monitoring and prediction plots
+            TrainingPlot.update_and_save(monitor_metrics, epoch)
+            epoch_time = time.time() - start_time
+            logger.info('trained epoch {}: took {} sec. ({} train / {} val)'.format(
+                epoch, epoch_time, train_time, epoch_time-train_time))
+            batch = next(batch_gen['val_sampling'])
+            results_dict = net.train_forward(batch, is_validation=True)
+            logger.info('plotting predictions from validation sampling.')
+            plot_batch_prediction(batch, results_dict, cf)
+
+
+def test(logger):
+    """
+    perform testing for a given fold (or hold out set). save stats in evaluator.
+    """
+    logger.info('starting testing model of fold {} in exp {}'.format(cf.fold, cf.exp_dir))
+    net = model.net(cf, logger).cuda()
+    test_predictor = Predictor(cf, net, logger, mode='test')
+    test_evaluator = Evaluator(cf, logger, mode='test')
+    batch_gen = data_loader.get_test_generator(cf, logger)
+    test_results_list = test_predictor.predict_test_set(batch_gen, return_results=True)
+    test_evaluator.evaluate_predictions(test_results_list)
+    test_evaluator.score_test_df()
+
+
+if __name__ == '__main__':
+
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--mode', type=str,  default='train_test',
+                        help='one out of: train / test / train_test / analysis / create_exp')
+    parser.add_argument('--folds', nargs='+', type=int, default=[0],
+                        help='None runs over all folds in CV. otherwise specify list of folds.')
+    parser.add_argument('--exp_dir', type=str, default='/mnt/hdd/experiments/segmentation/final_test',
+                        help='path to experiment dir. will be created if non existent.')
+    parser.add_argument('--server_env', default=False, action='store_true',
+                        help='change IO settings to deploy models on a cluster.')
+    parser.add_argument('--slurm_job_id', type=str, default=None, help='job scheduler info')
+    parser.add_argument('--use_stored_settings', default=False, action='store_true',
+                        help='load configs from existing exp_dir instead of source dir. always done for testing, '
+                             'but can be set to true to do the same for training. useful in job scheduler environment, '
+                             'where source code might change before the job actually runs.')
+    parser.add_argument('--resume_to_checkpoint', type=str, default=None,
+                        help='if resuming to checkpoint, the desired fold still needs to be parsed via --folds.')
+    parser.add_argument('--exp_source', type=str, default='experiments/toy_exp',
+                        help='specifies, from which source experiment to load configs and data_loader.')
+
+    args = parser.parse_args()
+    folds = args.folds
+    resume_to_checkpoint = args.resume_to_checkpoint
+
+    if args.mode == 'train' or args.mode == 'train_test':
+
+        cf = utils.prep_exp(args.exp_source, args.exp_dir, args.server_env, args.use_stored_settings)
+        cf.slurm_job_id = args.slurm_job_id
+        model = utils.import_module('model', cf.model_path)
+        data_loader = utils.import_module('dl', os.path.join(args.exp_source, 'data_loader.py'))
+        if folds is None:
+            folds = range(cf.n_cv_splits)
+
+        for fold in folds:
+            cf.fold_dir = os.path.join(cf.exp_dir, 'fold_{}'.format(fold))
+            cf.fold = fold
+            cf.resume_to_checkpoint = resume_to_checkpoint
+            if not os.path.exists(cf.fold_dir):
+                os.mkdir(cf.fold_dir)
+            logger = utils.get_logger(cf.fold_dir)
+            train(logger)
+            cf.resume_to_checkpoint = None
+            if args.mode == 'train_test':
+                test(logger)
+
+    elif args.mode == 'test':
+
+        cf = utils.prep_exp(args.exp_source, args.exp_dir, args.server_env, is_training=False, use_stored_settings=True)
+        cf.slurm_job_id = args.slurm_job_id
+        model = utils.import_module('model', cf.model_path)
+        data_loader = utils.import_module('dl', os.path.join(args.exp_source, 'data_loader.py'))
+        if folds is None:
+            folds = range(cf.n_cv_splits)
+
+        for fold in folds:
+            cf.fold_dir = os.path.join(cf.exp_dir, 'fold_{}'.format(fold))
+            logger = utils.get_logger(cf.fold_dir)
+            cf.fold = fold
+            test(logger)
+
+    # load raw predictions saved by predictor during testing, run aggregation algorithms and evaluation.
+    elif args.mode == 'analysis':
+        cf = utils.prep_exp(args.exp_source, args.exp_dir, args.server_env, is_training=False, use_stored_settings=True)
+        logger = utils.get_logger(cf.exp_dir)
+
+        if cf.hold_out_test_set:
+            cf.folds = args.folds
+            predictor = Predictor(cf, net=None, logger=logger, mode='analysis')
+            results_list = predictor.load_saved_predictions(apply_wbc=True)
+            utils.create_csv_output(cf, logger, results_list)
+
+        else:
+            if folds is None:
+                folds = range(cf.n_cv_splits)
+            for fold in folds:
+                cf.fold_dir = os.path.join(cf.exp_dir, 'fold_{}'.format(fold))
+                cf.fold = fold
+                predictor = Predictor(cf, net=None, logger=logger, mode='analysis')
+                results_list = predictor.load_saved_predictions(apply_wbc=True)
+                logger.info('starting evaluation...')
+                evaluator = Evaluator(cf, logger, mode='test')
+                evaluator.evaluate_predictions(results_list)
+                evaluator.score_test_df()
+
+    # create experiment folder and copy scripts without starting job.
+    # usefull for cloud deployment where configs might change before job actually runs.
+    elif args.mode == 'create_exp':
+        cf = utils.prep_exp(args.exp_source, args.exp_dir, args.server_env, use_stored_settings=True)
+        logger = utils.get_logger(cf.exp_dir)
+        logger.info('created experiment directory at {}'.format(args.exp_dir))
+
+    else:
+        raise RuntimeError('mode specified in args is not implemented...')
diff --git a/experiments/lidc_exp/__pycache__/configs.cpython-35.pyc b/experiments/lidc_exp/__pycache__/configs.cpython-35.pyc
new file mode 100644
index 0000000..0f55697
Binary files /dev/null and b/experiments/lidc_exp/__pycache__/configs.cpython-35.pyc differ
diff --git a/experiments/lidc_exp/__pycache__/configs.cpython-36.pyc b/experiments/lidc_exp/__pycache__/configs.cpython-36.pyc
new file mode 100644
index 0000000..19e7e83
Binary files /dev/null and b/experiments/lidc_exp/__pycache__/configs.cpython-36.pyc differ
diff --git a/experiments/lidc_exp/__pycache__/data_loader.cpython-35.pyc b/experiments/lidc_exp/__pycache__/data_loader.cpython-35.pyc
new file mode 100644
index 0000000..47b52d6
Binary files /dev/null and b/experiments/lidc_exp/__pycache__/data_loader.cpython-35.pyc differ
diff --git a/experiments/lidc_exp/__pycache__/data_loader.cpython-36.pyc b/experiments/lidc_exp/__pycache__/data_loader.cpython-36.pyc
new file mode 100644
index 0000000..9ffbba7
Binary files /dev/null and b/experiments/lidc_exp/__pycache__/data_loader.cpython-36.pyc differ
diff --git a/experiments/lidc_exp/configs.py b/experiments/lidc_exp/configs.py
new file mode 100644
index 0000000..a848c3e
--- /dev/null
+++ b/experiments/lidc_exp/configs.py
@@ -0,0 +1,335 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.realpath(__file__)))
+import numpy as np
+from default_configs import DefaultConfigs
+
+class configs(DefaultConfigs):
+
+    def __init__(self, server_env=None):
+
+        #########################
+        #    Preprocessing      #
+        #########################
+
+        self.root_dir = '/path/to/raw/data'
+        self.raw_data_dir = '{}/data_nrrd'.format(self.root_dir)
+        self.pp_dir = '{}/pp_norm'.format(self.root_dir)
+        self.target_spacing = (0.7, 0.7, 1.25)
+
+        #########################
+        #         I/O           #
+        #########################
+
+
+        # one out of [2, 3]. dimension the model operates in.
+        self.dim = 3
+
+        # one out of ['mrcnn', 'retina_net', 'retina_unet', 'detection_unet', 'ufrcnn', 'detection_unet'].
+        self.model = 'mrcnn'
+
+        DefaultConfigs.__init__(self, self.model, server_env, self.dim)
+
+        # int [0 < dataset_size]. select n patients from dataset for prototyping. If None, all data is used.
+        self.select_prototype_subset = None
+
+        # path to preprocessed data.
+        self.pp_name = 'pp_norm'
+        self.input_df_name = 'info_df.pickle'
+        self.pp_data_path = '/path/to/preprocessed/data/{}'.format(self.pp_name)
+        self.pp_test_data_path = self.pp_data_path #change if test_data in separate folder.
+
+        # settings for deployment in cloud.
+        if server_env:
+            # path to preprocessed data.
+            self.pp_name = 'pp_fg_slices'
+            self.crop_name = 'pp_fg_slices_packed'
+            self.pp_data_path = '/path/to/preprocessed/data/{}/{}'.format(self.pp_name, self.crop_name)
+            self.pp_test_data_path = self.pp_data_path
+            self.select_prototype_subset = None
+
+        #########################
+        #      Data Loader      #
+        #########################
+
+        # select modalities from preprocessed data
+        self.channels = [0]
+        self.n_channels = len(self.channels)
+
+        # patch_size to be used for training. pre_crop_size is the patch_size before data augmentation.
+        self.pre_crop_size_2D = [300, 300]
+        self.patch_size_2D = [288, 288]
+        self.pre_crop_size_3D = [156, 156, 96]
+        self.patch_size_3D = [128, 128, 64]
+        self.patch_size = self.patch_size_2D if self.dim == 2 else self.patch_size_3D
+        self.pre_crop_size = self.pre_crop_size_2D if self.dim == 2 else self.pre_crop_size_3D
+
+        # ratio of free sampled batch elements before class balancing is triggered
+        # (>0 to include "empty"/background patches.)
+        self.batch_sample_slack = 0.2
+
+        # set 2D network to operate in 3D images.
+        self.merge_2D_to_3D_preds = True
+
+        # feed +/- n neighbouring slices into channel dimension. set to None for no context.
+        self.n_3D_context = None
+        if self.n_3D_context is not None and self.dim == 2:
+            self.n_channels *= (self.n_3D_context * 2 + 1)
+
+
+        #########################
+        #      Architecture      #
+        #########################
+
+        self.start_filts = 48 if self.dim == 2 else 18
+        self.end_filts = self.start_filts * 4 if self.dim == 2 else self.start_filts * 2
+        self.res_architecture = 'resnet50' # 'resnet101' , 'resnet50'
+        self.norm = None # one of None, 'instance_norm', 'batch_norm'
+        self.weight_decay = 0
+
+        # one of 'xavier_uniform', 'xavier_normal', or 'kaiming_normal', None (=default = 'kaiming_uniform')
+        self.weight_init = None
+
+        #########################
+        #  Schedule / Selection #
+        #########################
+
+        self.num_epochs = 100
+        self.num_train_batches = 200 if self.dim == 2 else 200
+        self.batch_size = 20 if self.dim == 2 else 8
+
+        self.do_validation = True
+        # decide whether to validate on entire patient volumes (like testing) or sampled patches (like training)
+        # the former is morge accurate, while the latter is faster (depending on volume size)
+        self.val_mode = 'val_sampling' # one of 'val_sampling' , 'val_patient'
+        if self.val_mode == 'val_patient':
+            self.max_val_patients = 50  # if 'None' iterates over entire val_set once.
+        if self.val_mode == 'val_sampling':
+            self.num_val_batches = 50
+
+        #########################
+        #   Testing / Plotting  #
+        #########################
+
+        # set the top-n-epochs to be saved for temporal averaging in testing.
+        self.save_n_models = 5
+        self.test_n_epochs = 5
+        # set a minimum epoch number for saving in case of instabilities in the first phase of training.
+        self.min_save_thresh = 0 if self.dim == 2 else 0
+
+        self.report_score_level = ['patient', 'rois']  # choose list from 'patient', 'rois'
+        self.class_dict = {1: 'benign', 2: 'malignant'}  # 0 is background.
+        self.patient_class_of_interest = 2  # patient metrics are only plotted for one class.
+        self.ap_match_ious = [0.1]  # list of ious to be evaluated for ap-scoring.
+
+        self.model_selection_criteria = ['malignant_ap', 'benign_ap'] # criteria to average over for saving epochs.
+        self.min_det_thresh = 0.1  # minimum confidence value to select predictions for evaluation.
+
+        # threshold for clustering predictions together (wcs = weighted cluster scoring).
+        # needs to be >= the expected overlap of predictions coming from one model (typically NMS threshold).
+        # if too high, preds of the same object are separate clusters.
+        self.wcs_iou = 1e-5
+
+        self.plot_prediction_histograms = True
+        self.plot_stat_curves = False
+
+        #########################
+        #   Data Augmentation   #
+        #########################
+
+        self.da_kwargs={
+        'do_elastic_deform': True,
+        'alpha':(0., 1500.),
+        'sigma':(30., 50.),
+        'do_rotation':True,
+        'angle_x': (0., 2 * np.pi),
+        'angle_y': (0., 0),
+        'angle_z': (0., 0),
+        'do_scale': True,
+        'scale':(0.8, 1.1),
+        'random_crop':False,
+        'rand_crop_dist':  (self.patch_size[0] / 2. - 3, self.patch_size[1] / 2. - 3),
+        'border_mode_data': 'constant',
+        'border_cval_data': 0,
+        'order_data': 1
+        }
+
+        if self.dim == 3:
+            self.da_kwargs['do_elastic_deform'] = False
+            self.da_kwargs['angle_x'] = (0, 0.0)
+            self.da_kwargs['angle_y'] = (0, 0.0) #must be 0!!
+            self.da_kwargs['angle_z'] = (0., 2 * np.pi)
+
+
+        #########################
+        #   Add model specifics #
+        #########################
+
+        {'detection_unet': self.add_det_unet_configs,
+         'mrcnn': self.add_mrcnn_configs,
+         'ufrcnn': self.add_mrcnn_configs,
+         'retina_net': self.add_mrcnn_configs,
+         'retina_unet': self.add_mrcnn_configs,
+        }[self.model]()
+
+
+    def add_det_unet_configs(self):
+
+        self.learning_rate = [1e-4] * self.num_epochs
+
+        # aggregation from pixel perdiction to object scores (connected component). One of ['max', 'median']
+        self.aggregation_operation = 'max'
+
+        # max number of roi candidates to identify per batch element and class.
+        self.n_roi_candidates = 10 if self.dim == 2 else 30
+
+        # loss mode: either weighted cross entropy ('wce'), batch-wise dice loss ('dice), or the sum of both ('dice_wce')
+        self.seg_loss_mode = 'dice_wce'
+
+        # if <1, false positive predictions in foreground are penalized less.
+        self.fp_dice_weight = 1 if self.dim == 2 else 1
+
+        self.wce_weights = [1, 1, 1]
+        self.detection_min_confidence = self.min_det_thresh
+
+        # if 'True', loss distinguishes all classes, else only foreground vs. background (class agnostic).
+        self.class_specific_seg_flag = True
+        self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+        self.head_classes = self.num_seg_classes
+
+    def add_mrcnn_configs(self):
+
+        # learning rate is a list with one entry per epoch.
+        self.learning_rate = [1e-4] * self.num_epochs
+
+        # disable the re-sampling of mask proposals to original size for speed-up.
+        # since evaluation is detection-driven (box-matching) and not instance segmentation-driven (iou-matching),
+        # mask-outputs are optional.
+        self.return_masks_in_val = True
+        self.return_masks_in_test = False
+
+        # set number of proposal boxes to plot after each epoch.
+        self.n_plot_rpn_props = 5 if self.dim == 2 else 30
+
+        # number of classes for head networks: n_foreground_classes + 1 (background)
+        self.head_classes = 3
+
+        # seg_classes hier refers to the first stage classifier (RPN)
+        self.num_seg_classes = 2  # foreground vs. background
+
+        # feature map strides per pyramid level are inferred from architecture.
+        self.backbone_strides = {'xy': [4, 8, 16, 32], 'z': [1, 2, 4, 8]}
+
+        # anchor scales are chosen according to expected object sizes in data set. Default uses only one anchor scale
+        # per pyramid level. (outer list are pyramid levels (corresponding to BACKBONE_STRIDES), inner list are scales per level.)
+        self.rpn_anchor_scales = {'xy': [[8], [16], [32], [64]], 'z': [[2], [4], [8], [16]]}
+
+        # choose which pyramid levels to extract features from: P2: 0, P3: 1, P4: 2, P5: 3.
+        self.pyramid_levels = [0, 1, 2, 3]
+
+        # number of feature maps in rpn. typically lowered in 3D to save gpu-memory.
+        self.n_rpn_features = 512 if self.dim == 2 else 128
+
+        # anchor ratios and strides per position in feature maps.
+        self.rpn_anchor_ratios = [0.5, 1, 2]
+        self.rpn_anchor_stride = 1
+
+        # Threshold for first stage (RPN) non-maximum suppression (NMS):  LOWER == HARDER SELECTION
+        self.rpn_nms_threshold = 0.7 if self.dim == 2 else 0.7
+
+        # loss sampling settings.
+        self.rpn_train_anchors_per_image = 6  #per batch element
+        self.train_rois_per_image = 6 #per batch element
+        self.roi_positive_ratio = 0.5
+        self.anchor_matching_iou = 0.7
+
+        # factor of top-k candidates to draw from  per negative sample (stochastic-hard-example-mining).
+        # poolsize to draw top-k candidates from will be shem_poolsize * n_negative_samples.
+        self.shem_poolsize = 10
+
+        self.pool_size = (7, 7) if self.dim == 2 else (7, 7, 3)
+        self.mask_pool_size = (14, 14) if self.dim == 2 else (14, 14, 5)
+        self.mask_shape = (28, 28) if self.dim == 2 else (28, 28, 10)
+
+        self.rpn_bbox_std_dev = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.2])
+        self.bbox_std_dev = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.2])
+        self.window = np.array([0, 0, self.patch_size[0], self.patch_size[1], 0, self.patch_size_3D[2]])
+        self.scale = np.array([self.patch_size[0], self.patch_size[1], self.patch_size[0], self.patch_size[1],
+                               self.patch_size_3D[2], self.patch_size_3D[2]])
+        if self.dim == 2:
+            self.rpn_bbox_std_dev = self.rpn_bbox_std_dev[:4]
+            self.bbox_std_dev = self.bbox_std_dev[:4]
+            self.window = self.window[:4]
+            self.scale = self.scale[:4]
+
+        # pre-selection in proposal-layer (stage 1) for NMS-speedup. applied per batch element.
+        self.pre_nms_limit = 3000 if self.dim == 2 else 6000
+
+        # n_proposals to be selected after NMS per batch element. too high numbers blow up memory if "detect_while_training" is True,
+        # since proposals of the entire batch are forwarded through second stage in as one "batch".
+        self.roi_chunk_size = 2500 if self.dim == 2 else 600
+        self.post_nms_rois_training = 500 if self.dim == 2 else 75
+        self.post_nms_rois_inference = 500
+
+        # Final selection of detections (refine_detections)
+        self.model_max_instances_per_batch_element = 10 if self.dim == 2 else 30  # per batch element and class.
+        self.detection_nms_threshold = 1e-5  # needs to be > 0, otherwise all predictions are one cluster.
+        self.model_min_confidence = 0.1
+
+        if self.dim == 2:
+            self.backbone_shapes = np.array(
+                [[int(np.ceil(self.patch_size[0] / stride)),
+                  int(np.ceil(self.patch_size[1] / stride))]
+                 for stride in self.backbone_strides['xy']])
+        else:
+            self.backbone_shapes = np.array(
+                [[int(np.ceil(self.patch_size[0] / stride)),
+                  int(np.ceil(self.patch_size[1] / stride)),
+                  int(np.ceil(self.patch_size[2] / stride_z))]
+                 for stride, stride_z in zip(self.backbone_strides['xy'], self.backbone_strides['z']
+                                             )])
+
+        if self.model == 'ufrcnn':
+            self.operate_stride1 = True
+            self.class_specific_seg_flag = True
+            self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+            self.frcnn_mode = True
+
+        if self.model == 'retina_net' or self.model == 'retina_unet' or self.model == 'prob_detector':
+            # implement extra anchor-scales according to retina-net publication.
+            self.rpn_anchor_scales['xy'] = [[ii[0], ii[0] * (2 ** (1 / 3)), ii[0] * (2 ** (2 / 3))] for ii in
+                                            self.rpn_anchor_scales['xy']]
+            self.rpn_anchor_scales['z'] = [[ii[0], ii[0] * (2 ** (1 / 3)), ii[0] * (2 ** (2 / 3))] for ii in
+                                           self.rpn_anchor_scales['z']]
+            self.n_anchors_per_pos = len(self.rpn_anchor_ratios) * 3
+
+            self.n_rpn_features = 256 if self.dim == 2 else 64
+
+            # pre-selection of detections for NMS-speedup. per entire batch.
+            self.pre_nms_limit = 10000 if self.dim == 2 else 50000
+
+            # anchor matching iou is lower than in Mask R-CNN according to https://arxiv.org/abs/1708.02002
+            self.anchor_matching_iou = 0.5
+
+            # if 'True', seg loss distinguishes all classes, else only foreground vs. background (class agnostic).
+            self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+
+            if self.model == 'retina_unet':
+                self.operate_stride1 = True
+
diff --git a/experiments/lidc_exp/data_loader.py b/experiments/lidc_exp/data_loader.py
new file mode 100644
index 0000000..6c97670
--- /dev/null
+++ b/experiments/lidc_exp/data_loader.py
@@ -0,0 +1,451 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+'''
+Example Data Loader for the LIDC data set. This dataloader expects preprocessed data in .npy or .npz files per patient and
+a pandas dataframe in the same directory containing the meta-info e.g. file paths, labels, foregound slice-ids.
+'''
+
+
+import numpy as np
+import os
+from collections import OrderedDict
+import pandas as pd
+import pickle
+import time
+import subprocess
+import utils.dataloader_utils as dutils
+
+# batch generator tools from https://github.com/MIC-DKFZ/batchgenerators
+from batchgenerators.dataloading.data_loader import SlimDataLoaderBase
+from batchgenerators.transforms.spatial_transforms import MirrorTransform as Mirror
+from batchgenerators.transforms.abstract_transforms import Compose
+from batchgenerators.dataloading.multi_threaded_augmenter import MultiThreadedAugmenter
+from batchgenerators.dataloading import SingleThreadedAugmenter
+from batchgenerators.transforms.spatial_transforms import SpatialTransform
+from batchgenerators.transforms.crop_and_pad_transforms import CenterCropTransform
+from batchgenerators.transforms.utility_transforms import ConvertSegToBoundingBoxCoordinates
+
+
+
+def get_train_generators(cf, logger):
+    """
+    wrapper function for creating the training batch generator pipeline. returns the train/val generators.
+    selects patients according to cv folds (generated by first run/fold of experiment):
+    splits the data into n-folds, where 1 split is used for val, 1 split for testing and the rest for training. (inner loop test set)
+    If cf.hold_out_test_set is True, adds the test split to the training data.
+    """
+    all_data = load_dataset(cf, logger)
+    all_pids_list = np.unique([v['pid'] for (k, v) in all_data.items()])
+
+    if not cf.created_fold_id_pickle:
+        fg = dutils.fold_generator(seed=cf.seed, n_splits=cf.n_cv_splits, len_data=len(all_pids_list)).get_fold_names()
+        with open(os.path.join(cf.exp_dir, 'fold_ids.pickle'), 'wb') as handle:
+            pickle.dump(fg, handle)
+        cf.created_fold_id_pickle = True
+    else:
+        with open(os.path.join(cf.exp_dir, 'fold_ids.pickle'), 'rb') as handle:
+            fg = pickle.load(handle)
+
+    train_ix, val_ix, test_ix, _ = fg[cf.fold]
+
+    train_pids = [all_pids_list[ix] for ix in train_ix]
+    val_pids = [all_pids_list[ix] for ix in val_ix]
+
+    if cf.hold_out_test_set:
+        train_pids += [all_pids_list[ix] for ix in test_ix]
+
+    train_data = {k: v for (k, v) in all_data.items() if any(p == v['pid'] for p in train_pids)}
+    val_data = {k: v for (k, v) in all_data.items() if any(p == v['pid'] for p in val_pids)}
+
+    logger.info("data set loaded with: {} train / {} val / {} test patients".format(len(train_ix), len(val_ix), len(test_ix)))
+    batch_gen = {}
+    batch_gen['train'] = create_data_gen_pipeline(train_data, cf=cf, is_training=True)
+    batch_gen['val_sampling'] = create_data_gen_pipeline(val_data, cf=cf, is_training=False)
+    if cf.val_mode == 'val_patient':
+        batch_gen['val_patient'] = PatientBatchIterator(val_data, cf=cf)
+        batch_gen['n_val'] = len(val_ix) if cf.max_val_patients is None else cf.max_val_patients
+    else:
+        batch_gen['n_val'] = cf.num_val_batches
+
+    return batch_gen
+
+
+def get_test_generator(cf, logger):
+    """
+    wrapper function for creating the test batch generator pipeline.
+    selects patients according to cv folds (generated by first run/fold of experiment)
+    If cf.hold_out_test_set is True, gets the data from an external folder instead.
+    """
+    if cf.hold_out_test_set:
+        cf.pp_data_path = cf.pp_test_data_path
+        test_ix = None
+    else:
+        with open(os.path.join(cf.exp_dir, 'fold_ids.pickle'), 'rb') as handle:
+            fold_list = pickle.load(handle)
+        _, _, test_ix, _ = fold_list[cf.fold]
+        # warnings.warn('WARNING: using validation set for testing!!!')
+
+    test_data = load_dataset(cf, logger, test_ix)
+    logger.info("data set loaded with: {} test patients".format(len(test_ix)))
+    batch_gen = {}
+    batch_gen['test'] = PatientBatchIterator(test_data, cf=cf)
+    batch_gen['n_test'] = len(test_ix)
+    return batch_gen
+
+
+
+def load_dataset(cf, logger, subset_ixs=None):
+    """
+    loads the dataset. if deployed in cloud also copies and unpacks the data to the working directory.
+    :param subset_ixs: subset indices to be loaded from the dataset. used e.g. for testing to only load the test folds.
+    :return: data: dictionary with one entry per patient (in this case per patient-breast, since they are treated as
+    individual images for training) each entry is a dictionary containing respective meta-info as well as paths to the preprocessed
+    numpy arrays to be loaded during batch-generation
+    """
+    if cf.server_env:
+        copy_data = True
+        target_dir = os.path.join('/ssd', cf.slurm_job_id, cf.pp_name, cf.crop_name)
+        if not os.path.exists(target_dir):
+            cf.data_source_dir = cf.pp_data_path
+            os.makedirs(target_dir)
+            subprocess.call('rsync -av {} {}'.format(
+                os.path.join(cf.data_source_dir, cf.input_df_name), os.path.join(target_dir, cf.input_df_name)), shell=True)
+            logger.info('created target dir and info df at {}'.format(os.path.join(target_dir, cf.input_df_name)))
+
+        elif subset_ixs is None:
+            copy_data = False
+
+        cf.pp_data_path = target_dir
+
+
+    p_df = pd.read_pickle(os.path.join(cf.pp_data_path, cf.input_df_name))
+
+    if cf.select_prototype_subset is not None:
+        prototype_pids = p_df.pid.tolist()[:cf.select_prototype_subset]
+        p_df = p_df[p_df.pid.isin(prototype_pids)]
+        logger.warning('WARNING: using prototyping data subset!!!')
+
+    if subset_ixs is not None:
+        subset_pids = [np.unique(p_df.pid.tolist())[ix] for ix in subset_ixs]
+        p_df = p_df[p_df.pid.isin(subset_pids)]
+        logger.info('subset: selected {} instances from df'.format(len(p_df)))
+
+    if cf.server_env:
+        if copy_data:
+            copy_and_unpack_data(logger, p_df.pid.tolist(), cf.fold_dir, cf.data_source_dir, target_dir)
+
+    class_targets = p_df['class_target'].tolist()
+    pids = p_df.pid.tolist()
+    imgs = [os.path.join(cf.pp_data_path, '{}_img.npy'.format(pid)) for pid in pids]
+    segs = [os.path.join(cf.pp_data_path,'{}_rois.npy'.format(pid)) for pid in pids]
+
+    data = OrderedDict()
+    for ix, pid in enumerate(pids):
+        # for the experiment conducted here, malignancy scores are binarized: (benign: 1-2, malignant: 3-5)
+        targets = [1 if ii >= 3 else 0 for ii in class_targets[ix]]
+        data[pid] = {'data': imgs[ix], 'seg': segs[ix], 'pid': pid, 'class_target': targets}
+        data[pid]['fg_slices'] = p_df['fg_slices'].tolist()[ix]
+
+    return data
+
+
+
+def create_data_gen_pipeline(patient_data, cf, is_training=True):
+    """
+    create mutli-threaded train/val/test batch generation and augmentation pipeline.
+    :param patient_data: dictionary containing one dictionary per patient in the train/test subset.
+    :param is_training: (optional) whether to perform data augmentation (training) or not (validation/testing)
+    :return: multithreaded_generator
+    """
+
+    # create instance of batch generator as first element in pipeline.
+    data_gen = BatchGenerator(patient_data, batch_size=cf.batch_size, cf=cf)
+
+    # add transformations to pipeline.
+    my_transforms = []
+    if is_training:
+        mirror_transform = Mirror(axes=np.arange(2, cf.dim+2, 1))
+        my_transforms.append(mirror_transform)
+        spatial_transform = SpatialTransform(patch_size=cf.patch_size[:cf.dim],
+                                             patch_center_dist_from_border=cf.da_kwargs['rand_crop_dist'],
+                                             do_elastic_deform=cf.da_kwargs['do_elastic_deform'],
+                                             alpha=cf.da_kwargs['alpha'], sigma=cf.da_kwargs['sigma'],
+                                             do_rotation=cf.da_kwargs['do_rotation'], angle_x=cf.da_kwargs['angle_x'],
+                                             angle_y=cf.da_kwargs['angle_y'], angle_z=cf.da_kwargs['angle_z'],
+                                             do_scale=cf.da_kwargs['do_scale'], scale=cf.da_kwargs['scale'],
+                                             random_crop=cf.da_kwargs['random_crop'])
+
+        my_transforms.append(spatial_transform)
+    else:
+        my_transforms.append(CenterCropTransform(crop_size=cf.patch_size[:cf.dim]))
+
+    my_transforms.append(ConvertSegToBoundingBoxCoordinates(cf.dim, get_rois_from_seg_flag=False, class_specific_seg_flag=cf.class_specific_seg_flag))
+    all_transforms = Compose(my_transforms)
+    # multithreaded_generator = SingleThreadedAugmenter(data_gen, all_transforms)
+    multithreaded_generator = MultiThreadedAugmenter(data_gen, all_transforms, num_processes=cf.n_workers, seeds=range(cf.n_workers))
+    return multithreaded_generator
+
+
+class BatchGenerator(SlimDataLoaderBase):
+    """
+    creates the training/validation batch generator. Samples n_batch_size patients (draws a slice from each patient if 2D)
+    from the data set while maintaining foreground-class balance. Returned patches are cropped/padded to pre_crop_size.
+    Actual patch_size is obtained after data augmentation.
+    :param data: data dictionary as provided by 'load_dataset'.
+    :param batch_size: number of patients to sample for the batch
+    :return dictionary containing the batch data (b, c, x, y, (z)) / seg (b, 1, x, y, (z)) / pids / class_target
+    """
+    def __init__(self, data, batch_size, cf):
+        super(BatchGenerator, self).__init__(data, batch_size)
+
+        self.cf = cf
+        self.crop_margin = np.array(self.cf.patch_size)/8. #min distance of ROI center to edge of cropped_patch.
+        self.p_fg = 0.5
+
+    def generate_train_batch(self):
+
+        batch_data, batch_segs, batch_pids, batch_targets, batch_patient_labels = [], [], [], [], []
+        class_targets_list =  [v['class_target'] for (k, v) in self._data.items()]
+
+        #samples patients towards equilibrium of foreground classes on a roi-level (after randomly sampling the ratio "batch_sample_slack).
+        batch_ixs = dutils.get_class_balanced_patients(
+            class_targets_list, self.batch_size, self.cf.head_classes - 1, slack_factor=self.cf.batch_sample_slack)
+        patients = list(self._data.items())
+
+        for b in batch_ixs:
+            patient = patients[b][1]
+
+            data = np.transpose(np.load(patient['data'], mmap_mode='r'), axes=(1, 2, 0))[np.newaxis]
+            seg = np.transpose(np.load(patient['seg'], mmap_mode='r'), axes=(1, 2, 0))
+            batch_pids.append(patient['pid'])
+            batch_targets.append(patient['class_target'])
+
+            if self.cf.dim == 2:
+                # draw random slice from patient while oversampling slices containing foreground objects with p_fg.
+                if len(patient['fg_slices']) > 0:
+                    fg_prob = self.p_fg / len(patient['fg_slices'])
+                    bg_prob = (1 - self.p_fg) / (data.shape[3] - len(patient['fg_slices']))
+                    slices_prob = [fg_prob if ix in patient['fg_slices'] else bg_prob for ix in range(data.shape[3])]
+                    slice_id = np.random.choice(data.shape[3], p=slices_prob)
+                else:
+                    slice_id = np.random.choice(data.shape[3])
+
+                # if set to not None, add neighbouring slices to each selected slice in channel dimension.
+                if self.cf.n_3D_context is not None:
+                    padded_data = dutils.pad_nd_image(data[0], [(data.shape[-1] + (self.cf.n_3D_context*2))], mode='constant')
+                    padded_slice_id = slice_id + self.cf.n_3D_context
+                    data = (np.concatenate([padded_data[..., ii][np.newaxis] for ii in range(
+                        padded_slice_id - self.cf.n_3D_context, padded_slice_id + self.cf.n_3D_context + 1)], axis=0))
+                else:
+                    data = data[..., slice_id]
+                seg = seg[..., slice_id]
+
+            # pad data if smaller than pre_crop_size.
+            if np.any([data.shape[dim + 1] < ps for dim, ps in enumerate(self.cf.pre_crop_size)]):
+                new_shape = [np.max([data.shape[dim + 1], ps]) for dim, ps in enumerate(self.cf.pre_crop_size)]
+                data = dutils.pad_nd_image(data, new_shape, mode='constant')
+                seg = dutils.pad_nd_image(seg, new_shape, mode='constant')
+
+            # crop patches of size pre_crop_size, while sampling patches containing foreground with p_fg.
+            crop_dims = [dim for dim, ps in enumerate(self.cf.pre_crop_size) if data.shape[dim + 1] > ps]
+            if len(crop_dims) > 0:
+                fg_prob_sample = np.random.rand(1)
+                # with p_fg: sample random pixel from random ROI and shift center by random value.
+                if fg_prob_sample < self.p_fg and np.sum(seg) > 0:
+                    seg_ixs = np.argwhere(seg == np.random.choice(np.unique(seg)[1:], 1))
+                    roi_anchor_pixel = seg_ixs[np.random.choice(seg_ixs.shape[0], 1)][0]
+                    assert seg[tuple(roi_anchor_pixel)] > 0
+                    # sample the patch center coords. constrained by edges of images - pre_crop_size /2. And by
+                    # distance to the desired ROI < patch_size /2.
+                    # (here final patch size to account for center_crop after data augmentation).
+                    sample_seg_center = {}
+                    for ii in crop_dims:
+                        low = np.max((self.cf.pre_crop_size[ii]//2, roi_anchor_pixel[ii] - (self.cf.patch_size[ii]//2 - self.crop_margin[ii])))
+                        high = np.min((data.shape[ii + 1] - self.cf.pre_crop_size[ii]//2,
+                                       roi_anchor_pixel[ii] + (self.cf.patch_size[ii]//2 - self.crop_margin[ii])))
+                        # happens if lesion on the edge of the image. dont care about roi anymore,
+                        # just make sure pre-crop is inside image.
+                        if low >= high:
+                            low = data.shape[ii + 1] // 2 - (data.shape[ii + 1] // 2 - self.cf.pre_crop_size[ii] // 2)
+                            high = data.shape[ii + 1] // 2 + (data.shape[ii + 1] // 2 - self.cf.pre_crop_size[ii] // 2)
+                        sample_seg_center[ii] = np.random.randint(low=low, high=high)
+
+                else:
+                    # not guaranteed to be empty. probability of emptiness depends on the data.
+                    sample_seg_center = {ii: np.random.randint(low=self.cf.pre_crop_size[ii]//2,
+                                                           high=data.shape[ii + 1] - self.cf.pre_crop_size[ii]//2) for ii in crop_dims}
+
+                for ii in crop_dims:
+                    min_crop = int(sample_seg_center[ii] - self.cf.pre_crop_size[ii] // 2)
+                    max_crop = int(sample_seg_center[ii] + self.cf.pre_crop_size[ii] // 2)
+                    data = np.take(data, indices=range(min_crop, max_crop), axis=ii + 1)
+                    seg = np.take(seg, indices=range(min_crop, max_crop), axis=ii)
+
+            batch_data.append(data)
+            batch_segs.append(seg[np.newaxis])
+
+        data = np.array(batch_data).astype(np.float16)
+        seg = np.array(batch_segs).astype(np.uint8)
+        class_target = np.array(batch_targets)
+        return {'data': data, 'seg': seg, 'pid': batch_pids, 'class_target': class_target}
+
+
+
+class PatientBatchIterator(SlimDataLoaderBase):
+    """
+    creates a test generator that iterates over entire given dataset returning 1 patient per batch.
+    Can be used for monitoring if cf.val_mode = 'patient_val' for a monitoring closer to actualy evaluation (done in 3D),
+    if willing to accept speed-loss during training.
+    :return: out_batch: dictionary containing one patient with batch_size = n_3D_patches in 3D or
+    batch_size = n_2D_patches in 2D .
+    """
+    def __init__(self, data, cf): #threads in augmenter
+        super(PatientBatchIterator, self).__init__(data, 0)
+        self.cf = cf
+        self.patient_ix = 0
+        self.dataset_pids = [v['pid'] for (k, v) in data.items()]
+        self.patch_size = cf.patch_size
+        if len(self.patch_size) == 2:
+            self.patch_size = self.patch_size + [1]
+
+
+    def generate_train_batch(self):
+
+
+        pid = self.dataset_pids[self.patient_ix]
+        patient = self._data[pid]
+        data = np.transpose(np.load(patient['data'], mmap_mode='r'), axes=(1, 2, 0))
+        seg = np.transpose(np.load(patient['seg'], mmap_mode='r'), axes=(1, 2, 0))
+        batch_class_targets = np.array([patient['class_target']])
+
+        # pad data if smaller than patch_size seen during training.
+        if np.any([data.shape[dim] < ps for dim, ps in enumerate(self.patch_size)]):
+            new_shape = [np.max([data.shape[dim], self.patch_size[dim]]) for dim, ps in enumerate(self.patch_size)]
+            data = dutils.pad_nd_image(data, new_shape) # use 'return_slicer' to crop image back to original shape.
+            seg = dutils.pad_nd_image(seg, new_shape)
+
+        # get 3D targets for evaluation, even if network operates in 2D. 2D predictions will be merged to 3D in predictor.
+        if self.cf.dim == 3 or self.cf.merge_2D_to_3D_preds:
+            out_data = data[np.newaxis, np.newaxis]
+            out_seg = seg[np.newaxis, np.newaxis]
+            out_targets = batch_class_targets
+
+            batch_3D = {'data': out_data, 'seg': out_seg, 'class_target': out_targets, 'pid': pid}
+            converter = ConvertSegToBoundingBoxCoordinates(dim=3, get_rois_from_seg_flag=False, class_specific_seg_flag=self.cf.class_specific_seg_flag)
+            batch_3D = converter(**batch_3D)
+            batch_3D.update({'patient_bb_target': batch_3D['bb_target'],
+                                  'patient_roi_labels': batch_3D['roi_labels'],
+                                  'original_img_shape': out_data.shape})
+
+        if self.cf.dim == 2:
+            out_data = np.transpose(data, axes=(2, 0, 1))[:, np.newaxis]  # (z, c, x, y )
+            out_seg = np.transpose(seg, axes=(2, 0, 1))[:, np.newaxis]
+            out_targets = np.array(np.repeat(batch_class_targets, out_data.shape[0], axis=0))
+
+            # if set to not None, add neighbouring slices to each selected slice in channel dimension.
+            if self.cf.n_3D_context is not None:
+                slice_range = range(self.cf.n_3D_context, out_data.shape[0] + self.cf.n_3D_context)
+                out_data = np.pad(out_data, ((self.cf.n_3D_context, self.cf.n_3D_context), (0, 0), (0, 0), (0, 0)), 'constant', constant_values=0)
+                out_data = np.array(
+                    [np.concatenate([out_data[ii] for ii in range(
+                        slice_id - self.cf.n_3D_context, slice_id + self.cf.n_3D_context + 1)], axis=0) for slice_id in
+                     slice_range])
+
+            batch_2D = {'data': out_data, 'seg': out_seg, 'class_target': out_targets, 'pid': pid}
+            converter = ConvertSegToBoundingBoxCoordinates(dim=2, get_rois_from_seg_flag=False, class_specific_seg_flag=self.cf.class_specific_seg_flag)
+            batch_2D = converter(**batch_2D)
+
+            if self.cf.merge_2D_to_3D_preds:
+                batch_2D.update({'patient_bb_target': batch_3D['patient_bb_target'],
+                                      'patient_roi_labels': batch_3D['patient_roi_labels'],
+                                      'original_img_shape': out_data.shape})
+            else:
+                batch_2D.update({'patient_bb_target': batch_2D['bb_target'],
+                                 'patient_roi_labels': batch_2D['roi_labels'],
+                                 'original_img_shape': out_data.shape})
+
+        out_batch = batch_3D if self.cf.dim == 3 else batch_2D
+        patient_batch = out_batch
+
+        # crop patient-volume to patches of patch_size used during training. stack patches up in batch dimension.
+        # in this case, 2D is treated as a special case of 3D with patch_size[z] = 1.
+        if np.any([data.shape[dim] > self.patch_size[dim] for dim in range(3)]):
+            patch_crop_coords_list = dutils.get_patch_crop_coords(data, self.patch_size)
+            new_img_batch, new_seg_batch, new_class_targets_batch = [], [], []
+
+            for cix, c in enumerate(patch_crop_coords_list):
+
+                seg_patch = seg[c[0]:c[1], c[2]: c[3], c[4]:c[5]]
+                new_seg_batch.append(seg_patch)
+
+                # if set to not None, add neighbouring slices to each selected slice in channel dimension.
+                # correct patch_crop coordinates by added slices of 3D context.
+                if self.cf.dim == 2 and self.cf.n_3D_context is not None:
+                    tmp_c_5 = c[5] + (self.cf.n_3D_context * 2)
+                    if cix == 0:
+                        data = np.pad(data, ((0, 0), (0, 0), (self.cf.n_3D_context, self.cf.n_3D_context)), 'constant', constant_values=0)
+                else:
+                    tmp_c_5 = c[5]
+
+                new_img_batch.append(data[c[0]:c[1], c[2]:c[3], c[4]:tmp_c_5])
+
+            data = np.array(new_img_batch)[:, np.newaxis] # (n_patches, c, x, y, z)
+            seg = np.array(new_seg_batch)[:, np.newaxis]  # (n_patches, 1, x, y, z)
+            batch_class_targets = np.repeat(batch_class_targets, len(patch_crop_coords_list), axis=0)
+
+            if self.cf.dim == 2:
+                if self.cf.n_3D_context is not None:
+                    data = np.transpose(data[:, 0], axes=(0, 3, 1, 2))
+                else:
+                    # all patches have z dimension 1 (slices). discard dimension
+                    data = data[..., 0]
+                seg = seg[..., 0]
+
+            patch_batch = {'data': data, 'seg': seg, 'class_target': batch_class_targets, 'pid': pid}
+            patch_batch['patch_crop_coords'] = np.array(patch_crop_coords_list)
+            patch_batch['patient_bb_target'] = patient_batch['patient_bb_target']
+            patch_batch['patient_roi_labels'] = patient_batch['patient_roi_labels']
+            patch_batch['original_img_shape'] = patient_batch['original_img_shape']
+
+            converter = ConvertSegToBoundingBoxCoordinates(self.cf.dim, get_rois_from_seg_flag=False, class_specific_seg_flag=self.cf.class_specific_seg_flag)
+            patch_batch = converter(**patch_batch)
+            out_batch = patch_batch
+
+        self.patient_ix += 1
+        if self.patient_ix == len(self.dataset_pids):
+            self.patient_ix = 0
+
+        return out_batch
+
+
+
+def copy_and_unpack_data(logger, pids, fold_dir, source_dir, target_dir):
+
+
+    start_time = time.time()
+    with open(os.path.join(fold_dir, 'file_list.txt'), 'w') as handle:
+        for pid in pids:
+            handle.write('{}_img.npz\n'.format(pid))
+            handle.write('{}_rois.npz\n'.format(pid))
+
+    subprocess.call('rsync -av --files-from {} {} {}'.format(os.path.join(fold_dir, 'file_list.txt'),
+        source_dir, target_dir), shell=True)
+    dutils.unpack_dataset(target_dir)
+    copied_files = os.listdir(target_dir)
+    logger.info("copying and unpacking data set finsihed : {} files in target dir: {}. took {} sec".format(
+        len(copied_files), target_dir, np.round(time.time() - start_time, 0)))
+
diff --git a/experiments/lidc_exp/pack_dataset.py b/experiments/lidc_exp/pack_dataset.py
new file mode 100644
index 0000000..23eb174
--- /dev/null
+++ b/experiments/lidc_exp/pack_dataset.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import numpy as np
+from multiprocessing import Pool
+import os
+import subprocess
+
+
+def get_case_identifiers(folder):
+    case_identifiers = [i[:-4] for i in os.listdir(folder) if i.endswith("npz")]
+    return case_identifiers
+
+
+def convert_to_npy(npz_file):
+    if not os.path.isfile(npz_file[:-3] + "npy"):
+        a = np.load(npz_file)['data']
+        np.save(npz_file[:-3] + "npy", a)
+
+
+def unpack_dataset(folder, threads=8):
+    case_identifiers = get_case_identifiers(folder)
+    p = Pool(threads)
+    npz_files = [os.path.join(folder, i + ".npz") for i in case_identifiers]
+    p.map(convert_to_npy, npz_files)
+    p.close()
+    p.join()
+
+
+def delete_npy(folder):
+    case_identifiers = get_case_identifiers(folder)
+    npy_files = [os.path.join(folder, i + ".npy") for i in case_identifiers]
+    npy_files = [i for i in npy_files if os.path.isfile(i)]
+    for n in npy_files:
+        os.remove(n)
+
+
+def mp_pack(inputs):
+    ix , f = inputs
+    file_path, source_dir, target_dir = f
+    print('packing file number: {}'.format(ix))
+    if 'npy' in file_path:
+        source_path = os.path.join(source_dir, file_path)
+        target_path = os.path.join(target_dir, file_path.split('.')[0] + '.npz')
+        arr = np.load(source_path, mmap_mode='r')
+        np.savez_compressed(target_path, data=arr)
+        print('target_path', target_path)
+
+
+if __name__ == '__main__':
+
+    use_previous = False
+    source_dir = '/mnt/hdd2/lidc/test_pp_rounding/'
+    target_dir = '/mnt/hdd2/lidc/test_pp_rounding_packed/'
+
+    if use_previous:
+        file_list = [ii for ii in os.listdir(source_dir) if not ii in os.listdir(target_dir)]
+    else:
+        file_list = os.listdir(source_dir)
+    info_list = [[ii, source_dir, target_dir] for ii in file_list]
+
+    if not os.path.exists(target_dir):
+        os.mkdir(target_dir)
+
+    pool = Pool(processes=12)
+    p1 = pool.map(mp_pack, enumerate(info_list), chunksize=1)
+    pool.close()
+    pool.join()
+
+    subprocess.call('cp {} {}'.format(os.path.join(source_dir, 'info_df.pickle'), os.path.join(target_dir, 'info_df.pickle')), shell=True)
\ No newline at end of file
diff --git a/experiments/lidc_exp/preprocessing.py b/experiments/lidc_exp/preprocessing.py
new file mode 100644
index 0000000..73c6ef5
--- /dev/null
+++ b/experiments/lidc_exp/preprocessing.py
@@ -0,0 +1,136 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import os
+import SimpleITK as sitk
+import numpy as np
+from multiprocessing import Pool
+import pandas as pd
+import numpy.testing as npt
+from skimage.transform import resize
+import subprocess
+
+import configs
+cf = configs.configs()
+
+# if a rater did not identify a nodule, this vote counts as 0s on the pixels. and as 0 == background (or 1?) on the mal. score.
+# will this lead to many surpressed nodules. yes. they are not stored in segmentation map and the mal. labels are discarded.
+# a pixel counts as foreground, if at least 2 raters drew it as foreground.
+
+
+def resample_array(src_imgs, src_spacing, target_spacing):
+
+    src_spacing = np.round(src_spacing, 3)
+    target_shape = [int(src_imgs.shape[ix] * src_spacing[::-1][ix] / target_spacing[::-1][ix]) for ix in range(len(src_imgs.shape))]
+    # print('target shape', target_shape, src_imgs.shape, src_spacing, target_spacing)
+    for i in range(len(target_shape)):
+        try:
+            assert target_shape[i] > 0
+        except:
+            raise AssertionError("AssertionError:", src_imgs.shape, src_spacing, target_spacing)
+
+    img = src_imgs.astype(float)
+    resampled_img = resize(img, target_shape, order=1, clip=True, mode='edge').astype('float32')
+
+    return resampled_img
+
+
+def pp_patient(inputs):
+
+    ix, path = inputs
+    pid = path.split('/')[-1]
+    img = sitk.ReadImage(os.path.join(path, '{}_ct_scan.nrrd'.format(pid)))
+    img_arr = sitk.GetArrayFromImage(img)
+    print('processing {}'.format(pid), img.GetSpacing(), img_arr.shape)
+    img_arr = resample_array(img_arr, img.GetSpacing(), cf.target_spacing)
+    img_arr = np.clip(img_arr, -1200, 600)
+    #img_arr = (1200 + img_arr) / (600 + 1200) * 255  # a+x / (b-a) * (c-d) (c, d = new)
+    img_arr = img_arr.astype(np.float32)
+    img_arr = (img_arr - np.mean(img_arr)) / np.std(img_arr).astype(np.float16)
+    print('img arr shape after', img_arr.shape)
+
+    # import matplotlib.pyplot as plt
+    # plt.figure()
+    # plt.hist(img_arr.flatten(), bins=100)
+    # plt.savefig(cf.root_dir + '/test.png')
+    # plt.close()
+
+    df = pd.read_csv(os.path.join(cf.root_dir, 'characteristics.csv'), sep=';')
+    df = df[df.PatientID == pid]
+
+    final_rois = np.zeros_like(img_arr, dtype=np.uint8)
+    mal_labels = []
+    roi_ids = set([ii.split('.')[0].split('_')[-1] for ii in os.listdir(path) if '.nii.gz' in ii])
+
+    rix = 1
+    for rid in roi_ids:
+        roi_id_paths = [ii for ii in os.listdir(path) if '{}.nii'.format(rid) in ii]
+        nodule_ids = [ii.split('_')[2].lstrip("0") for ii in roi_id_paths]
+        rater_labels = [df[df.NoduleID == int(ii)].Malignancy.values[0] for ii in nodule_ids]
+        rater_labels.extend([0] * (4-len(rater_labels)))
+        # print(nodule_ids, roi_id_paths, df.Malignancy.values, pid)
+        mal_label = np.mean([ii for ii in rater_labels if ii > -1])
+        roi_rater_list = []
+        for rp in roi_id_paths:
+            roi = sitk.ReadImage(os.path.join(cf.raw_data_dir, pid, rp))
+            roi_arr = sitk.GetArrayFromImage(roi).astype(np.uint8)
+            roi_arr = resample_array(roi_arr, roi.GetSpacing(), cf.target_spacing)
+            assert roi_arr.shape == img_arr.shape, [roi_arr.shape, img_arr.shape, pid, roi.GetSpacing()]
+            for ix in range(len(img_arr.shape)):
+                npt.assert_almost_equal(roi.GetSpacing()[ix], img.GetSpacing()[ix])
+            roi_rater_list.append(roi_arr)
+        roi_rater_list.extend([np.zeros_like(roi_rater_list[-1])]*(4-len(roi_id_paths)))
+        roi_raters = np.array(roi_rater_list)
+        roi_raters = np.mean(roi_raters, axis=0)
+        roi_raters[roi_raters < 0.5] = 0
+        if np.sum(roi_raters) > 0:
+            mal_labels.append(mal_label)
+            final_rois[roi_raters >= 0.5] = rix
+            rix += 1
+        else:
+            print('surpressed roi!', roi_id_paths)
+            with open(os.path.join(cf.pp_dir, 'surpressed_rois.txt'), 'a') as handle:
+                handle.write(" ".join(roi_id_paths))
+
+    fg_slices = [ii for ii in np.unique(np.argwhere(final_rois != 0)[:, 0])]
+    mal_labels = np.array(mal_labels)
+    assert len(mal_labels) + 1 == len(np.unique(final_rois)), [len(mal_labels), np.unique(final_rois), pid]
+    out_df = pd.read_pickle(os.path.join(cf.pp_dir, 'info_df.pickle'))
+    out_df.loc[len(out_df)] = {'pid': pid, 'class_target': mal_labels, 'spacing': img.GetSpacing(), 'fg_slices': fg_slices}
+    out_df.to_pickle(os.path.join(cf.pp_dir, 'info_df.pickle'))
+    np.save(os.path.join(cf.pp_dir, '{}_rois.npy'.format(pid)), final_rois)
+    np.save(os.path.join(cf.pp_dir, '{}_img.npy'.format(pid)), img_arr)
+
+
+
+if __name__ == "__main__":
+
+    paths = [os.path.join(cf.raw_data_dir, ii) for ii in os.listdir(cf.raw_data_dir)]
+
+    if not os.path.exists(cf.pp_dir):
+        os.mkdir(cf.pp_dir)
+
+    df = pd.DataFrame(columns=['pid', 'class_target', 'spacing', 'fg_slices'])
+    df.to_pickle(os.path.join(cf.pp_dir, 'info_df.pickle'))
+
+    pool = Pool(processes=12)
+    p1 = pool.map(pp_patient, enumerate(paths), chunksize=1)
+    pool.close()
+    pool.join()
+    # for i in enumerate(paths):
+    #     pp_patient(i)
+
+    subprocess.call('cp {} {}'.format(os.path.join(cf.pp_dir, 'info_df.pickle'), os.path.join(cf.pp_dir, 'info_df_bk.pickle')), shell=True)
\ No newline at end of file
diff --git a/experiments/toy_exp/__pycache__/configs.cpython-35.pyc b/experiments/toy_exp/__pycache__/configs.cpython-35.pyc
new file mode 100644
index 0000000..6171c47
Binary files /dev/null and b/experiments/toy_exp/__pycache__/configs.cpython-35.pyc differ
diff --git a/experiments/toy_exp/__pycache__/configs.cpython-36.pyc b/experiments/toy_exp/__pycache__/configs.cpython-36.pyc
new file mode 100644
index 0000000..2b334d9
Binary files /dev/null and b/experiments/toy_exp/__pycache__/configs.cpython-36.pyc differ
diff --git a/experiments/toy_exp/__pycache__/data_loader.cpython-35.pyc b/experiments/toy_exp/__pycache__/data_loader.cpython-35.pyc
new file mode 100644
index 0000000..7347c22
Binary files /dev/null and b/experiments/toy_exp/__pycache__/data_loader.cpython-35.pyc differ
diff --git a/experiments/toy_exp/__pycache__/data_loader.cpython-36.pyc b/experiments/toy_exp/__pycache__/data_loader.cpython-36.pyc
new file mode 100644
index 0000000..ab4d287
Binary files /dev/null and b/experiments/toy_exp/__pycache__/data_loader.cpython-36.pyc differ
diff --git a/experiments/toy_exp/configs.py b/experiments/toy_exp/configs.py
new file mode 100644
index 0000000..d414d76
--- /dev/null
+++ b/experiments/toy_exp/configs.py
@@ -0,0 +1,345 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import sys
+import os
+sys.path.append(os.path.dirname(os.path.realpath(__file__)))
+import numpy as np
+from default_configs import DefaultConfigs
+
+class configs(DefaultConfigs):
+
+    def __init__(self, server_env=None):
+
+        #########################
+        #    Preprocessing      #
+        #########################
+
+        self.root_dir = '/path/to/data'
+
+        #########################
+        #         I/O           #
+        #########################
+
+
+        # one out of [2, 3]. dimension the model operates in.
+        self.dim = 2
+
+        # one out of ['mrcnn', 'retina_net', 'retina_unet', 'detection_unet', 'ufrcnn', 'detection_unet'].
+        self.model = 'ufrcnn'
+
+        DefaultConfigs.__init__(self, self.model, server_env, self.dim)
+
+        # int [0 < dataset_size]. select n patients from dataset for prototyping.
+        self.select_prototype_subset = None
+        self.hold_out_test_set = True
+        self.n_train_data = 1000
+
+        # choose one of the 3 toy experiments described in https://arxiv.org/pdf/1811.08661.pdf
+        # one of ['donuts_shape', 'donuts_pattern', 'circles_scale'].
+        toy_mode = 'donuts_shape'
+
+
+        # path to preprocessed data.
+        self.input_df_name = 'info_df.pickle'
+        self.pp_name = os.path.join(toy_mode, 'train')
+        self.pp_data_path = os.path.join(self.root_dir, self.pp_name)
+        self.pp_test_name = os.path.join(toy_mode, 'test')
+        self.pp_test_data_path = os.path.join(self.root_dir, self.pp_test_name)
+
+        # settings for deployment in cloud.
+        if server_env:
+            # path to preprocessed data.
+            pp_root_dir = '/path/to/data'
+            self.pp_name = os.path.join(toy_mode, 'train')
+            self.pp_data_path = os.path.join(pp_root_dir, self.pp_name)
+            self.pp_test_name = os.path.join(toy_mode, 'test')
+            self.pp_test_data_path = os.path.join(pp_root_dir, self.pp_test_name)
+            self.select_prototype_subset = None
+
+        #########################
+        #      Data Loader      #
+        #########################
+
+        # select modalities from preprocessed data
+        self.channels = [0]
+        self.n_channels = len(self.channels)
+
+        # patch_size to be used for training. pre_crop_size is the patch_size before data augmentation.
+        self.pre_crop_size_2D = [320, 320]
+        self.patch_size_2D = [320, 320]
+
+        self.patch_size = self.patch_size_2D if self.dim == 2 else self.patch_size_3D
+        self.pre_crop_size = self.pre_crop_size_2D if self.dim == 2 else self.pre_crop_size_3D
+
+        # ratio of free sampled batch elements before class balancing is triggered
+        # (>0 to include "empty"/background patches.)
+        self.batch_sample_slack = 0.2
+
+        # set 2D network to operate in 3D images.
+        self.merge_2D_to_3D_preds = False
+
+        # feed +/- n neighbouring slices into channel dimension. set to None for no context.
+        self.n_3D_context = None
+        if self.n_3D_context is not None and self.dim == 2:
+            self.n_channels *= (self.n_3D_context * 2 + 1)
+
+
+        #########################
+        #      Architecture      #
+        #########################
+
+        self.start_filts = 48 if self.dim == 2 else 18
+        self.end_filts = self.start_filts * 4 if self.dim == 2 else self.start_filts * 2
+        self.res_architecture = 'resnet50' # 'resnet101' , 'resnet50'
+        self.norm = None # one of None, 'instance_norm', 'batch_norm'
+        self.weight_decay = 0
+
+        # one of 'xavier_uniform', 'xavier_normal', or 'kaiming_normal', None (=default = 'kaiming_uniform')
+        self.weight_init = None
+
+        #########################
+        #  Schedule / Selection #
+        #########################
+
+        self.num_epochs = 100
+        self.num_train_batches = 200 if self.dim == 2 else 200
+        self.batch_size = 20 if self.dim == 2 else 8
+
+        self.do_validation = True
+        # decide whether to validate on entire patient volumes (like testing) or sampled patches (like training)
+        # the former is morge accurate, while the latter is faster (depending on volume size)
+        self.val_mode = 'val_patient' # one of 'val_sampling' , 'val_patient'
+        if self.val_mode == 'val_patient':
+            self.max_val_patients = None  # if 'None' iterates over entire val_set once.
+        if self.val_mode == 'val_sampling':
+            self.num_val_batches = 50
+
+        #########################
+        #   Testing / Plotting  #
+        #########################
+
+        # set the top-n-epochs to be saved for temporal averaging in testing.
+        self.save_n_models = 5
+        self.test_n_epochs = 5
+
+        # set a minimum epoch number for saving in case of instabilities in the first phase of training.
+        self.min_save_thresh = 0 if self.dim == 2 else 0
+
+        self.report_score_level = ['patient', 'rois']  # choose list from 'patient', 'rois'
+        self.class_dict = {1: 'benign', 2: 'malignant'}  # 0 is background.
+        self.patient_class_of_interest = 2  # patient metrics are only plotted for one class.
+        self.ap_match_ious = [0.1]  # list of ious to be evaluated for ap-scoring.
+
+        self.model_selection_criteria = ['benign_ap', 'malignant_ap'] # criteria to average over for saving epochs.
+        self.min_det_thresh = 0.1  # minimum confidence value to select predictions for evaluation.
+
+        # threshold for clustering predictions together (wcs = weighted cluster scoring).
+        # needs to be >= the expected overlap of predictions coming from one model (typically NMS threshold).
+        # if too high, preds of the same object are separate clusters.
+        self.wcs_iou = 1e-5
+
+        self.plot_prediction_histograms = True
+        self.plot_stat_curves = False
+
+        #########################
+        #   Data Augmentation   #
+        #########################
+
+        self.da_kwargs={
+        'do_elastic_deform': True,
+        'alpha':(0., 1500.),
+        'sigma':(30., 50.),
+        'do_rotation':True,
+        'angle_x': (0., 2 * np.pi),
+        'angle_y': (0., 0),
+        'angle_z': (0., 0),
+        'do_scale': True,
+        'scale':(0.8, 1.1),
+        'random_crop':False,
+        'rand_crop_dist':  (self.patch_size[0] / 2. - 3, self.patch_size[1] / 2. - 3),
+        'border_mode_data': 'constant',
+        'border_cval_data': 0,
+        'order_data': 1
+        }
+
+        if self.dim == 3:
+            self.da_kwargs['do_elastic_deform'] = False
+            self.da_kwargs['angle_x'] = (0, 0.0)
+            self.da_kwargs['angle_y'] = (0, 0.0) #must be 0!!
+            self.da_kwargs['angle_z'] = (0., 2 * np.pi)
+
+
+        #########################
+        #   Add model specifics #
+        #########################
+
+        {'detection_unet': self.add_det_unet_configs,
+         'mrcnn': self.add_mrcnn_configs,
+         'ufrcnn': self.add_mrcnn_configs,
+         'ufrcnn_surrounding': self.add_mrcnn_configs,
+         'retina_net': self.add_mrcnn_configs,
+         'retina_unet': self.add_mrcnn_configs,
+         'prob_detector': self.add_mrcnn_configs,
+        }[self.model]()
+
+
+    def add_det_unet_configs(self):
+
+        self.learning_rate = [1e-4] * self.num_epochs
+
+        # aggregation from pixel perdiction to object scores (connected component). One of ['max', 'median']
+        self.aggregation_operation = 'max'
+
+        # max number of roi candidates to identify per image (slice in 2D, volume in 3D)
+        self.n_roi_candidates = 3 if self.dim == 2 else 8
+
+        # loss mode: either weighted cross entropy ('wce'), batch-wise dice loss ('dice), or the sum of both ('dice_wce')
+        self.seg_loss_mode = 'dice_wce'
+
+        # if <1, false positive predictions in foreground are penalized less.
+        self.fp_dice_weight = 1 if self.dim == 2 else 1
+
+        self.wce_weights = [1, 1, 1]
+        self.detection_min_confidence = self.min_det_thresh
+
+        # if 'True', loss distinguishes all classes, else only foreground vs. background (class agnostic).
+        self.class_specific_seg_flag = True
+        self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+        self.head_classes = self.num_seg_classes
+
+    def add_mrcnn_configs(self):
+
+        # learning rate is a list with one entry per epoch.
+        self.learning_rate = [1e-4] * self.num_epochs
+
+        # disable mask head loss. (e.g. if no pixelwise annotations available)
+        self.frcnn_mode = False
+
+        # disable the re-sampling of mask proposals to original size for speed-up.
+        # since evaluation is detection-driven (box-matching) and not instance segmentation-driven (iou-matching),
+        # mask-outputs are optional.
+        self.return_masks_in_val = True
+        self.return_masks_in_test = False
+
+        # set number of proposal boxes to plot after each epoch.
+        self.n_plot_rpn_props = 5 if self.dim == 2 else 30
+
+        # number of classes for head networks: n_foreground_classes + 1 (background)
+        self.head_classes = 3
+
+        # seg_classes hier refers to the first stage classifier (RPN)
+        self.num_seg_classes = 2  # foreground vs. background
+
+        # feature map strides per pyramid level are inferred from architecture.
+        self.backbone_strides = {'xy': [4, 8, 16, 32], 'z': [1, 2, 4, 8]}
+
+        # anchor scales are chosen according to expected object sizes in data set. Default uses only one anchor scale
+        # per pyramid level. (outer list are pyramid levels (corresponding to BACKBONE_STRIDES), inner list are scales per level.)
+        self.rpn_anchor_scales = {'xy': [[8], [16], [32], [64]], 'z': [[2], [4], [8], [16]]}
+
+        # choose which pyramid levels to extract features from: P2: 0, P3: 1, P4: 2, P5: 3.
+        self.pyramid_levels = [0, 1, 2, 3]
+
+        # number of feature maps in rpn. typically lowered in 3D to save gpu-memory.
+        self.n_rpn_features = 512 if self.dim == 2 else 128
+
+        # anchor ratios and strides per position in feature maps.
+        self.rpn_anchor_ratios = [0.5, 1, 2]
+        self.rpn_anchor_stride = 1
+
+        # Threshold for first stage (RPN) non-maximum suppression (NMS):  LOWER == HARDER SELECTION
+        self.rpn_nms_threshold = 0.7 if self.dim == 2 else 0.7
+
+        # loss sampling settings.
+        self.rpn_train_anchors_per_image = 2  #per batch element
+        self.train_rois_per_image = 2 #per batch element
+        self.roi_positive_ratio = 0.5
+        self.anchor_matching_iou = 0.7
+
+        # factor of top-k candidates to draw from  per negative sample (stochastic-hard-example-mining).
+        # poolsize to draw top-k candidates from will be shem_poolsize * n_negative_samples.
+        self.shem_poolsize = 10
+
+        self.pool_size = (7, 7) if self.dim == 2 else (7, 7, 3)
+        self.mask_pool_size = (14, 14) if self.dim == 2 else (14, 14, 5)
+        self.mask_shape = (28, 28) if self.dim == 2 else (28, 28, 10)
+
+        self.rpn_bbox_std_dev = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.2])
+        self.bbox_std_dev = np.array([0.1, 0.1, 0.1, 0.2, 0.2, 0.2])
+        self.window = np.array([0, 0, self.patch_size[0], self.patch_size[1]])
+        self.scale = np.array([self.patch_size[0], self.patch_size[1], self.patch_size[0], self.patch_size[1]])
+
+        if self.dim == 2:
+            self.rpn_bbox_std_dev = self.rpn_bbox_std_dev[:4]
+            self.bbox_std_dev = self.bbox_std_dev[:4]
+            self.window = self.window[:4]
+            self.scale = self.scale[:4]
+
+        # pre-selection in proposal-layer (stage 1) for NMS-speedup. applied per batch element.
+        self.pre_nms_limit = 3000 if self.dim == 2 else 6000
+
+        # n_proposals to be selected after NMS per batch element. too high numbers blow up memory if "detect_while_training" is True,
+        # since proposals of the entire batch are forwarded through second stage in as one "batch".
+        self.roi_chunk_size = 800 if self.dim == 2 else 600
+        self.post_nms_rois_training = 500 if self.dim == 2 else 75
+        self.post_nms_rois_inference = 500
+
+        # Final selection of detections (refine_detections)
+        self.model_max_instances_per_batch_element = 10 if self.dim == 2 else 30  # per batch element and class.
+        self.detection_nms_threshold = 1e-5  # needs to be > 0, otherwise all predictions are one cluster.
+        self.model_min_confidence = 0.1
+
+        if self.dim == 2:
+            self.backbone_shapes = np.array(
+                [[int(np.ceil(self.patch_size[0] / stride)),
+                  int(np.ceil(self.patch_size[1] / stride))]
+                 for stride in self.backbone_strides['xy']])
+        else:
+            self.backbone_shapes = np.array(
+                [[int(np.ceil(self.patch_size[0] / stride)),
+                  int(np.ceil(self.patch_size[1] / stride)),
+                  int(np.ceil(self.patch_size[2] / stride_z))]
+                 for stride, stride_z in zip(self.backbone_strides['xy'], self.backbone_strides['z']
+                                             )])
+        if self.model == 'ufrcnn':
+            self.operate_stride1 = True
+            self.class_specific_seg_flag = True
+            self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+            self.frcnn_mode = True
+
+        if self.model == 'retina_net' or self.model == 'retina_unet' or self.model == 'prob_detector':
+            # implement extra anchor-scales according to retina-net publication.
+            self.rpn_anchor_scales['xy'] = [[ii[0], ii[0] * (2 ** (1 / 3)), ii[0] * (2 ** (2 / 3))] for ii in
+                                            self.rpn_anchor_scales['xy']]
+            self.rpn_anchor_scales['z'] = [[ii[0], ii[0] * (2 ** (1 / 3)), ii[0] * (2 ** (2 / 3))] for ii in
+                                           self.rpn_anchor_scales['z']]
+            self.n_anchors_per_pos = len(self.rpn_anchor_ratios) * 3
+
+            self.n_rpn_features = 256 if self.dim == 2 else 64
+
+            # pre-selection of detections for NMS-speedup. per entire batch.
+            self.pre_nms_limit = 10000 if self.dim == 2 else 50000
+
+            # anchor matching iou is lower than in Mask R-CNN according to https://arxiv.org/abs/1708.02002
+            self.anchor_matching_iou = 0.5
+
+            # if 'True', seg loss distinguishes all classes, else only foreground vs. background (class agnostic).
+            self.num_seg_classes = 3 if self.class_specific_seg_flag else 2
+
+            if self.model == 'retina_unet':
+                self.operate_stride1 = True
+                self.class_specific_seg_flag = True
diff --git a/experiments/toy_exp/data_loader.py b/experiments/toy_exp/data_loader.py
new file mode 100644
index 0000000..158896b
--- /dev/null
+++ b/experiments/toy_exp/data_loader.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import numpy as np
+import os
+from collections import OrderedDict
+import pandas as pd
+import pickle
+import time
+import subprocess
+import utils.dataloader_utils as dutils
+
+# batch generator tools from https://github.com/MIC-DKFZ/batchgenerators
+from batchgenerators.dataloading.data_loader import SlimDataLoaderBase
+from batchgenerators.transforms.spatial_transforms import MirrorTransform as Mirror
+from batchgenerators.transforms.abstract_transforms import Compose
+from batchgenerators.dataloading.multi_threaded_augmenter import MultiThreadedAugmenter
+from batchgenerators.dataloading import SingleThreadedAugmenter
+from batchgenerators.transforms.spatial_transforms import SpatialTransform
+from batchgenerators.transforms.crop_and_pad_transforms import CenterCropTransform
+from batchgenerators.transforms.utility_transforms import ConvertSegToBoundingBoxCoordinates
+
+
+
+def get_train_generators(cf, logger):
+    """
+    wrapper function for creating the training batch generator pipeline. returns the train/val generators.
+    selects patients according to cv folds (generated by first run/fold of experiment):
+    splits the data into n-folds, where 1 split is used for val, 1 split for testing and the rest for training. (inner loop test set)
+    If cf.hold_out_test_set is True, adds the test split to the training data.
+    """
+    all_data = load_dataset(cf, logger)
+    all_pids_list = np.unique([v['pid'] for (k, v) in all_data.items()])
+
+    train_pids = all_pids_list[:cf.n_train_data]
+    val_pids = all_pids_list[1000:1500]
+
+    train_data = {k: v for (k, v) in all_data.items() if any(p == v['pid'] for p in train_pids)}
+    val_data = {k: v for (k, v) in all_data.items() if any(p == v['pid'] for p in val_pids)}
+
+    logger.info("data set loaded with: {} train / {} val patients".format(len(train_pids), len(val_pids)))
+    batch_gen = {}
+    batch_gen['train'] = create_data_gen_pipeline(train_data, cf=cf, do_aug=False)
+    batch_gen['val_sampling'] = create_data_gen_pipeline(val_data, cf=cf, do_aug=False)
+    if cf.val_mode == 'val_patient':
+        batch_gen['val_patient'] = PatientBatchIterator(val_data, cf=cf)
+        batch_gen['n_val'] = len(val_pids) if cf.max_val_patients is None else cf.max_val_patients
+    else:
+        batch_gen['n_val'] = cf.num_val_batches
+
+    return batch_gen
+
+
+def get_test_generator(cf, logger):
+    """
+    wrapper function for creating the test batch generator pipeline.
+    selects patients according to cv folds (generated by first run/fold of experiment)
+    If cf.hold_out_test_set is True, gets the data from an external folder instead.
+    """
+    if cf.hold_out_test_set:
+        cf.pp_data_path = cf.pp_test_data_path
+        cf.pp_name = cf.pp_test_name
+        test_ix = None
+    else:
+        with open(os.path.join(cf.exp_dir, 'fold_ids.pickle'), 'rb') as handle:
+            fold_list = pickle.load(handle)
+        _, _, test_ix, _ = fold_list[cf.fold]
+        # warnings.warn('WARNING: using validation set for testing!!!')
+
+    test_data = load_dataset(cf, logger, test_ix)
+    logger.info("data set loaded with: {} test patients from {}".format(len(test_data.keys()), cf.pp_data_path))
+    batch_gen = {}
+    batch_gen['test'] = PatientBatchIterator(test_data, cf=cf)
+    batch_gen['n_test'] = len(test_data.keys())
+    return batch_gen
+
+
+
+def load_dataset(cf, logger, subset_ixs=None):
+    """
+    loads the dataset. if deployed in cloud also copies and unpacks the data to the working directory.
+    :param subset_ixs: subset indices to be loaded from the dataset. used e.g. for testing to only load the test folds.
+    :return: data: dictionary with one entry per patient (in this case per patient-breast, since they are treated as
+    individual images for training) each entry is a dictionary containing respective meta-info as well as paths to the preprocessed
+    numpy arrays to be loaded during batch-generation
+    """
+    if cf.server_env:
+        copy_data = True
+        target_dir = os.path.join('/ssd', cf.slurm_job_id, cf.pp_name)
+        if not os.path.exists(target_dir):
+            cf.data_source_dir = cf.pp_data_path
+            os.makedirs(target_dir)
+            subprocess.call('rsync -av {} {}'.format(
+                os.path.join(cf.data_source_dir, cf.input_df_name), os.path.join(target_dir, cf.input_df_name)), shell=True)
+            logger.info('created target dir and info df at {}'.format(os.path.join(target_dir, cf.input_df_name)))
+
+        elif subset_ixs is None:
+            copy_data = False
+
+        cf.pp_data_path = target_dir
+
+
+    p_df = pd.read_pickle(os.path.join(cf.pp_data_path, cf.input_df_name))
+
+
+    if subset_ixs is not None:
+        subset_pids = [np.unique(p_df.pid.tolist())[ix] for ix in subset_ixs]
+        p_df = p_df[p_df.pid.isin(subset_pids)]
+        logger.info('subset: selected {} instances from df'.format(len(p_df)))
+
+    if cf.server_env:
+        if copy_data:
+            copy_and_unpack_data(logger, p_df.pid.tolist(), cf.fold_dir, cf.data_source_dir, target_dir)
+
+    class_targets = p_df['class_id'].tolist()
+    pids = p_df.pid.tolist()
+    imgs = [os.path.join(cf.pp_data_path, '{}.npy'.format(pid)) for pid in pids]
+    segs = [os.path.join(cf.pp_data_path,'{}.npy'.format(pid)) for pid in pids]
+
+    data = OrderedDict()
+    for ix, pid in enumerate(pids):
+
+        data[pid] = {'data': imgs[ix], 'seg': segs[ix], 'pid': pid, 'class_target': [class_targets[ix]]}
+
+    return data
+
+
+
+def create_data_gen_pipeline(patient_data, cf, do_aug=True):
+    """
+    create mutli-threaded train/val/test batch generation and augmentation pipeline.
+    :param patient_data: dictionary containing one dictionary per patient in the train/test subset.
+    :param is_training: (optional) whether to perform data augmentation (training) or not (validation/testing)
+    :return: multithreaded_generator
+    """
+
+    # create instance of batch generator as first element in pipeline.
+    data_gen = BatchGenerator(patient_data, batch_size=cf.batch_size, cf=cf)
+
+    # add transformations to pipeline.
+    my_transforms = []
+    if do_aug:
+        mirror_transform = Mirror(axes=np.arange(2, cf.dim+2, 1))
+        my_transforms.append(mirror_transform)
+        spatial_transform = SpatialTransform(patch_size=cf.patch_size[:cf.dim],
+                                             patch_center_dist_from_border=cf.da_kwargs['rand_crop_dist'],
+                                             do_elastic_deform=cf.da_kwargs['do_elastic_deform'],
+                                             alpha=cf.da_kwargs['alpha'], sigma=cf.da_kwargs['sigma'],
+                                             do_rotation=cf.da_kwargs['do_rotation'], angle_x=cf.da_kwargs['angle_x'],
+                                             angle_y=cf.da_kwargs['angle_y'], angle_z=cf.da_kwargs['angle_z'],
+                                             do_scale=cf.da_kwargs['do_scale'], scale=cf.da_kwargs['scale'],
+                                             random_crop=cf.da_kwargs['random_crop'])
+
+        my_transforms.append(spatial_transform)
+    else:
+        my_transforms.append(CenterCropTransform(crop_size=cf.patch_size[:cf.dim]))
+
+    my_transforms.append(ConvertSegToBoundingBoxCoordinates(cf.dim, get_rois_from_seg_flag=False, class_specific_seg_flag=cf.class_specific_seg_flag))
+    all_transforms = Compose(my_transforms)
+    # multithreaded_generator = SingleThreadedAugmenter(data_gen, all_transforms)
+    multithreaded_generator = MultiThreadedAugmenter(data_gen, all_transforms, num_processes=cf.n_workers, seeds=range(cf.n_workers))
+    return multithreaded_generator
+
+
+class BatchGenerator(SlimDataLoaderBase):
+    """
+    creates the training/validation batch generator. Samples n_batch_size patients (draws a slice from each patient if 2D)
+    from the data set while maintaining foreground-class balance. Returned patches are cropped/padded to pre_crop_size.
+    Actual patch_size is obtained after data augmentation.
+    :param data: data dictionary as provided by 'load_dataset'.
+    :param batch_size: number of patients to sample for the batch
+    :return dictionary containing the batch data (b, c, x, y, (z)) / seg (b, 1, x, y, (z)) / pids / class_target
+    """
+    def __init__(self, data, batch_size, cf):
+        super(BatchGenerator, self).__init__(data, batch_size)
+
+        self.cf = cf
+
+    def generate_train_batch(self):
+
+        batch_data, batch_segs, batch_pids, batch_targets = [], [], [], []
+        class_targets_list =  [v['class_target'] for (k, v) in self._data.items()]
+
+        #samples patients towards equilibrium of foreground classes on a roi-level (after randomly sampling the ratio "batch_sample_slack).
+        batch_ixs = dutils.get_class_balanced_patients(
+            class_targets_list, self.batch_size, self.cf.head_classes - 1, slack_factor=self.cf.batch_sample_slack)
+        patients = list(self._data.items())
+
+        for b in batch_ixs:
+
+            patient = patients[b][1]
+            all_data = np.load(patient['data'], mmap_mode='r')
+            data = all_data[0].astype('float16')
+            seg = all_data[1].astype('uint8')
+            batch_pids.append(patient['pid'])
+            batch_targets.append(patient['class_target'])
+            batch_data.append(data[np.newaxis])
+            batch_segs.append(seg[np.newaxis])
+
+        data = np.array(batch_data).astype(np.float16)
+        seg = np.array(batch_segs).astype(np.uint8)
+        class_target = np.array(batch_targets)
+        return {'data': data, 'seg': seg, 'pid': batch_pids, 'class_target': class_target}
+
+
+
+class PatientBatchIterator(SlimDataLoaderBase):
+    """
+    creates a test generator that iterates over entire given dataset returning 1 patient per batch.
+    Can be used for monitoring if cf.val_mode = 'patient_val' for a monitoring closer to actualy evaluation (done in 3D),
+    if willing to accept speed-loss during training.
+    :return: out_batch: dictionary containing one patient with batch_size = n_3D_patches in 3D or
+    batch_size = n_2D_patches in 2D .
+    """
+    def __init__(self, data, cf): #threads in augmenter
+        super(PatientBatchIterator, self).__init__(data, 0)
+        self.cf = cf
+        self.patient_ix = 0
+        self.dataset_pids = [v['pid'] for (k, v) in data.items()]
+        self.patch_size = cf.patch_size
+        if len(self.patch_size) == 2:
+            self.patch_size = self.patch_size + [1]
+
+
+    def generate_train_batch(self):
+
+
+        pid = self.dataset_pids[self.patient_ix]
+        patient = self._data[pid]
+        all_data = np.load(patient['data'], mmap_mode='r')
+        data = all_data[0].astype('float16')
+        seg = all_data[1].astype('uint8')
+        batch_class_targets = np.array([patient['class_target']])
+
+        out_data = data[None, None]
+        out_seg = seg[None, None]
+
+        print('check patient data loader', out_data.shape, out_seg.shape)
+        batch_2D = {'data': out_data, 'seg': out_seg, 'class_target': batch_class_targets, 'pid': pid}
+        converter = ConvertSegToBoundingBoxCoordinates(dim=2, get_rois_from_seg_flag=False, class_specific_seg_flag=self.cf.class_specific_seg_flag)
+        batch_2D = converter(**batch_2D)
+
+        batch_2D.update({'patient_bb_target': batch_2D['bb_target'],
+                         'patient_roi_labels': batch_2D['roi_labels'],
+                         'original_img_shape': out_data.shape})
+
+        self.patient_ix += 1
+        if self.patient_ix == len(self.dataset_pids):
+            self.patient_ix = 0
+
+        return batch_2D
+
+
+
+def copy_and_unpack_data(logger, pids, fold_dir, source_dir, target_dir):
+
+
+    start_time = time.time()
+    with open(os.path.join(fold_dir, 'file_list.txt'), 'w') as handle:
+        for pid in pids:
+            handle.write('{}.npy\n'.format(pid))
+
+    subprocess.call('rsync -av --files-from {} {} {}'.format(os.path.join(fold_dir, 'file_list.txt'),
+        source_dir, target_dir), shell=True)
+    # dutils.unpack_dataset(target_dir)
+    copied_files = os.listdir(target_dir)
+    logger.info("copying and unpacking data set finsihed : {} files in target dir: {}. took {} sec".format(
+        len(copied_files), target_dir, np.round(time.time() - start_time, 0)))
+
diff --git a/experiments/toy_exp/generate_toys.py b/experiments/toy_exp/generate_toys.py
new file mode 100644
index 0000000..9f336c9
--- /dev/null
+++ b/experiments/toy_exp/generate_toys.py
@@ -0,0 +1,94 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import os
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+import configs as cf
+
+def multi_processing_create_image(inputs):
+
+
+    out_dir, six, foreground_margin, class_diameters, mode = inputs
+    print('proceesing {} {}'.format(out_dir, six))
+
+    img = np.random.rand(320, 320)
+    seg = np.zeros((320, 320)).astype('uint8')
+    center_x = np.random.randint(foreground_margin, img.shape[0] - foreground_margin)
+    center_y = np.random.randint(foreground_margin, img.shape[1] - foreground_margin)
+    class_id = np.random.randint(0, 2)
+
+    for y in range(img.shape[0]):
+        for x in range(img.shape[0]):
+            if ((x - center_x) ** 2 + (y - center_y) ** 2 - class_diameters[class_id] ** 2) < 0:
+                img[y][x] += 0.2
+                seg[y][x] = 1
+
+    if 'donuts' in mode:
+        whole_diameter = 4
+        if class_id == 1:
+            for y in range(img.shape[0]):
+                for x in range(img.shape[0]):
+                    if ((x - center_x) ** 2 + (y - center_y) ** 2 - whole_diameter ** 2) < 0:
+                        img[y][x] -= 0.2
+                        if mode == 'donuts_shape':
+                            seg[y][x] = 0
+
+    out = np.concatenate((img[None], seg[None]))
+    out_path = os.path.join(out_dir, '{}.npy'.format(six))
+    df = pd.read_pickle(os.path.join(out_dir, 'info_df.pickle'))
+    df.loc[len(df)] = [out_path, class_id, str(six)]
+    df.to_pickle(os.path.join(out_dir, 'info_df.pickle'))
+    np.save(out_path, out)
+
+
+def get_toy_image_info(mode, n_images, out_dir, class_diameters=(20, 20)):
+
+    if not os.path.exists(out_dir):
+        os.makedirs(out_dir)
+
+    # enforced distance between object center and image edge.
+    foreground_margin = np.max(class_diameters) // 2
+
+    df = pd.DataFrame(columns=['path', 'class_id', 'pid'])
+    df.to_pickle(os.path.join(out_dir, 'info_df.pickle'))
+    return [[out_dir, six, foreground_margin, class_diameters, mode] for six in range(n_images)]
+
+
+if __name__ == '__main__':
+
+    cf = cf.configs()
+
+    root_dir = os.path.join(cf.root_dir, 'donuts_shape')
+    info = []
+    info += get_toy_image_info(mode='donuts_shape', n_images=1500, out_dir=os.path.join(root_dir, 'train'))
+    info += get_toy_image_info(mode='donuts_shape', n_images=1000, out_dir=os.path.join(root_dir, 'test'))
+
+    root_dir = os.path.join(cf.root_dir, 'donuts_pattern')
+    info += get_toy_image_info(mode='donuts_pattern', n_images=1500, out_dir=os.path.join(root_dir, 'train'))
+    info += get_toy_image_info(mode='donuts_pattern', n_images=1000, out_dir=os.path.join(root_dir, 'test'))
+
+    root_dir = os.path.join(cf.root_dir, 'circles_scale')
+    info += get_toy_image_info(mode='circles_scale', n_images=1500, out_dir=os.path.join(root_dir, 'train'), class_diameters=(19, 20))
+    info += get_toy_image_info(mode='circles_scale', n_images=1000, out_dir=os.path.join(root_dir, 'test'), class_diameters=(19, 20))
+
+    print('starting creating {} images'.format(len(info)))
+    pool = Pool(processes=12)
+    pool.map(multi_processing_create_image, info, chunksize=1)
+    pool.close()
+    pool.join()
+
diff --git a/models/backbone.py b/models/backbone.py
new file mode 100644
index 0000000..8d249ac
--- /dev/null
+++ b/models/backbone.py
@@ -0,0 +1,288 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import torch.nn as nn
+import torch.nn.functional as F
+import torch
+
+
+class FPN(nn.Module):
+    """
+    Feature Pyramid Network from https://arxiv.org/pdf/1612.03144.pdf with options for modifications.
+    by default is constructed with Pyramid levels P2, P3, P4, P5.
+    """
+    def __init__(self, cf, conv, operate_stride1=False):
+        """
+        from configs:
+        :param input_channels: number of channel dimensions in input data.
+        :param start_filts:  number of feature_maps in first layer. rest is scaled accordingly.
+        :param out_channels: number of feature_maps for output_layers of all levels in decoder.
+        :param conv: instance of custom conv class containing the dimension info.
+        :param res_architecture: string deciding whether to use "resnet50" or "resnet101".
+        :param operate_stride1: boolean flag. enables adding of Pyramid levels P1 (output stride 2) and P0 (output stride 1).
+        :param norm: string specifying type of feature map normalization. If None, no normalization is applied.
+        :param relu: string specifying type of nonlinearity. If None, no nonlinearity is applied.
+        :param sixth_pooling: boolean flag. enables adding of Pyramid level P6.
+        """
+        super(FPN, self).__init__()
+
+        self.start_filts = cf.start_filts
+        start_filts = self.start_filts
+        self.n_blocks = [3, 4, {"resnet50": 6, "resnet101": 23}[cf.res_architecture], 3]
+        self.block = ResBlock
+        self.block_expansion = 4
+        self.operate_stride1 = operate_stride1
+        self.sixth_pooling = cf.sixth_pooling
+        self.dim = conv.dim
+
+        if operate_stride1:
+            self.C0 = nn.Sequential(conv(cf.n_channels, start_filts, ks=3, pad=1, norm=cf.norm, relu=cf.relu),
+                                    conv(start_filts, start_filts, ks=3, pad=1, norm=cf.norm, relu=cf.relu))
+
+            self.C1 = conv(start_filts, start_filts, ks=7, stride=(2, 2, 1) if conv.dim == 3 else 2, pad=3, norm=cf.norm, relu=cf.relu)
+
+        else:
+            self.C1 = conv(cf.n_channels, start_filts, ks=7, stride=(2, 2, 1) if conv.dim == 3 else 2, pad=3, norm=cf.norm, relu=cf.relu)
+
+        start_filts_exp = start_filts * self.block_expansion
+
+        C2_layers = []
+        C2_layers.append(nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
+                         if conv.dim == 2 else nn.MaxPool3d(kernel_size=3, stride=(2, 2, 1), padding=1))
+        C2_layers.append(self.block(start_filts, start_filts, conv=conv, stride=1, norm=cf.norm, relu=cf.relu,
+                                    downsample=(start_filts, self.block_expansion, 1)))
+        for i in range(1, self.n_blocks[0]):
+            C2_layers.append(self.block(start_filts_exp, start_filts, conv=conv, norm=cf.norm, relu=cf.relu))
+        self.C2 = nn.Sequential(*C2_layers)
+
+        C3_layers = []
+        C3_layers.append(self.block(start_filts_exp, start_filts * 2, conv=conv, stride=2, norm=cf.norm, relu=cf.relu,
+                                    downsample=(start_filts_exp, 2, 2)))
+        for i in range(1, self.n_blocks[1]):
+            C3_layers.append(self.block(start_filts_exp * 2, start_filts * 2, conv=conv, norm=cf.norm, relu=cf.relu))
+        self.C3 = nn.Sequential(*C3_layers)
+
+        C4_layers = []
+        C4_layers.append(self.block(
+            start_filts_exp * 2, start_filts * 4, conv=conv, stride=2, norm=cf.norm, relu=cf.relu, downsample=(start_filts_exp * 2, 2, 2)))
+        for i in range(1, self.n_blocks[2]):
+            C4_layers.append(self.block(start_filts_exp * 4, start_filts * 4, conv=conv, norm=cf.norm, relu=cf.relu))
+        self.C4 = nn.Sequential(*C4_layers)
+
+        C5_layers = []
+        C5_layers.append(self.block(
+            start_filts_exp * 4, start_filts * 8, conv=conv, stride=2, norm=cf.norm, relu=cf.relu, downsample=(start_filts_exp * 4, 2, 2)))
+        for i in range(1, self.n_blocks[3]):
+            C5_layers.append(self.block(start_filts_exp * 8, start_filts * 8, conv=conv, norm=cf.norm, relu=cf.relu))
+        self.C5 = nn.Sequential(*C5_layers)
+
+        if self.sixth_pooling:
+            C6_layers = []
+            C6_layers.append(self.block(
+                start_filts_exp * 8, start_filts * 16, conv=conv, stride=2, norm=cf.norm, relu=cf.relu, downsample=(start_filts_exp * 8, 2, 2)))
+            for i in range(1, self.n_blocks[3]):
+                C6_layers.append(self.block(start_filts_exp * 16, start_filts * 16, conv=conv, norm=cf.norm, relu=cf.relu))
+            self.C6 = nn.Sequential(*C6_layers)
+
+        if conv.dim == 2:
+            self.P1_upsample = Interpolate(scale_factor=2, mode='bilinear')
+            self.P2_upsample = Interpolate(scale_factor=2, mode='bilinear')
+        else:
+            self.P1_upsample = Interpolate(scale_factor=(2, 2, 1), mode='trilinear')
+            self.P2_upsample = Interpolate(scale_factor=(2, 2, 1), mode='trilinear')
+
+        self.out_channels = cf.end_filts
+        self.P5_conv1 = conv(start_filts*32 + cf.n_latent_dims, self.out_channels, ks=1, stride=1, relu=None) #
+        self.P4_conv1 = conv(start_filts*16, self.out_channels, ks=1, stride=1, relu=None)
+        self.P3_conv1 = conv(start_filts*8, self.out_channels, ks=1, stride=1, relu=None)
+        self.P2_conv1 = conv(start_filts*4, self.out_channels, ks=1, stride=1, relu=None)
+        self.P1_conv1 = conv(start_filts, self.out_channels, ks=1, stride=1, relu=None)
+
+        if operate_stride1:
+            self.P0_conv1 = conv(start_filts, self.out_channels, ks=1, stride=1, relu=None)
+            self.P0_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+
+        self.P1_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+        self.P2_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+        self.P3_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+        self.P4_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+        self.P5_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+
+        if self.sixth_pooling:
+            self.P6_conv1 = conv(start_filts * 64, self.out_channels, ks=1, stride=1, relu=None)
+            self.P6_conv2 = conv(self.out_channels, self.out_channels, ks=3, stride=1, pad=1, relu=None)
+
+
+    def forward(self, x):
+        """
+        :param x: input image of shape (b, c, y, x, (z))
+        :return: list of output feature maps per pyramid level, each with shape (b, c, y, x, (z)).
+        """
+        if self.operate_stride1:
+            c0_out = self.C0(x)
+        else:
+            c0_out = x
+
+        c1_out = self.C1(c0_out)
+        c2_out = self.C2(c1_out)
+        c3_out = self.C3(c2_out)
+        c4_out = self.C4(c3_out)
+        c5_out = self.C5(c4_out)
+        if self.sixth_pooling:
+            c6_out = self.C6(c5_out)
+            p6_pre_out = self.P6_conv1(c6_out)
+            p5_pre_out = self.P5_conv1(c5_out) + F.interpolate(p6_pre_out, scale_factor=2)
+        else:
+            p5_pre_out = self.P5_conv1(c5_out)
+
+        p4_pre_out = self.P4_conv1(c4_out) + F.interpolate(p5_pre_out, scale_factor=2)
+        p3_pre_out = self.P3_conv1(c3_out) + F.interpolate(p4_pre_out, scale_factor=2)
+        p2_pre_out = self.P2_conv1(c2_out) + F.interpolate(p3_pre_out, scale_factor=2)
+
+        # plot feature map shapes for debugging.
+        # for ii in [c0_out, c1_out, c2_out, c3_out, c4_out, c5_out, c6_out]:
+        #     print ("encoder shapes:", ii.shape)
+        #
+        # for ii in [p6_out, p5_out, p4_out, p3_out, p2_out, p1_out]:
+        #     print("decoder shapes:", ii.shape)
+
+        p2_out = self.P2_conv2(p2_pre_out)
+        p3_out = self.P3_conv2(p3_pre_out)
+        p4_out = self.P4_conv2(p4_pre_out)
+        p5_out = self.P5_conv2(p5_pre_out)
+        out_list = [p2_out, p3_out, p4_out, p5_out]
+
+        if self.sixth_pooling:
+            p6_out = self.P6_conv2(p6_pre_out)
+            out_list.append(p6_out)
+
+        if self.operate_stride1:
+            p1_pre_out = self.P1_conv1(c1_out) + self.P2_upsample(p2_pre_out)
+            p0_pre_out = self.P0_conv1(c0_out) + self.P1_upsample(p1_pre_out)
+            # p1_out = self.P1_conv2(p1_pre_out) # usually not needed.
+            p0_out = self.P0_conv2(p0_pre_out)
+            out_list = [p0_out] + out_list
+
+        return out_list
+
+
+    def encoder_forward(self, x):
+        """
+        :param x: input image of shape (b, c, y, x, (z))
+        :return: list of output feature maps per pyramid level, each with shape (b, c, y, x, (z)).
+        """
+        if self.operate_stride1:
+            c0_out = self.C0(x)
+        else:
+            c0_out = x
+
+        c1_out = self.C1(c0_out)
+        c2_out = self.C2(c1_out)
+        c3_out = self.C3(c2_out)
+        c4_out = self.C4(c3_out)
+        c5_out = self.C5(c4_out)
+        out_list = [c0_out, c1_out, c2_out, c3_out, c4_out, c5_out]
+        if self.sixth_pooling:
+            c6_out = self.C6(c5_out)
+            out_list += [c6_out]
+
+        return out_list
+
+
+    def decoder_forward(self, encoder_list, inject=None):
+
+        if inject is not None:
+            z = inject
+
+            if self.dim == 2:
+                z = z.unsqueeze(-1).unsqueeze(-1).repeat(
+                    1, 1, encoder_list[-1].shape[-2], encoder_list[-1].shape[-1])
+            else:
+                z = z.unsqueeze(-1).unsqueeze(-1).unsqueeze(-1).repeat(
+                    1, 1, encoder_list[-1].shape[-3], encoder_list[-1].shape[-2], encoder_list[-1].shape[-1])
+
+            x = torch.cat((encoder_list[-1], z), 1)
+
+        else:
+            x = encoder_list[-1]
+
+        if self.sixth_pooling:
+            p6_pre_out = self.P6_conv1(x)
+            p5_pre_out = self.P5_conv1(encoder_list[5]) + F.interpolate(p6_pre_out, scale_factor=2)
+        else:
+            p5_pre_out = self.P5_conv1(x)
+
+        p4_pre_out = self.P4_conv1(encoder_list[4]) + F.interpolate(p5_pre_out, scale_factor=2)
+        p3_pre_out = self.P3_conv1(encoder_list[3]) + F.interpolate(p4_pre_out, scale_factor=2)
+        p2_pre_out = self.P2_conv1(encoder_list[2]) + F.interpolate(p3_pre_out, scale_factor=2)
+
+        p2_out = self.P2_conv2(p2_pre_out)
+        p3_out = self.P3_conv2(p3_pre_out)
+        p4_out = self.P4_conv2(p4_pre_out)
+        p5_out = self.P5_conv2(p5_pre_out)
+        out_list = [p2_out, p3_out, p4_out, p5_out]
+
+        if self.sixth_pooling:
+            p6_out = self.P6_conv2(p6_pre_out)
+            out_list.append(p6_out)
+
+        if self.operate_stride1:
+            p1_pre_out = self.P1_conv1(c1_out) + self.P2_upsample(p2_pre_out)
+            p0_pre_out = self.P0_conv1(c0_out) + self.P1_upsample(p1_pre_out)
+            # p1_out = self.P1_conv2(p1_pre_out) # usually not needed.
+            p0_out = self.P0_conv2(p0_pre_out)
+            out_list = [p0_out] + out_list
+
+        return out_list
+
+
+
+class ResBlock(nn.Module):
+
+    def __init__(self, start_filts, planes, conv, stride=1, downsample=None, norm=None, relu='relu'):
+        super(ResBlock, self).__init__()
+        self.conv1 = conv(start_filts, planes, ks=1, stride=stride, norm=norm, relu=relu)
+        self.conv2 = conv(planes, planes, ks=3, pad=1, norm=norm, relu=relu)
+        self.conv3 = conv(planes, planes * 4, ks=1, norm=norm, relu=None)
+        self.relu = nn.ReLU(inplace=True) if relu == 'relu' else nn.LeakyReLU(inplace=True)
+        if downsample is not None:
+            self.downsample = conv(downsample[0], downsample[0] * downsample[1], ks=1, stride=downsample[2], norm=norm, relu=None)
+        else:
+            self.downsample = None
+        self.stride = stride
+
+    def forward(self, x):
+        residual = x
+        out = self.conv1(x)
+        out = self.conv2(out)
+        out = self.conv3(out)
+        if self.downsample:
+            residual = self.downsample(x)
+        out += residual
+        out = self.relu(out)
+        return out
+
+
+class Interpolate(nn.Module):
+    def __init__(self, scale_factor, mode):
+        super(Interpolate, self).__init__()
+        self.interp = nn.functional.interpolate
+        self.scale_factor = scale_factor
+        self.mode = mode
+
+    def forward(self, x):
+        x = self.interp(x, scale_factor=self.scale_factor, mode=self.mode, align_corners=False)
+        return x
\ No newline at end of file
diff --git a/models/detection_unet.py b/models/detection_unet.py
new file mode 100644
index 0000000..0e58fdd
--- /dev/null
+++ b/models/detection_unet.py
@@ -0,0 +1,214 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Unet-like Backbone architecture, with non-parametric heuristics for box detection on semantic segmentation outputs.
+"""
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from scipy.ndimage.measurements import label as lb
+import numpy as np
+import utils.exp_utils as utils
+import utils.model_utils as mutils
+
+
+class net(nn.Module):
+
+    def __init__(self, cf, logger):
+
+        super(net, self).__init__()
+        self.cf = cf
+        self.logger = logger
+        backbone = utils.import_module('bbone', cf.backbone_path)
+        conv = mutils.NDConvGenerator(cf.dim)
+
+        # set operate_stride1=True to generate a unet-like FPN.)
+        self.fpn = backbone.FPN(cf, conv, operate_stride1=True).cuda()
+        self.conv_final = conv(cf.end_filts, cf.num_seg_classes, ks=1, pad=0, norm=cf.norm, relu=None)
+
+        if self.cf.weight_init is not None:
+            logger.info("using pytorch weight init of type {}".format(self.cf.weight_init))
+            mutils.initialize_weights(self)
+        else:
+            logger.info("using default pytorch weight init")
+
+
+    def forward(self, x):
+        """
+        forward pass of network.
+        :param x: input image. shape (b, c, y, x, (z))
+        :return: seg_logits: shape (b, n_classes, y, x, (z))
+        :return: out_box_coords: list over n_classes. elements are arrays(b, n_rois, (y1, x1, y2, x2, (z1), (z2)))
+        :return: out_max_scores: list over n_classes. elements are arrays(b, n_rois)
+        """
+
+        out_features = self.fpn(x)[0]
+        seg_logits = self.conv_final(out_features)
+        out_box_coords, out_max_scores = [], []
+        smax = F.softmax(seg_logits, dim=1).detach().cpu().data.numpy()
+
+        for cl in range(1, len(self.cf.class_dict.keys()) + 1):
+            max_scores = [[] for _ in range(x.shape[0])]
+            hard_mask = np.copy(smax).argmax(1)
+            hard_mask[hard_mask != cl] = 0
+            hard_mask[hard_mask == cl] = 1
+            # perform connected component analysis on argmaxed predictions,
+            # draw boxes around components and return coordinates.
+            box_coords, rois = get_coords(hard_mask, self.cf.n_roi_candidates, self.cf.dim)
+
+            # for each object, choose the highest softmax score (in the respective class)
+            # of all pixels in the component as object score.
+            for bix, broi in enumerate(rois):
+                for nix, nroi in enumerate(broi):
+                    component_score = np.max(smax[bix, cl][nroi > 0]) if self.cf.aggregation_operation == 'max' \
+                        else np.median(smax[bix, cl][nroi > 0])
+                    max_scores[bix].append(component_score)
+            out_box_coords.append(box_coords)
+            out_max_scores.append(max_scores)
+        return seg_logits, out_box_coords, out_max_scores
+
+
+    def train_forward(self, batch, **kwargs):
+        """
+        train method (also used for validation monitoring). wrapper around forward pass of network. prepares input data
+        for processing, computes losses, and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data', 'seg', etc.
+        :param kwargs:
+        :return: results_dict: dictionary with keys:
+                'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                        [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+                'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes]
+                'monitor_values': dict of values to be monitored.
+        """
+        img = batch['data']
+        seg = batch['seg']
+        var_img = torch.FloatTensor(img).cuda()
+        var_seg = torch.FloatTensor(seg).cuda().long()
+        var_seg_ohe = torch.FloatTensor(mutils.get_one_hot_encoding(seg, self.cf.num_seg_classes)).cuda()
+        results_dict = {}
+        seg_logits, box_coords, max_scores = self.forward(var_img)
+
+        results_dict['boxes'] = [[] for _ in range(img.shape[0])]
+        for cix in range(len(self.cf.class_dict.keys())):
+            for bix in range(img.shape[0]):
+                for rix in range(len(max_scores[cix][bix])):
+                    if max_scores[cix][bix][rix] > self.cf.detection_min_confidence:
+                        results_dict['boxes'][bix].append({'box_coords': np.copy(box_coords[cix][bix][rix]),
+                                                           'box_score': max_scores[cix][bix][rix],
+                                                           'box_pred_class_id': cix + 1,  # add 0 for background.
+                                                           'box_type': 'det'})
+
+
+        for bix in range(img.shape[0]):
+            for tix in range(len(batch['bb_target'][bix])):
+                results_dict['boxes'][bix].append({'box_coords': batch['bb_target'][bix][tix],
+                                                   'box_label': batch['roi_labels'][bix][tix],
+                                                   'box_type': 'gt'})
+
+        # compute segmentation loss as either weighted cross entropy, dice loss, or the sum of both.
+        loss = torch.FloatTensor([0]).cuda()
+        if self.cf.seg_loss_mode == 'dice' or self.cf.seg_loss_mode == 'dice_wce':
+            loss += 1 - mutils.batch_dice(F.softmax(seg_logits, dim=1), var_seg_ohe,
+                                          false_positive_weight=float(self.cf.fp_dice_weight))
+
+        if self.cf.seg_loss_mode == 'wce' or self.cf.seg_loss_mode == 'dice_wce':
+            loss += F.cross_entropy(seg_logits, var_seg[:, 0], weight=torch.tensor(self.cf.wce_weights).float().cuda())
+
+        results_dict['seg_preds'] = np.argmax(F.softmax(seg_logits, 1).cpu().data.numpy(), 1)[:, np.newaxis]
+        results_dict['torch_loss'] = loss
+        results_dict['monitor_extra_values'] = {'loss': loss.item()}
+        results_dict['logger_string'] = "loss: {0:.2f}".format(loss.item())
+
+
+        return results_dict
+
+
+    def test_forward(self, batch, **kwargs):
+        """
+        test method. wrapper around forward pass of network without usage of any ground truth information.
+        prepares input data for processing and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data'
+        :param kwargs:
+        :return: results_dict: dictionary with keys:
+               'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                       [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+               'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes]
+        """
+        img = batch['data']
+        var_img = torch.FloatTensor(img).cuda()
+        seg_logits, box_coords, max_scores = self.forward(var_img)
+
+        results_dict = {}
+        results_dict['boxes'] = [[] for _ in range(img.shape[0])]
+        for cix in range(len(self.cf.class_dict.keys())):
+            for bix in range(img.shape[0]):
+                for rix in range(len(max_scores[cix][bix])):
+                    if max_scores[cix][bix][rix] > self.cf.detection_min_confidence:
+                        results_dict['boxes'][bix].append({'box_coords': np.copy(box_coords[cix][bix][rix]),
+                                                           'box_score': max_scores[cix][bix][rix],
+                                                           'box_pred_class_id': cix + 1,  # add 0 for background.
+                                                           'box_type': 'det'})
+
+        results_dict['seg_preds'] = np.argmax(F.softmax(seg_logits, 1).cpu().data.numpy(), 1)[:, np.newaxis].astype('uint8')
+        return results_dict
+
+
+
+def get_coords(binary_mask, n_components, dim):
+    """
+    loops over batch to perform connected component analysis on binary input mask. computes box coordiantes around
+    n_components - biggest components (rois).
+    :param binary_mask: (b, y, x, (z)). binary mask for one specific foreground class.
+    :param n_components: int. number of components to extract per batch element and class.
+    :return: coords (b, n, (y1, x1, y2, x2, (z1), (z2))
+    :return: batch_components (b, n, (y1, x1, y2, x2, (z1), (z2))
+    """
+    binary_mask = binary_mask.astype('uint8')
+    batch_coords = []
+    batch_components = []
+    for ix, b in enumerate(binary_mask):
+        clusters, n_cands = lb(b)  # peforms connected component analysis.
+        uniques, counts = np.unique(clusters, return_counts=True)
+        # only keep n_components largest components.
+        keep_uniques = uniques[1:][np.argsort(counts[1:])[::-1]][:n_components]
+        # separate clusters and concat.
+        p_components = np.array([(clusters == ii) * 1 for ii in keep_uniques])
+        p_coords = []
+        if p_components.shape[0] > 0:
+            for roi in p_components:
+                mask_ixs = np.argwhere(roi != 0)
+
+                # get coordinates around component.
+                roi_coords = [np.min(mask_ixs[:, 0]) - 1, np.min(mask_ixs[:, 1]) - 1, np.max(mask_ixs[:, 0]) + 1,
+                              np.max(mask_ixs[:, 1]) + 1]
+                if dim == 3:
+                    roi_coords += [np.min(mask_ixs[:, 2]), np.max(mask_ixs[:, 2])+1]
+                p_coords.append(roi_coords)
+
+            p_coords = np.array(p_coords)
+
+            # clip coords.
+            p_coords[p_coords < 0] = 0
+            p_coords[:, :4][p_coords[:, :4] > binary_mask.shape[-2]] = binary_mask.shape[-2]
+            if dim == 3:
+                p_coords[:, 4:][p_coords[:, 4:] > binary_mask.shape[-1]] = binary_mask.shape[-1]
+
+        batch_coords.append(p_coords)
+        batch_components.append(p_components)
+    return batch_coords, batch_components
+
diff --git a/models/mrcnn.py b/models/mrcnn.py
new file mode 100644
index 0000000..24d15f7
--- /dev/null
+++ b/models/mrcnn.py
@@ -0,0 +1,1076 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Parts are based on https://github.com/multimodallearning/pytorch-mask-rcnn
+published under MIT license.
+"""
+
+import utils.model_utils as mutils
+import utils.exp_utils as utils
+from cuda_functions.nms_2D.pth_nms import nms_gpu as nms_2D
+from cuda_functions.nms_3D.pth_nms import nms_gpu as nms_3D
+from cuda_functions.roi_align_2D.roi_align.crop_and_resize import CropAndResizeFunction as ra2D
+from cuda_functions.roi_align_3D.roi_align.crop_and_resize import CropAndResizeFunction as ra3D
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils
+
+
+############################################################
+# Networks on top of backbone
+############################################################
+
+class RPN(nn.Module):
+    """
+    Region Proposal Network.
+    """
+
+    def __init__(self, cf, conv):
+
+        super(RPN, self).__init__()
+        self.dim = conv.dim
+
+        self.conv_shared = conv(cf.end_filts, cf.n_rpn_features, ks=3, stride=cf.rpn_anchor_stride, pad=1, relu=cf.relu)
+        self.conv_class = conv(cf.n_rpn_features, 2 * len(cf.rpn_anchor_ratios), ks=1, stride=1, relu=None)
+        self.conv_bbox = conv(cf.n_rpn_features, 2 * self.dim * len(cf.rpn_anchor_ratios), ks=1, stride=1, relu=None)
+
+
+    def forward(self, x):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :return: rpn_class_logits (b, 2, n_anchors)
+        :return: rpn_probs_logits (b, 2, n_anchors)
+        :return: rpn_bbox (b, 2 * dim, n_anchors)
+        """
+
+        # Shared convolutional base of the RPN.
+        x = self.conv_shared(x)
+
+        # Anchor Score. (batch, anchors per location * 2, y, x, (z)).
+        rpn_class_logits = self.conv_class(x)
+        # Reshape to (batch, 2, anchors)
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        rpn_class_logits = rpn_class_logits.permute(*axes)
+        rpn_class_logits = rpn_class_logits.contiguous()
+        rpn_class_logits = rpn_class_logits.view(x.size()[0], -1, 2)
+
+        # Softmax on last dimension (fg vs. bg).
+        rpn_probs = F.softmax(rpn_class_logits, dim=2)
+
+        # Bounding box refinement. (batch, anchors_per_location * (y, x, (z), log(h), log(w), (log(d)), y, x, (z))
+        rpn_bbox = self.conv_bbox(x)
+
+        # Reshape to (batch, 2*dim, anchors)
+        rpn_bbox = rpn_bbox.permute(*axes)
+        rpn_bbox = rpn_bbox.contiguous()
+        rpn_bbox = rpn_bbox.view(x.size()[0], -1, self.dim * 2)
+
+        return [rpn_class_logits, rpn_probs, rpn_bbox]
+
+
+
+class Classifier(nn.Module):
+    """
+    Head network for classification and bounding box refinement. Performs RoiAlign, processes resulting features through a
+    shared convolutional base and finally branches off the classifier- and regression head.
+    """
+    def __init__(self, cf, conv):
+        super(Classifier, self).__init__()
+
+        self.dim = conv.dim
+        self.in_channels = cf.end_filts
+        self.pool_size = cf.pool_size
+        self.pyramid_levels = cf.pyramid_levels
+        # instance_norm does not work with spatial dims (1, 1, (1))
+        norm = cf.norm if cf.norm != 'instance_norm' else None
+
+        self.conv1 = conv(cf.end_filts, cf.end_filts * 4, ks=self.pool_size, stride=1, norm=norm, relu=cf.relu)
+        self.conv2 = conv(cf.end_filts * 4, cf.end_filts * 4, ks=1, stride=1, norm=norm, relu=cf.relu)
+        self.linear_class = nn.Linear(cf.end_filts * 4, cf.head_classes)
+        self.linear_bbox = nn.Linear(cf.end_filts * 4, cf.head_classes * 2 * self.dim)
+
+    def forward(self, x, rois):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :param rois: normalized box coordinates as proposed by the RPN to be forwarded through
+        the second stage (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix). Proposals of all batch elements
+        have been merged to one vector, while the origin info has been stored for re-allocation.
+        :return: mrcnn_class_logits (n_proposals, n_head_classes)
+        :return: mrcnn_bbox (n_proposals, n_head_classes, 2 * dim) predicted corrections to be applied to proposals for refinement.
+        """
+        x = pyramid_roi_align(x, rois, self.pool_size, self.pyramid_levels, self.dim)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = x.view(-1, self.in_channels * 4)
+        mrcnn_class_logits = self.linear_class(x)
+        mrcnn_bbox = self.linear_bbox(x)
+        mrcnn_bbox = mrcnn_bbox.view(mrcnn_bbox.size()[0], -1, self.dim * 2)
+
+        return [mrcnn_class_logits, mrcnn_bbox]
+
+
+
+class Mask(nn.Module):
+    """
+    Head network for proposal-based mask segmentation. Performs RoiAlign, some convolutions and applies sigmoid on the
+    output logits to allow for overlapping classes.
+    """
+    def __init__(self, cf, conv):
+        super(Mask, self).__init__()
+        self.pool_size = cf.mask_pool_size
+        self.pyramid_levels = cf.pyramid_levels
+        self.dim = conv.dim
+        self.conv1 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv2 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv3 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv4 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        if conv.dim == 2:
+            self.deconv = nn.ConvTranspose2d(cf.end_filts, cf.end_filts, kernel_size=2, stride=2)
+        else:
+            self.deconv = nn.ConvTranspose3d(cf.end_filts, cf.end_filts, kernel_size=2, stride=2)
+
+        self.relu = nn.ReLU(inplace=True) if cf.relu == 'relu' else nn.LeakyReLU(inplace=True)
+        self.conv5 = conv(cf.end_filts, cf.head_classes, ks=1, stride=1, relu=None)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, x, rois):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :param rois: normalized box coordinates as proposed by the RPN to be forwarded through
+        the second stage (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix). Proposals of all batch elements
+        have been merged to one vector, while the origin info has been stored for re-allocation.
+        :return: x: masks (n_sampled_proposals (n_detections in inference), n_classes, y, x, (z))
+        """
+        x = pyramid_roi_align(x, rois, self.pool_size, self.pyramid_levels, self.dim)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        x = self.relu(self.deconv(x))
+        x = self.conv5(x)
+        x = self.sigmoid(x)
+        return x
+
+
+############################################################
+#  Loss Functions
+############################################################
+
+def compute_rpn_class_loss(rpn_match, rpn_class_logits, shem_poolsize):
+    """
+    :param rpn_match: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :param rpn_class_logits: (n_anchors, 2). logits from RPN classifier.
+    :param shem_poolsize: int. factor of top-k candidates to draw from per negative sample
+    (stochastic-hard-example-mining).
+    :return: loss: torch tensor
+    :return: np_neg_ix: 1D array containing indices of the neg_roi_logits, which have been sampled for training.
+    """
+
+    # filter out neutral anchors.
+    pos_indices = torch.nonzero(rpn_match == 1)
+    neg_indices = torch.nonzero(rpn_match == -1)
+
+    # loss for positive samples
+    if 0 not in pos_indices.size():
+        pos_indices = pos_indices.squeeze(1)
+        roi_logits_pos = rpn_class_logits[pos_indices]
+        pos_loss = F.cross_entropy(roi_logits_pos, torch.LongTensor([1] * pos_indices.shape[0]).cuda())
+    else:
+        pos_loss = torch.FloatTensor([0]).cuda()
+
+    # loss for negative samples: draw hard negative examples (SHEM)
+    # that match the number of positive samples, but at least 1.
+    if 0 not in neg_indices.size():
+        neg_indices = neg_indices.squeeze(1)
+        roi_logits_neg = rpn_class_logits[neg_indices]
+        negative_count = np.max((1, pos_indices.cpu().data.numpy().size))
+        roi_probs_neg = F.softmax(roi_logits_neg, dim=1)
+        neg_ix = mutils.shem(roi_probs_neg, negative_count, shem_poolsize)
+        neg_loss = F.cross_entropy(roi_logits_neg[neg_ix], torch.LongTensor([0] * neg_ix.shape[0]).cuda())
+        np_neg_ix = neg_ix.cpu().data.numpy()
+    else:
+        neg_loss = torch.FloatTensor([0]).cuda()
+        np_neg_ix = np.array([]).astype('int32')
+
+    loss = (pos_loss + neg_loss) / 2
+    return loss, np_neg_ix
+
+
+def compute_rpn_bbox_loss(rpn_target_deltas, rpn_pred_deltas, rpn_match):
+    """
+    :param rpn_target_deltas:   (b, n_positive_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd)))).
+    Uses 0 padding to fill in unsed bbox deltas.
+    :param rpn_pred_deltas: predicted deltas from RPN. (b, n_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd))))
+    :param rpn_match: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(rpn_match == 1).size():
+
+        indices = torch.nonzero(rpn_match == 1).squeeze(1)
+        # Pick bbox deltas that contribute to the loss
+        rpn_pred_deltas = rpn_pred_deltas[indices]
+        # Trim target bounding box deltas to the same length as rpn_bbox.
+        target_deltas = rpn_target_deltas[:rpn_pred_deltas.size()[0], :]
+        # Smooth L1 loss
+        loss = F.smooth_l1_loss(rpn_pred_deltas, target_deltas)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_class_loss(target_class_ids, pred_class_logits):
+    """
+    :param target_class_ids: (n_sampled_rois) batch dimension was merged into roi dimension.
+    :param pred_class_logits: (n_sampled_rois, n_classes)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in target_class_ids.size():
+        loss = F.cross_entropy(pred_class_logits, target_class_ids.long())
+    else:
+        loss = torch.FloatTensor([0.]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_bbox_loss(mrcnn_target_deltas, mrcnn_pred_deltas, target_class_ids):
+    """
+    :param mrcnn_target_deltas: (n_sampled_rois, (dy, dx, (dz), log(dh), log(dw), (log(dh)))
+    :param mrcnn_pred_deltas: (n_sampled_rois, n_classes, (dy, dx, (dz), log(dh), log(dw), (log(dh)))
+    :param target_class_ids: (n_sampled_rois)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(target_class_ids > 0).size():
+        positive_roi_ix = torch.nonzero(target_class_ids > 0)[:, 0]
+        positive_roi_class_ids = target_class_ids[positive_roi_ix].long()
+        target_bbox = mrcnn_target_deltas[positive_roi_ix, :].detach()
+        pred_bbox = mrcnn_pred_deltas[positive_roi_ix, positive_roi_class_ids, :]
+        loss = F.smooth_l1_loss(pred_bbox, target_bbox)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_mask_loss(target_masks, pred_masks, target_class_ids):
+    """
+    :param target_masks: (n_sampled_rois, y, x, (z)) A float32 tensor of values 0 or 1. Uses zero padding to fill array.
+    :param pred_masks: (n_sampled_rois, n_classes, y, x, (z)) float32 tensor with values between [0, 1].
+    :param target_class_ids: (n_sampled_rois)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(target_class_ids > 0).size():
+        # Only positive ROIs contribute to the loss. And only
+        # the class specific mask of each ROI.
+        positive_ix = torch.nonzero(target_class_ids > 0)[:, 0]
+        positive_class_ids = target_class_ids[positive_ix].long()
+        y_true = target_masks[positive_ix, :, :].detach()
+        y_pred = pred_masks[positive_ix, positive_class_ids, :, :]
+        loss = F.binary_cross_entropy(y_pred, y_true)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+############################################################
+#  Helper Layers
+############################################################
+
+def proposal_layer(rpn_pred_probs, rpn_pred_deltas, proposal_count, anchors, cf):
+    """
+    Receives anchor scores and selects a subset to pass as proposals
+    to the second stage. Filtering is done based on anchor scores and
+    non-max suppression to remove overlaps. It also applies bounding
+    box refinment detals to anchors.
+    :param rpn_pred_probs: (b, n_anchors, 2)
+    :param rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+    :return: batch_normalized_boxes: Proposals in normalized coordinates
+    (b, proposal_count, (y1, x1, y2, x2, (z1), (z2)))
+    :return: batch_out_proposals: Box coords + RPN foreground scores
+    for monitoring/plotting (b, proposal_count, (y1, x1, y2, x2, (z1), (z2), score))
+    """
+    batch_scores = rpn_pred_probs[:, :, 1]
+    batch_deltas = rpn_pred_deltas
+    batch_anchors = anchors
+    batch_normalized_boxes = []
+    batch_out_proposals = []
+
+    # loop over batch dimension.
+    for ix in range(batch_scores.shape[0]):
+
+        scores = batch_scores[ix]
+        deltas = batch_deltas[ix]
+        anchors = batch_anchors.clone()
+        # norm deltas
+        std_dev = torch.from_numpy(cf.rpn_bbox_std_dev[None]).float().cuda()
+        deltas = deltas * std_dev
+
+        # improve performance by trimming to top anchors by score
+        # and doing the rest on the smaller subset.
+        pre_nms_limit = min(cf.pre_nms_limit, anchors.size()[0])
+        scores, order = scores.sort(descending=True)
+        order = order[:pre_nms_limit]
+        scores = scores[:pre_nms_limit]
+        deltas = deltas[order, :]
+        anchors = anchors[order, :]
+
+        # apply deltas to anchors to get refined anchors and filter with non-maximum surpression.
+        if batch_deltas.shape[-1] == 4:
+            boxes = mutils.apply_box_deltas_2D(anchors, deltas)
+            boxes = mutils.clip_boxes_2D(boxes, cf.window)
+            keep = nms_2D(torch.cat((boxes, scores.unsqueeze(1)), 1), cf.rpn_nms_threshold)
+            norm = torch.from_numpy(cf.scale).float().cuda()
+
+        else:
+            boxes = mutils.apply_box_deltas_3D(anchors, deltas)
+            boxes = mutils.clip_boxes_3D(boxes, cf.window)
+            keep = nms_3D(torch.cat((boxes, scores.unsqueeze(1)), 1), cf.rpn_nms_threshold)
+            norm = torch.from_numpy(cf.scale).float().cuda()
+
+        keep = keep[:proposal_count]
+        boxes = boxes[keep, :]
+        rpn_scores = scores[keep][:, None]
+
+        # pad missing boxes with 0.
+        if boxes.shape[0] < proposal_count:
+            n_pad_boxes = proposal_count - boxes.shape[0]
+            zeros = torch.zeros([n_pad_boxes, boxes.shape[1]]).cuda()
+            boxes = torch.cat([boxes, zeros], dim=0)
+            zeros = torch.zeros([n_pad_boxes, rpn_scores.shape[1]]).cuda()
+            rpn_scores = torch.cat([rpn_scores, zeros], dim=0)
+
+        # concat box and score info for monitoring/plotting.
+        batch_out_proposals.append(torch.cat((boxes, rpn_scores), 1).cpu().data.numpy())
+        # normalize dimensions to range of 0 to 1.
+        normalized_boxes = boxes / norm
+        # add back batch dimension
+        batch_normalized_boxes.append(normalized_boxes.unsqueeze(0))
+
+    batch_normalized_boxes = torch.cat(batch_normalized_boxes)
+    batch_out_proposals = np.array(batch_out_proposals)
+    return batch_normalized_boxes, batch_out_proposals
+
+
+
+def pyramid_roi_align(feature_maps, rois, pool_size, pyramid_levels, dim):
+    """
+    Implements ROI Pooling on multiple levels of the feature pyramid.
+    :param feature_maps: list of feature maps, each of shape (b, c, y, x , (z))
+    :param rois: proposals (normalized coords.) as returned by RPN. contain info about original batch element allocation.
+    (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ixs)
+    :param pool_size: list of poolsizes in dims: [x, y, (z)]
+    :param pyramid_levels: list. [0, 1, 2, ...]
+    :return: pooled: pooled feature map rois (n_proposals, c, poolsize_y, poolsize_x, (poolsize_z))
+
+    Output:
+    Pooled regions in the shape: [num_boxes, height, width, channels].
+    The width and height are those specific in the pool_shape in the layer
+    constructor.
+    """
+    boxes = rois[:, :dim*2]
+    batch_ixs = rois[:, dim*2]
+
+    # Assign each ROI to a level in the pyramid based on the ROI area.
+    if dim == 2:
+        y1, x1, y2, x2 = boxes.chunk(4, dim=1)
+    else:
+        y1, x1, y2, x2, z1, z2 = boxes.chunk(6, dim=1)
+
+    h = y2 - y1
+    w = x2 - x1
+
+    # Equation 1 in https://arxiv.org/abs/1612.03144. Account for
+    # the fact that our coordinates are normalized here.
+    # divide sqrt(h*w) by 1 instead image_area.
+    roi_level = (4 + mutils.log2(torch.sqrt(h*w))).round().int().clamp(pyramid_levels[0], pyramid_levels[-1])
+    # if Pyramid contains additional level P6, adapt the roi_level assignemnt accordingly.
+    if len(pyramid_levels) == 5:
+        roi_level[h*w > 0.65] = 5
+
+    # Loop through levels and apply ROI pooling to each.
+    pooled = []
+    box_to_level = []
+    for level_ix, level in enumerate(pyramid_levels):
+        ix = roi_level == level
+        if not ix.any():
+            continue
+        ix = torch.nonzero(ix)[:, 0]
+        level_boxes = boxes[ix, :]
+        # re-assign rois to feature map of original batch element.
+        ind = batch_ixs[ix].int()
+
+        # Keep track of which box is mapped to which level
+        box_to_level.append(ix)
+
+        # Stop gradient propogation to ROI proposals
+        level_boxes = level_boxes.detach()
+
+        # Crop and Resize
+        # From Mask R-CNN paper: "We sample four regular locations, so
+        # that we can evaluate either max or average pooling. In fact,
+        # interpolating only a single value at each bin center (without
+        # pooling) is nearly as effective."
+        #
+        # Here we use the simplified approach of a single value per bin,
+        # which is how is done in tf.crop_and_resize()
+        #
+        # Also fixed a bug from original implementation, reported in:
+        # https://hackernoon.com/how-tensorflows-tf-image-resize-stole-60-days-of-my-life-aba5eb093f35
+
+        if len(pool_size) == 2:
+            pooled_features = ra2D(pool_size[0], pool_size[1], 0)(feature_maps[level_ix], level_boxes, ind)
+        else:
+            pooled_features = ra3D(pool_size[0], pool_size[1], pool_size[2], 0)(feature_maps[level_ix], level_boxes, ind)
+
+        pooled.append(pooled_features)
+
+
+    # Pack pooled features into one tensor
+    pooled = torch.cat(pooled, dim=0)
+
+    # Pack box_to_level mapping into one array and add another
+    # column representing the order of pooled boxes
+    box_to_level = torch.cat(box_to_level, dim=0)
+
+    # Rearrange pooled features to match the order of the original boxes
+    _, box_to_level = torch.sort(box_to_level)
+    pooled = pooled[box_to_level, :, :]
+
+    return pooled
+
+
+
+def detection_target_layer(batch_proposals, batch_mrcnn_class_scores, batch_gt_class_ids, batch_gt_boxes, batch_gt_masks, cf):
+    """
+    Subsamples proposals for mrcnn losses and generates targets. Sampling is done per batch element, seems to have positive
+    effects on training, as opposed to sampling over entire batch. Negatives are sampled via stochastic-hard-example-mining
+    (SHEM), where a number of negative proposals are drawn from larger pool of highest scoring proposals for stochasticity.
+    Scoring is obtained here as the max over all foreground probabilities as returned by mrcnn_classifier (worked better than
+    loss-based class balancing methods like "online-hard-example-mining" or "focal loss".)
+
+    :param batch_proposals: (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ixs).
+    boxes as proposed by RPN. n_proposals here is determined by batch_size * POST_NMS_ROIS.
+    :param batch_mrcnn_class_scores: (n_proposals, n_classes)
+    :param batch_gt_class_ids: list over batch elements. Each element is a list over the corresponding roi target labels.
+    :param batch_gt_boxes: list over batch elements. Each element is a list over the corresponding roi target coordinates.
+    :param batch_gt_masks: list over batch elements. Each element is binary mask of shape (n_gt_rois, y, x, (z), c)
+    :return: sample_indices: (n_sampled_rois) indices of sampled proposals to be used for loss functions.
+    :return: target_class_ids: (n_sampled_rois)containing target class labels of sampled proposals.
+    :return: target_deltas: (n_sampled_rois, 2 * dim) containing target deltas of sampled proposals for box refinement.
+    :return: target_masks: (n_sampled_rois, y, x, (z)) containing target masks of sampled proposals.
+    """
+    # normalization of target coordinates
+    if cf.dim == 2:
+        h, w = cf.patch_size
+        scale = torch.from_numpy(np.array([h, w, h, w])).float().cuda()
+    else:
+        h, w, z = cf.patch_size
+        scale = torch.from_numpy(np.array([h, w, h, w, z, z])).float().cuda()
+
+
+    positive_count = 0
+    negative_count = 0
+    sample_positive_indices = []
+    sample_negative_indices = []
+    sample_deltas = []
+    sample_masks = []
+    sample_class_ids = []
+
+    # loop over batch and get positive and negative sample rois.
+    for b in range(len(batch_gt_class_ids)):
+
+        gt_class_ids = torch.from_numpy(batch_gt_class_ids[b]).int().cuda()
+        gt_masks = torch.from_numpy(batch_gt_masks[b]).float().cuda()
+        if np.any(batch_gt_class_ids[b] > 0):  # skip roi selection for no gt images.
+            gt_boxes = torch.from_numpy(batch_gt_boxes[b]).float().cuda() / scale
+        else:
+            gt_boxes = torch.FloatTensor().cuda()
+
+        # get proposals and indices of current batch element.
+        proposals = batch_proposals[batch_proposals[:, -1] == b][:, :-1]
+        batch_element_indices = torch.nonzero(batch_proposals[:, -1] == b).squeeze(1)
+
+        # Compute overlaps matrix [proposals, gt_boxes]
+        if 0 not in gt_boxes.size():
+            if gt_boxes.shape[1] == 4:
+                overlaps = mutils.bbox_overlaps_2D(proposals, gt_boxes)
+            else:
+                overlaps = mutils.bbox_overlaps_3D(proposals, gt_boxes)
+
+            # Determine postive and negative ROIs
+            roi_iou_max = torch.max(overlaps, dim=1)[0]
+            # 1. Positive ROIs are those with >= 0.5 IoU with a GT box
+            positive_roi_bool = roi_iou_max >= (0.5 if cf.dim == 2 else 0.3)
+            # 2. Negative ROIs are those with < 0.1 with every GT box.
+            negative_roi_bool = roi_iou_max < (0.1 if cf.dim == 2 else 0.01)
+        else:
+            positive_roi_bool = torch.FloatTensor().cuda()
+            negative_roi_bool = torch.from_numpy(np.array([1]*proposals.shape[0])).cuda()
+
+        # Sample Positive ROIs
+        if 0 not in torch.nonzero(positive_roi_bool).size():
+            positive_indices = torch.nonzero(positive_roi_bool).squeeze(1)
+            positive_samples = int(cf.train_rois_per_image * cf.roi_positive_ratio)
+            rand_idx = torch.randperm(positive_indices.size()[0])
+            rand_idx = rand_idx[:positive_samples].cuda()
+            positive_indices = positive_indices[rand_idx]
+            positive_samples = positive_indices.size()[0]
+            positive_rois = proposals[positive_indices, :]
+            # Assign positive ROIs to GT boxes.
+            positive_overlaps = overlaps[positive_indices, :]
+            roi_gt_box_assignment = torch.max(positive_overlaps, dim=1)[1]
+            roi_gt_boxes = gt_boxes[roi_gt_box_assignment, :]
+            roi_gt_class_ids = gt_class_ids[roi_gt_box_assignment]
+
+            # Compute bbox refinement targets for positive ROIs
+            deltas = mutils.box_refinement(positive_rois, roi_gt_boxes)
+            std_dev = torch.from_numpy(cf.bbox_std_dev).float().cuda()
+            deltas /= std_dev
+
+            # Assign positive ROIs to GT masks
+            roi_masks = gt_masks[roi_gt_box_assignment, :, :]
+
+            # Compute mask targets
+            boxes = positive_rois
+            box_ids = torch.arange(roi_masks.size()[0]).int().cuda()
+
+            if len(cf.mask_shape) == 2:
+                masks = ra2D(cf.mask_shape[0], cf.mask_shape[1], 0)(roi_masks.unsqueeze(1), boxes, box_ids)
+            else:
+                masks = ra3D(cf.mask_shape[0], cf.mask_shape[1], cf.mask_shape[2], 0)(roi_masks.unsqueeze(1), boxes, box_ids)
+
+            masks = masks.squeeze(1)
+            # Threshold mask pixels at 0.5 to have GT masks be 0 or 1 to use with
+            # binary cross entropy loss.
+            masks = torch.round(masks)
+
+            sample_positive_indices.append(batch_element_indices[positive_indices])
+            sample_deltas.append(deltas)
+            sample_masks.append(masks)
+            sample_class_ids.append(roi_gt_class_ids)
+            positive_count += positive_samples
+        else:
+            positive_samples = 0
+
+        # Negative ROIs. Add enough to maintain positive:negative ratio, but at least 1. Sample via SHEM.
+        if 0 not in torch.nonzero(negative_roi_bool).size():
+            negative_indices = torch.nonzero(negative_roi_bool).squeeze(1)
+            r = 1.0 / cf.roi_positive_ratio
+            b_neg_count = np.max((int(r * positive_samples - positive_samples), 1))
+            roi_probs_neg = batch_mrcnn_class_scores[batch_element_indices[negative_indices]]
+            raw_sampled_indices = mutils.shem(roi_probs_neg, b_neg_count, cf.shem_poolsize)
+            sample_negative_indices.append(batch_element_indices[negative_indices[raw_sampled_indices]])
+            negative_count += raw_sampled_indices.size()[0]
+
+    if len(sample_positive_indices) > 0:
+        target_deltas = torch.cat(sample_deltas)
+        target_masks = torch.cat(sample_masks)
+        target_class_ids = torch.cat(sample_class_ids)
+
+    # Pad target information with zeros for negative ROIs.
+    if positive_count > 0 and negative_count > 0:
+        sample_indices = torch.cat((torch.cat(sample_positive_indices), torch.cat(sample_negative_indices)), dim=0)
+        zeros = torch.zeros(negative_count).int().cuda()
+        target_class_ids = torch.cat([target_class_ids, zeros], dim=0)
+        zeros = torch.zeros(negative_count, cf.dim * 2).cuda()
+        target_deltas = torch.cat([target_deltas, zeros], dim=0)
+        zeros = torch.zeros(negative_count, *cf.mask_shape).cuda()
+        target_masks = torch.cat([target_masks, zeros], dim=0)
+    elif positive_count > 0:
+        sample_indices = torch.cat(sample_positive_indices)
+    elif negative_count > 0:
+        sample_indices = torch.cat(sample_negative_indices)
+        zeros = torch.zeros(negative_count).int().cuda()
+        target_class_ids = zeros
+        zeros = torch.zeros(negative_count, cf.dim * 2).cuda()
+        target_deltas = zeros
+        zeros = torch.zeros(negative_count, *cf.mask_shape).cuda()
+        target_masks = zeros
+    else:
+        sample_indices = torch.LongTensor().cuda()
+        target_class_ids = torch.IntTensor().cuda()
+        target_deltas = torch.FloatTensor().cuda()
+        target_masks = torch.FloatTensor().cuda()
+
+    return sample_indices, target_class_ids, target_deltas, target_masks
+
+
+############################################################
+#  Output Handler
+############################################################
+
+def refine_detections(rois, probs, deltas, batch_ixs, cf):
+    """
+    Refine classified proposals, filter overlaps and return final detections.
+
+    :param rois: (n_proposals, 2 * dim) normalized boxes as proposed by RPN. n_proposals = batch_size * POST_NMS_ROIS
+    :param probs: (n_proposals, n_classes) softmax probabilities for all rois as predicted by mrcnn classifier.
+    :param deltas: (n_proposals, n_classes, 2 * dim) box refinement deltas as predicted by mrcnn bbox regressor.
+    :param batch_ixs: (n_proposals) batch element assignemnt info for re-allocation.
+    :return: result: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score))
+    """
+    # class IDs per ROI. Since scores of all classes are of interest (not just max class), all are kept at this point.
+    class_ids = []
+    fg_classes = cf.head_classes - 1
+    # repeat vectors to fill in predictions for all foreground classes.
+    for ii in range(1, fg_classes + 1):
+        class_ids += [ii] * rois.shape[0]
+    class_ids = torch.from_numpy(np.array(class_ids)).cuda()
+
+    rois = rois.repeat(fg_classes, 1)
+    probs = probs.repeat(fg_classes, 1)
+    deltas = deltas.repeat(fg_classes, 1, 1)
+    batch_ixs = batch_ixs.repeat(fg_classes)
+
+    # get class-specific scores and  bounding box deltas
+    idx = torch.arange(class_ids.size()[0]).long().cuda()
+    class_scores = probs[idx, class_ids]
+    deltas_specific = deltas[idx, class_ids]
+    batch_ixs = batch_ixs[idx]
+
+    # apply bounding box deltas. re-scale to image coordinates.
+    std_dev = torch.from_numpy(np.reshape(cf.rpn_bbox_std_dev, [1, cf.dim * 2])).float().cuda()
+    scale = torch.from_numpy(cf.scale).float().cuda()
+    refined_rois = mutils.apply_box_deltas_2D(rois, deltas_specific * std_dev) * scale if cf.dim == 2 else \
+        mutils.apply_box_deltas_3D(rois, deltas_specific * std_dev) * scale
+
+    # round and cast to int since we're deadling with pixels now
+    refined_rois = mutils.clip_to_window(cf.window, refined_rois)
+    refined_rois = torch.round(refined_rois)
+
+    # filter out low confidence boxes
+    keep = idx
+    keep_bool = (class_scores >= cf.model_min_confidence)
+    if 0 not in torch.nonzero(keep_bool).size():
+
+        score_keep = torch.nonzero(keep_bool)[:, 0]
+        pre_nms_class_ids = class_ids[score_keep]
+        pre_nms_rois = refined_rois[score_keep]
+        pre_nms_scores = class_scores[score_keep]
+        pre_nms_batch_ixs = batch_ixs[score_keep]
+
+        for j, b in enumerate(mutils.unique1d(pre_nms_batch_ixs)):
+
+            bixs = torch.nonzero(pre_nms_batch_ixs == b)[:, 0]
+            bix_class_ids = pre_nms_class_ids[bixs]
+            bix_rois = pre_nms_rois[bixs]
+            bix_scores = pre_nms_scores[bixs]
+
+            for i, class_id in enumerate(mutils.unique1d(bix_class_ids)):
+
+                ixs = torch.nonzero(bix_class_ids == class_id)[:, 0]
+                # nms expects boxes sorted by score.
+                ix_rois = bix_rois[ixs]
+                ix_scores = bix_scores[ixs]
+                ix_scores, order = ix_scores.sort(descending=True)
+                ix_rois = ix_rois[order, :]
+
+                if cf.dim == 2:
+                    class_keep = nms_2D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+                else:
+                    class_keep = nms_3D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+
+                # map indices back.
+                class_keep = keep[score_keep[bixs[ixs[order[class_keep]]]]]
+                # merge indices over classes for current batch element
+                b_keep = class_keep if i == 0 else mutils.unique1d(torch.cat((b_keep, class_keep)))
+
+            # only keep top-k boxes of current batch-element
+            top_ids = class_scores[b_keep].sort(descending=True)[1][:cf.model_max_instances_per_batch_element]
+            b_keep = b_keep[top_ids]
+
+            # merge indices over batch elements.
+            batch_keep = b_keep if j == 0 else mutils.unique1d(torch.cat((batch_keep, b_keep)))
+
+        keep = batch_keep
+
+    else:
+        keep = torch.tensor([0]).long().cuda()
+
+    # arrange output
+    result = torch.cat((refined_rois[keep],
+                        batch_ixs[keep].unsqueeze(1),
+                        class_ids[keep].unsqueeze(1).float(),
+                        class_scores[keep].unsqueeze(1)), dim=1)
+
+    return result
+
+
+def get_results(cf, img_shape, detections, detection_masks, box_results_list=None, return_masks=True):
+    """
+    Restores batch dimension of merged detections, unmolds detections, creates and fills results dict.
+    :param img_shape:
+    :param detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+    :param detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+    :param box_results_list: None or list of output boxes for monitoring/plotting.
+    each element is a list of boxes per batch element.
+    :param return_masks: boolean. If True, full resolution masks are returned for all proposals (speed trade-off).
+    :return: results_dict: dictionary with keys:
+             'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                      [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+             'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, 1] only fg. vs. bg for now.
+             class-specific return of masks will come with implementation of instance segmentation evaluation.
+    """
+    detections = detections.cpu().data.numpy()
+    if cf.dim == 2:
+        detection_masks = detection_masks.permute(0, 2, 3, 1).cpu().data.numpy()
+    else:
+        detection_masks = detection_masks.permute(0, 2, 3, 4, 1).cpu().data.numpy()
+
+    # restore batch dimension of merged detections using the batch_ix info.
+    batch_ixs = detections[:, cf.dim*2]
+    detections = [detections[batch_ixs == ix] for ix in range(img_shape[0])]
+    mrcnn_mask = [detection_masks[batch_ixs == ix] for ix in range(img_shape[0])]
+
+    # for test_forward, where no previous list exists.
+    if box_results_list is None:
+        box_results_list = [[] for _ in range(img_shape[0])]
+
+    seg_preds = []
+    # loop over batch and unmold detections.
+    for ix in range(img_shape[0]):
+
+        if 0 not in detections[ix].shape:
+            boxes = detections[ix][:, :2 * cf.dim].astype(np.int32)
+            class_ids = detections[ix][:, 2 * cf.dim + 1].astype(np.int32)
+            scores = detections[ix][:, 2 * cf.dim + 2]
+            masks = mrcnn_mask[ix][np.arange(boxes.shape[0]), ..., class_ids]
+
+            # Filter out detections with zero area. Often only happens in early
+            # stages of training when the network weights are still a bit random.
+            if cf.dim == 2:
+                exclude_ix = np.where((boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
+            else:
+                exclude_ix = np.where(
+                    (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 5] - boxes[:, 4]) <= 0)[0]
+
+            if exclude_ix.shape[0] > 0:
+                boxes = np.delete(boxes, exclude_ix, axis=0)
+                class_ids = np.delete(class_ids, exclude_ix, axis=0)
+                scores = np.delete(scores, exclude_ix, axis=0)
+                masks = np.delete(masks, exclude_ix, axis=0)
+
+            # Resize masks to original image size and set boundary threshold.
+            full_masks = []
+            permuted_image_shape = list(img_shape[2:]) + [img_shape[1]]
+            if return_masks:
+                for i in range(masks.shape[0]):
+                    # Convert neural network mask to full size mask.
+                    full_masks.append(mutils.unmold_mask_2D(masks[i], boxes[i], permuted_image_shape)
+                    if cf.dim == 2 else mutils.unmold_mask_3D(masks[i], boxes[i], permuted_image_shape))
+            # if masks are returned, take max over binary full masks of all predictions in this image.
+            # right now only binary masks for plotting/monitoring. for instance segmentation return all proposal maks.
+            final_masks = np.max(np.array(full_masks), 0) if len(full_masks) > 0 else np.zeros(
+                (*permuted_image_shape[:-1],))
+
+            # add final perdictions to results.
+            if 0 not in boxes.shape:
+                for ix2, score in enumerate(scores):
+                    box_results_list[ix].append({'box_coords': boxes[ix2], 'box_score': score,
+                                                 'box_type': 'det', 'box_pred_class_id': class_ids[ix2]})
+        else:
+            # pad with zero dummy masks.
+            final_masks = np.zeros(img_shape[2:])
+
+        seg_preds.append(final_masks)
+
+    # create and fill results dictionary.
+    results_dict = {'boxes': box_results_list,
+                    'seg_preds': np.round(np.array(seg_preds))[:, np.newaxis].astype('uint8')}
+
+    return results_dict
+
+
+############################################################
+#  Mask R-CNN Class
+############################################################
+
+class net(nn.Module):
+
+
+    def __init__(self, cf, logger):
+
+        super(net, self).__init__()
+        self.cf = cf
+        self.logger = logger
+        self.build()
+
+        if self.cf.weight_init is not None:
+            logger.info("using pytorch weight init of type {}".format(self.cf.weight_init))
+            mutils.initialize_weights(self)
+        else:
+            logger.info("using default pytorch weight init")
+
+
+    def build(self):
+        """Build Mask R-CNN architecture."""
+
+        # Image size must be dividable by 2 multiple times.
+        h, w = self.cf.patch_size[:2]
+        if h / 2**5 != int(h / 2**5) or w / 2**5 != int(w / 2**5):
+            raise Exception("Image size must be dividable by 2 at least 5 times "
+                            "to avoid fractions when downscaling and upscaling."
+                            "For example, use 256, 320, 384, 448, 512, ... etc. ")
+
+        # instanciate abstract multi dimensional conv class and backbone class.
+        conv = mutils.NDConvGenerator(self.cf.dim)
+        backbone = utils.import_module('bbone', self.cf.backbone_path)
+
+        # build Anchors, FPN, RPN, Classifier / Bbox-Regressor -head, Mask-head
+        self.np_anchors = mutils.generate_pyramid_anchors(self.logger, self.cf)
+        self.anchors = torch.from_numpy(self.np_anchors).float().cuda()
+        self.fpn = backbone.FPN(self.cf, conv)
+        self.rpn = RPN(self.cf, conv)
+        self.classifier = Classifier(self.cf, conv)
+        self.mask = Mask(self.cf, conv)
+
+
+    def train_forward(self, batch, is_validation=False):
+        """
+        train method (also used for validation monitoring). wrapper around forward pass of network. prepares input data
+        for processing, computes losses, and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data', 'seg', etc.
+        :return: results_dict: dictionary with keys:
+                'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                        [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+                'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes].
+                'monitor_values': dict of values to be monitored.
+        """
+        img = batch['data']
+        gt_class_ids = batch['roi_labels']
+        gt_boxes = batch['bb_target']
+        axes = (0, 2, 3, 1) if self.cf.dim == 2 else (0, 2, 3, 4, 1)
+        gt_masks = [np.transpose(batch['roi_masks'][ii], axes=axes) for ii in range(len(batch['roi_masks']))]
+
+
+        img = torch.from_numpy(img).float().cuda()
+        batch_rpn_class_loss = torch.FloatTensor([0]).cuda()
+        batch_rpn_bbox_loss = torch.FloatTensor([0]).cuda()
+
+        # list of output boxes for monitoring/plotting. each element is a list of boxes per batch element.
+        box_results_list = [[] for _ in range(img.shape[0])]
+
+        #forward passes. 1. general forward pass, where no activations are saved in second stage (for performance
+        # monitoring and loss sampling). 2. second stage forward pass of sampled rois with stored activations for backprop.
+        rpn_class_logits, rpn_pred_deltas, proposal_boxes, detections, detection_masks = self.forward(img)
+        mrcnn_class_logits, mrcnn_pred_deltas, mrcnn_pred_mask, target_class_ids, mrcnn_target_deltas, target_mask,  \
+        sample_proposals = self.loss_samples_forward(gt_class_ids, gt_boxes, gt_masks)
+
+        # loop over batch
+        for b in range(img.shape[0]):
+            if len(gt_boxes[b]) > 0:
+
+                # add gt boxes to output list for monitoring.
+                for ix in range(len(gt_boxes[b])):
+                    box_results_list[b].append({'box_coords': batch['bb_target'][b][ix],
+                                                'box_label': batch['roi_labels'][b][ix], 'box_type': 'gt'})
+
+                # match gt boxes with anchors to generate targets for RPN losses.
+                rpn_match, rpn_target_deltas = mutils.gt_anchor_matching(self.cf, self.np_anchors, gt_boxes[b])
+
+                # add positive anchors used for loss to output list for monitoring.
+                pos_anchors = mutils.clip_boxes_numpy(self.np_anchors[np.argwhere(rpn_match == 1)][:, 0], img.shape[2:])
+                for p in pos_anchors:
+                    box_results_list[b].append({'box_coords': p, 'box_type': 'pos_anchor'})
+
+            else:
+                rpn_match = np.array([-1]*self.np_anchors.shape[0])
+                rpn_target_deltas = np.array([0])
+
+            rpn_match = torch.from_numpy(rpn_match).cuda()
+            rpn_target_deltas = torch.from_numpy(rpn_target_deltas).float().cuda()
+
+            # compute RPN losses.
+            rpn_class_loss, neg_anchor_ix = compute_rpn_class_loss(rpn_match, rpn_class_logits[b], self.cf.shem_poolsize)
+            rpn_bbox_loss = compute_rpn_bbox_loss(rpn_target_deltas, rpn_pred_deltas[b], rpn_match)
+            batch_rpn_class_loss += rpn_class_loss / img.shape[0]
+            batch_rpn_bbox_loss += rpn_bbox_loss / img.shape[0]
+
+            # add negative anchors used for loss to output list for monitoring.
+            neg_anchors = mutils.clip_boxes_numpy(self.np_anchors[np.argwhere(rpn_match == -1)][0, neg_anchor_ix], img.shape[2:])
+            for n in neg_anchors:
+                box_results_list[b].append({'box_coords': n, 'box_type': 'neg_anchor'})
+
+            # add highest scoring proposals to output list for monitoring.
+            rpn_proposals = proposal_boxes[b][proposal_boxes[b, :, -1].argsort()][::-1]
+            for r in rpn_proposals[:self.cf.n_plot_rpn_props, :-1]:
+                box_results_list[b].append({'box_coords': r, 'box_type': 'prop'})
+
+        # add positive and negative roi samples used for mrcnn losses to output list for monitoring.
+        if 0 not in sample_proposals.shape:
+            rois = mutils.clip_to_window(self.cf.window, sample_proposals).cpu().data.numpy()
+            for ix, r in enumerate(rois):
+                box_results_list[int(r[-1])].append({'box_coords': r[:-1] * self.cf.scale,
+                                            'box_type': 'pos_class' if target_class_ids[ix] > 0 else 'neg_class'})
+
+        batch_rpn_class_loss = batch_rpn_class_loss
+        batch_rpn_bbox_loss = batch_rpn_bbox_loss
+
+        # compute mrcnn losses.
+        mrcnn_class_loss = compute_mrcnn_class_loss(target_class_ids, mrcnn_class_logits)
+        mrcnn_bbox_loss = compute_mrcnn_bbox_loss(mrcnn_target_deltas, mrcnn_pred_deltas, target_class_ids)
+
+        # mrcnn can be run without pixelwise annotations available (Faster R-CNN mode).
+        # In this case, the mask_loss is taken out of training.
+        if not self.cf.frcnn_mode:
+            mrcnn_mask_loss = compute_mrcnn_mask_loss(target_mask, mrcnn_pred_mask, target_class_ids)
+        else:
+            mrcnn_mask_loss = torch.FloatTensor([0]).cuda()
+
+        loss = batch_rpn_class_loss + batch_rpn_bbox_loss + mrcnn_class_loss + mrcnn_bbox_loss + mrcnn_mask_loss
+
+        # monitor RPN performance: detection count = the number of correctly matched proposals per fg-class.
+        dcount = [list(target_class_ids.cpu().data.numpy()).count(c) for c in np.arange(self.cf.head_classes)[1:]]
+
+
+
+        # run unmolding of predictions for monitoring and merge all results to one dictionary.
+        return_masks = self.cf.return_masks_in_val if is_validation else False
+        results_dict = get_results(self.cf, img.shape, detections, detection_masks,
+                                   box_results_list, return_masks=return_masks)
+
+        results_dict['torch_loss'] = loss
+        results_dict['monitor_values'] = {'loss': loss.item(), 'class_loss': mrcnn_class_loss.item()}
+
+        results_dict['logger_string'] =  \
+            "loss: {0:.2f}, rpn_class: {1:.2f}, rpn_bbox: {2:.2f}, mrcnn_class: {3:.2f}, mrcnn_bbox: {4:.2f}, " \
+            "mrcnn_mask: {5:.2f}, dcount {6}".format(loss.item(), batch_rpn_class_loss.item(),
+                                                     batch_rpn_bbox_loss.item(), mrcnn_class_loss.item(),
+                                                     mrcnn_bbox_loss.item(), mrcnn_mask_loss.item(), dcount)
+
+        return results_dict
+
+
+    def test_forward(self, batch, return_masks=True):
+        """
+        test method. wrapper around forward pass of network without usage of any ground truth information.
+        prepares input data for processing and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data'
+        :param return_masks: boolean. If True, full resolution masks are returned for all proposals (speed trade-off).
+        :return: results_dict: dictionary with keys:
+               'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                       [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+               'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes]
+        """
+        img = batch['data']
+        img = torch.from_numpy(img).float().cuda()
+        _, _, _, detections, detection_masks = self.forward(img)
+        results_dict = get_results(self.cf, img.shape, detections, detection_masks, return_masks=return_masks)
+        return results_dict
+
+
+    def forward(self, img, is_training=True):
+        """
+        :param img: input images (b, c, y, x, (z)).
+        :return: rpn_pred_logits: (b, n_anchors, 2)
+        :return: rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+        :return: batch_proposal_boxes: (b, n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix)) only for monitoring/plotting.
+        :return: detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+        :return: detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+        """
+        # extract features.
+        fpn_outs = self.fpn(img)
+        rpn_feature_maps = [fpn_outs[i] for i in self.cf.pyramid_levels]
+        self.mrcnn_feature_maps = rpn_feature_maps
+
+        # loop through pyramid layers and apply RPN.
+        layer_outputs = []  # list of lists
+        for p in rpn_feature_maps:
+            layer_outputs.append(self.rpn(p))
+
+        # concatenate layer outputs.
+        # convert from list of lists of level outputs to list of lists of outputs across levels.
+        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
+        outputs = list(zip(*layer_outputs))
+        outputs = [torch.cat(list(o), dim=1) for o in outputs]
+        rpn_pred_logits, rpn_pred_probs, rpn_pred_deltas = outputs
+
+        # generate proposals: apply predicted deltas to anchors and filter by foreground scores from RPN classifier.
+        proposal_count = self.cf.post_nms_rois_training if is_training else self.cf.post_nms_rois_inference
+        batch_rpn_rois, batch_proposal_boxes = proposal_layer(rpn_pred_probs, rpn_pred_deltas, proposal_count, self.anchors, self.cf)
+
+        # merge batch dimension of proposals while storing allocation info in coordinate dimension.
+        batch_ixs = torch.from_numpy(np.repeat(np.arange(batch_rpn_rois.shape[0]), batch_rpn_rois.shape[1])).float().cuda()
+        rpn_rois = batch_rpn_rois.view(-1, batch_rpn_rois.shape[2])
+        self.rpn_rois_batch_info = torch.cat((rpn_rois, batch_ixs.unsqueeze(1)), dim=1)
+
+        # this is the first of two forward passes in the second stage, where no activations are stored for backprop.
+        # here, all proposals are forwarded (with virtual_batch_size = batch_size * post_nms_rois.)
+        # for inference/monitoring as well as sampling of rois for the loss functions.
+        # processed in chunks of roi_chunk_size to re-adjust to gpu-memory.
+        chunked_rpn_rois = self.rpn_rois_batch_info.split(self.cf.roi_chunk_size)
+        class_logits_list, bboxes_list = [], []
+        with torch.no_grad():
+            for chunk in chunked_rpn_rois:
+                chunk_class_logits, chunk_bboxes = self.classifier(self.mrcnn_feature_maps, chunk)
+                class_logits_list.append(chunk_class_logits)
+                bboxes_list.append(chunk_bboxes)
+        batch_mrcnn_class_logits = torch.cat(class_logits_list, 0)
+        batch_mrcnn_bbox = torch.cat(bboxes_list, 0)
+        self.batch_mrcnn_class_scores = F.softmax(batch_mrcnn_class_logits, dim=1)
+
+        # refine classified proposals, filter and return final detections.
+        detections = refine_detections(rpn_rois, self.batch_mrcnn_class_scores, batch_mrcnn_bbox, batch_ixs, self.cf, )
+
+        # forward remaining detections through mask-head to generate corresponding masks.
+        scale = [img.shape[2]] * 4 + [img.shape[-1]] * 2
+        scale = torch.from_numpy(np.array(scale[:self.cf.dim * 2] + [1])[None]).float().cuda()
+
+
+        detection_boxes = detections[:, :self.cf.dim * 2 + 1] / scale
+        with torch.no_grad():
+            detection_masks = self.mask(self.mrcnn_feature_maps, detection_boxes)
+
+        return [rpn_pred_logits, rpn_pred_deltas, batch_proposal_boxes, detections, detection_masks]
+
+
+    def loss_samples_forward(self, batch_gt_class_ids, batch_gt_boxes, batch_gt_masks):
+        """
+        this is the second forward pass through the second stage (features from stage one are re-used).
+        samples few rois in detection_target_layer and forwards only those for loss computation.
+        :param batch_gt_class_ids: list over batch elements. Each element is a list over the corresponding roi target labels.
+        :param batch_gt_boxes: list over batch elements. Each element is a list over the corresponding roi target coordinates.
+        :param batch_gt_masks: list over batch elements. Each element is binary mask of shape (n_gt_rois, y, x, (z), c)
+        :return: sample_logits: (n_sampled_rois, n_classes) predicted class scores.
+        :return: sample_boxes: (n_sampled_rois, n_classes, 2 * dim) predicted corrections to be applied to proposals for refinement.
+        :return: sample_mask: (n_sampled_rois, n_classes, y, x, (z)) predicted masks per class and proposal.
+        :return: sample_target_class_ids: (n_sampled_rois) target class labels of sampled proposals.
+        :return: sample_target_deltas: (n_sampled_rois, 2 * dim) target deltas of sampled proposals for box refinement.
+        :return: sample_target_masks: (n_sampled_rois, y, x, (z)) target masks of sampled proposals.
+        :return: sample_proposals: (n_sampled_rois, 2 * dim) RPN output for sampled proposals. only for monitoring/plotting.
+        """
+        # sample rois for loss and get corresponding targets for all Mask R-CNN head network losses.
+        sample_ix, sample_target_class_ids, sample_target_deltas, sample_target_mask = \
+            detection_target_layer(self.rpn_rois_batch_info, self.batch_mrcnn_class_scores,
+                                   batch_gt_class_ids, batch_gt_boxes, batch_gt_masks, self.cf)
+
+        # re-use feature maps and RPN output from first forward pass.
+        sample_proposals = self.rpn_rois_batch_info[sample_ix]
+        if 0 not in sample_proposals.size():
+            sample_logits, sample_boxes = self.classifier(self.mrcnn_feature_maps, sample_proposals)
+            sample_mask = self.mask(self.mrcnn_feature_maps, sample_proposals)
+        else:
+            sample_logits = torch.FloatTensor().cuda()
+            sample_boxes = torch.FloatTensor().cuda()
+            sample_mask = torch.FloatTensor().cuda()
+
+        return [sample_logits, sample_boxes, sample_mask, sample_target_class_ids, sample_target_deltas,
+                sample_target_mask, sample_proposals]
\ No newline at end of file
diff --git a/models/retina_net.py b/models/retina_net.py
new file mode 100644
index 0000000..81dff0a
--- /dev/null
+++ b/models/retina_net.py
@@ -0,0 +1,508 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Retina Net. According to https://arxiv.org/abs/1708.02002
+Retina U-Net. According to https://arxiv.org/abs/1811.08661
+"""
+
+import utils.model_utils as mutils
+import utils.exp_utils as utils
+import sys
+sys.path.append('../')
+from cuda_functions.nms_2D.pth_nms import nms_gpu as nms_2D
+from cuda_functions.nms_3D.pth_nms import nms_gpu as nms_3D
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils
+
+
+############################################################
+#  Network Heads
+############################################################
+
+class Classifier(nn.Module):
+
+
+    def __init__(self, cf, conv):
+        """
+        Builds the classifier sub-network.
+        """
+        super(Classifier, self).__init__()
+        self.dim = conv.dim
+        self.n_classes = cf.head_classes
+        n_input_channels = cf.end_filts
+        n_features = cf.n_rpn_features
+        n_output_channels = cf.n_anchors_per_pos * cf.head_classes
+        anchor_stride = cf.rpn_anchor_stride
+
+        self.conv_1 = conv(n_input_channels, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_2 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_3 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_4 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_final = conv(n_features, n_output_channels, ks=3, stride=anchor_stride, pad=1, relu=None)
+
+
+    def forward(self, x):
+        """
+        :param x: input feature map (b, in_c, y, x, (z))
+        :return: class_logits (b, n_anchors, n_classes)
+        """
+        x = self.conv_1(x)
+        x = self.conv_2(x)
+        x = self.conv_3(x)
+        x = self.conv_4(x)
+        class_logits = self.conv_final(x)
+
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        class_logits = class_logits.permute(*axes)
+        class_logits = class_logits.contiguous()
+        class_logits = class_logits.view(x.size()[0], -1, self.n_classes)
+
+        return [class_logits]
+
+
+
+class BBRegressor(nn.Module):
+
+
+    def __init__(self, cf, conv):
+        """
+        Builds the bb-regression sub-network.
+        """
+        super(BBRegressor, self).__init__()
+        self.dim = conv.dim
+        n_input_channels = cf.end_filts
+        n_features = cf.n_rpn_features
+        n_output_channels = cf.n_anchors_per_pos * self.dim * 2
+        anchor_stride = cf.rpn_anchor_stride
+
+        self.conv_1 = conv(n_input_channels, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_2 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_3 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_4 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_final = conv(n_features, n_output_channels, ks=3, stride=anchor_stride,
+                               pad=1, relu=None)
+
+    def forward(self, x):
+        """
+        :param x: input feature map (b, in_c, y, x, (z))
+        :return: bb_logits (b, n_anchors, dim * 2)
+        """
+        x = self.conv_1(x)
+        x = self.conv_2(x)
+        x = self.conv_3(x)
+        x = self.conv_4(x)
+        bb_logits = self.conv_final(x)
+
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        bb_logits = bb_logits.permute(*axes)
+        bb_logits = bb_logits.contiguous()
+        bb_logits = bb_logits.view(x.size()[0], -1, self.dim * 2)
+
+        return [bb_logits]
+
+
+############################################################
+#  Loss Functions
+############################################################
+
+def compute_class_loss(anchor_matches, class_pred_logits, shem_poolsize=20):
+    """
+    :param anchor_matches: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :param class_pred_logits: (n_anchors, n_classes). logits from classifier sub-network.
+    :param shem_poolsize: int. factor of top-k candidates to draw from per negative sample (online-hard-example-mining).
+    :return: loss: torch tensor.
+    :return: np_neg_ix: 1D array containing indices of the neg_roi_logits, which have been sampled for training.
+    """
+    # Positive and Negative anchors contribute to the loss,
+    # but neutral anchors (match value = 0) don't.
+    pos_indices = torch.nonzero(anchor_matches > 0)
+    neg_indices = torch.nonzero(anchor_matches == -1)
+
+    # get positive samples and calucalte loss.
+    if 0 not in pos_indices.size():
+        pos_indices = pos_indices.squeeze(1)
+        roi_logits_pos = class_pred_logits[pos_indices]
+        targets_pos = anchor_matches[pos_indices]
+        pos_loss = F.cross_entropy(roi_logits_pos, targets_pos.long())
+    else:
+        pos_loss = torch.FloatTensor([0]).cuda()
+
+    # get negative samples, such that the amount matches the number of positive samples, but at least 1.
+    # get high scoring negatives by applying online-hard-example-mining.
+    if 0 not in neg_indices.size():
+        neg_indices = neg_indices.squeeze(1)
+        roi_logits_neg = class_pred_logits[neg_indices]
+        negative_count = np.max((1, pos_indices.size()[0]))
+        roi_probs_neg = F.softmax(roi_logits_neg, dim=1)
+        neg_ix = mutils.shem(roi_probs_neg, negative_count, shem_poolsize)
+        neg_loss = F.cross_entropy(roi_logits_neg[neg_ix], torch.LongTensor([0] * neg_ix.shape[0]).cuda())
+        # return the indices of negative samples, which contributed to the loss (for monitoring plots).
+        np_neg_ix = neg_ix.cpu().data.numpy()
+    else:
+        neg_loss = torch.FloatTensor([0]).cuda()
+        np_neg_ix = np.array([]).astype('int32')
+
+    loss = (pos_loss + neg_loss) / 2
+    return loss, np_neg_ix
+
+
+def compute_bbox_loss(target_deltas, pred_deltas, anchor_matches):
+    """
+    :param target_deltas:   (b, n_positive_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd)))).
+    Uses 0 padding to fill in unsed bbox deltas.
+    :param pred_deltas: predicted deltas from bbox regression head. (b, n_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd))))
+    :param anchor_matches: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(anchor_matches == 1).size():
+
+        indices = torch.nonzero(anchor_matches == 1).squeeze(1)
+        # Pick bbox deltas that contribute to the loss
+        pred_deltas = pred_deltas[indices]
+        # Trim target bounding box deltas to the same length as pred_deltas.
+        target_deltas = target_deltas[:pred_deltas.size()[0], :]
+        # Smooth L1 loss
+        loss = F.smooth_l1_loss(pred_deltas, target_deltas)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+############################################################
+#  Output Handler
+############################################################
+
+def refine_detections(anchors, probs, deltas, batch_ixs, cf):
+    """
+    Refine classified proposals, filter overlaps and return final
+    detections. n_proposals here is typically a very large number: batch_size * n_anchors.
+    This function is hence optimized on trimming down n_proposals.
+    :param anchors: (n_anchors, 2 * dim)
+    :param probs: (n_proposals, n_classes) softmax probabilities for all rois as predicted by classifier head.
+    :param deltas: (n_proposals, n_classes, 2 * dim) box refinement deltas as predicted by bbox regressor head.
+    :param batch_ixs: (n_proposals) batch element assignemnt info for re-allocation.
+    :return: result: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score))
+    """
+    anchors = anchors.repeat(len(np.unique(batch_ixs)), 1)
+
+    # flatten foreground probabilities, sort and trim down to highest confidences by pre_nms limit.
+    fg_probs = probs[:, 1:].contiguous()
+    flat_probs, flat_probs_order = fg_probs.view(-1).sort(descending=True)
+    keep_ix = flat_probs_order[:cf.pre_nms_limit]
+    # reshape indices to 2D index array with shape like fg_probs.
+    keep_arr = torch.cat(((keep_ix / fg_probs.shape[1]).unsqueeze(1), (keep_ix % fg_probs.shape[1]).unsqueeze(1)), 1)
+
+    pre_nms_scores = flat_probs[:cf.pre_nms_limit]
+    pre_nms_class_ids = keep_arr[:, 1] + 1  # add background again.
+    pre_nms_batch_ixs = batch_ixs[keep_arr[:, 0]]
+    pre_nms_anchors = anchors[keep_arr[:, 0]]
+    pre_nms_deltas = deltas[keep_arr[:, 0]]
+    keep = torch.arange(pre_nms_scores.size()[0]).long().cuda()
+
+    # apply bounding box deltas. re-scale to image coordinates.
+    std_dev = torch.from_numpy(np.reshape(cf.rpn_bbox_std_dev, [1, cf.dim * 2])).float().cuda()
+    scale = torch.from_numpy(cf.scale).float().cuda()
+    refined_rois = mutils.apply_box_deltas_2D(pre_nms_anchors / scale, pre_nms_deltas * std_dev) * scale \
+        if cf.dim == 2 else mutils.apply_box_deltas_3D(pre_nms_anchors / scale, pre_nms_deltas * std_dev) * scale
+
+    # round and cast to int since we're deadling with pixels now
+    refined_rois = mutils.clip_to_window(cf.window, refined_rois)
+    pre_nms_rois = torch.round(refined_rois)
+    for j, b in enumerate(mutils.unique1d(pre_nms_batch_ixs)):
+
+        bixs = torch.nonzero(pre_nms_batch_ixs == b)[:, 0]
+        bix_class_ids = pre_nms_class_ids[bixs]
+        bix_rois = pre_nms_rois[bixs]
+        bix_scores = pre_nms_scores[bixs]
+
+        for i, class_id in enumerate(mutils.unique1d(bix_class_ids)):
+
+            ixs = torch.nonzero(bix_class_ids == class_id)[:, 0]
+            # nms expects boxes sorted by score.
+            ix_rois = bix_rois[ixs]
+            ix_scores = bix_scores[ixs]
+            ix_scores, order = ix_scores.sort(descending=True)
+            ix_rois = ix_rois[order, :]
+            ix_scores = ix_scores
+
+            if cf.dim == 2:
+                class_keep = nms_2D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+            else:
+                class_keep = nms_3D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+
+            # map indices back.
+            class_keep = keep[bixs[ixs[order[class_keep]]]]
+            # merge indices over classes for current batch element
+            b_keep = class_keep if i == 0 else mutils.unique1d(torch.cat((b_keep, class_keep)))
+
+        # only keep top-k boxes of current batch-element.
+        top_ids = pre_nms_scores[b_keep].sort(descending=True)[1][:cf.model_max_instances_per_batch_element]
+        b_keep = b_keep[top_ids]
+        # merge indices over batch elements.
+        batch_keep = b_keep if j == 0 else mutils.unique1d(torch.cat((batch_keep, b_keep)))
+
+    keep = batch_keep
+
+    # arrange output.
+    result = torch.cat((pre_nms_rois[keep],
+                        pre_nms_batch_ixs[keep].unsqueeze(1).float(),
+                        pre_nms_class_ids[keep].unsqueeze(1).float(),
+                        pre_nms_scores[keep].unsqueeze(1)), dim=1)
+
+    return result
+
+
+
+def get_results(cf, img_shape, detections, seg_logits, box_results_list=None):
+    """
+    Restores batch dimension of merged detections, unmolds detections, creates and fills results dict.
+    :param img_shape:
+    :param detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+    :param box_results_list: None or list of output boxes for monitoring/plotting.
+    each element is a list of boxes per batch element.
+    :return: results_dict: dictionary with keys:
+             'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                      [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+             'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, ..., n_classes] for
+                          retina_unet and dummy array for retina_net.
+    """
+    detections = detections.cpu().data.numpy()
+    batch_ixs = detections[:, cf.dim*2]
+    detections = [detections[batch_ixs == ix] for ix in range(img_shape[0])]
+
+    # for test_forward, where no previous list exists.
+    if box_results_list is None:
+        box_results_list = [[] for _ in range(img_shape[0])]
+
+    for ix in range(img_shape[0]):
+
+        if 0 not in detections[ix].shape:
+
+            boxes = detections[ix][:, :2 * cf.dim].astype(np.int32)
+            class_ids = detections[ix][:, 2 * cf.dim + 1].astype(np.int32)
+            scores = detections[ix][:, 2 * cf.dim + 2]
+
+            # Filter out detections with zero area. Often only happens in early
+            # stages of training when the network weights are still a bit random.
+            if cf.dim == 2:
+                exclude_ix = np.where((boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
+            else:
+                exclude_ix = np.where(
+                    (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 5] - boxes[:, 4]) <= 0)[0]
+
+            if exclude_ix.shape[0] > 0:
+                boxes = np.delete(boxes, exclude_ix, axis=0)
+                class_ids = np.delete(class_ids, exclude_ix, axis=0)
+                scores = np.delete(scores, exclude_ix, axis=0)
+
+            if 0 not in boxes.shape:
+                for ix2, score in enumerate(scores):
+                    if score >= cf.model_min_confidence:
+                        box_results_list[ix].append({'box_coords': boxes[ix2],
+                                                     'box_score': score,
+                                                     'box_type': 'det',
+                                                     'box_pred_class_id': class_ids[ix2]})
+
+    results_dict = {'boxes': box_results_list}
+    if seg_logits is None:
+        # output dummy segmentation for retina_net.
+        results_dict['seg_preds'] = np.zeros(img_shape)[:, 0][:, np.newaxis]
+    else:
+        # output label maps for retina_unet.
+        results_dict['seg_preds'] = F.softmax(seg_logits, 1).argmax(1).cpu().data.numpy()[:, np.newaxis].astype('uint8')
+
+    return results_dict
+
+
+############################################################
+#  Retina (U-)Net Class
+############################################################
+
+
+class net(nn.Module):
+
+
+    def __init__(self, cf, logger):
+
+        super(net, self).__init__()
+        self.cf = cf
+        self.logger = logger
+        self.build()
+        if self.cf.weight_init is not None:
+            logger.info("using pytorch weight init of type {}".format(self.cf.weight_init))
+            mutils.initialize_weights(self)
+        else:
+            logger.info("using default pytorch weight init")
+
+    def build(self):
+        """
+        Build Retina Net architecture.
+        """
+
+        # Image size must be dividable by 2 multiple times.
+        h, w = self.cf.patch_size[:2]
+        if h / 2 ** 5 != int(h / 2 ** 5) or w / 2 ** 5 != int(w / 2 ** 5):
+            raise Exception("Image size must be dividable by 2 at least 5 times "
+                            "to avoid fractions when downscaling and upscaling."
+                            "For example, use 256, 320, 384, 448, 512, ... etc. ")
+
+        # instanciate abstract multi dimensional conv class and backbone model.
+        conv = mutils.NDConvGenerator(self.cf.dim)
+        backbone = utils.import_module('bbone', self.cf.backbone_path)
+
+        # build Anchors, FPN, Classifier / Bbox-Regressor -head
+        self.np_anchors = mutils.generate_pyramid_anchors(self.logger, self.cf)
+        self.anchors = torch.from_numpy(self.np_anchors).float().cuda()
+        self.Fpn = backbone.FPN(self.cf, conv, operate_stride1=self.cf.operate_stride1)
+        self.Classifier = Classifier(self.cf, conv)
+        self.BBRegressor = BBRegressor(self.cf, conv)
+
+
+    def train_forward(self, batch, **kwargs):
+        """
+        train method (also used for validation monitoring). wrapper around forward pass of network. prepares input data
+        for processing, computes losses, and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data', 'seg', etc.
+        :return: results_dict: dictionary with keys:
+                'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                        [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+                'seg_preds': pixelwise segmentation output (b, c, y, x, (z)) with values [0, .., n_classes].
+                'monitor_values': dict of values to be monitored.
+        """
+        img = batch['data']
+        gt_class_ids = batch['roi_labels']
+        gt_boxes = batch['bb_target']
+        var_seg_ohe = torch.FloatTensor(mutils.get_one_hot_encoding(batch['seg'], self.cf.num_seg_classes)).cuda()
+        var_seg = torch.LongTensor(batch['seg']).cuda()
+
+        img = torch.from_numpy(img).float().cuda()
+        batch_class_loss = torch.FloatTensor([0]).cuda()
+        batch_bbox_loss = torch.FloatTensor([0]).cuda()
+
+        # list of output boxes for monitoring/plotting. each element is a list of boxes per batch element.
+        box_results_list = [[] for _ in range(img.shape[0])]
+        detections, class_logits, pred_deltas, seg_logits = self.forward(img)
+
+        # loop over batch
+        for b in range(img.shape[0]):
+
+            # add gt boxes to results dict for monitoring.
+            if len(gt_boxes[b]) > 0:
+                for ix in range(len(gt_boxes[b])):
+                    box_results_list[b].append({'box_coords': batch['bb_target'][b][ix],
+                                                'box_label': batch['roi_labels'][b][ix], 'box_type': 'gt'})
+
+                # match gt boxes with anchors to generate targets.
+                anchor_class_match, anchor_target_deltas = mutils.gt_anchor_matching(
+                    self.cf, self.np_anchors, gt_boxes[b], gt_class_ids[b])
+
+                # add positive anchors used for loss to results_dict for monitoring.
+                pos_anchors = mutils.clip_boxes_numpy(
+                    self.np_anchors[np.argwhere(anchor_class_match > 0)][:, 0], img.shape[2:])
+                for p in pos_anchors:
+                    box_results_list[b].append({'box_coords': p, 'box_type': 'pos_anchor'})
+
+            else:
+                anchor_class_match = np.array([-1]*self.np_anchors.shape[0])
+                anchor_target_deltas = np.array([0])
+
+            anchor_class_match = torch.from_numpy(anchor_class_match).cuda()
+            anchor_target_deltas = torch.from_numpy(anchor_target_deltas).float().cuda()
+
+            # compute losses.
+            class_loss, neg_anchor_ix = compute_class_loss(anchor_class_match, class_logits[b])
+            bbox_loss = compute_bbox_loss(anchor_target_deltas, pred_deltas[b], anchor_class_match)
+
+            # add negative anchors used for loss to results_dict for monitoring.
+            neg_anchors = mutils.clip_boxes_numpy(
+                self.np_anchors[np.argwhere(anchor_class_match == -1)][0, neg_anchor_ix], img.shape[2:])
+            for n in neg_anchors:
+                box_results_list[b].append({'box_coords': n, 'box_type': 'neg_anchor'})
+
+            batch_class_loss += class_loss / img.shape[0]
+            batch_bbox_loss += bbox_loss / img.shape[0]
+
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits, box_results_list)
+        loss = batch_class_loss + batch_bbox_loss
+        results_dict['torch_loss'] = loss
+        results_dict['monitor_values'] = {'loss': loss.item(), 'class_loss': batch_class_loss.item()}
+        results_dict['logger_string'] = "loss: {0:.2f}, class: {1:.2f}, bbox: {2:.2f}"\
+            .format(loss.item(), batch_class_loss.item(), batch_bbox_loss.item())
+
+        return results_dict
+
+
+    def test_forward(self, batch, **kwargs):
+        """
+        test method. wrapper around forward pass of network without usage of any ground truth information.
+        prepares input data for processing and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data'
+        :return: results_dict: dictionary with keys:
+               'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                       [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+               'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, ..., n_classes] for
+                            retina_unet and dummy array for retina_net.
+        """
+        img = batch['data']
+        img = torch.from_numpy(img).float().cuda()
+        detections, _, _, seg_logits = self.forward(img)
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits)
+        return results_dict
+
+
+    def forward(self, img):
+        """
+        forward pass of the model.
+        :param img: input img (b, c, y, x, (z)).
+        :return: rpn_pred_logits: (b, n_anchors, 2)
+        :return: rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+        :return: batch_proposal_boxes: (b, n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix)) only for monitoring/plotting.
+        :return: detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+        :return: detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+        """
+        # Feature extraction
+        fpn_outs = self.Fpn(img)
+        seg_logits = None
+        selected_fmaps = [fpn_outs[i] for i in self.cf.pyramid_levels]
+
+        # Loop through pyramid layers
+        class_layer_outputs, bb_reg_layer_outputs = [], []  # list of lists
+        for p in selected_fmaps:
+            class_layer_outputs.append(self.Classifier(p))
+            bb_reg_layer_outputs.append(self.BBRegressor(p))
+
+        # Concatenate layer outputs
+        # Convert from list of lists of level outputs to list of lists
+        # of outputs across levels.
+        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
+        class_logits = list(zip(*class_layer_outputs))
+        class_logits = [torch.cat(list(o), dim=1) for o in class_logits][0]
+        bb_outputs = list(zip(*bb_reg_layer_outputs))
+        bb_outputs = [torch.cat(list(o), dim=1) for o in bb_outputs][0]
+
+        # merge batch_dimension and store info in batch_ixs for re-allocation.
+        batch_ixs = torch.arange(class_logits.shape[0]).unsqueeze(1).repeat(1, class_logits.shape[1]).view(-1).cuda()
+        flat_class_softmax = F.softmax(class_logits.view(-1, class_logits.shape[-1]), 1)
+        flat_bb_outputs = bb_outputs.view(-1, bb_outputs.shape[-1])
+        detections = refine_detections(self.anchors, flat_class_softmax, flat_bb_outputs, batch_ixs, self.cf)
+
+        return detections, class_logits, bb_outputs, seg_logits
diff --git a/models/retina_unet.py b/models/retina_unet.py
new file mode 100644
index 0000000..e95089b
--- /dev/null
+++ b/models/retina_unet.py
@@ -0,0 +1,513 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Retina Net. According to https://arxiv.org/abs/1708.02002
+Retina U-Net. According to https://arxiv.org/abs/1811.08661
+"""
+
+import utils.model_utils as mutils
+import utils.exp_utils as utils
+import sys
+sys.path.append('../')
+from cuda_functions.nms_2D.pth_nms import nms_gpu as nms_2D
+from cuda_functions.nms_3D.pth_nms import nms_gpu as nms_3D
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils
+
+
+############################################################
+#  Network Heads
+############################################################
+
+class Classifier(nn.Module):
+
+
+    def __init__(self, cf, conv):
+        """
+        Builds the classifier sub-network.
+        """
+        super(Classifier, self).__init__()
+        self.dim = conv.dim
+        self.n_classes = cf.head_classes
+        n_input_channels = cf.end_filts
+        n_features = cf.n_rpn_features
+        n_output_channels = cf.n_anchors_per_pos * cf.head_classes
+        anchor_stride = cf.rpn_anchor_stride
+
+        self.conv_1 = conv(n_input_channels, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_2 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_3 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_4 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_final = conv(n_features, n_output_channels, ks=3, stride=anchor_stride, pad=1, relu=None)
+
+
+    def forward(self, x):
+        """
+        :param x: input feature map (b, in_c, y, x, (z))
+        :return: class_logits (b, n_anchors, n_classes)
+        """
+        x = self.conv_1(x)
+        x = self.conv_2(x)
+        x = self.conv_3(x)
+        x = self.conv_4(x)
+        class_logits = self.conv_final(x)
+
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        class_logits = class_logits.permute(*axes)
+        class_logits = class_logits.contiguous()
+        class_logits = class_logits.view(x.size()[0], -1, self.n_classes)
+
+        return [class_logits]
+
+
+
+class BBRegressor(nn.Module):
+
+
+    def __init__(self, cf, conv):
+        """
+        Builds the bb-regression sub-network.
+        """
+        super(BBRegressor, self).__init__()
+        self.dim = conv.dim
+        n_input_channels = cf.end_filts
+        n_features = cf.n_rpn_features
+        n_output_channels = cf.n_anchors_per_pos * self.dim * 2
+        anchor_stride = cf.rpn_anchor_stride
+
+        self.conv_1 = conv(n_input_channels, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_2 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_3 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_4 = conv(n_features, n_features, ks=3, stride=anchor_stride, pad=1, relu=cf.relu)
+        self.conv_final = conv(n_features, n_output_channels, ks=3, stride=anchor_stride,
+                               pad=1, relu=None)
+
+    def forward(self, x):
+        """
+        :param x: input feature map (b, in_c, y, x, (z))
+        :return: bb_logits (b, n_anchors, dim * 2)
+        """
+        x = self.conv_1(x)
+        x = self.conv_2(x)
+        x = self.conv_3(x)
+        x = self.conv_4(x)
+        bb_logits = self.conv_final(x)
+
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        bb_logits = bb_logits.permute(*axes)
+        bb_logits = bb_logits.contiguous()
+        bb_logits = bb_logits.view(x.size()[0], -1, self.dim * 2)
+
+        return [bb_logits]
+
+
+############################################################
+#  Loss Functions
+############################################################
+
+def compute_class_loss(anchor_matches, class_pred_logits, shem_poolsize=20):
+    """
+    :param anchor_matches: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :param class_pred_logits: (n_anchors, n_classes). logits from classifier sub-network.
+    :param shem_poolsize: int. factor of top-k candidates to draw from per negative sample (online-hard-example-mining).
+    :return: loss: torch tensor.
+    :return: np_neg_ix: 1D array containing indices of the neg_roi_logits, which have been sampled for training.
+    """
+    # Positive and Negative anchors contribute to the loss,
+    # but neutral anchors (match value = 0) don't.
+    pos_indices = torch.nonzero(anchor_matches > 0)
+    neg_indices = torch.nonzero(anchor_matches == -1)
+
+    # get positive samples and calucalte loss.
+    if 0 not in pos_indices.size():
+        pos_indices = pos_indices.squeeze(1)
+        roi_logits_pos = class_pred_logits[pos_indices]
+        targets_pos = anchor_matches[pos_indices]
+        pos_loss = F.cross_entropy(roi_logits_pos, targets_pos.long())
+    else:
+        pos_loss = torch.FloatTensor([0]).cuda()
+
+    # get negative samples, such that the amount matches the number of positive samples, but at least 1.
+    # get high scoring negatives by applying online-hard-example-mining.
+    if 0 not in neg_indices.size():
+        neg_indices = neg_indices.squeeze(1)
+        roi_logits_neg = class_pred_logits[neg_indices]
+        negative_count = np.max((1, pos_indices.size()[0]))
+        roi_probs_neg = F.softmax(roi_logits_neg, dim=1)
+        neg_ix = mutils.shem(roi_probs_neg, negative_count, shem_poolsize)
+        neg_loss = F.cross_entropy(roi_logits_neg[neg_ix], torch.LongTensor([0] * neg_ix.shape[0]).cuda())
+        # return the indices of negative samples, which contributed to the loss (for monitoring plots).
+        np_neg_ix = neg_ix.cpu().data.numpy()
+    else:
+        neg_loss = torch.FloatTensor([0]).cuda()
+        np_neg_ix = np.array([]).astype('int32')
+
+    loss = (pos_loss + neg_loss) / 2
+    return loss, np_neg_ix
+
+
+def compute_bbox_loss(target_deltas, pred_deltas, anchor_matches):
+    """
+    :param target_deltas:   (b, n_positive_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd)))).
+    Uses 0 padding to fill in unsed bbox deltas.
+    :param pred_deltas: predicted deltas from bbox regression head. (b, n_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd))))
+    :param anchor_matches: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(anchor_matches == 1).size():
+
+        indices = torch.nonzero(anchor_matches == 1).squeeze(1)
+        # Pick bbox deltas that contribute to the loss
+        pred_deltas = pred_deltas[indices]
+        # Trim target bounding box deltas to the same length as pred_deltas.
+        target_deltas = target_deltas[:pred_deltas.size()[0], :]
+        # Smooth L1 loss
+        loss = F.smooth_l1_loss(pred_deltas, target_deltas)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+############################################################
+#  Output Handler
+############################################################
+
+def refine_detections(anchors, probs, deltas, batch_ixs, cf):
+    """
+    Refine classified proposals, filter overlaps and return final
+    detections. n_proposals here is typically a very large number: batch_size * n_anchors.
+    This function is hence optimized on trimming down n_proposals.
+    :param anchors: (n_anchors, 2 * dim)
+    :param probs: (n_proposals, n_classes) softmax probabilities for all rois as predicted by classifier head.
+    :param deltas: (n_proposals, n_classes, 2 * dim) box refinement deltas as predicted by bbox regressor head.
+    :param batch_ixs: (n_proposals) batch element assignemnt info for re-allocation.
+    :return: result: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score))
+    """
+    anchors = anchors.repeat(len(np.unique(batch_ixs)), 1)
+
+    # flatten foreground probabilities, sort and trim down to highest confidences by pre_nms limit.
+    fg_probs = probs[:, 1:].contiguous()
+    flat_probs, flat_probs_order = fg_probs.view(-1).sort(descending=True)
+    keep_ix = flat_probs_order[:cf.pre_nms_limit]
+    # reshape indices to 2D index array with shape like fg_probs.
+    keep_arr = torch.cat(((keep_ix / fg_probs.shape[1]).unsqueeze(1), (keep_ix % fg_probs.shape[1]).unsqueeze(1)), 1)
+
+    pre_nms_scores = flat_probs[:cf.pre_nms_limit]
+    pre_nms_class_ids = keep_arr[:, 1] + 1  # add background again.
+    pre_nms_batch_ixs = batch_ixs[keep_arr[:, 0]]
+    pre_nms_anchors = anchors[keep_arr[:, 0]]
+    pre_nms_deltas = deltas[keep_arr[:, 0]]
+    keep = torch.arange(pre_nms_scores.size()[0]).long().cuda()
+
+    # apply bounding box deltas. re-scale to image coordinates.
+    std_dev = torch.from_numpy(np.reshape(cf.rpn_bbox_std_dev, [1, cf.dim * 2])).float().cuda()
+    scale = torch.from_numpy(cf.scale).float().cuda()
+    refined_rois = mutils.apply_box_deltas_2D(pre_nms_anchors / scale, pre_nms_deltas * std_dev) * scale \
+        if cf.dim == 2 else mutils.apply_box_deltas_3D(pre_nms_anchors / scale, pre_nms_deltas * std_dev) * scale
+
+    # round and cast to int since we're deadling with pixels now
+    refined_rois = mutils.clip_to_window(cf.window, refined_rois)
+    pre_nms_rois = torch.round(refined_rois)
+    for j, b in enumerate(mutils.unique1d(pre_nms_batch_ixs)):
+
+        bixs = torch.nonzero(pre_nms_batch_ixs == b)[:, 0]
+        bix_class_ids = pre_nms_class_ids[bixs]
+        bix_rois = pre_nms_rois[bixs]
+        bix_scores = pre_nms_scores[bixs]
+
+        for i, class_id in enumerate(mutils.unique1d(bix_class_ids)):
+
+            ixs = torch.nonzero(bix_class_ids == class_id)[:, 0]
+            # nms expects boxes sorted by score.
+            ix_rois = bix_rois[ixs]
+            ix_scores = bix_scores[ixs]
+            ix_scores, order = ix_scores.sort(descending=True)
+            ix_rois = ix_rois[order, :]
+            ix_scores = ix_scores
+
+            if cf.dim == 2:
+                class_keep = nms_2D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+            else:
+                class_keep = nms_3D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+
+            # map indices back.
+            class_keep = keep[bixs[ixs[order[class_keep]]]]
+            # merge indices over classes for current batch element
+            b_keep = class_keep if i == 0 else mutils.unique1d(torch.cat((b_keep, class_keep)))
+
+        # only keep top-k boxes of current batch-element.
+        top_ids = pre_nms_scores[b_keep].sort(descending=True)[1][:cf.model_max_instances_per_batch_element]
+        b_keep = b_keep[top_ids]
+        # merge indices over batch elements.
+        batch_keep = b_keep if j == 0 else mutils.unique1d(torch.cat((batch_keep, b_keep)))
+
+    keep = batch_keep
+
+    # arrange output.
+    result = torch.cat((pre_nms_rois[keep],
+                        pre_nms_batch_ixs[keep].unsqueeze(1).float(),
+                        pre_nms_class_ids[keep].unsqueeze(1).float(),
+                        pre_nms_scores[keep].unsqueeze(1)), dim=1)
+
+    return result
+
+
+
+def get_results(cf, img_shape, detections, seg_logits, box_results_list=None):
+    """
+    Restores batch dimension of merged detections, unmolds detections, creates and fills results dict.
+    :param img_shape:
+    :param detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+    :param box_results_list: None or list of output boxes for monitoring/plotting.
+    each element is a list of boxes per batch element.
+    :return: results_dict: dictionary with keys:
+             'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                      [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+             'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, ..., n_classes] for
+                          retina_unet and dummy array for retina_net.
+    """
+    detections = detections.cpu().data.numpy()
+    batch_ixs = detections[:, cf.dim*2]
+    detections = [detections[batch_ixs == ix] for ix in range(img_shape[0])]
+
+    # for test_forward, where no previous list exists.
+    if box_results_list is None:
+        box_results_list = [[] for _ in range(img_shape[0])]
+
+    for ix in range(img_shape[0]):
+
+        if 0 not in detections[ix].shape:
+
+            boxes = detections[ix][:, :2 * cf.dim].astype(np.int32)
+            class_ids = detections[ix][:, 2 * cf.dim + 1].astype(np.int32)
+            scores = detections[ix][:, 2 * cf.dim + 2]
+
+            # Filter out detections with zero area. Often only happens in early
+            # stages of training when the network weights are still a bit random.
+            if cf.dim == 2:
+                exclude_ix = np.where((boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
+            else:
+                exclude_ix = np.where(
+                    (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 5] - boxes[:, 4]) <= 0)[0]
+
+            if exclude_ix.shape[0] > 0:
+                boxes = np.delete(boxes, exclude_ix, axis=0)
+                class_ids = np.delete(class_ids, exclude_ix, axis=0)
+                scores = np.delete(scores, exclude_ix, axis=0)
+
+            if 0 not in boxes.shape:
+                for ix2, score in enumerate(scores):
+                    if score >= cf.model_min_confidence:
+                        box_results_list[ix].append({'box_coords': boxes[ix2],
+                                                     'box_score': score,
+                                                     'box_type': 'det',
+                                                     'box_pred_class_id': class_ids[ix2]})
+
+    results_dict = {'boxes': box_results_list}
+    if seg_logits is None:
+        # output dummy segmentation for retina_net.
+        results_dict['seg_preds'] = np.zeros(img_shape)[:, 0][:, np.newaxis]
+    else:
+        # output label maps for retina_unet.
+        results_dict['seg_preds'] = F.softmax(seg_logits, 1).argmax(1).cpu().data.numpy()[:, np.newaxis].astype('uint8')
+
+    return results_dict
+
+
+############################################################
+#  Retina (U-)Net Class
+############################################################
+
+
+class net(nn.Module):
+
+
+    def __init__(self, cf, logger):
+
+        super(net, self).__init__()
+        self.cf = cf
+        self.logger = logger
+        self.build()
+        if self.cf.weight_init is not None:
+            logger.info("using pytorch weight init of type {}".format(self.cf.weight_init))
+            mutils.initialize_weights(self)
+        else:
+            logger.info("using default pytorch weight init")
+
+    def build(self):
+        """
+        Build Retina Net architecture.
+        """
+
+        # Image size must be dividable by 2 multiple times.
+        h, w = self.cf.patch_size[:2]
+        if h / 2 ** 5 != int(h / 2 ** 5) or w / 2 ** 5 != int(w / 2 ** 5):
+            raise Exception("Image size must be dividable by 2 at least 5 times "
+                            "to avoid fractions when downscaling and upscaling."
+                            "For example, use 256, 320, 384, 448, 512, ... etc. ")
+
+        # instanciate abstract multi dimensional conv class and backbone model.
+        conv = mutils.NDConvGenerator(self.cf.dim)
+        backbone = utils.import_module('bbone', self.cf.backbone_path)
+
+        # build Anchors, FPN, Classifier / Bbox-Regressor -head
+        self.np_anchors = mutils.generate_pyramid_anchors(self.logger, self.cf)
+        self.anchors = torch.from_numpy(self.np_anchors).float().cuda()
+        self.Fpn = backbone.FPN(self.cf, conv, operate_stride1=self.cf.operate_stride1)
+        self.Classifier = Classifier(self.cf, conv)
+        self.BBRegressor = BBRegressor(self.cf, conv)
+        self.final_conv = conv(self.cf.end_filts, self.cf.num_seg_classes, ks=1, pad=0, norm=self.cf.norm, relu=None)
+
+
+    def train_forward(self, batch, **kwargs):
+        """
+        train method (also used for validation monitoring). wrapper around forward pass of network. prepares input data
+        for processing, computes losses, and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data', 'seg', etc.
+        :return: results_dict: dictionary with keys:
+                'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                        [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+                'seg_preds': pixelwise segmentation output (b, c, y, x, (z)) with values [0, .., n_classes].
+                'monitor_values': dict of values to be monitored.
+        """
+        img = batch['data']
+        gt_class_ids = batch['roi_labels']
+        gt_boxes = batch['bb_target']
+        var_seg_ohe = torch.FloatTensor(mutils.get_one_hot_encoding(batch['seg'], self.cf.num_seg_classes)).cuda()
+        var_seg = torch.LongTensor(batch['seg']).cuda()
+
+        img = torch.from_numpy(img).float().cuda()
+        batch_class_loss = torch.FloatTensor([0]).cuda()
+        batch_bbox_loss = torch.FloatTensor([0]).cuda()
+
+        # list of output boxes for monitoring/plotting. each element is a list of boxes per batch element.
+        box_results_list = [[] for _ in range(img.shape[0])]
+        detections, class_logits, pred_deltas, seg_logits = self.forward(img)
+
+        # loop over batch
+        for b in range(img.shape[0]):
+
+            # add gt boxes to results dict for monitoring.
+            if len(gt_boxes[b]) > 0:
+                for ix in range(len(gt_boxes[b])):
+                    box_results_list[b].append({'box_coords': batch['bb_target'][b][ix],
+                                                'box_label': batch['roi_labels'][b][ix], 'box_type': 'gt'})
+
+                # match gt boxes with anchors to generate targets.
+                anchor_class_match, anchor_target_deltas = mutils.gt_anchor_matching(
+                    self.cf, self.np_anchors, gt_boxes[b], gt_class_ids[b])
+
+                # add positive anchors used for loss to results_dict for monitoring.
+                pos_anchors = mutils.clip_boxes_numpy(
+                    self.np_anchors[np.argwhere(anchor_class_match > 0)][:, 0], img.shape[2:])
+                for p in pos_anchors:
+                    box_results_list[b].append({'box_coords': p, 'box_type': 'pos_anchor'})
+
+            else:
+                anchor_class_match = np.array([-1]*self.np_anchors.shape[0])
+                anchor_target_deltas = np.array([0])
+
+            anchor_class_match = torch.from_numpy(anchor_class_match).cuda()
+            anchor_target_deltas = torch.from_numpy(anchor_target_deltas).float().cuda()
+
+            # compute losses.
+            class_loss, neg_anchor_ix = compute_class_loss(anchor_class_match, class_logits[b])
+            bbox_loss = compute_bbox_loss(anchor_target_deltas, pred_deltas[b], anchor_class_match)
+
+            # add negative anchors used for loss to results_dict for monitoring.
+            neg_anchors = mutils.clip_boxes_numpy(
+                self.np_anchors[np.argwhere(anchor_class_match == -1)][0, neg_anchor_ix], img.shape[2:])
+            for n in neg_anchors:
+                box_results_list[b].append({'box_coords': n, 'box_type': 'neg_anchor'})
+
+            batch_class_loss += class_loss / img.shape[0]
+            batch_bbox_loss += bbox_loss / img.shape[0]
+
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits, box_results_list)
+        seg_loss_dice = 1 - mutils.batch_dice(F.softmax(seg_logits, dim=1),var_seg_ohe)
+        seg_loss_ce = F.cross_entropy(seg_logits, var_seg[:, 0])
+        loss = batch_class_loss + batch_bbox_loss + (seg_loss_dice + seg_loss_ce) / 2
+        results_dict['torch_loss'] = loss
+        results_dict['monitor_values'] = {'loss': loss.item(), 'class_loss': batch_class_loss.item()}
+        results_dict['logger_string'] = \
+            "loss: {0:.2f}, class: {1:.2f}, bbox: {2:.2f}, seg dice: {3:.3f}, seg ce: {4:.3f}, mean pix. pr.: {5:.5f}"\
+            .format(loss.item(), batch_class_loss.item(), batch_bbox_loss.item(), seg_loss_dice.item(),
+                    seg_loss_ce.item(), np.mean(results_dict['seg_preds']))
+
+        return results_dict
+
+
+    def test_forward(self, batch, **kwargs):
+        """
+        test method. wrapper around forward pass of network without usage of any ground truth information.
+        prepares input data for processing and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data'
+        :return: results_dict: dictionary with keys:
+               'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                       [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+               'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, ..., n_classes] for
+                            retina_unet and dummy array for retina_net.
+        """
+        img = batch['data']
+        img = torch.from_numpy(img).float().cuda()
+        detections, _, _, seg_logits = self.forward(img)
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits)
+        return results_dict
+
+
+    def forward(self, img):
+        """
+        forward pass of the model.
+        :param img: input img (b, c, y, x, (z)).
+        :return: rpn_pred_logits: (b, n_anchors, 2)
+        :return: rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+        :return: batch_proposal_boxes: (b, n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix)) only for monitoring/plotting.
+        :return: detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+        :return: detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+        """
+        # Feature extraction
+        fpn_outs = self.Fpn(img)
+        seg_logits = self.final_conv(fpn_outs[0])
+        selected_fmaps = [fpn_outs[i + 1] for i in self.cf.pyramid_levels]
+
+        # Loop through pyramid layers
+        class_layer_outputs, bb_reg_layer_outputs = [], []  # list of lists
+        for p in selected_fmaps:
+            class_layer_outputs.append(self.Classifier(p))
+            bb_reg_layer_outputs.append(self.BBRegressor(p))
+
+        # Concatenate layer outputs
+        # Convert from list of lists of level outputs to list of lists
+        # of outputs across levels.
+        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
+        class_logits = list(zip(*class_layer_outputs))
+        class_logits = [torch.cat(list(o), dim=1) for o in class_logits][0]
+        bb_outputs = list(zip(*bb_reg_layer_outputs))
+        bb_outputs = [torch.cat(list(o), dim=1) for o in bb_outputs][0]
+
+        # merge batch_dimension and store info in batch_ixs for re-allocation.
+        batch_ixs = torch.arange(class_logits.shape[0]).unsqueeze(1).repeat(1, class_logits.shape[1]).view(-1).cuda()
+        flat_class_softmax = F.softmax(class_logits.view(-1, class_logits.shape[-1]), 1)
+        flat_bb_outputs = bb_outputs.view(-1, bb_outputs.shape[-1])
+        detections = refine_detections(self.anchors, flat_class_softmax, flat_bb_outputs, batch_ixs, self.cf)
+
+        return detections, class_logits, bb_outputs, seg_logits
diff --git a/models/ufrcnn.py b/models/ufrcnn.py
new file mode 100644
index 0000000..a1dd68a
--- /dev/null
+++ b/models/ufrcnn.py
@@ -0,0 +1,1019 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Parts are based on https://github.com/multimodallearning/pytorch-mask-rcnn
+published under MIT license.
+"""
+
+import utils.model_utils as mutils
+import utils.exp_utils as utils
+from cuda_functions.nms_2D.pth_nms import nms_gpu as nms_2D
+from cuda_functions.nms_3D.pth_nms import nms_gpu as nms_3D
+from cuda_functions.roi_align_2D.roi_align.crop_and_resize import CropAndResizeFunction as ra2D
+from cuda_functions.roi_align_3D.roi_align.crop_and_resize import CropAndResizeFunction as ra3D
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.utils
+
+
+############################################################
+# Networks on top of backbone
+############################################################
+
+class RPN(nn.Module):
+    """
+    Region Proposal Network.
+    """
+
+    def __init__(self, cf, conv):
+
+        super(RPN, self).__init__()
+        self.dim = conv.dim
+
+        self.conv_shared = conv(cf.end_filts, cf.n_rpn_features, ks=3, stride=cf.rpn_anchor_stride, pad=1, relu=cf.relu)
+        self.conv_class = conv(cf.n_rpn_features, 2 * len(cf.rpn_anchor_ratios), ks=1, stride=1, relu=None)
+        self.conv_bbox = conv(cf.n_rpn_features, 2 * self.dim * len(cf.rpn_anchor_ratios), ks=1, stride=1, relu=None)
+
+
+    def forward(self, x):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :return: rpn_class_logits (b, 2, n_anchors)
+        :return: rpn_probs_logits (b, 2, n_anchors)
+        :return: rpn_bbox (b, 2 * dim, n_anchors)
+        """
+
+        # Shared convolutional base of the RPN.
+        x = self.conv_shared(x)
+
+        # Anchor Score. (batch, anchors per location * 2, y, x, (z)).
+        rpn_class_logits = self.conv_class(x)
+        # Reshape to (batch, 2, anchors)
+        axes = (0, 2, 3, 1) if self.dim == 2 else (0, 2, 3, 4, 1)
+        rpn_class_logits = rpn_class_logits.permute(*axes)
+        rpn_class_logits = rpn_class_logits.contiguous()
+        rpn_class_logits = rpn_class_logits.view(x.size()[0], -1, 2)
+
+        # Softmax on last dimension (fg vs. bg).
+        rpn_probs = F.softmax(rpn_class_logits, dim=2)
+
+        # Bounding box refinement. (batch, anchors_per_location * (y, x, (z), log(h), log(w), (log(d)), y, x, (z))
+        rpn_bbox = self.conv_bbox(x)
+
+        # Reshape to (batch, 2*dim, anchors)
+        rpn_bbox = rpn_bbox.permute(*axes)
+        rpn_bbox = rpn_bbox.contiguous()
+        rpn_bbox = rpn_bbox.view(x.size()[0], -1, self.dim * 2)
+
+        return [rpn_class_logits, rpn_probs, rpn_bbox]
+
+
+
+class Classifier(nn.Module):
+    """
+    Head network for classification and bounding box refinement. Performs RoiAlign, processes resulting features through a
+    shared convolutional base and finally branches off the classifier- and regression head.
+    """
+    def __init__(self, cf, conv):
+        super(Classifier, self).__init__()
+
+        self.dim = conv.dim
+        self.in_channels = cf.end_filts
+        self.pool_size = cf.pool_size
+        self.pyramid_levels = cf.pyramid_levels
+        # instance_norm does not work with spatial dims (1, 1, (1))
+        norm = cf.norm if cf.norm != 'instance_norm' else None
+
+        self.conv1 = conv(cf.end_filts, cf.end_filts * 4, ks=self.pool_size, stride=1, norm=norm, relu=cf.relu)
+        self.conv2 = conv(cf.end_filts * 4, cf.end_filts * 4, ks=1, stride=1, norm=norm, relu=cf.relu)
+        self.linear_class = nn.Linear(cf.end_filts * 4, cf.head_classes)
+        self.linear_bbox = nn.Linear(cf.end_filts * 4, cf.head_classes * 2 * self.dim)
+
+    def forward(self, x, rois):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :param rois: normalized box coordinates as proposed by the RPN to be forwarded through
+        the second stage (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix). Proposals of all batch elements
+        have been merged to one vector, while the origin info has been stored for re-allocation.
+        :return: mrcnn_class_logits (n_proposals, n_head_classes)
+        :return: mrcnn_bbox (n_proposals, n_head_classes, 2 * dim) predicted corrections to be applied to proposals for refinement.
+        """
+        x = pyramid_roi_align(x, rois, self.pool_size, self.pyramid_levels, self.dim)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = x.view(-1, self.in_channels * 4)
+        mrcnn_class_logits = self.linear_class(x)
+        mrcnn_bbox = self.linear_bbox(x)
+        mrcnn_bbox = mrcnn_bbox.view(mrcnn_bbox.size()[0], -1, self.dim * 2)
+
+        return [mrcnn_class_logits, mrcnn_bbox]
+
+
+
+class Mask(nn.Module):
+    """
+    Head network for proposal-based mask segmentation. Performs RoiAlign, some convolutions and applies sigmoid on the
+    output logits to allow for overlapping classes.
+    """
+    def __init__(self, cf, conv):
+        super(Mask, self).__init__()
+        self.pool_size = cf.mask_pool_size
+        self.pyramid_levels = cf.pyramid_levels
+        self.dim = conv.dim
+        self.conv1 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv2 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv3 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        self.conv4 = conv(cf.end_filts, cf.end_filts, ks=3, stride=1, pad=1, norm=cf.norm, relu=cf.relu)
+        if conv.dim == 2:
+            self.deconv = nn.ConvTranspose2d(cf.end_filts, cf.end_filts, kernel_size=2, stride=2)
+        else:
+            self.deconv = nn.ConvTranspose3d(cf.end_filts, cf.end_filts, kernel_size=2, stride=2)
+
+        self.relu = nn.ReLU(inplace=True) if cf.relu == 'relu' else nn.LeakyReLU(inplace=True)
+        self.conv5 = conv(cf.end_filts, cf.head_classes, ks=1, stride=1, relu=None)
+        self.sigmoid = nn.Sigmoid()
+
+    def forward(self, x, rois):
+        """
+        :param x: input feature maps (b, in_channels, y, x, (z))
+        :param rois: normalized box coordinates as proposed by the RPN to be forwarded through
+        the second stage (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix). Proposals of all batch elements
+        have been merged to one vector, while the origin info has been stored for re-allocation.
+        :return: x: masks (n_sampled_proposals (n_detections in inference), n_classes, y, x, (z))
+        """
+        x = pyramid_roi_align(x, rois, self.pool_size, self.pyramid_levels, self.dim)
+        x = self.conv1(x)
+        x = self.conv2(x)
+        x = self.conv3(x)
+        x = self.conv4(x)
+        x = self.relu(self.deconv(x))
+        x = self.conv5(x)
+        x = self.sigmoid(x)
+        return x
+
+
+############################################################
+#  Loss Functions
+############################################################
+
+def compute_rpn_class_loss(rpn_match, rpn_class_logits, shem_poolsize):
+    """
+    :param rpn_match: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :param rpn_class_logits: (n_anchors, 2). logits from RPN classifier.
+    :param shem_poolsize: int. factor of top-k candidates to draw from per negative sample
+    (stochastic-hard-example-mining).
+    :return: loss: torch tensor
+    :return: np_neg_ix: 1D array containing indices of the neg_roi_logits, which have been sampled for training.
+    """
+
+    # filter out neutral anchors.
+    pos_indices = torch.nonzero(rpn_match == 1)
+    neg_indices = torch.nonzero(rpn_match == -1)
+
+    # loss for positive samples
+    if 0 not in pos_indices.size():
+        pos_indices = pos_indices.squeeze(1)
+        roi_logits_pos = rpn_class_logits[pos_indices]
+        pos_loss = F.cross_entropy(roi_logits_pos, torch.LongTensor([1] * pos_indices.shape[0]).cuda())
+    else:
+        pos_loss = torch.FloatTensor([0]).cuda()
+
+    # loss for negative samples: draw hard negative examples (SHEM)
+    # that match the number of positive samples, but at least 1.
+    if 0 not in neg_indices.size():
+        neg_indices = neg_indices.squeeze(1)
+        roi_logits_neg = rpn_class_logits[neg_indices]
+        negative_count = np.max((1, pos_indices.cpu().data.numpy().size))
+        roi_probs_neg = F.softmax(roi_logits_neg, dim=1)
+        neg_ix = mutils.shem(roi_probs_neg, negative_count, shem_poolsize)
+        neg_loss = F.cross_entropy(roi_logits_neg[neg_ix], torch.LongTensor([0] * neg_ix.shape[0]).cuda())
+        np_neg_ix = neg_ix.cpu().data.numpy()
+    else:
+        neg_loss = torch.FloatTensor([0]).cuda()
+        np_neg_ix = np.array([]).astype('int32')
+
+    loss = (pos_loss + neg_loss) / 2
+    return loss, np_neg_ix
+
+
+def compute_rpn_bbox_loss(rpn_target_deltas, rpn_pred_deltas, rpn_match):
+    """
+    :param rpn_target_deltas:   (b, n_positive_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd)))).
+    Uses 0 padding to fill in unsed bbox deltas.
+    :param rpn_pred_deltas: predicted deltas from RPN. (b, n_anchors, (dy, dx, (dz), log(dh), log(dw), (log(dd))))
+    :param rpn_match: (n_anchors). [-1, 0, 1] for negative, neutral, and positive matched anchors.
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(rpn_match == 1).size():
+
+        indices = torch.nonzero(rpn_match == 1).squeeze(1)
+        # Pick bbox deltas that contribute to the loss
+        rpn_pred_deltas = rpn_pred_deltas[indices]
+        # Trim target bounding box deltas to the same length as rpn_bbox.
+        target_deltas = rpn_target_deltas[:rpn_pred_deltas.size()[0], :]
+        # Smooth L1 loss
+        loss = F.smooth_l1_loss(rpn_pred_deltas, target_deltas)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_class_loss(target_class_ids, pred_class_logits):
+    """
+    :param target_class_ids: (n_sampled_rois) batch dimension was merged into roi dimension.
+    :param pred_class_logits: (n_sampled_rois, n_classes)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in target_class_ids.size():
+        loss = F.cross_entropy(pred_class_logits, target_class_ids.long())
+    else:
+        loss = torch.FloatTensor([0.]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_bbox_loss(mrcnn_target_deltas, mrcnn_pred_deltas, target_class_ids):
+    """
+    :param mrcnn_target_deltas: (n_sampled_rois, (dy, dx, (dz), log(dh), log(dw), (log(dh)))
+    :param mrcnn_pred_deltas: (n_sampled_rois, n_classes, (dy, dx, (dz), log(dh), log(dw), (log(dh)))
+    :param target_class_ids: (n_sampled_rois)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(target_class_ids > 0).size():
+        positive_roi_ix = torch.nonzero(target_class_ids > 0)[:, 0]
+        positive_roi_class_ids = target_class_ids[positive_roi_ix].long()
+        target_bbox = mrcnn_target_deltas[positive_roi_ix, :].detach()
+        pred_bbox = mrcnn_pred_deltas[positive_roi_ix, positive_roi_class_ids, :]
+        loss = F.smooth_l1_loss(pred_bbox, target_bbox)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+def compute_mrcnn_mask_loss(target_masks, pred_masks, target_class_ids):
+    """
+    :param target_masks: (n_sampled_rois, y, x, (z)) A float32 tensor of values 0 or 1. Uses zero padding to fill array.
+    :param pred_masks: (n_sampled_rois, n_classes, y, x, (z)) float32 tensor with values between [0, 1].
+    :param target_class_ids: (n_sampled_rois)
+    :return: loss: torch 1D tensor.
+    """
+    if 0 not in torch.nonzero(target_class_ids > 0).size():
+        # Only positive ROIs contribute to the loss. And only
+        # the class specific mask of each ROI.
+        positive_ix = torch.nonzero(target_class_ids > 0)[:, 0]
+        positive_class_ids = target_class_ids[positive_ix].long()
+        y_true = target_masks[positive_ix, :, :].detach()
+        y_pred = pred_masks[positive_ix, positive_class_ids, :, :]
+        loss = F.binary_cross_entropy(y_pred, y_true)
+    else:
+        loss = torch.FloatTensor([0]).cuda()
+
+    return loss
+
+
+############################################################
+#  Helper Layers
+############################################################
+
+def proposal_layer(rpn_pred_probs, rpn_pred_deltas, proposal_count, anchors, cf):
+    """
+    Receives anchor scores and selects a subset to pass as proposals
+    to the second stage. Filtering is done based on anchor scores and
+    non-max suppression to remove overlaps. It also applies bounding
+    box refinment detals to anchors.
+    :param rpn_pred_probs: (b, n_anchors, 2)
+    :param rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+    :return: batch_normalized_boxes: Proposals in normalized coordinates
+    (b, proposal_count, (y1, x1, y2, x2, (z1), (z2)))
+    :return: batch_out_proposals: Box coords + RPN foreground scores
+    for monitoring/plotting (b, proposal_count, (y1, x1, y2, x2, (z1), (z2), score))
+    """
+    batch_scores = rpn_pred_probs[:, :, 1]
+    batch_deltas = rpn_pred_deltas
+    batch_anchors = anchors
+    batch_normalized_boxes = []
+    batch_out_proposals = []
+
+    # loop over batch dimension.
+    for ix in range(batch_scores.shape[0]):
+
+        scores = batch_scores[ix]
+        deltas = batch_deltas[ix]
+        anchors = batch_anchors.clone()
+        # norm deltas
+        std_dev = torch.from_numpy(cf.rpn_bbox_std_dev[None]).float().cuda()
+        deltas = deltas * std_dev
+
+        # improve performance by trimming to top anchors by score
+        # and doing the rest on the smaller subset.
+        pre_nms_limit = min(cf.pre_nms_limit, anchors.size()[0])
+        scores, order = scores.sort(descending=True)
+        order = order[:pre_nms_limit]
+        scores = scores[:pre_nms_limit]
+        deltas = deltas[order, :]
+        anchors = anchors[order, :]
+
+        # apply deltas to anchors to get refined anchors and filter with non-maximum surpression.
+        if batch_deltas.shape[-1] == 4:
+            boxes = mutils.apply_box_deltas_2D(anchors, deltas)
+            boxes = mutils.clip_boxes_2D(boxes, cf.window)
+            keep = nms_2D(torch.cat((boxes, scores.unsqueeze(1)), 1), cf.rpn_nms_threshold)
+            norm = torch.from_numpy(cf.scale).float().cuda()
+
+        else:
+            boxes = mutils.apply_box_deltas_3D(anchors, deltas)
+            boxes = mutils.clip_boxes_3D(boxes, cf.window)
+            keep = nms_3D(torch.cat((boxes, scores.unsqueeze(1)), 1), cf.rpn_nms_threshold)
+            norm = torch.from_numpy(cf.scale).float().cuda()
+
+        keep = keep[:proposal_count]
+        boxes = boxes[keep, :]
+        rpn_scores = scores[keep][:, None]
+
+        # pad missing boxes with 0.
+        if boxes.shape[0] < proposal_count:
+            n_pad_boxes = proposal_count - boxes.shape[0]
+            zeros = torch.zeros([n_pad_boxes, boxes.shape[1]]).cuda()
+            boxes = torch.cat([boxes, zeros], dim=0)
+            zeros = torch.zeros([n_pad_boxes, rpn_scores.shape[1]]).cuda()
+            rpn_scores = torch.cat([rpn_scores, zeros], dim=0)
+
+        # concat box and score info for monitoring/plotting.
+        batch_out_proposals.append(torch.cat((boxes, rpn_scores), 1).cpu().data.numpy())
+        # normalize dimensions to range of 0 to 1.
+        normalized_boxes = boxes / norm
+        # add back batch dimension
+        batch_normalized_boxes.append(normalized_boxes.unsqueeze(0))
+
+    batch_normalized_boxes = torch.cat(batch_normalized_boxes)
+    batch_out_proposals = np.array(batch_out_proposals)
+    return batch_normalized_boxes, batch_out_proposals
+
+
+
+def pyramid_roi_align(feature_maps, rois, pool_size, pyramid_levels, dim):
+    """
+    Implements ROI Pooling on multiple levels of the feature pyramid.
+    :param feature_maps: list of feature maps, each of shape (b, c, y, x , (z))
+    :param rois: proposals (normalized coords.) as returned by RPN. contain info about original batch element allocation.
+    (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ixs)
+    :param pool_size: list of poolsizes in dims: [x, y, (z)]
+    :param pyramid_levels: list. [0, 1, 2, ...]
+    :return: pooled: pooled feature map rois (n_proposals, c, poolsize_y, poolsize_x, (poolsize_z))
+
+    Output:
+    Pooled regions in the shape: [num_boxes, height, width, channels].
+    The width and height are those specific in the pool_shape in the layer
+    constructor.
+    """
+    boxes = rois[:, :dim*2]
+    batch_ixs = rois[:, dim*2]
+
+    # Assign each ROI to a level in the pyramid based on the ROI area.
+    if dim == 2:
+        y1, x1, y2, x2 = boxes.chunk(4, dim=1)
+    else:
+        y1, x1, y2, x2, z1, z2 = boxes.chunk(6, dim=1)
+
+    h = y2 - y1
+    w = x2 - x1
+
+    # Equation 1 in https://arxiv.org/abs/1612.03144. Account for
+    # the fact that our coordinates are normalized here.
+    # divide sqrt(h*w) by 1 instead image_area.
+    roi_level = (4 + mutils.log2(torch.sqrt(h*w))).round().int().clamp(pyramid_levels[0], pyramid_levels[-1])
+    # if Pyramid contains additional level P6, adapt the roi_level assignemnt accordingly.
+    if len(pyramid_levels) == 5:
+        roi_level[h*w > 0.65] = 5
+
+    # Loop through levels and apply ROI pooling to each.
+    pooled = []
+    box_to_level = []
+    for level_ix, level in enumerate(pyramid_levels):
+        ix = roi_level == level
+        if not ix.any():
+            continue
+        ix = torch.nonzero(ix)[:, 0]
+        level_boxes = boxes[ix, :]
+        # re-assign rois to feature map of original batch element.
+        ind = batch_ixs[ix].int()
+
+        # Keep track of which box is mapped to which level
+        box_to_level.append(ix)
+
+        # Stop gradient propogation to ROI proposals
+        level_boxes = level_boxes.detach()
+
+        # Crop and Resize
+        # From Mask R-CNN paper: "We sample four regular locations, so
+        # that we can evaluate either max or average pooling. In fact,
+        # interpolating only a single value at each bin center (without
+        # pooling) is nearly as effective."
+        #
+        # Here we use the simplified approach of a single value per bin,
+        # which is how is done in tf.crop_and_resize()
+        #
+        # Also fixed a bug from original implementation, reported in:
+        # https://hackernoon.com/how-tensorflows-tf-image-resize-stole-60-days-of-my-life-aba5eb093f35
+
+        if len(pool_size) == 2:
+            pooled_features = ra2D(pool_size[0], pool_size[1], 0)(feature_maps[level_ix], level_boxes, ind)
+        else:
+            pooled_features = ra3D(pool_size[0], pool_size[1], pool_size[2], 0)(feature_maps[level_ix], level_boxes, ind)
+
+        pooled.append(pooled_features)
+
+
+    # Pack pooled features into one tensor
+    pooled = torch.cat(pooled, dim=0)
+
+    # Pack box_to_level mapping into one array and add another
+    # column representing the order of pooled boxes
+    box_to_level = torch.cat(box_to_level, dim=0)
+
+    # Rearrange pooled features to match the order of the original boxes
+    _, box_to_level = torch.sort(box_to_level)
+    pooled = pooled[box_to_level, :, :]
+
+    return pooled
+
+
+
+def detection_target_layer(batch_proposals, batch_mrcnn_class_scores, batch_gt_class_ids, batch_gt_boxes, cf):
+    """
+    Subsamples proposals for mrcnn losses and generates targets. Sampling is done per batch element, seems to have positive
+    effects on training, as opposed to sampling over entire batch. Negatives are sampled via stochastic-hard-example-mining
+    (SHEM), where a number of negative proposals are drawn from larger pool of highest scoring proposals for stochasticity.
+    Scoring is obtained here as the max over all foreground probabilities as returned by mrcnn_classifier (worked better than
+    loss-based class balancing methods like "online-hard-example-mining" or "focal loss".)
+
+    :param batch_proposals: (n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ixs).
+    boxes as proposed by RPN. n_proposals here is determined by batch_size * POST_NMS_ROIS.
+    :param batch_mrcnn_class_scores: (n_proposals, n_classes)
+    :param batch_gt_class_ids: list over batch elements. Each element is a list over the corresponding roi target labels.
+    :param batch_gt_boxes: list over batch elements. Each element is a list over the corresponding roi target coordinates.
+    :param batch_gt_masks: list over batch elements. Each element is binary mask of shape (n_gt_rois, y, x, (z), c)
+    :return: sample_indices: (n_sampled_rois) indices of sampled proposals to be used for loss functions.
+    :return: target_class_ids: (n_sampled_rois)containing target class labels of sampled proposals.
+    :return: target_deltas: (n_sampled_rois, 2 * dim) containing target deltas of sampled proposals for box refinement.
+    :return: target_masks: (n_sampled_rois, y, x, (z)) containing target masks of sampled proposals.
+    """
+    # normalization of target coordinates
+    if cf.dim == 2:
+        h, w = cf.patch_size
+        scale = torch.from_numpy(np.array([h, w, h, w])).float().cuda()
+    else:
+        h, w, z = cf.patch_size
+        scale = torch.from_numpy(np.array([h, w, h, w, z, z])).float().cuda()
+
+
+    positive_count = 0
+    negative_count = 0
+    sample_positive_indices = []
+    sample_negative_indices = []
+    sample_deltas = []
+    sample_class_ids = []
+
+    # loop over batch and get positive and negative sample rois.
+    for b in range(len(batch_gt_class_ids)):
+
+        gt_class_ids = torch.from_numpy(batch_gt_class_ids[b]).int().cuda()
+        if np.any(batch_gt_class_ids[b] > 0):  # skip roi selection for no gt images.
+            gt_boxes = torch.from_numpy(batch_gt_boxes[b]).float().cuda() / scale
+        else:
+            gt_boxes = torch.FloatTensor().cuda()
+
+        # get proposals and indices of current batch element.
+        proposals = batch_proposals[batch_proposals[:, -1] == b][:, :-1]
+        batch_element_indices = torch.nonzero(batch_proposals[:, -1] == b).squeeze(1)
+
+        # Compute overlaps matrix [proposals, gt_boxes]
+        if 0 not in gt_boxes.size():
+            if gt_boxes.shape[1] == 4:
+                overlaps = mutils.bbox_overlaps_2D(proposals, gt_boxes)
+            else:
+                overlaps = mutils.bbox_overlaps_3D(proposals, gt_boxes)
+
+            # Determine postive and negative ROIs
+            roi_iou_max = torch.max(overlaps, dim=1)[0]
+            # 1. Positive ROIs are those with >= 0.5 IoU with a GT box
+            positive_roi_bool = roi_iou_max >= (0.5 if cf.dim == 2 else 0.3)
+            # 2. Negative ROIs are those with < 0.1 with every GT box.
+            negative_roi_bool = roi_iou_max < (0.1 if cf.dim == 2 else 0.01)
+        else:
+            positive_roi_bool = torch.FloatTensor().cuda()
+            negative_roi_bool = torch.from_numpy(np.array([1]*proposals.shape[0])).cuda()
+
+        # Sample Positive ROIs
+        if 0 not in torch.nonzero(positive_roi_bool).size():
+            positive_indices = torch.nonzero(positive_roi_bool).squeeze(1)
+            positive_samples = int(cf.train_rois_per_image * cf.roi_positive_ratio)
+            rand_idx = torch.randperm(positive_indices.size()[0])
+            rand_idx = rand_idx[:positive_samples].cuda()
+            positive_indices = positive_indices[rand_idx]
+            positive_samples = positive_indices.size()[0]
+            positive_rois = proposals[positive_indices, :]
+            # Assign positive ROIs to GT boxes.
+            positive_overlaps = overlaps[positive_indices, :]
+            roi_gt_box_assignment = torch.max(positive_overlaps, dim=1)[1]
+            roi_gt_boxes = gt_boxes[roi_gt_box_assignment, :]
+            roi_gt_class_ids = gt_class_ids[roi_gt_box_assignment]
+
+            # Compute bbox refinement targets for positive ROIs
+            deltas = mutils.box_refinement(positive_rois, roi_gt_boxes)
+            std_dev = torch.from_numpy(cf.bbox_std_dev).float().cuda()
+            deltas /= std_dev
+
+            sample_positive_indices.append(batch_element_indices[positive_indices])
+            sample_deltas.append(deltas)
+            sample_class_ids.append(roi_gt_class_ids)
+            positive_count += positive_samples
+        else:
+            positive_samples = 0
+
+        # Negative ROIs. Add enough to maintain positive:negative ratio, but at least 1. Sample via SHEM.
+        if 0 not in torch.nonzero(negative_roi_bool).size():
+            negative_indices = torch.nonzero(negative_roi_bool).squeeze(1)
+            r = 1.0 / cf.roi_positive_ratio
+            b_neg_count = np.max((int(r * positive_samples - positive_samples), 1))
+            roi_probs_neg = batch_mrcnn_class_scores[batch_element_indices[negative_indices]]
+            raw_sampled_indices = mutils.shem(roi_probs_neg, b_neg_count, cf.shem_poolsize)
+            sample_negative_indices.append(batch_element_indices[negative_indices[raw_sampled_indices]])
+            negative_count += raw_sampled_indices.size()[0]
+
+    if len(sample_positive_indices) > 0:
+        target_deltas = torch.cat(sample_deltas)
+        target_class_ids = torch.cat(sample_class_ids)
+
+    # Pad target information with zeros for negative ROIs.
+    if positive_count > 0 and negative_count > 0:
+        sample_indices = torch.cat((torch.cat(sample_positive_indices), torch.cat(sample_negative_indices)), dim=0)
+        zeros = torch.zeros(negative_count).int().cuda()
+        target_class_ids = torch.cat([target_class_ids, zeros], dim=0)
+        zeros = torch.zeros(negative_count, cf.dim * 2).cuda()
+        target_deltas = torch.cat([target_deltas, zeros], dim=0)
+    elif positive_count > 0:
+        sample_indices = torch.cat(sample_positive_indices)
+    elif negative_count > 0:
+        sample_indices = torch.cat(sample_negative_indices)
+        zeros = torch.zeros(negative_count).int().cuda()
+        target_class_ids = zeros
+        zeros = torch.zeros(negative_count, cf.dim * 2).cuda()
+        target_deltas = zeros
+    else:
+        sample_indices = torch.LongTensor().cuda()
+        target_class_ids = torch.IntTensor().cuda()
+        target_deltas = torch.FloatTensor().cuda()
+
+    return sample_indices, target_class_ids, target_deltas
+
+
+############################################################
+#  Output Handler
+############################################################
+
+def refine_detections(rois, probs, deltas, batch_ixs, cf):
+    """
+    Refine classified proposals, filter overlaps and return final detections.
+
+    :param rois: (n_proposals, 2 * dim) normalized boxes as proposed by RPN. n_proposals = batch_size * POST_NMS_ROIS
+    :param probs: (n_proposals, n_classes) softmax probabilities for all rois as predicted by mrcnn classifier.
+    :param deltas: (n_proposals, n_classes, 2 * dim) box refinement deltas as predicted by mrcnn bbox regressor.
+    :param batch_ixs: (n_proposals) batch element assignemnt info for re-allocation.
+    :return: result: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score))
+    """
+    # class IDs per ROI. Since scores of all classes are of interest (not just max class), all are kept at this point.
+    class_ids = []
+    fg_classes = cf.head_classes - 1
+    # repeat vectors to fill in predictions for all foreground classes.
+    for ii in range(1, fg_classes + 1):
+        class_ids += [ii] * rois.shape[0]
+    class_ids = torch.from_numpy(np.array(class_ids)).cuda()
+
+    rois = rois.repeat(fg_classes, 1)
+    probs = probs.repeat(fg_classes, 1)
+    deltas = deltas.repeat(fg_classes, 1, 1)
+    batch_ixs = batch_ixs.repeat(fg_classes)
+
+    # get class-specific scores and  bounding box deltas
+    idx = torch.arange(class_ids.size()[0]).long().cuda()
+    class_scores = probs[idx, class_ids]
+    deltas_specific = deltas[idx, class_ids]
+    batch_ixs = batch_ixs[idx]
+
+    # apply bounding box deltas. re-scale to image coordinates.
+    std_dev = torch.from_numpy(np.reshape(cf.rpn_bbox_std_dev, [1, cf.dim * 2])).float().cuda()
+    scale = torch.from_numpy(cf.scale).float().cuda()
+    refined_rois = mutils.apply_box_deltas_2D(rois, deltas_specific * std_dev) * scale if cf.dim == 2 else \
+        mutils.apply_box_deltas_3D(rois, deltas_specific * std_dev) * scale
+
+    # round and cast to int since we're deadling with pixels now
+    refined_rois = mutils.clip_to_window(cf.window, refined_rois)
+    refined_rois = torch.round(refined_rois)
+
+    # filter out low confidence boxes
+    keep = idx
+    keep_bool = (class_scores >= cf.model_min_confidence)
+    if 0 not in torch.nonzero(keep_bool).size():
+
+        score_keep = torch.nonzero(keep_bool)[:, 0]
+        pre_nms_class_ids = class_ids[score_keep]
+        pre_nms_rois = refined_rois[score_keep]
+        pre_nms_scores = class_scores[score_keep]
+        pre_nms_batch_ixs = batch_ixs[score_keep]
+
+        for j, b in enumerate(mutils.unique1d(pre_nms_batch_ixs)):
+
+            bixs = torch.nonzero(pre_nms_batch_ixs == b)[:, 0]
+            bix_class_ids = pre_nms_class_ids[bixs]
+            bix_rois = pre_nms_rois[bixs]
+            bix_scores = pre_nms_scores[bixs]
+
+            for i, class_id in enumerate(mutils.unique1d(bix_class_ids)):
+
+                ixs = torch.nonzero(bix_class_ids == class_id)[:, 0]
+                # nms expects boxes sorted by score.
+                ix_rois = bix_rois[ixs]
+                ix_scores = bix_scores[ixs]
+                ix_scores, order = ix_scores.sort(descending=True)
+                ix_rois = ix_rois[order, :]
+
+                if cf.dim == 2:
+                    class_keep = nms_2D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+                else:
+                    class_keep = nms_3D(torch.cat((ix_rois, ix_scores.unsqueeze(1)), dim=1), cf.detection_nms_threshold)
+
+                # map indices back.
+                class_keep = keep[score_keep[bixs[ixs[order[class_keep]]]]]
+                # merge indices over classes for current batch element
+                b_keep = class_keep if i == 0 else mutils.unique1d(torch.cat((b_keep, class_keep)))
+
+            # only keep top-k boxes of current batch-element
+            top_ids = class_scores[b_keep].sort(descending=True)[1][:cf.model_max_instances_per_batch_element]
+            b_keep = b_keep[top_ids]
+
+            # merge indices over batch elements.
+            batch_keep = b_keep if j == 0 else mutils.unique1d(torch.cat((batch_keep, b_keep)))
+
+        keep = batch_keep
+
+    else:
+        keep = torch.tensor([0]).long().cuda()
+
+    # arrange output
+    result = torch.cat((refined_rois[keep],
+                        batch_ixs[keep].unsqueeze(1),
+                        class_ids[keep].unsqueeze(1).float(),
+                        class_scores[keep].unsqueeze(1)), dim=1)
+
+    return result
+
+
+def get_results(cf, img_shape, detections, seg_logits, box_results_list=None):
+    """
+    Restores batch dimension of merged detections, unmolds detections, creates and fills results dict.
+    :param img_shape:
+    :param detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+    :param detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+    :param box_results_list: None or list of output boxes for monitoring/plotting.
+    each element is a list of boxes per batch element.
+    :param return_masks: boolean. If True, full resolution masks are returned for all proposals (speed trade-off).
+    :return: results_dict: dictionary with keys:
+             'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                      [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+             'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, 1] only fg. vs. bg for now.
+             class-specific return of masks will come with implementation of instance segmentation evaluation.
+    """
+    detections = detections.cpu().data.numpy()
+
+    # restore batch dimension of merged detections using the batch_ix info.
+    batch_ixs = detections[:, cf.dim*2]
+    detections = [detections[batch_ixs == ix] for ix in range(img_shape[0])]
+
+    # for test_forward, where no previous list exists.
+    if box_results_list is None:
+        box_results_list = [[] for _ in range(img_shape[0])]
+
+    seg_preds = []
+    # loop over batch and unmold detections.
+    for ix in range(img_shape[0]):
+
+        if 0 not in detections[ix].shape:
+            boxes = detections[ix][:, :2 * cf.dim].astype(np.int32)
+            class_ids = detections[ix][:, 2 * cf.dim + 1].astype(np.int32)
+            scores = detections[ix][:, 2 * cf.dim + 2]
+
+            # Filter out detections with zero area. Often only happens in early
+            # stages of training when the network weights are still a bit random.
+            if cf.dim == 2:
+                exclude_ix = np.where((boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
+            else:
+                exclude_ix = np.where(
+                    (boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 5] - boxes[:, 4]) <= 0)[0]
+
+            if exclude_ix.shape[0] > 0:
+                boxes = np.delete(boxes, exclude_ix, axis=0)
+                class_ids = np.delete(class_ids, exclude_ix, axis=0)
+                scores = np.delete(scores, exclude_ix, axis=0)
+
+            # add final perdictions to results.
+            if 0 not in boxes.shape:
+                for ix2, score in enumerate(scores):
+                    if score >= cf.model_min_confidence:
+                        box_results_list[ix].append({'box_coords': boxes[ix2], 'box_score': score,
+                                                     'box_type': 'det', 'box_pred_class_id': class_ids[ix2]})
+
+    # create and fill results dictionary.
+    results_dict = {'boxes': box_results_list}
+    if seg_logits is None:
+        # output dummy segmentation for retina_net.
+        results_dict['seg_preds'] = np.zeros(img_shape)[:, 0][:, np.newaxis]
+    else:
+        # output label maps for retina_unet.
+        results_dict['seg_preds'] = F.softmax(seg_logits, 1).argmax(1).cpu().data.numpy()[:, np.newaxis].astype('uint8')
+
+    return results_dict
+
+
+############################################################
+#  Mask R-CNN Class
+############################################################
+
+class net(nn.Module):
+
+
+    def __init__(self, cf, logger):
+
+        super(net, self).__init__()
+        self.cf = cf
+        self.logger = logger
+        self.build()
+
+        if self.cf.weight_init is not None:
+            logger.info("using pytorch weight init of type {}".format(self.cf.weight_init))
+            mutils.initialize_weights(self)
+        else:
+            logger.info("using default pytorch weight init")
+
+
+    def build(self):
+        """Build Mask R-CNN architecture."""
+
+        # Image size must be dividable by 2 multiple times.
+        h, w = self.cf.patch_size[:2]
+        if h / 2**5 != int(h / 2**5) or w / 2**5 != int(w / 2**5):
+            raise Exception("Image size must be dividable by 2 at least 5 times "
+                            "to avoid fractions when downscaling and upscaling."
+                            "For example, use 256, 320, 384, 448, 512, ... etc. ")
+
+        # instanciate abstract multi dimensional conv class and backbone class.
+        conv = mutils.NDConvGenerator(self.cf.dim)
+        backbone = utils.import_module('bbone', self.cf.backbone_path)
+
+        # build Anchors, FPN, RPN, Classifier / Bbox-Regressor -head, Mask-head
+        self.np_anchors = mutils.generate_pyramid_anchors(self.logger, self.cf)
+        self.anchors = torch.from_numpy(self.np_anchors).float().cuda()
+        self.fpn = backbone.FPN(self.cf, conv, operate_stride1=True)
+        self.rpn = RPN(self.cf, conv)
+        self.classifier = Classifier(self.cf, conv)
+        self.mask = Mask(self.cf, conv)
+        self.final_conv = conv(self.cf.end_filts, self.cf.num_seg_classes, ks=1, pad=0, norm=self.cf.norm, relu=None)
+
+
+    def train_forward(self, batch, is_validation=False):
+        """
+        train method (also used for validation monitoring). wrapper around forward pass of network. prepares input data
+        for processing, computes losses, and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data', 'seg', etc.
+        :return: results_dict: dictionary with keys:
+                'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                        [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+                'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes].
+                'torch_loss': 1D torch tensor for backprop.
+                'class_loss': classification loss for monitoring.
+        """
+        img = batch['data']
+        gt_class_ids = batch['roi_labels']
+        gt_boxes = batch['bb_target']
+        axes = (0, 2, 3, 1) if self.cf.dim == 2 else (0, 2, 3, 4, 1)
+        var_seg_ohe = torch.FloatTensor(mutils.get_one_hot_encoding(batch['seg'], self.cf.num_seg_classes)).cuda()
+        var_seg = torch.LongTensor(batch['seg']).cuda()
+
+
+        img = torch.from_numpy(img).float().cuda()
+        batch_rpn_class_loss = torch.FloatTensor([0]).cuda()
+        batch_rpn_bbox_loss = torch.FloatTensor([0]).cuda()
+
+        # list of output boxes for monitoring/plotting. each element is a list of boxes per batch element.
+        box_results_list = [[] for _ in range(img.shape[0])]
+
+        #forward passes. 1. general forward pass, where no activations are saved in second stage (for performance
+        # monitoring and loss sampling). 2. second stage forward pass of sampled rois with stored activations for backprop.
+        rpn_class_logits, rpn_pred_deltas, proposal_boxes, detections, seg_logits = self.forward(img)
+        mrcnn_class_logits, mrcnn_pred_deltas, target_class_ids, mrcnn_target_deltas,  \
+        sample_proposals = self.loss_samples_forward(gt_class_ids, gt_boxes)
+
+        # loop over batch
+        for b in range(img.shape[0]):
+            if len(gt_boxes[b]) > 0:
+
+                # add gt boxes to output list for monitoring.
+                for ix in range(len(gt_boxes[b])):
+                    box_results_list[b].append({'box_coords': batch['bb_target'][b][ix],
+                                                'box_label': batch['roi_labels'][b][ix], 'box_type': 'gt'})
+
+                # match gt boxes with anchors to generate targets for RPN losses.
+                rpn_match, rpn_target_deltas = mutils.gt_anchor_matching(self.cf, self.np_anchors, gt_boxes[b])
+
+                # add positive anchors used for loss to output list for monitoring.
+                pos_anchors = mutils.clip_boxes_numpy(self.np_anchors[np.argwhere(rpn_match == 1)][:, 0], img.shape[2:])
+                for p in pos_anchors:
+                    box_results_list[b].append({'box_coords': p, 'box_type': 'pos_anchor'})
+
+            else:
+                rpn_match = np.array([-1]*self.np_anchors.shape[0])
+                rpn_target_deltas = np.array([0])
+
+            rpn_match = torch.from_numpy(rpn_match).cuda()
+            rpn_target_deltas = torch.from_numpy(rpn_target_deltas).float().cuda()
+
+            # compute RPN losses.
+            rpn_class_loss, neg_anchor_ix = compute_rpn_class_loss(rpn_match, rpn_class_logits[b], self.cf.shem_poolsize)
+            rpn_bbox_loss = compute_rpn_bbox_loss(rpn_target_deltas, rpn_pred_deltas[b], rpn_match)
+            batch_rpn_class_loss += rpn_class_loss / img.shape[0]
+            batch_rpn_bbox_loss += rpn_bbox_loss / img.shape[0]
+
+            # add negative anchors used for loss to output list for monitoring.
+            neg_anchors = mutils.clip_boxes_numpy(self.np_anchors[np.argwhere(rpn_match == -1)][0, neg_anchor_ix], img.shape[2:])
+            for n in neg_anchors:
+                box_results_list[b].append({'box_coords': n, 'box_type': 'neg_anchor'})
+
+            # add highest scoring proposals to output list for monitoring.
+            rpn_proposals = proposal_boxes[b][proposal_boxes[b, :, -1].argsort()][::-1]
+            for r in rpn_proposals[:self.cf.n_plot_rpn_props, :-1]:
+                box_results_list[b].append({'box_coords': r, 'box_type': 'prop'})
+
+        # add positive and negative roi samples used for mrcnn losses to output list for monitoring.
+        if 0 not in sample_proposals.shape:
+            rois = mutils.clip_to_window(self.cf.window, sample_proposals).cpu().data.numpy()
+            for ix, r in enumerate(rois):
+                box_results_list[int(r[-1])].append({'box_coords': r[:-1] * self.cf.scale,
+                                            'box_type': 'pos_class' if target_class_ids[ix] > 0 else 'neg_class'})
+
+        batch_rpn_class_loss = batch_rpn_class_loss
+        batch_rpn_bbox_loss = batch_rpn_bbox_loss
+
+        # compute mrcnn losses.
+        mrcnn_class_loss = compute_mrcnn_class_loss(target_class_ids, mrcnn_class_logits)
+        mrcnn_bbox_loss = compute_mrcnn_bbox_loss(mrcnn_target_deltas, mrcnn_pred_deltas, target_class_ids)
+
+        # mrcnn can be run without pixelwise annotations available (Faster R-CNN mode).
+        # In this case, the mask_loss is taken out of training.
+        # if not self.cf.frcnn_mode:
+        #     mrcnn_mask_loss = compute_mrcnn_mask_loss(target_mask, mrcnn_pred_mask, target_class_ids)
+        # else:
+        #     mrcnn_mask_loss = torch.FloatTensor([0]).cuda()
+
+        seg_loss_dice = 1 - mutils.batch_dice(F.softmax(seg_logits, dim=1), var_seg_ohe)
+        seg_loss_ce = F.cross_entropy(seg_logits, var_seg[:, 0])
+
+        loss = batch_rpn_class_loss + batch_rpn_bbox_loss + mrcnn_class_loss + mrcnn_bbox_loss + (seg_loss_dice + seg_loss_ce) / 2
+
+        # monitor RPN performance: detection count = the number of correctly matched proposals per fg-class.
+        dcount = [list(target_class_ids.cpu().data.numpy()).count(c) for c in np.arange(self.cf.head_classes)[1:]]
+
+        # run unmolding of predictions for monitoring and merge all results to one dictionary.
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits, box_results_list)
+        results_dict['torch_loss'] = loss
+        results_dict['monitor_values'] = {'loss': loss.item(), 'class_loss': mrcnn_class_loss.item()}
+        results_dict['logger_string'] = "loss: {0:.2f}, rpn_class: {1:.2f}, rpn_bbox: {2:.2f}, mrcnn_class: {3:.2f}, " \
+                                        "mrcnn_bbox: {4:.2f}, dice_loss: {5:.2f}, dcount {6}"\
+            .format(loss.item(), batch_rpn_class_loss.item(), batch_rpn_bbox_loss.item(), mrcnn_class_loss.item(),
+                    mrcnn_bbox_loss.item(), seg_loss_dice.item(), dcount)
+
+        return results_dict
+
+
+    def test_forward(self, batch, return_masks=True):
+        """
+        test method. wrapper around forward pass of network without usage of any ground truth information.
+        prepares input data for processing and stores outputs in a dictionary.
+        :param batch: dictionary containing 'data'
+        :param return_masks: boolean. If True, full resolution masks are returned for all proposals (speed trade-off).
+        :return: results_dict: dictionary with keys:
+               'boxes': list over batch elements. each batch element is a list of boxes. each box is a dictionary:
+                       [[{box_0}, ... {box_n}], [{box_0}, ... {box_n}], ...]
+               'seg_preds': pixel-wise class predictions (b, 1, y, x, (z)) with values [0, n_classes]
+        """
+        img = batch['data']
+        img = torch.from_numpy(img).float().cuda()
+        _, _, _, detections, seg_logits = self.forward(img)
+        results_dict = get_results(self.cf, img.shape, detections, seg_logits)
+        return results_dict
+
+
+    def forward(self, img, is_training=True):
+        """
+        :param img: input images (b, c, y, x, (z)).
+        :return: rpn_pred_logits: (b, n_anchors, 2)
+        :return: rpn_pred_deltas: (b, n_anchors, (y, x, (z), log(h), log(w), (log(d))))
+        :return: batch_proposal_boxes: (b, n_proposals, (y1, x1, y2, x2, (z1), (z2), batch_ix)) only for monitoring/plotting.
+        :return: detections: (n_final_detections, (y1, x1, y2, x2, (z1), (z2), batch_ix, pred_class_id, pred_score)
+        :return: detection_masks: (n_final_detections, n_classes, y, x, (z)) raw molded masks as returned by mask-head.
+        """
+        # extract features.
+        fpn_outs = self.fpn(img)
+        seg_logits = self.final_conv(fpn_outs[0])
+        rpn_feature_maps = [fpn_outs[i + 1] for i in self.cf.pyramid_levels]
+        self.mrcnn_feature_maps = rpn_feature_maps
+
+        # loop through pyramid layers and apply RPN.
+        layer_outputs = []  # list of lists
+        for p in rpn_feature_maps:
+            layer_outputs.append(self.rpn(p))
+
+        # concatenate layer outputs.
+        # convert from list of lists of level outputs to list of lists of outputs across levels.
+        # e.g. [[a1, b1, c1], [a2, b2, c2]] => [[a1, a2], [b1, b2], [c1, c2]]
+        outputs = list(zip(*layer_outputs))
+        outputs = [torch.cat(list(o), dim=1) for o in outputs]
+        rpn_pred_logits, rpn_pred_probs, rpn_pred_deltas = outputs
+
+        # generate proposals: apply predicted deltas to anchors and filter by foreground scores from RPN classifier.
+        proposal_count = self.cf.post_nms_rois_training if is_training else self.cf.post_nms_rois_inference
+        batch_rpn_rois, batch_proposal_boxes = proposal_layer(rpn_pred_probs, rpn_pred_deltas, proposal_count, self.anchors, self.cf)
+
+        # merge batch dimension of proposals while storing allocation info in coordinate dimension.
+        batch_ixs = torch.from_numpy(np.repeat(np.arange(batch_rpn_rois.shape[0]), batch_rpn_rois.shape[1])).float().cuda()
+        rpn_rois = batch_rpn_rois.view(-1, batch_rpn_rois.shape[2])
+        self.rpn_rois_batch_info = torch.cat((rpn_rois, batch_ixs.unsqueeze(1)), dim=1)
+
+        # this is the first of two forward passes in the second stage, where no activations are stored for backprop.
+        # here, all proposals are forwarded (with virtual_batch_size = batch_size * post_nms_rois.)
+        # for inference/monitoring as well as sampling of rois for the loss functions.
+        # processed in chunks of roi_chunk_size to re-adjust to gpu-memory.
+        chunked_rpn_rois = self.rpn_rois_batch_info.split(self.cf.roi_chunk_size)
+        class_logits_list, bboxes_list = [], []
+        with torch.no_grad():
+            for chunk in chunked_rpn_rois:
+                chunk_class_logits, chunk_bboxes = self.classifier(self.mrcnn_feature_maps, chunk)
+                class_logits_list.append(chunk_class_logits)
+                bboxes_list.append(chunk_bboxes)
+        batch_mrcnn_class_logits = torch.cat(class_logits_list, 0)
+        batch_mrcnn_bbox = torch.cat(bboxes_list, 0)
+        self.batch_mrcnn_class_scores = F.softmax(batch_mrcnn_class_logits, dim=1)
+
+        # refine classified proposals, filter and return final detections.
+        detections = refine_detections(rpn_rois, self.batch_mrcnn_class_scores, batch_mrcnn_bbox, batch_ixs, self.cf, )
+
+        return [rpn_pred_logits, rpn_pred_deltas, batch_proposal_boxes, detections, seg_logits]
+
+
+    def loss_samples_forward(self, batch_gt_class_ids, batch_gt_boxes):
+        """
+        this is the second forward pass through the second stage (features from stage one are re-used).
+        samples few rois in detection_target_layer and forwards only those for loss computation.
+        :param batch_gt_class_ids: list over batch elements. Each element is a list over the corresponding roi target labels.
+        :param batch_gt_boxes: list over batch elements. Each element is a list over the corresponding roi target coordinates.
+        :param batch_gt_masks: list over batch elements. Each element is binary mask of shape (n_gt_rois, y, x, (z), c)
+        :return: sample_logits: (n_sampled_rois, n_classes) predicted class scores.
+        :return: sample_boxes: (n_sampled_rois, n_classes, 2 * dim) predicted corrections to be applied to proposals for refinement.
+        :return: sample_mask: (n_sampled_rois, n_classes, y, x, (z)) predicted masks per class and proposal.
+        :return: sample_target_class_ids: (n_sampled_rois) target class labels of sampled proposals.
+        :return: sample_target_deltas: (n_sampled_rois, 2 * dim) target deltas of sampled proposals for box refinement.
+        :return: sample_target_masks: (n_sampled_rois, y, x, (z)) target masks of sampled proposals.
+        :return: sample_proposals: (n_sampled_rois, 2 * dim) RPN output for sampled proposals. only for monitoring/plotting.
+        """
+        # sample rois for loss and get corresponding targets for all Mask R-CNN head network losses.
+        sample_ix, sample_target_class_ids, sample_target_deltas = \
+            detection_target_layer(self.rpn_rois_batch_info, self.batch_mrcnn_class_scores,
+                                   batch_gt_class_ids, batch_gt_boxes, self.cf)
+
+        # re-use feature maps and RPN output from first forward pass.
+        sample_proposals = self.rpn_rois_batch_info[sample_ix]
+        if 0 not in sample_proposals.size():
+            sample_logits, sample_boxes = self.classifier(self.mrcnn_feature_maps, sample_proposals)
+        else:
+            sample_logits = torch.FloatTensor().cuda()
+            sample_boxes = torch.FloatTensor().cuda()
+
+        return [sample_logits, sample_boxes, sample_target_class_ids, sample_target_deltas, sample_proposals]
diff --git a/plotting.py b/plotting.py
new file mode 100644
index 0000000..4e47646
--- /dev/null
+++ b/plotting.py
@@ -0,0 +1,266 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import matplotlib
+matplotlib.use('Agg')
+import matplotlib.pyplot as plt
+import matplotlib.gridspec as gridspec
+import numpy as np
+import os
+from copy import deepcopy
+
+
+def plot_batch_prediction(batch, results_dict, cf, outfile= None):
+    """
+    plot the input images, ground truth annotations, and output predictions of a batch. If 3D batch, plots a 2D projection
+    of one randomly sampled element (patient) in the batch. Since plotting all slices of patient volume blows up costs of
+    time and space, only a section containing a randomly sampled ground truth annotation is plotted.
+    :param batch: dict with keys: 'data' (input image), 'seg' (pixelwise annotations), 'pid'
+    :param results_dict: list over batch element. Each element is a list of boxes (prediction and ground truth),
+    where every box is a dictionary containing box_coords, box_score and box_type.
+    """
+    if outfile is None:
+        outfile = os.path.join(cf.plot_dir, 'pred_example_{}.png'.format(cf.fold))
+
+    data = batch['data']
+    segs = batch['seg']
+    pids = batch['pid']
+    # for 3D, repeat pid over batch elements.
+    if len(set(pids)) == 1:
+        pids = [pids] * data.shape[0]
+
+    seg_preds = results_dict['seg_preds']
+    roi_results = deepcopy(results_dict['boxes'])
+
+    # Randomly sampled one patient of batch and project data into 2D slices for plotting.
+    if cf.dim == 3:
+        patient_ix = np.random.choice(data.shape[0])
+        data = np.transpose(data[patient_ix], axes=(3, 0, 1, 2))
+
+        # select interesting foreground section to plot.
+        gt_boxes = [box['box_coords'] for box in roi_results[patient_ix] if box['box_type'] == 'gt']
+        if len(gt_boxes) > 0:
+            z_cuts = [np.max((int(gt_boxes[0][4]) - 5, 0)), np.min((int(gt_boxes[0][5]) + 5, data.shape[0]))]
+        else:
+            z_cuts = [data.shape[0]//2 - 5, int(data.shape[0]//2 + np.min([10, data.shape[0]//2]))]
+        p_roi_results = roi_results[patient_ix]
+        roi_results = [[] for _ in range(data.shape[0])]
+
+        # iterate over cubes and spread across slices.
+        for box in p_roi_results:
+            b = box['box_coords']
+            # dismiss negative anchor slices.
+            slices = np.round(np.unique(np.clip(np.arange(b[4], b[5] + 1), 0, data.shape[0]-1)))
+            for s in slices:
+                roi_results[int(s)].append(box)
+                roi_results[int(s)][-1]['box_coords'] = b[:4]
+
+        roi_results = roi_results[z_cuts[0]: z_cuts[1]]
+        data = data[z_cuts[0]: z_cuts[1]]
+        segs = np.transpose(segs[patient_ix], axes=(3, 0, 1, 2))[z_cuts[0]: z_cuts[1]]
+        seg_preds = np.transpose(seg_preds[patient_ix], axes=(3, 0, 1, 2))[z_cuts[0]: z_cuts[1]]
+        pids = [pids[patient_ix]] * data.shape[0]
+
+    try:
+        # all dimensions except for the 'channel-dimension' are required to match
+        for i in [0, 2, 3]:
+            assert data.shape[i] == segs.shape[i] == seg_preds.shape[i]
+    except:
+        raise Warning('Shapes of arrays to plot not in agreement!'
+                      'Shapes {} vs. {} vs {}'.format(data.shape, segs.shape, seg_preds.shape))
+
+
+    show_arrays = np.concatenate([data, segs, seg_preds, data[:, 0][:, None]], axis=1).astype(float)
+    approx_figshape = (4 * show_arrays.shape[0], 4 * show_arrays.shape[1])
+    fig = plt.figure(figsize=approx_figshape)
+    gs = gridspec.GridSpec(show_arrays.shape[1] + 1, show_arrays.shape[0])
+    gs.update(wspace=0.1, hspace=0.1)
+    for b in range(show_arrays.shape[0]):
+        for m in range(show_arrays.shape[1]):
+
+            ax = plt.subplot(gs[m, b])
+            ax.axis('off')
+            if m < show_arrays.shape[1]:
+                arr = show_arrays[b, m]
+
+            if m < data.shape[1] or m == show_arrays.shape[1] - 1:
+                cmap = 'gray'
+                vmin = None
+                vmax = None
+            else:
+                cmap = None
+                vmin = 0
+                vmax = cf.num_seg_classes - 1
+
+            if m == 0:
+                plt.title('{}'.format(pids[b][:10]), fontsize=20)
+
+            plt.imshow(arr, cmap=cmap, vmin=vmin, vmax=vmax)
+            if m >= (data.shape[1]):
+                for box in roi_results[b]:
+                    if box['box_type'] != 'patient_tn_box': # don't plot true negative dummy boxes.
+                        coords = box['box_coords']
+                        if box['box_type'] == 'det':
+                            # dont plot background preds or low confidence boxes.
+                            if box['box_pred_class_id'] > 0 and box['box_score'] > 0.1:
+                                plot_text = True
+                                score = np.max(box['box_score'])
+                                score_text = '{}|{:.0f}'.format(box['box_pred_class_id'], score*100)
+                                # if prob detection: plot only boxes from correct sampling instance.
+                                if 'sample_id' in box.keys() and int(box['sample_id']) != m - data.shape[1] - 2:
+                                        continue
+                                # if prob detection: plot reconstructed boxes only in corresponding line.
+                                if not 'sample_id' in box.keys() and  m != data.shape[1] + 1:
+                                    continue
+
+                                score_font_size = 7
+                                text_color = 'w'
+                                text_x = coords[1] + 10*(box['box_pred_class_id'] -1) #avoid overlap of scores in plot.
+                                text_y = coords[2] + 5
+                            else:
+                                continue
+                        elif box['box_type'] == 'gt':
+                            plot_text = True
+                            score_text = int(box['box_label'])
+                            score_font_size = 7
+                            text_color = 'r'
+                            text_x = coords[1]
+                            text_y = coords[0] - 1
+                        else:
+                            plot_text = False
+
+                        color_var = 'extra_usage' if 'extra_usage' in list(box.keys()) else 'box_type'
+                        color = cf.box_color_palette[box[color_var]]
+                        plt.plot([coords[1], coords[3]], [coords[0], coords[0]], color=color, linewidth=1, alpha=1) # up
+                        plt.plot([coords[1], coords[3]], [coords[2], coords[2]], color=color, linewidth=1, alpha=1) # down
+                        plt.plot([coords[1], coords[1]], [coords[0], coords[2]], color=color, linewidth=1, alpha=1) # left
+                        plt.plot([coords[3], coords[3]], [coords[0], coords[2]], color=color, linewidth=1, alpha=1) # right
+                        if plot_text:
+                            plt.text(text_x, text_y, score_text, fontsize=score_font_size, color=text_color)
+
+    try:
+        plt.savefig(outfile)
+    except:
+        raise Warning('failed to save plot.')
+    plt.close(fig)
+
+
+
+class TrainingPlot_2Panel():
+
+
+    def __init__(self, cf):
+
+        self.file_name = cf.plot_dir + '/monitor_{}'.format(cf.fold)
+        self.exp_name = cf.fold_dir
+        self.separate_values_dict = cf.assign_values_to_extra_figure
+        self.figure_list = []
+        for n in range(cf.n_monitoring_figures):
+            self.figure_list.append(plt.figure(figsize=(10, 6)))
+            self.figure_list[-1].ax1 = plt.subplot(111)
+            self.figure_list[-1].ax1.set_xlabel('epochs')
+            self.figure_list[-1].ax1.set_ylabel('loss / metrics')
+            self.figure_list[-1].ax1.set_xlim(0, cf.num_epochs)
+            self.figure_list[-1].ax1.grid()
+
+        self.figure_list[0].ax1.set_ylim(0, 1.5)
+        self.color_palette = ['b', 'c', 'r', 'purple', 'm', 'y', 'k', 'tab:gray']
+
+    def update_and_save(self, metrics, epoch):
+
+        for figure_ix in range(len(self.figure_list)):
+            fig = self.figure_list[figure_ix]
+            detection_monitoring_plot(fig.ax1, metrics, self.exp_name, self.color_palette, epoch, figure_ix, self.separate_values_dict)
+            fig.savefig(self.file_name + '_{}'.format(figure_ix))
+
+
+def detection_monitoring_plot(ax1, metrics, exp_name, color_palette, epoch, figure_ix, separate_values_dict):
+
+    monitor_values_keys = metrics['train']['monitor_values'][1][0].keys()
+    separate_values = [v for fig_ix in separate_values_dict.values() for v in fig_ix]
+    if figure_ix == 0:
+        plot_keys = [ii for ii in monitor_values_keys if ii not in separate_values]
+        plot_keys += [k for k in metrics['train'].keys() if k != 'monitor_values']
+    else:
+        plot_keys = separate_values_dict[figure_ix]
+
+
+    x = np.arange(1, epoch + 1)
+
+    for kix, pk in enumerate(plot_keys):
+        if pk in metrics['train'].keys():
+            y_train = metrics['train'][pk][1:]
+            y_val = metrics['val'][pk][1:]
+        else:
+            y_train = [np.mean([er[pk] for er in metrics['train']['monitor_values'][e]]) for e in x]
+            y_val = [np.mean([er[pk] for er in metrics['val']['monitor_values'][e]]) for e in x]
+
+        ax1.plot(x, y_train, label='train_{}'.format(pk), linestyle='--', color=color_palette[kix])
+        ax1.plot(x, y_val, label='val_{}'.format(pk), linestyle='-', color=color_palette[kix])
+
+    if epoch == 1:
+        box = ax1.get_position()
+        ax1.set_position([box.x0, box.y0, box.width * 0.8, box.height])
+        ax1.legend(loc='center left', bbox_to_anchor=(1, 0.5))
+        ax1.set_title(exp_name)
+
+
+def plot_prediction_hist(label_list, pred_list, type_list, outfile):
+    """
+    plot histogram of predictions for a specific class.
+    :param label_list: list of 1s and 0s specifying whether prediction is a true positive match (1) or a false positive (0).
+    False negatives (missed ground truth objects) are artificially added predictions with score 0 and label 1.
+    :param pred_list: list of prediction-scores.
+    :param type_list: list of prediction-types for stastic-info in title.
+    """
+    preds = np.array(pred_list)
+    labels = np.array(label_list)
+    title = outfile.split('/')[-1] + ' count:{}'.format(len(label_list))
+    plt.figure()
+    plt.yscale('log')
+    if 0 in labels:
+        plt.hist(preds[labels == 0], alpha=0.3, color='g', range=(0, 1), bins=50, label='false pos.')
+    if 1 in labels:
+        plt.hist(preds[labels == 1], alpha=0.3, color='b', range=(0, 1), bins=50, label='true pos. (false neg. @ score=0)')
+
+    if type_list is not None:
+        fp_count = type_list.count('det_fp')
+        fn_count = type_list.count('det_fn')
+        tp_count = type_list.count('det_tp')
+        pos_count = fn_count + tp_count
+        title += ' tp:{} fp:{} fn:{} pos:{}'. format(tp_count, fp_count, fn_count, pos_count)
+
+    plt.legend()
+    plt.title(title)
+    plt.xlabel('confidence score')
+    plt.ylabel('log n')
+    plt.savefig(outfile)
+    plt.close()
+
+
+def plot_stat_curves(stats, outfile):
+
+    for c in ['roc', 'prc']:
+        plt.figure()
+        for s in stats:
+            if s[c] is not None:
+                plt.plot(s[c][0], s[c][1], label=s['name'] + '_' + c)
+        plt.title(outfile.split('/')[-1] + '_' + c)
+        plt.legend(loc=3 if c == 'prc' else 4)
+        plt.xlabel('precision' if c == 'prc' else '1-spec.')
+        plt.ylabel('recall')
+        plt.savefig(outfile + '_' + c)
+        plt.close()
diff --git a/predictor.py b/predictor.py
new file mode 100644
index 0000000..0c32495
--- /dev/null
+++ b/predictor.py
@@ -0,0 +1,819 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import os
+import numpy as np
+import torch
+from scipy.stats import norm
+from collections import OrderedDict
+from multiprocessing import Pool
+import pickle
+import pandas as pd
+
+
+class Predictor:
+    """
+    Prediction pipeline:
+    - receives a patched patient image (n_patches, c, y, x, (z)) from patient data loader.
+    - forwards patches through model in chunks of batch_size. (method: batch_tiling_forward)
+    - unmolds predictions (boxes and segmentations) to original patient coordinates. (method: spatial_tiling_forward)
+
+    Ensembling (mode == 'test'):
+    - for inference, forwards 4 mirrored versions of image to through model and unmolds predictions afterwards
+      accordingly (method: data_aug_forward)
+    - for inference, loads multiple parameter-sets of the trained model corresponding to different epochs. for each
+      parameter-set loops over entire test set, runs prediction pipeline for each patient. (method: predict_test_set)
+
+    Consolidation of predictions:
+    - consolidates a patient's predictions (boxes, segmentations) collected over patches, data_aug- and temporal ensembling,
+      performs clustering and weighted averaging (external function: apply_wbc_to_patient) to obtain consistent outptus.
+    - for 2D networks, consolidates box predictions to 3D cubes via clustering (adaption of non-maximum surpression).
+      (external function: merge_2D_to_3D_preds_per_patient)
+
+    Ground truth handling:
+    - dissmisses any ground truth boxes returned by the model (happens in validation mode, patch-based groundtruth)
+    - if provided by data loader, adds 3D ground truth to the final predictions to be passed to the evaluator.
+    """
+    def __init__(self, cf, net, logger, mode):
+
+        self.cf = cf
+        self.logger = logger
+
+        # mode is 'val' for patient-based validation/monitoring and 'test' for inference.
+        self.mode = mode
+
+        # model instance. In validation mode, contains parameters of current epoch.
+        self.net = net
+
+        # rank of current epoch loaded (for temporal averaging). this info is added to each prediction,
+        # for correct weighting during consolidation.
+        self.rank_ix = '0'
+
+        # number of ensembled models. used to calculate the number of expected predictions per position
+        # during consolidation of predictions. Default is 1 (no ensembling, e.g. in validation).
+        self.n_ens = 1
+
+        if self.mode == 'test':
+            try:
+                self.epoch_ranking = np.load(os.path.join(self.cf.fold_dir, 'epoch_ranking.npy'))[:cf.test_n_epochs]
+            except:
+                raise RuntimeError('no epoch ranking file in fold directory. '
+                                   'seems like you are trying to run testing without prior training...')
+            self.n_ens = cf.test_n_epochs
+            if self.cf.test_aug:
+                self.n_ens *= 4
+
+
+    def predict_patient(self, batch):
+        """
+        predicts one patient.
+        called either directly via loop over validation set in exec.py (mode=='val')
+        or from self.predict_test_set (mode=='test).
+        in val mode:  adds 3D ground truth info to predictions and runs consolidation and 2Dto3D merging of predictions.
+        in test mode: returns raw predictions (ground truth addition, consolidation, 2D to 3D merging are
+                      done in self.predict_test_set, because patient predictions across several epochs might be needed
+                      to be collected first, in case of temporal ensembling).
+        :return. results_dict: stores the results for one patient. dictionary with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions
+                            (if not merged to 3D), and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': pixel-wise predictions. (b, 1, y, x, (z))
+                 - monitor_values (only in validation mode)
+        """
+        self.logger.info('evaluating patient {} for fold {} '.format(batch['pid'], self.cf.fold))
+
+        # True if patient is provided in patches and predictions need to be tiled.
+        self.patched_patient = True if 'patch_crop_coords' in list(batch.keys()) else False
+
+        # forward batch through prediction pipeline.
+        results_dict = self.data_aug_forward(batch)
+
+        if self.mode == 'val':
+            for b in range(batch['patient_bb_target'].shape[0]):
+                for t in range(len(batch['patient_bb_target'][b])):
+                    results_dict['boxes'][b].append({'box_coords': batch['patient_bb_target'][b][t],
+                                                     'box_label': batch['patient_roi_labels'][b][t],
+                                                     'box_type': 'gt'})
+
+            if self.patched_patient:
+                wcs_input = [results_dict['boxes'], 'dummy_pid', self.cf.class_dict, self.cf.wcs_iou, self.n_ens]
+                results_dict['boxes'] = apply_wbc_to_patient(wcs_input)[0]
+
+            if self.cf.merge_2D_to_3D_preds:
+                merge_dims_inputs = [results_dict['boxes'], 'dummy_pid', self.cf.class_dict, self.cf.merge_3D_iou]
+                results_dict['boxes'] = merge_2D_to_3D_preds_per_patient(merge_dims_inputs)[0]
+
+        return results_dict
+
+
+    def predict_test_set(self, batch_gen, return_results=True):
+        """
+        wrapper around test method, which loads multiple (or one) epoch parameters (temporal ensembling), loops through
+        the test set and collects predictions per patient. Also flattens the results per patient and epoch
+        and adds optional ground truth boxes for evaluation. Saves out the raw result list for later analysis and
+        optionally consolidates and returns predictions immediately.
+        :return: (optionally) list_of_results_per_patient: list over patient results. each entry is a dict with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions
+                            (if not merged to 3D), and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': not implemented yet. todo for evaluation of instance/semantic segmentation.
+        """
+        dict_of_patient_results = OrderedDict()
+
+        # get paths of all parameter sets to be loaded for temporal ensembling. (or just one for no temp. ensembling).
+        weight_paths = [os.path.join(self.cf.fold_dir, '{}_best_params.pth'.format(epoch)) for epoch in self.epoch_ranking]
+
+        for rank_ix, weight_path in enumerate(weight_paths):
+
+            self.logger.info(('tmp ensembling over rank_ix:{} epoch:{}'.format(rank_ix, weight_path)))
+            self.net.load_state_dict(torch.load(weight_path))
+            self.net.eval()
+            self.rank_ix = str(rank_ix)  # get string of current rank for unique patch ids.
+
+            with torch.no_grad():
+                for _ in range(batch_gen['n_test']):
+
+                    batch = next(batch_gen['test'])
+
+                    # store batch info in patient entry of results dict.
+                    if rank_ix == 0:
+                        dict_of_patient_results[batch['pid']] = {}
+                        dict_of_patient_results[batch['pid']]['results_list'] = []
+                        dict_of_patient_results[batch['pid']]['patient_bb_target'] = batch['patient_bb_target']
+                        dict_of_patient_results[batch['pid']]['patient_roi_labels'] = batch['patient_roi_labels']
+
+                    # call prediction pipeline and store results in dict.
+                    results_dict = self.predict_patient(batch)
+                    dict_of_patient_results[batch['pid']]['results_list'].append(results_dict['boxes'])
+
+
+        self.logger.info('finished predicting test set. starting post-processing of predictions.')
+        list_of_results_per_patient = []
+
+        # loop over patients again to flatten results across epoch predictions.
+        # if provided, add ground truth boxes for evaluation.
+        for pid, p_dict in dict_of_patient_results.items():
+
+            tmp_ens_list = p_dict['results_list']
+            results_dict = {}
+            # collect all boxes/seg_preds of same batch_instance over temporal instances.
+            results_dict['boxes'] = [[item for d in tmp_ens_list for item in d[batch_instance]]
+                                     for batch_instance in range(len(tmp_ens_list[0]))]
+
+            # TODO return for instance segmentation:
+            # results_dict['seg_preds'] = np.mean(results_dict['seg_preds'], 1)[:, None]
+            # results_dict['seg_preds'] = np.array([[item for d in tmp_ens_list for item in d['seg_preds'][batch_instance]]
+            #                                       for batch_instance in range(len(tmp_ens_list[0]['boxes']))])
+
+            # add 3D ground truth boxes for evaluation.
+            for b in range(p_dict['patient_bb_target'].shape[0]):
+                for t in range(len(p_dict['patient_bb_target'][b])):
+                    results_dict['boxes'][b].append({'box_coords': p_dict['patient_bb_target'][b][t],
+                                                     'box_label': p_dict['patient_roi_labels'][b][t],
+                                                     'box_type': 'gt'})
+
+            list_of_results_per_patient.append([results_dict['boxes'], pid])
+
+        # save out raw predictions.
+        out_string = 'raw_pred_boxes_hold_out_list' if self.cf.hold_out_test_set else 'raw_pred_boxes_list'
+        with open(os.path.join(self.cf.fold_dir, '{}.pickle'.format(out_string)), 'wb') as handle:
+            pickle.dump(list_of_results_per_patient, handle)
+
+        if return_results:
+
+            # consolidate predictions.
+            self.logger.info('applying wcs to test set predictions with iou = {} and n_ens = {}.'.format(
+                self.cf.wcs_iou, self.n_ens))
+            pool = Pool(processes=6)
+            mp_inputs = [[ii[0], ii[1], self.cf.class_dict, self.cf.wcs_iou, self.n_ens] for ii in list_of_results_per_patient]
+            list_of_results_per_patient = pool.map(apply_wbc_to_patient, mp_inputs, chunksize=1)
+            pool.close()
+            pool.join()
+
+            # merge 2D boxes to 3D cubes. (if model predicts 2D but evaluation is run in 3D)
+            if self.cf.merge_2D_to_3D_preds:
+                self.logger.info('applying 2Dto3D merging to test set predictions with iou = {}.'.format(self.cf.merge_3D_iou))
+                pool = Pool(processes=6)
+                mp_inputs = [[ii[0], ii[1], self.cf.class_dict, self.cf.merge_3D_iou] for ii in list_of_results_per_patient]
+                list_of_results_per_patient = pool.map(merge_2D_to_3D_preds_per_patient, mp_inputs, chunksize=1)
+                pool.close()
+                pool.join()
+
+            return list_of_results_per_patient
+
+
+    def load_saved_predictions(self, apply_wbc=False):
+        """
+        loads raw predictions saved by self.predict_test_set. consolidates and merges 2D boxes to 3D cubes for evaluation.
+        (if model predicts 2D but evaluation is run in 3D)
+        :return: (optionally) list_of_results_per_patient: list over patient results. each entry is a dict with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions
+                            (if not merged to 3D), and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': not implemented yet. todo for evaluation of instance/semantic segmentation.
+        """
+
+        # load predictions for a single test-set fold.
+        if not self.cf.hold_out_test_set:
+            with open(os.path.join(self.cf.fold_dir, 'raw_pred_boxes_list.pickle'), 'rb') as handle:
+                list_of_results_per_patient = pickle.load(handle)
+            da_factor = 4 if self.cf.test_aug else 1
+            n_ens = self.cf.test_n_epochs * da_factor
+            self.logger.info('loaded raw test set predictions with n_patients = {} and n_ens = {}'.format(
+                len(list_of_results_per_patient), n_ens))
+
+        # if hold out test set was perdicted, aggregate predictions of all trained models
+        # corresponding to all CV-folds and flatten them.
+        else:
+            boxes_list = []
+            for fold in self.cf.folds:
+                fold_dir = os.path.join(self.cf.exp_dir, 'fold_{}'.format(fold))
+                with open(os.path.join(fold_dir, 'raw_pred_boxes_hold_out_list.pickle'), 'rb') as handle:
+                    fold_list = pickle.load(handle)
+                    pids = [ii[1] for ii in fold_list]
+                    boxes_list.append([ii[0] for ii in fold_list])
+            list_of_results_per_patient = [[[[box for fold_list in boxes_list for box in fold_list[pix][0]
+                                              if box['box_type'] == 'det']], pid] for pix, pid in enumerate(pids)]
+            da_factor = 4 if self.cf.test_aug else 1
+            n_ens = self.cf.test_n_epochs * da_factor * len(self.cf.folds)
+
+        # consolidate predictions.
+        if apply_wbc:
+            self.logger.info('applying wcs to test set predictions with iou = {} and n_ens = {}.'.format(
+                self.cf.wcs_iou, n_ens))
+            pool = Pool(processes=6)
+            mp_inputs = [[ii[0], ii[1], self.cf.class_dict, self.cf.wcs_iou, n_ens] for ii in list_of_results_per_patient]
+            list_of_results_per_patient = pool.map(apply_wbc_to_patient, mp_inputs, chunksize=1)
+            pool.close()
+            pool.join()
+        else:
+            list_of_results_per_patient = list_of_results_per_patient
+
+        # merge 2D box predictions to 3D cubes (if model predicts 2D but evaluation is run in 3D)
+        if self.cf.merge_2D_to_3D_preds:
+            self.logger.info(
+                'applying 2Dto3D merging to test set predictions with iou = {}.'.format(self.cf.merge_3D_iou))
+            pool = Pool(processes=6)
+            mp_inputs = [[ii[0], ii[1], self.cf.class_dict, self.cf.merge_3D_iou] for ii in list_of_results_per_patient]
+            list_of_results_per_patient = pool.map(merge_2D_to_3D_preds_per_patient, mp_inputs, chunksize=1)
+            pool.close()
+            pool.join()
+
+        return list_of_results_per_patient
+
+
+    def data_aug_forward(self, batch):
+        """
+        in val_mode: passes batch through to spatial_tiling method without data_aug.
+        in test_mode: if cf.test_aug is set in configs, createst 4 mirrored versions of the input image,
+        passes all of them to the next processing step (spatial_tiling method) and re-transforms returned predictions
+        to original image version.
+        :return. results_dict: stores the results for one patient. dictionary with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions,
+                            and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': pixel-wise predictions. (b, 1, y, x, (z))
+                 - monitor_values (only in validation mode)
+        """
+        patch_crops = batch['patch_crop_coords'] if self.patched_patient else None
+        results_list = [self.spatial_tiling_forward(batch, patch_crops)]
+        org_img_shape = batch['original_img_shape']
+
+        if self.mode == 'test' and self.cf.test_aug:
+
+            if self.patched_patient:
+                # apply mirror transformations to patch-crop coordinates, for correct tiling in spatial_tiling method.
+                mirrored_patch_crops = get_mirrored_patch_crops(patch_crops, batch['original_img_shape'])
+            else:
+                mirrored_patch_crops = [None] * 3
+
+            img = np.copy(batch['data'])
+
+            # first mirroring: y-axis.
+            batch['data'] = np.flip(img, axis=2).copy()
+            chunk_dict = self.spatial_tiling_forward(batch, mirrored_patch_crops[0], n_aug='1')
+            # re-transform coordinates.
+            for ix in range(len(chunk_dict['boxes'])):
+                for boxix in range(len(chunk_dict['boxes'][ix])):
+                    coords = chunk_dict['boxes'][ix][boxix]['box_coords'].copy()
+                    coords[0] = org_img_shape[2] - chunk_dict['boxes'][ix][boxix]['box_coords'][2]
+                    coords[2] = org_img_shape[2] - chunk_dict['boxes'][ix][boxix]['box_coords'][0]
+                    assert coords[2] >= coords[0], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    assert coords[3] >= coords[1], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    chunk_dict['boxes'][ix][boxix]['box_coords'] = coords
+            # re-transform segmentation predictions.
+            chunk_dict['seg_preds'] = np.flip(chunk_dict['seg_preds'], axis=2)
+            results_list.append(chunk_dict)
+
+            # second mirroring: x-axis.
+            batch['data'] = np.flip(img, axis=3).copy()
+            chunk_dict = self.spatial_tiling_forward(batch, mirrored_patch_crops[1], n_aug='2')
+            # re-transform coordinates.
+            for ix in range(len(chunk_dict['boxes'])):
+                for boxix in range(len(chunk_dict['boxes'][ix])):
+                    coords = chunk_dict['boxes'][ix][boxix]['box_coords'].copy()
+                    coords[1] = org_img_shape[3] - chunk_dict['boxes'][ix][boxix]['box_coords'][3]
+                    coords[3] = org_img_shape[3] - chunk_dict['boxes'][ix][boxix]['box_coords'][1]
+                    assert coords[2] >= coords[0], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    assert coords[3] >= coords[1], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    chunk_dict['boxes'][ix][boxix]['box_coords'] = coords
+            # re-transform segmentation predictions.
+            chunk_dict['seg_preds'] = np.flip(chunk_dict['seg_preds'], axis=3)
+            results_list.append(chunk_dict)
+
+            # third mirroring: y-axis and x-axis.
+            batch['data'] = np.flip(np.flip(img, axis=2), axis=3).copy()
+            chunk_dict = self.spatial_tiling_forward(batch, mirrored_patch_crops[2], n_aug='3')
+            # re-transform coordinates.
+            for ix in range(len(chunk_dict['boxes'])):
+                for boxix in range(len(chunk_dict['boxes'][ix])):
+                    coords = chunk_dict['boxes'][ix][boxix]['box_coords'].copy()
+                    coords[0] = org_img_shape[2] - chunk_dict['boxes'][ix][boxix]['box_coords'][2]
+                    coords[2] = org_img_shape[2] - chunk_dict['boxes'][ix][boxix]['box_coords'][0]
+                    coords[1] = org_img_shape[3] - chunk_dict['boxes'][ix][boxix]['box_coords'][3]
+                    coords[3] = org_img_shape[3] - chunk_dict['boxes'][ix][boxix]['box_coords'][1]
+                    assert coords[2] >= coords[0], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    assert coords[3] >= coords[1], [coords, chunk_dict['boxes'][ix][boxix]['box_coords'].copy()]
+                    chunk_dict['boxes'][ix][boxix]['box_coords'] = coords
+            # re-transform segmentation predictions.
+            chunk_dict['seg_preds'] = np.flip(np.flip(chunk_dict['seg_preds'], axis=2), axis=3).copy()
+            results_list.append(chunk_dict)
+
+            batch['data'] = img
+
+        # aggregate all boxes/seg_preds per batch element from data_aug predictions.
+        results_dict = {}
+        results_dict['boxes'] = [[item for d in results_list for item in d['boxes'][batch_instance]]
+                                 for batch_instance in range(org_img_shape[0])]
+        results_dict['seg_preds'] = np.array([[item for d in results_list for item in d['seg_preds'][batch_instance]]
+                                              for batch_instance in range(org_img_shape[0])])
+        if self.mode == 'val':
+            results_dict['monitor_values'] = results_list[0]['monitor_values']
+
+        return results_dict
+
+
+    def spatial_tiling_forward(self, batch, patch_crops=None, n_aug='0'):
+        """
+        forwards batch to batch_tiling_forward method and receives and returns a dictionary with results.
+        if patch-based prediction, the results received from batch_tiling_forward will be on a per-patch-basis.
+        this method uses the provided patch_crops to re-transform all predictions to whole-image coordinates.
+        Patch-origin information of all box-predictions will be needed for consolidation, hence it is stored as
+        'patch_id', which is a unique string for each patch (also takes current data aug and temporal epoch instances
+        into account). all box predictions get additional information about the amount overlapping patches at the
+        respective position (used for consolidation).
+        :return. results_dict: stores the results for one patient. dictionary with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions,
+                            and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': pixel-wise predictions. (b, 1, y, x, (z))
+                 - monitor_values (only in validation mode)
+        """
+        if patch_crops is not None:
+
+            patches_dict = self.batch_tiling_forward(batch)
+
+            results_dict = {'boxes': [[] for _ in range(batch['original_img_shape'][0])]}
+
+            # instanciate segemntation output array. Will contain averages over patch predictions.
+            out_seg_preds = np.zeros(batch['original_img_shape'], dtype=np.float16)[:, 0][:, None]
+            # counts patch instances per pixel-position.
+            patch_overlap_map = np.zeros_like(out_seg_preds, dtype='uint8')
+
+            #unmold segmentation outputs. loop over patches.
+            for pix, pc in enumerate(patch_crops):
+                if self.cf.dim == 3:
+                    try:
+                        out_seg_preds[:, :, pc[0]:pc[1], pc[2]:pc[3], pc[4]:pc[5]] += patches_dict['seg_preds'][pix][None]
+                        patch_overlap_map[:, :, pc[0]:pc[1], pc[2]:pc[3], pc[4]:pc[5]] += 1
+                    except:
+                        print('hi')
+                else:
+                    out_seg_preds[pc[4]:pc[5], :, pc[0]:pc[1], pc[2]:pc[3], ] += patches_dict['seg_preds'][pix]
+                    patch_overlap_map[pc[4]:pc[5], :, pc[0]:pc[1], pc[2]:pc[3], ] += 1
+
+            # take mean in overlapping areas.
+            out_seg_preds[patch_overlap_map > 0] /= patch_overlap_map[patch_overlap_map > 0]
+            results_dict['seg_preds'] = out_seg_preds
+
+            # unmold box outputs. loop over patches.
+            for pix, pc in enumerate(patch_crops):
+                patch_boxes = patches_dict['boxes'][pix]
+
+                for box in patch_boxes:
+
+                    # add unique patch id for consolidation of predictions.
+                    box['patch_id'] = self.rank_ix + '_' + n_aug + '_' + str(pix)
+
+                    # boxes from the edges of a patch have a lower prediction quality, than the ones at patch-centers.
+                    # hence they will be downweighted for consolidation, using the 'box_patch_center_factor', which is
+                    # obtained by a normal distribution over positions in the patch and average over spatial dimensions.
+                    # Also the info 'box_n_overlaps' is stored for consolidation, which depicts the amount over
+                    # overlapping patches at the box's position.
+                    c = box['box_coords']
+                    box_centers = np.array([(c[ii+2] - c[ii])/2 for ii in range(len(c)//2)])
+                    box['box_patch_center_factor'] = np.mean(
+                        [norm.pdf(bc, loc=pc, scale=pc * 0.8) * np.sqrt(2 * np.pi) * pc * 0.8 for bc, pc in
+                         zip(box_centers, np.array(self.cf.patch_size) / 2)])
+                    if self.cf.dim == 3:
+                        c += np.array([pc[0], pc[2], pc[0], pc[2], pc[4], pc[4]])
+                        int_c = [int(np.floor(ii)) if ix%2 == 0 else int(np.ceil(ii)) for ix, ii in enumerate(c)]
+                        box['box_n_overlaps'] = np.mean(patch_overlap_map[:, :, int_c[1]:int_c[3], int_c[0]:int_c[2], int_c[4]:int_c[5]])
+                        results_dict['boxes'][0].append(box)
+                    else:
+                        c += np.array([pc[0], pc[2], pc[0], pc[2]])
+                        int_c = [int(np.floor(ii)) if ix % 2 == 0 else int(np.ceil(ii)) for ix, ii in enumerate(c)]
+                        box['box_n_overlaps'] = np.mean(patch_overlap_map[pc[4], :, int_c[1]:int_c[3], int_c[0]:int_c[2]])
+                        results_dict['boxes'][pc[4]].append(box)
+
+            if self.mode == 'val':
+                results_dict['monitor_values'] = patches_dict['monitor_values']
+
+        # if predictions are not patch-based:
+        # add patch-origin info to boxes (entire image is the same patch with overlap=1) and return results.
+        else:
+            results_dict = self.batch_tiling_forward(batch)
+            for b in results_dict['boxes']:
+                for box in b:
+                    box['box_patch_center_factor'] = 1
+                    box['box_n_overlaps'] = 1
+                    box['patch_id'] = self.rank_ix + '_' + n_aug
+
+        return results_dict
+
+
+    def batch_tiling_forward(self, batch):
+        """
+        calls the actual network forward method. in patch-based prediction, the batch dimension might be overladed
+        with n_patches >> batch_size, which would exceed gpu memory. In this case, batches are processed in chunks of
+        batch_size. validation mode calls the train method to monitor losses (returned ground truth objects are discarded).
+        test mode calls the test forward method, no ground truth required / involved.
+        :return. results_dict: stores the results for one patient. dictionary with keys:
+                 - 'boxes': list over batch elements. each element is a list over boxes, where each box is
+                            one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D predictions,
+                            and a dummy batch dimension of 1 for 3D predictions.
+                 - 'seg_preds': pixel-wise predictions. (b, 1, y, x, (z))
+                 - monitor_values (only in validation mode)
+        """
+        self.logger.info('forwarding (patched) patient with shape: {}'.format(batch['data'].shape))
+
+        img = batch['data']
+
+        if img.shape[0] <= self.cf.batch_size:
+
+            if self.mode == 'val':
+                # call training method to monitor losses
+                results_dict = self.net.train_forward(batch, is_validation=True)
+                # discard returned ground-truth boxes (also training info boxes).
+                results_dict['boxes'] = [[box for box in b if box['box_type'] == 'det'] for b in results_dict['boxes']]
+            else:
+                results_dict = self.net.test_forward(batch, return_masks=self.cf.return_masks_in_test)
+
+        else:
+            split_ixs = np.split(np.arange(img.shape[0]), np.arange(img.shape[0])[::self.cf.batch_size])
+            chunk_dicts = []
+            for chunk_ixs in split_ixs[1:]:  # first split is elements before 0, so empty
+                b = {k: batch[k][chunk_ixs] for k in batch.keys()
+                     if (isinstance(batch[k], np.ndarray) and batch[k].shape[0] == img.shape[0])}
+                if self.mode == 'val':
+                    chunk_dicts += [self.net.train_forward(b, is_validation=True)]
+                else:
+                    chunk_dicts += [self.net.test_forward(b, return_masks=self.cf.return_masks_in_test)]
+
+
+            results_dict = {}
+            # flatten out batch elements from chunks ([chunk, chunk] -> [b, b, b, b, ...])
+            results_dict['boxes'] = [item for d in chunk_dicts for item in d['boxes']]
+            results_dict['seg_preds'] = np.array([item for d in chunk_dicts for item in d['seg_preds']])
+
+            if self.mode == 'val':
+                # estimate metrics by mean over batch_chunks. Most similar to training metrics.
+                results_dict['monitor_values'] = \
+                    {k:np.mean([d['monitor_values'][k] for d in chunk_dicts])
+                     for k in chunk_dicts[0]['monitor_values'].keys()}
+                # discard returned ground-truth boxes (also training info boxes).
+                results_dict['boxes'] = [[box for box in b if box['box_type'] == 'det'] for b in results_dict['boxes']]
+
+        return results_dict
+
+
+
+def apply_wbc_to_patient(inputs):
+    """
+    wrapper around prediction box consolidation: weighted cluster scoring (wcs). processes a single patient.
+    loops over batch elements in patient results (1 in 3D, slices in 2D) and foreground classes,
+    aggregates and stores results in new list.
+    :return. patient_results_list: list over batch elements. each element is a list over boxes, where each box is
+                                 one dictionary: [[box_0, ...], [box_n,...]]. batch elements are slices for 2D
+                                 predictions, and a dummy batch dimension of 1 for 3D predictions.
+    :return. pid: string. patient id.
+    """
+    in_patient_results_list, pid, class_dict, wcs_iou, n_ens = inputs
+    out_patient_results_list = [[] for _ in range(len(in_patient_results_list))]
+
+    for bix, b in enumerate(in_patient_results_list):
+
+        for cl in list(class_dict.keys()):
+
+            boxes = [(ix, box) for ix, box in enumerate(b) if (box['box_type'] == 'det' and box['box_pred_class_id'] == cl)]
+            box_coords = np.array([b[1]['box_coords'] for b in boxes])
+            box_scores = np.array([b[1]['box_score'] for b in boxes])
+            box_center_factor = np.array([b[1]['box_patch_center_factor'] for b in boxes])
+            box_n_overlaps = np.array([b[1]['box_n_overlaps'] for b in boxes])
+            box_patch_id = np.array([b[1]['patch_id'] for b in boxes])
+
+            if 0 not in box_scores.shape:
+                keep_scores, keep_coords = weighted_box_clustering(
+                    np.concatenate((box_coords, box_scores[:, None], box_center_factor[:, None],
+                                    box_n_overlaps[:, None]), axis=1), box_patch_id, wcs_iou, n_ens)
+
+                for boxix in range(len(keep_scores)):
+                    out_patient_results_list[bix].append({'box_type': 'det', 'box_coords': keep_coords[boxix],
+                                             'box_score': keep_scores[boxix], 'box_pred_class_id': cl})
+
+        # add gt boxes back to new output list.
+        out_patient_results_list[bix].extend([box for box in b if box['box_type'] == 'gt'])
+
+    return [out_patient_results_list, pid]
+
+
+
+def merge_2D_to_3D_preds_per_patient(inputs):
+    """
+    wrapper around 2Dto3D merging operation. Processes a single patient. Takes 2D patient results (slices in batch dimension)
+    and returns 3D patient results (dummy batch dimension of 1). Applies an adaption of Non-Maximum Surpression
+    (Detailed methodology is described in nms_2to3D).
+    :return. results_dict_boxes: list over batch elements (1 in 3D). each element is a list over boxes, where each box is
+                                 one dictionary: [[box_0, ...], [box_n,...]].
+    :return. pid: string. patient id.
+    """
+    in_patient_results_list, pid, class_dict, merge_3D_iou = inputs
+    out_patient_results_list = []
+
+    for cl in list(class_dict.keys()):
+        boxes, slice_ids = [], []
+        # collect box predictions over batch dimension (slices) and store slice info as slice_ids.
+        for bix, b in enumerate(in_patient_results_list):
+            det_boxes = [(ix, box) for ix, box in enumerate(b) if
+                     (box['box_type'] == 'det' and box['box_pred_class_id'] == cl)]
+            boxes += det_boxes
+            slice_ids += [bix] * len(det_boxes)
+
+        box_coords = np.array([b[1]['box_coords'] for b in boxes])
+        box_scores = np.array([b[1]['box_score'] for b in boxes])
+        slice_ids = np.array(slice_ids)
+
+        if 0 not in box_scores.shape:
+            keep_ix, keep_z = nms_2to3D(
+                np.concatenate((box_coords, box_scores[:, None], slice_ids[:, None]), axis=1), merge_3D_iou)
+        else:
+            keep_ix, keep_z = [], []
+
+        # store kept predictions in new results list and add corresponding z-dimension info to coordinates.
+        for kix, kz in zip(keep_ix, keep_z):
+            out_patient_results_list.append({'box_type': 'det', 'box_coords': list(box_coords[kix]) + kz,
+                                             'box_score': box_scores[kix], 'box_pred_class_id': cl})
+
+    out_patient_results_list += [box for b in in_patient_results_list for box in b if box['box_type'] == 'gt']
+    out_patient_results_list = [out_patient_results_list] # add dummy batch dimension 1 for 3D.
+
+    return [out_patient_results_list, pid]
+
+
+
+def weighted_box_clustering(dets, box_patch_id, thresh, n_ens):
+    """
+    consolidates overlapping predictions resulting from patch overlaps, test data augmentations and temporal ensembling.
+    clusters predictions together with iou > thresh (like in NMS). Output score and coordinate for one cluster are the
+    average weighted by individual patch center factors (how trustworthy is this candidate measured by how centered
+    its position the patch is) and the size of the corresponding box.
+    The number of expected predictions at a position is n_data_aug * n_temp_ens * n_overlaps_at_position
+    (1 prediction per unique patch). Missing predictions at a cluster position are defined as the number of unique
+    patches in the cluster, which did not contribute any predict any boxes.
+    :param dets: (n_dets, (y1, x1, y2, x2, (z1), (z2), scores, box_pc_facts, box_n_ovs)
+    :param thresh: threshold for iou_matching.
+    :param n_ens: number of models, that are ensembled. (-> number of expected predicitions per position)
+    :return: keep_scores: (n_keep)  new scores of boxes to be kept.
+    :return: keep_coords: (n_keep, (y1, x1, y2, x2, (z1), (z2)) new coordinates of boxes to be kept.
+    """
+    dim = 2 if dets.shape[1] == 7 else 3
+    y1 = dets[:, 0]
+    x1 = dets[:, 1]
+    y2 = dets[:, 2]
+    x2 = dets[:, 3]
+    scores = dets[:, -3]
+    box_pc_facts = dets[:, -2]
+    box_n_ovs = dets[:, -1]
+
+    areas = (y2 - y1 + 1) * (x2 - x1 + 1)
+
+    if dim == 3:
+        z1 = dets[:, 4]
+        z2 = dets[:, 5]
+        areas *= (z2 - z1 + 1)
+
+    # order is the sorted index.  maps order to index o[1] = 24 (rank1, ix 24)
+    order = scores.argsort()[::-1]
+
+    keep = []
+    keep_scores = []
+    keep_coords = []
+
+    while order.size > 0:
+        i = order[0]  # higehst scoring element
+        xx1 = np.maximum(x1[i], x1[order])
+        yy1 = np.maximum(y1[i], y1[order])
+        xx2 = np.minimum(x2[i], x2[order])
+        yy2 = np.minimum(y2[i], y2[order])
+
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+
+        if dim == 3:
+            zz1 = np.maximum(z1[i], z1[order])
+            zz2 = np.minimum(z2[i], z2[order])
+            d = np.maximum(0.0, zz2 - zz1 + 1)
+            inter *= d
+
+        # overall between currently highest scoring box and all boxes.
+        ovr = inter / (areas[i] + areas[order] - inter)
+
+        # get all the predictions that match the current box to build one cluster.
+        matches = np.argwhere(ovr > thresh)
+
+        match_n_ovs = box_n_ovs[order[matches]]
+        match_pc_facts = box_pc_facts[order[matches]]
+        match_patch_id = box_patch_id[order[matches]]
+        match_ov_facts = ovr[matches]
+        match_areas = areas[order[matches]]
+        match_scores = scores[order[matches]]
+
+        # weight all socres in cluster by patch factors, and size.
+        match_score_weights = match_ov_facts * match_areas * match_pc_facts
+        match_scores *= match_score_weights
+
+        # for the weigted average, scores have to be divided by the number of total expected preds at the position
+        # of the current cluster. 1 Prediction per patch is expected. therefore, the number of ensembled models is
+        # multiplied by the mean overlaps of  patches at this position (boxes of the cluster might partly be
+        # in areas of different overlaps).
+        n_expected_preds = n_ens * np.mean(match_n_ovs)
+
+        # the number of missing predictions is obtained as the number of patches,
+        # which did not contribute any prediction to the current cluster.
+        n_missing_preds = np.max((0, n_expected_preds - np.unique(match_patch_id).shape[0]))
+
+        # missing preds are given the mean weighting
+        # (expected prediction is the mean over all predictions in cluster).
+        denom = np.sum(match_score_weights) + n_missing_preds * np.mean(match_score_weights)
+
+        # compute weighted average score for the cluster
+        avg_score = np.sum(match_scores) / denom
+
+        # compute weighted average of coordinates for the cluster. now only take existing
+        # predictions into account.
+        avg_coords = [np.sum(y1[order[matches]] * match_scores) / np.sum(match_scores),
+                      np.sum(x1[order[matches]] * match_scores) / np.sum(match_scores),
+                      np.sum(y2[order[matches]] * match_scores) / np.sum(match_scores),
+                      np.sum(x2[order[matches]] * match_scores) / np.sum(match_scores)]
+        if dim == 3:
+            avg_coords.append(np.sum(z1[order[matches]] * match_scores) / np.sum(match_scores))
+            avg_coords.append(np.sum(z2[order[matches]] * match_scores) / np.sum(match_scores))
+
+        # some clusters might have very low scores due to high amounts of missing predictions.
+        # filter out the with a conservative threshold, to speed up evaluation.
+        if avg_score > 0.01:
+            keep_scores.append(avg_score)
+            keep_coords.append(avg_coords)
+
+        # get index of all elements that were not matched and discard all others.
+        inds = np.where(ovr <= thresh)[0]
+        order = order[inds]
+
+    return keep_scores, keep_coords
+
+
+
+def nms_2to3D(dets, thresh):
+    """
+    Merges 2D boxes to 3D cubes. Therefore, boxes of all slices are projected into one slices. An adaptation of Non-maximum surpression
+    is applied, where clusters are found (like in NMS) with an extra constrained, that surpressed boxes have to have 'connected'
+    z-coordinates w.r.t the core slice (cluster center, highest scoring box). 'connected' z-coordinates are determined
+    as the z-coordinates with predictions until the first coordinate, where no prediction was found.
+
+    example: a cluster of predictions was found overlap > iou thresh in xy (like NMS). The z-coordinate of the highest
+    scoring box is 50. Other predictions have 23, 46, 48, 49, 51, 52, 53, 56, 57.
+    Only the coordinates connected with 50 are clustered to one cube: 48, 49, 51, 52, 53. (46 not because nothing was
+    found in 47, so 47 is a 'hole', which interrupts the connection). Only the boxes corresponding to these coordinates
+    are surpressed. All others are kept for building of further clusters.
+
+    This algorithm works better with a certain min_confidence of predictions, because low confidence (e.g. noisy/cluttery)
+    predictions can break the relatively strong assumption of defining cubes' z-boundaries at the first 'hole' in the cluster.
+
+    :param dets: (n_detections, (y1, x1, y2, x2, scores, slice_id)
+    :param thresh: iou matchin threshold (like in NMS).
+    :return: keep: (n_keep) 1D tensor of indices to be kept.
+    :return: keep_z: (n_keep, [z1, z2]) z-coordinates to be added to boxes, which are kept in order to form cubes.
+    """
+    y1 = dets[:, 0]
+    x1 = dets[:, 1]
+    y2 = dets[:, 2]
+    x2 = dets[:, 3]
+    scores = dets[:, -2]
+    slice_id = dets[:, -1]
+
+    areas = (x2 - x1 + 1) * (y2 - y1 + 1)
+    order = scores.argsort()[::-1]
+
+    keep = []
+    keep_z = []
+
+    while order.size > 0:  # order is the sorted index.  maps order to index o[1] = 24 (rank1, ix 24)
+        i = order[0]  # pop higehst scoring element
+        xx1 = np.maximum(x1[i], x1[order])
+        yy1 = np.maximum(y1[i], y1[order])
+        xx2 = np.minimum(x2[i], x2[order])
+        yy2 = np.minimum(y2[i], y2[order])
+
+        w = np.maximum(0.0, xx2 - xx1 + 1)
+        h = np.maximum(0.0, yy2 - yy1 + 1)
+        inter = w * h
+
+        ovr = inter / (areas[i] + areas[order] - inter)
+        matches = np.argwhere(ovr > thresh)  # get all the elements that match the current box and have a lower score
+
+        slice_ids = slice_id[order[matches]]
+        core_slice = slice_id[int(i)]
+        upper_wholes = [ii for ii in np.arange(core_slice, np.max(slice_ids)) if ii not in slice_ids]
+        lower_wholes = [ii for ii in np.arange(np.min(slice_ids), core_slice) if ii not in slice_ids]
+        max_valid_slice_id = np.min(upper_wholes) if len(upper_wholes) > 0 else np.max(slice_ids)
+        min_valid_slice_id = np.max(lower_wholes) if len(lower_wholes) > 0 else np.min(slice_ids)
+        z_matches = matches[(slice_ids <= max_valid_slice_id) & (slice_ids >= min_valid_slice_id)]
+
+        z1 = np.min(slice_id[order[z_matches]]) - 1
+        z2 = np.max(slice_id[order[z_matches]]) + 1
+
+        keep.append(i)
+        keep_z.append([z1, z2])
+        order = np.delete(order, z_matches, axis=0)
+
+    return keep, keep_z
+
+
+
+def get_mirrored_patch_crops(patch_crops, org_img_shape):
+    """
+    apply 3 mirrror transformations (x-axis, y-axis, x&y-axis)
+    to given patch crop coordinates and return the transformed coordinates.
+    Handles 2D and 3D coordinates.
+    :param patch_crops: list of crops: each element is a list of coordinates for given crop [[y1, x1, ...], [y1, x1, ..]]
+    :param org_img_shape: shape of patient volume used as world coordinates.
+    :return: list of mirrored patch crops: lenght=3. each element is a list of transformed patch crops.
+    """
+    mirrored_patch_crops = []
+
+    # y-axis transform.
+    mirrored_patch_crops.append([[org_img_shape[2] - ii[1],
+                                  org_img_shape[2] - ii[0],
+                                  ii[2], ii[3]] if len(ii) == 4 else
+                                 [org_img_shape[2] - ii[1],
+                                  org_img_shape[2] - ii[0],
+                                  ii[2], ii[3], ii[4], ii[5]] for ii in patch_crops])
+
+    # x-axis transform.
+    mirrored_patch_crops.append([[ii[0], ii[1],
+                                  org_img_shape[3] - ii[3],
+                                  org_img_shape[3] - ii[2]] if len(ii) == 4 else
+                                 [ii[0], ii[1],
+                                  org_img_shape[3] - ii[3],
+                                  org_img_shape[3] - ii[2],
+                                  ii[4], ii[5]] for ii in patch_crops])
+
+    # y-axis and x-axis transform.
+    mirrored_patch_crops.append([[org_img_shape[2] - ii[1],
+                                  org_img_shape[2] - ii[0],
+                                  org_img_shape[3] - ii[3],
+                                  org_img_shape[3] - ii[2]] if len(ii) == 4 else
+                                 [org_img_shape[2] - ii[1],
+                                  org_img_shape[2] - ii[0],
+                                  org_img_shape[3] - ii[3],
+                                  org_img_shape[3] - ii[2],
+                                  ii[4], ii[5]] for ii in patch_crops])
+
+    return mirrored_patch_crops
+
+
+
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..7e3d8a7
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,8 @@
+cffi==1.11.5
+matplotlib==3.0.0
+numpy==1.15.3
+pandas==0.23.4
+scikit-learn==0.20.0
+sklearn==0.0
+torch==0.4.1
+
diff --git a/setup.py b/setup.py
new file mode 100644
index 0000000..79ea0a1
--- /dev/null
+++ b/setup.py
@@ -0,0 +1,33 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+from distutils.core import setup
+from setuptools import find_packages
+
+req_file = "requirements.txt"
+
+def parse_requirements(filename):
+    lineiter = (line.strip() for line in open(filename))
+    return [line for line in lineiter if line and not line.startswith("#")]
+
+install_reqs = parse_requirements(req_file)
+
+setup(name='model',
+      version='latest',
+      packages=find_packages(exclude=['test', 'test.*']),
+      install_requires=install_reqs,
+      dependency_links=[],
+      )
\ No newline at end of file
diff --git a/utils/__pycache__/dataloader_utils.cpython-35.pyc b/utils/__pycache__/dataloader_utils.cpython-35.pyc
new file mode 100644
index 0000000..7f3ab5b
Binary files /dev/null and b/utils/__pycache__/dataloader_utils.cpython-35.pyc differ
diff --git a/utils/__pycache__/dataloader_utils.cpython-36.pyc b/utils/__pycache__/dataloader_utils.cpython-36.pyc
new file mode 100644
index 0000000..8d657a5
Binary files /dev/null and b/utils/__pycache__/dataloader_utils.cpython-36.pyc differ
diff --git a/utils/__pycache__/exp_utils.cpython-35.pyc b/utils/__pycache__/exp_utils.cpython-35.pyc
new file mode 100644
index 0000000..e1d8a6c
Binary files /dev/null and b/utils/__pycache__/exp_utils.cpython-35.pyc differ
diff --git a/utils/__pycache__/exp_utils.cpython-36.pyc b/utils/__pycache__/exp_utils.cpython-36.pyc
new file mode 100644
index 0000000..a3a7f35
Binary files /dev/null and b/utils/__pycache__/exp_utils.cpython-36.pyc differ
diff --git a/utils/__pycache__/model_utils.cpython-35.pyc b/utils/__pycache__/model_utils.cpython-35.pyc
new file mode 100644
index 0000000..661a944
Binary files /dev/null and b/utils/__pycache__/model_utils.cpython-35.pyc differ
diff --git a/utils/__pycache__/model_utils.cpython-36.pyc b/utils/__pycache__/model_utils.cpython-36.pyc
new file mode 100644
index 0000000..5d5d3f1
Binary files /dev/null and b/utils/__pycache__/model_utils.cpython-36.pyc differ
diff --git a/utils/__pycache__/mrcnn_utils.cpython-36.pyc b/utils/__pycache__/mrcnn_utils.cpython-36.pyc
new file mode 100644
index 0000000..479538d
Binary files /dev/null and b/utils/__pycache__/mrcnn_utils.cpython-36.pyc differ
diff --git a/utils/dataloader_utils.py b/utils/dataloader_utils.py
new file mode 100644
index 0000000..530f311
--- /dev/null
+++ b/utils/dataloader_utils.py
@@ -0,0 +1,277 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import numpy as np
+import os
+from multiprocessing import Pool
+
+
+
+def get_class_balanced_patients(class_targets, batch_size, num_classes, slack_factor=0.1):
+    '''
+    samples patients towards equilibrium of classes on a roi-level. For highly imbalanced datasets, this might be a too strong requirement.
+    Hence a slack factor determines the ratio of the batch, that is randomly sampled, before class-balance is triggered.
+    :param class_targets: list of patient targets. where each patient target is a list of class labels of respective rois.
+    :param batch_size:
+    :param num_classes:
+    :param slack_factor:
+    :return: batch_ixs: list of indices referring to a subset in class_targets-list, sampled to build one batch.
+    '''
+    batch_ixs = []
+    class_count = {k: 0 for k in range(num_classes)}
+    weakest_class = 0
+    for ix in range(batch_size):
+
+        keep_looking = True
+        while keep_looking:
+            #choose a random patient.
+            cand = np.random.choice(len(class_targets), 1)[0]
+            # check the least occuring class among this patient's rois.
+            tmp_weakest_class = np.argmin([class_targets[cand].count(ii) for ii in range(num_classes)])
+            # if current batch already bigger than the slack_factor ratio, then
+            # check that weakest class in this patient is not the weakest in current batch (since needs to be boosted)
+            # also that at least one roi of this patient belongs to weakest class. If True, keep patient, else keep looking.
+            if (tmp_weakest_class != weakest_class and class_targets[cand].count(weakest_class) > 0) or ix < int(batch_size * slack_factor):
+                keep_looking = False
+
+        for c in range(num_classes):
+            class_count[c] += class_targets[cand].count(c)
+        weakest_class = np.argmin(([class_count[c] for c in range(num_classes)]))
+        batch_ixs.append(cand)
+
+    return batch_ixs
+
+
+
+class fold_generator:
+    """
+    generates splits of indices for a given length of a dataset to perform n-fold cross-validation.
+    splits each fold into 3 subsets for training, validation and testing.
+    This form of cross validation uses an inner loop test set, which is useful if test scores shall be reported on a
+    statistically reliable amount of patients, despite limited size of a dataset.
+    If hold out test set is provided and hence no inner loop test set needed, just add test_idxs to the training data in the dataloader.
+    This creates straight-forward train-val splits.
+    :returns names list: list of len n_splits. each element is a list of len 3 for train_ix, val_ix, test_ix.
+    """
+    def __init__(self, seed, n_splits, len_data):
+        """
+        :param seed: Random seed for splits.
+        :param n_splits: number of splits, e.g. 5 splits for 5-fold cross-validation
+        :param len_data: number of elements in the dataset.
+        """
+        self.tr_ix = []
+        self.val_ix = []
+        self.te_ix = []
+        self.slicer = None
+        self.missing = 0
+        self.fold = 0
+        self.len_data = len_data
+        self.n_splits = n_splits
+        self.myseed = seed
+        self.boost_val = 0
+
+    def init_indices(self):
+
+        t = list(np.arange(self.l))
+        # round up to next splittable data amount.
+        split_length = int(np.ceil(len(t) / float(self.n_splits)))
+        self.slicer = split_length
+        self.mod = len(t) % self.n_splits
+        if self.mod > 0:
+            # missing is the number of folds, in which the new splits are reduced to account for missing data.
+            self.missing = self.n_splits - self.mod
+
+        self.te_ix = t[:self.slicer]
+        self.tr_ix = t[self.slicer:]
+        self.val_ix = self.tr_ix[:self.slicer]
+        self.tr_ix = self.tr_ix[self.slicer:]
+
+    def new_fold(self):
+
+        slicer = self.slicer
+        if self.fold < self.missing :
+            slicer = self.slicer - 1
+
+        temp = self.te_ix
+
+        # catch exception mod == 1: test set collects 1+ data since walk through both roudned up splits.
+        # account for by reducing last fold split by 1.
+        if self.fold == self.n_splits-2 and self.mod ==1:
+            temp += self.val_ix[-1:]
+            self.val_ix = self.val_ix[:-1]
+
+        self.te_ix = self.val_ix
+        self.val_ix = self.tr_ix[:slicer]
+        self.tr_ix = self.tr_ix[slicer:] + temp
+
+
+    def get_fold_names(self):
+        names_list = []
+        rgen = np.random.RandomState(self.myseed)
+        cv_names = np.arange(self.len_data)
+
+        rgen.shuffle(cv_names)
+        self.l = len(cv_names)
+        self.init_indices()
+
+        for split in range(self.n_splits):
+            train_names, val_names, test_names = cv_names[self.tr_ix], cv_names[self.val_ix], cv_names[self.te_ix]
+            names_list.append([train_names, val_names, test_names, self.fold])
+            self.new_fold()
+            self.fold += 1
+
+        return names_list
+
+
+
+def get_patch_crop_coords(img, patch_size, min_overlap=30):
+    """
+
+    _:param img (y, x, (z))
+    _:param patch_size: list of len 2 (2D) or 3 (3D).
+    _:param min_overlap: minimum required overlap of patches.
+    If too small, some areas are poorly represented only at edges of single patches.
+    _:return ndarray: shape (n_patches, 2*dim). crop coordinates for each patch.
+    """
+    crop_coords = []
+    for dim in range(len(img.shape)):
+        n_patches = int(np.ceil(img.shape[dim] / patch_size[dim]))
+
+        # no crops required in this dimension, add image shape as coordinates.
+        if n_patches == 1:
+            crop_coords.append([(0, img.shape[dim])])
+            continue
+
+        # fix the two outside patches to coords patchsize/2 and interpolate.
+        center_dists = (img.shape[dim] - patch_size[dim]) / (n_patches - 1)
+
+        if (patch_size[dim] - center_dists) < min_overlap:
+            n_patches += 1
+            center_dists = (img.shape[dim] - patch_size[dim]) / (n_patches - 1)
+
+        patch_centers = np.round([(patch_size[dim] / 2 + (center_dists * ii)) for ii in range(n_patches)])
+        dim_crop_coords = [(center - patch_size[dim] / 2, center + patch_size[dim] / 2) for center in patch_centers]
+        crop_coords.append(dim_crop_coords)
+
+    coords_mesh_grid = []
+    for ymin, ymax in crop_coords[0]:
+        for xmin, xmax in crop_coords[1]:
+            if len(crop_coords) == 3 and patch_size[2] > 1:
+                for zmin, zmax in crop_coords[2]:
+                    coords_mesh_grid.append([ymin, ymax, xmin, xmax, zmin, zmax])
+            elif len(crop_coords) == 3 and patch_size[2] == 1:
+                for zmin in range(img.shape[2]):
+                    coords_mesh_grid.append([ymin, ymax, xmin, xmax, zmin, zmin + 1])
+            else:
+                coords_mesh_grid.append([ymin, ymax, xmin, xmax])
+    return np.array(coords_mesh_grid).astype(int)
+
+
+
+def pad_nd_image(image, new_shape=None, mode="edge", kwargs=None, return_slicer=False, shape_must_be_divisible_by=None):
+    """
+    one padder to pad them all. Documentation? Well okay. A little bit. by Fabian Isensee
+
+    :param image: nd image. can be anything
+    :param new_shape: what shape do you want? new_shape does not have to have the same dimensionality as image. If
+    len(new_shape) < len(image.shape) then the last axes of image will be padded. If new_shape < image.shape in any of
+    the axes then we will not pad that axis, but also not crop! (interpret new_shape as new_min_shape)
+    Example:
+    image.shape = (10, 1, 512, 512); new_shape = (768, 768) -> result: (10, 1, 768, 768). Cool, huh?
+    image.shape = (10, 1, 512, 512); new_shape = (364, 768) -> result: (10, 1, 512, 768).
+
+    :param mode: see np.pad for documentation
+    :param return_slicer: if True then this function will also return what coords you will need to use when cropping back
+    to original shape
+    :param shape_must_be_divisible_by: for network prediction. After applying new_shape, make sure the new shape is
+    divisibly by that number (can also be a list with an entry for each axis). Whatever is missing to match that will
+    be padded (so the result may be larger than new_shape if shape_must_be_divisible_by is not None)
+    :param kwargs: see np.pad for documentation
+    """
+    if kwargs is None:
+        kwargs = {}
+
+    if new_shape is not None:
+        old_shape = np.array(image.shape[-len(new_shape):])
+    else:
+        assert shape_must_be_divisible_by is not None
+        assert isinstance(shape_must_be_divisible_by, (list, tuple, np.ndarray))
+        new_shape = image.shape[-len(shape_must_be_divisible_by):]
+        old_shape = new_shape
+
+    num_axes_nopad = len(image.shape) - len(new_shape)
+
+    new_shape = [max(new_shape[i], old_shape[i]) for i in range(len(new_shape))]
+
+    if not isinstance(new_shape, np.ndarray):
+        new_shape = np.array(new_shape)
+
+    if shape_must_be_divisible_by is not None:
+        if not isinstance(shape_must_be_divisible_by, (list, tuple, np.ndarray)):
+            shape_must_be_divisible_by = [shape_must_be_divisible_by] * len(new_shape)
+        else:
+            assert len(shape_must_be_divisible_by) == len(new_shape)
+
+        for i in range(len(new_shape)):
+            if new_shape[i] % shape_must_be_divisible_by[i] == 0:
+                new_shape[i] -= shape_must_be_divisible_by[i]
+
+        new_shape = np.array([new_shape[i] + shape_must_be_divisible_by[i] - new_shape[i] % shape_must_be_divisible_by[i] for i in range(len(new_shape))])
+
+    difference = new_shape - old_shape
+    pad_below = difference // 2
+    pad_above = difference // 2 + difference % 2
+    pad_list = [[0, 0]]*num_axes_nopad + list([list(i) for i in zip(pad_below, pad_above)])
+    res = np.pad(image, pad_list, mode, **kwargs)
+    if not return_slicer:
+        return res
+    else:
+        pad_list = np.array(pad_list)
+        pad_list[:, 1] = np.array(res.shape) - pad_list[:, 1]
+        slicer = list(slice(*i) for i in pad_list)
+        return res, slicer
+
+
+#############################
+#  data packing / unpacking #
+#############################
+
+def get_case_identifiers(folder):
+    case_identifiers = [i[:-4] for i in os.listdir(folder) if i.endswith("npz")]
+    return case_identifiers
+
+
+def convert_to_npy(npz_file):
+    if not os.path.isfile(npz_file[:-3] + "npy"):
+        a = np.load(npz_file)['data']
+        np.save(npz_file[:-3] + "npy", a)
+
+
+def unpack_dataset(folder, threads=8):
+    case_identifiers = get_case_identifiers(folder)
+    p = Pool(threads)
+    npz_files = [os.path.join(folder, i + ".npz") for i in case_identifiers]
+    p.map(convert_to_npy, npz_files)
+    p.close()
+    p.join()
+
+
+def delete_npy(folder):
+    case_identifiers = get_case_identifiers(folder)
+    npy_files = [os.path.join(folder, i + ".npy") for i in case_identifiers]
+    npy_files = [i for i in npy_files if os.path.isfile(i)]
+    for n in npy_files:
+        os.remove(n)
\ No newline at end of file
diff --git a/utils/exp_utils.py b/utils/exp_utils.py
new file mode 100644
index 0000000..852f4ef
--- /dev/null
+++ b/utils/exp_utils.py
@@ -0,0 +1,322 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+import numpy as np
+import logging
+import subprocess
+import os
+import torch
+from collections import OrderedDict
+import plotting
+import sys
+import importlib.util
+import pandas as pd
+
+
+
+def get_logger(exp_dir):
+    """
+    creates logger instance. writing out info to file and to terminal.
+    :param exp_dir: experiment directory, where exec.log file is stored.
+    :return: logger instance.
+    """
+
+    logger = logging.getLogger('medicaldetectiontoolkit')
+    logger.setLevel(logging.DEBUG)
+    log_file = exp_dir + '/exec.log'
+    hdlr = logging.FileHandler(log_file)
+    print('Logging to {}'.format(log_file))
+    logger.addHandler(hdlr)
+    logger.addHandler(ColorHandler())
+    logger.propagate = False
+    return logger
+
+
+
+def prep_exp(dataset_path, exp_path, server_env, use_stored_settings=True, is_training=True):
+    """
+    I/O handling, creating of experiment folder structure. Also creates a snapshot of configs/model scripts and copies them to the exp_dir.
+    This way the exp_dir contains all info needed to conduct an experiment, independent to changes in actual source code. Thus, training/inference of this experiment can be started at anytime. Therefore, the model script is copied back to the source code dir as tmp_model (tmp_backbone).
+    Provides robust structure for cloud deployment.
+    :param dataset_path: path to source code for specific data set. (e.g. medicaldetectiontoolkit/lidc_exp)
+    :param exp_path: path to experiment directory.
+    :param server_env: boolean flag. pass to configs script for cloud deployment.
+    :param use_stored_settings: boolean flag. When starting training: If True, starts training from snapshot in existing experiment directory, else creates experiment directory on the fly using configs/model scripts from source code.
+    :param is_training: boolean flag. distinguishes train vs. inference mode.
+    :return:
+    """
+
+    if is_training:
+
+        # the first process of an experiment creates the directories and copies the config to exp_path.
+        if not os.path.exists(exp_path):
+            os.mkdir(exp_path)
+            os.mkdir(os.path.join(exp_path, 'plots'))
+            subprocess.call('cp {} {}'.format(os.path.join(dataset_path, 'configs.py'), os.path.join(exp_path, 'configs.py')), shell=True)
+            subprocess.call('cp {} {}'.format('default_configs.py', os.path.join(exp_path, 'default_configs.py')), shell=True)
+
+
+        if use_stored_settings:
+            subprocess.call('cp {} {}'.format('default_configs.py', os.path.join(exp_path, 'default_configs.py')), shell=True)
+            cf_file = import_module('cf', os.path.join(exp_path, 'configs.py'))
+            cf = cf_file.configs(server_env)
+            # only the first process copies the model selcted in configs to exp_path.
+            if not os.path.isfile(os.path.join(exp_path, 'model.py')):
+                subprocess.call('cp {} {}'.format(cf.model_path, os.path.join(exp_path, 'model.py')), shell=True)
+                subprocess.call('cp {} {}'.format(os.path.join(cf.backbone_path), os.path.join(exp_path, 'backbone.py')), shell=True)
+
+            # copy the snapshot model scripts from exp_dir back to the source_dir as tmp_model / tmp_backbone.
+            tmp_model_path = os.path.join(cf.source_dir, 'models', 'tmp_model.py')
+            tmp_backbone_path = os.path.join(cf.source_dir, 'models', 'tmp_backbone.py')
+            subprocess.call('cp {} {}'.format(os.path.join(exp_path, 'model.py'), tmp_model_path), shell=True)
+            subprocess.call('cp {} {}'.format(os.path.join(exp_path, 'backbone.py'), tmp_backbone_path), shell=True)
+            cf.model_path = tmp_model_path
+            cf.backbone_path = tmp_backbone_path
+
+        else:
+            # run training with source code info and copy snapshot of model to exp_dir for later testing (overwrite scripts if exp_dir already exists.)
+            cf_file = import_module('cf', os.path.join(dataset_path, 'configs.py'))
+            cf = cf_file.configs(server_env)
+            subprocess.call('cp {} {}'.format(cf.model_path, os.path.join(exp_path, 'model.py')), shell=True)
+            subprocess.call('cp {} {}'.format(cf.backbone_path, os.path.join(exp_path, 'backbone.py')), shell=True)
+            subprocess.call('cp {} {}'.format('default_configs.py', os.path.join(exp_path, 'default_configs.py')), shell=True)
+            subprocess.call('cp {} {}'.format(os.path.join(dataset_path, 'configs.py'), os.path.join(exp_path, 'configs.py')), shell=True)
+
+    else:
+        # for testing copy the snapshot model scripts from exp_dir back to the source_dir as tmp_model / tmp_backbone.
+        cf_file = import_module('cf', os.path.join(exp_path, 'configs.py'))
+        cf = cf_file.configs(server_env)
+        if cf.hold_out_test_set:
+            cf.pp_data_path = cf.pp_test_data_path
+            cf.pp_name = cf.pp_test_name
+        tmp_model_path = os.path.join(cf.source_dir, 'models', 'tmp_model.py')
+        tmp_backbone_path = os.path.join(cf.source_dir, 'models', 'tmp_backbone.py')
+        subprocess.call('cp {} {}'.format(os.path.join(exp_path, 'model.py'), tmp_model_path), shell=True)
+        subprocess.call('cp {} {}'.format(os.path.join(exp_path, 'backbone.py'), tmp_backbone_path), shell=True)
+        cf.model_path = tmp_model_path
+        cf.backbone_path = tmp_backbone_path
+
+    cf.exp_dir = exp_path
+    cf.test_dir = os.path.join(cf.exp_dir, 'test')
+    cf.plot_dir = os.path.join(cf.exp_dir, 'plots')
+    cf.experiment_name = exp_path.split("/")[-1]
+    cf.server_env = server_env
+    cf.created_fold_id_pickle = False
+
+    return cf
+
+
+
+def import_module(name, path):
+    """
+    correct way of importing a module dynamically in python 3.
+    :param name: name given to module instance.
+    :param path: path to module.
+    :return: module: returned module instance.
+    """
+    spec = importlib.util.spec_from_file_location(name, path)
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module
+
+
+
+class ModelSelector:
+    '''
+    saves a checkpoint after each epoch as 'last_state' (can be loaded to continue interrupted training).
+    saves the top-k (k=cf.save_n_models) ranked epochs. In inference, predictions of multiple epochs can be ensembled to improve performance.
+    '''
+
+    def __init__(self, cf, logger):
+
+        self.cf = cf
+        self.saved_epochs = [-1] * cf.save_n_models
+        self.logger = logger
+
+    def run_model_selection(self, net, optimizer, monitor_metrics, epoch):
+
+        # take the mean over all selection criteria in each epoch
+        non_nan_scores = np.mean(np.array([[0 if ii is None else ii for ii in monitor_metrics['val'][sc]] for sc in self.cf.model_selection_criteria]), 0)
+        print('non none scores:', non_nan_scores)
+        epochs_scores = [ii for ii in non_nan_scores[1:]]
+        # ranking of epochs according to model_selection_criterion
+        epoch_ranking = np.argsort(epochs_scores)[::-1] + 1 #epochs start at 1
+        # if set in configs, epochs < min_save_thresh are discarded from saving process.
+        epoch_ranking = epoch_ranking[epoch_ranking >= self.cf.min_save_thresh]
+
+        # check if current epoch is among the top-k epchs.
+        if epoch in epoch_ranking[:self.cf.save_n_models]:
+            torch.save(net.state_dict(), os.path.join(self.cf.fold_dir, '{}_best_params.pth'.format(epoch)))
+            # save epoch_ranking to keep info for inference.
+            np.save(os.path.join(self.cf.fold_dir, 'epoch_ranking'), epoch_ranking[:self.cf.save_n_models])
+            self.logger.info(
+                "saving current epoch {} at rank {}".format(epoch, np.argwhere(epoch_ranking == epoch)))
+            # delete params of the epoch that just fell out of the top-k epochs.
+            for se in [int(ii.split('_')[0]) for ii in os.listdir(self.cf.fold_dir) if 'best_params' in ii]:
+                if se in epoch_ranking[self.cf.save_n_models:]:
+                    subprocess.call('rm {}'.format(os.path.join(self.cf.fold_dir, '{}_best_params.pth'.format(se))), shell=True)
+                    self.logger.info('deleting epoch {} at rank {}'.format(se, np.argwhere(epoch_ranking == se)))
+
+        state = {
+            'epoch': epoch,
+            'state_dict': net.state_dict(),
+            'optimizer': optimizer.state_dict(),
+        }
+
+        torch.save(state, os.path.join(self.cf.fold_dir, 'last_state.pth'))
+
+
+
+def load_checkpoint(checkpoint_path, net, optimizer):
+
+    checkpoint = torch.load(checkpoint_path)
+    net.load_state_dict(checkpoint['state_dict'])
+    optimizer.load_state_dict(checkpoint['optimizer'])
+    return checkpoint['epoch']
+
+
+
+def prepare_monitoring(cf):
+    """
+    creates dictionaries, where train/val metrics are stored.
+    """
+    metrics = {}
+    # first entry for loss dict accounts for epoch starting at 1.
+    metrics['train'] = OrderedDict()
+    metrics['val'] = OrderedDict()
+    metric_classes = []
+    if 'rois' in cf.report_score_level:
+        metric_classes.extend([v for k, v in cf.class_dict.items()])
+    if 'patient' in cf.report_score_level:
+        metric_classes.extend(['patient'])
+    for cl in metric_classes:
+        metrics['train'][cl + '_ap'] = [None]
+        metrics['val'][cl + '_ap'] = [None]
+        if cl == 'patient':
+            metrics['train'][cl + '_auc'] = [None]
+            metrics['val'][cl + '_auc'] = [None]
+
+    metrics['train']['monitor_values'] = [[] for _ in range(cf.num_epochs + 1)]
+    metrics['val']['monitor_values'] = [[] for _ in range(cf.num_epochs + 1)]
+
+    # generate isntance of monitor plot class.
+    TrainingPlot = plotting.TrainingPlot_2Panel(cf)
+
+    return metrics, TrainingPlot
+
+
+
+def create_csv_output(cf, logger, results_list):
+    """
+    Write out test set predictions to .csv file. output format is one line per patient:
+    PatientID score pred_class x y w h score pred_class x y w h .....
+    :param results_list: [[patient_results, patient_id], [patient_results, patient_id], ...]
+    """
+    logger.info('creating csv output file at {}'.format(os.path.join(cf.exp_dir, 'output.csv')))
+    submission_df = pd.DataFrame(columns=['patientID', 'PredictionString'])
+    for r in results_list:
+        pid = r[1]
+        prediction_string = ''
+        for box in r[0][0]:
+            coords = box['box_coords']
+            score = box['box_score']
+            pred_class = box['box_pred_class_id']
+
+            if score >= cf.min_det_thresh:
+                x = coords[1] #* cf.pp_downsample_factor
+                y = coords[0] #* cf.pp_downsample_factor
+                width = (coords[3] - coords[1]) #* cf.pp_downsample_factor
+                height = (coords[2] - coords[0]) #* cf.pp_downsample_factor
+                if len(coords) == 6:
+                    z = coords[4]
+                    depth = (coords[5] - coords[4])
+                    prediction_string += '{} {} {} {} {} {} {} {}'.format(score, pred_class, x, y, z, width, height, depth)
+                else:
+                    prediction_string += '{} {} {} {} {} {} '.format(score, pred_class, x, y, width, height)
+
+        if prediction_string == '':
+            prediction_string = None
+        submission_df.loc[len(submission_df)] = [pid, prediction_string]
+    submission_df.to_csv(os.path.join(cf.exp_dir, 'output.csv'), index=False)
+
+
+
+class _AnsiColorizer(object):
+    """
+    A colorizer is an object that loosely wraps around a stream, allowing
+    callers to write text to the stream in a particular color.
+
+    Colorizer classes must implement C{supported()} and C{write(text, color)}.
+    """
+    _colors = dict(black=30, red=31, green=32, yellow=33,
+                   blue=34, magenta=35, cyan=36, white=37, default=39)
+
+    def __init__(self, stream):
+        self.stream = stream
+
+    @classmethod
+    def supported(cls, stream=sys.stdout):
+        """
+        A class method that returns True if the current platform supports
+        coloring terminal output using this method. Returns False otherwise.
+        """
+        if not stream.isatty():
+            return False  # auto color only on TTYs
+        try:
+            import curses
+        except ImportError:
+            return False
+        else:
+            try:
+                try:
+                    return curses.tigetnum("colors") > 2
+                except curses.error:
+                    curses.setupterm()
+                    return curses.tigetnum("colors") > 2
+            except:
+                raise
+                # guess false in case of error
+                return False
+
+    def write(self, text, color):
+        """
+        Write the given text to the stream in the given color.
+
+        @param text: Text to be written to the stream.
+
+        @param color: A string label for a color. e.g. 'red', 'white'.
+        """
+        color = self._colors[color]
+        self.stream.write('\x1b[%sm%s\x1b[0m' % (color, text))
+
+
+
+class ColorHandler(logging.StreamHandler):
+
+
+    def __init__(self, stream=sys.stdout):
+        super(ColorHandler, self).__init__(_AnsiColorizer(stream))
+
+    def emit(self, record):
+        msg_colors = {
+            logging.DEBUG: "green",
+            logging.INFO: "default",
+            logging.WARNING: "red",
+            logging.ERROR: "red"
+        }
+        color = msg_colors.get(record.levelno, "blue")
+        self.stream.write(record.msg + "\n", color)
+
diff --git a/utils/model_utils.py b/utils/model_utils.py
new file mode 100644
index 0000000..a150a07
--- /dev/null
+++ b/utils/model_utils.py
@@ -0,0 +1,889 @@
+#!/usr/bin/env python
+# Copyright 2018 Division of Medical Image Computing, German Cancer Research Center (DKFZ).
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+
+"""
+Parts are based on https://github.com/multimodallearning/pytorch-mask-rcnn
+published under MIT license.
+"""
+
+import numpy as np
+import scipy.misc
+import scipy.ndimage
+import torch
+from torch.autograd import Variable
+import torch.nn as nn
+
+
+############################################################
+#  Bounding Boxes
+############################################################
+
+
+def compute_iou_2D(box, boxes, box_area, boxes_area):
+    """Calculates IoU of the given box with the array of the given boxes.
+    box: 1D vector [y1, x1, y2, x2] THIS IS THE GT BOX
+    boxes: [boxes_count, (y1, x1, y2, x2)]
+    box_area: float. the area of 'box'
+    boxes_area: array of length boxes_count.
+
+    Note: the areas are passed in rather than calculated here for
+          efficency. Calculate once in the caller to avoid duplicate work.
+    """
+    # Calculate intersection areas
+    y1 = np.maximum(box[0], boxes[:, 0])
+    y2 = np.minimum(box[2], boxes[:, 2])
+    x1 = np.maximum(box[1], boxes[:, 1])
+    x2 = np.minimum(box[3], boxes[:, 3])
+    intersection = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)
+    union = box_area + boxes_area[:] - intersection[:]
+    iou = intersection / union
+
+    return iou
+
+
+
+def compute_iou_3D(box, boxes, box_volume, boxes_volume):
+    """Calculates IoU of the given box with the array of the given boxes.
+    box: 1D vector [y1, x1, y2, x2, z1, z2] (typically gt box)
+    boxes: [boxes_count, (y1, x1, y2, x2, z1, z2)]
+    box_area: float. the area of 'box'
+    boxes_area: array of length boxes_count.
+
+    Note: the areas are passed in rather than calculated here for
+          efficency. Calculate once in the caller to avoid duplicate work.
+    """
+    # Calculate intersection areas
+    y1 = np.maximum(box[0], boxes[:, 0])
+    y2 = np.minimum(box[2], boxes[:, 2])
+    x1 = np.maximum(box[1], boxes[:, 1])
+    x2 = np.minimum(box[3], boxes[:, 3])
+    z1 = np.maximum(box[4], boxes[:, 4])
+    z2 = np.minimum(box[5], boxes[:, 5])
+    intersection = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0) * np.maximum(z2 - z1, 0)
+    union = box_volume + boxes_volume[:] - intersection[:]
+    iou = intersection / union
+
+    return iou
+
+
+
+def compute_overlaps(boxes1, boxes2):
+    """Computes IoU overlaps between two sets of boxes.
+    boxes1, boxes2: [N, (y1, x1, y2, x2)]. / 3D: (z1, z2))
+    For better performance, pass the largest set first and the smaller second.
+    """
+    # Areas of anchors and GT boxes
+    if boxes1.shape[1] == 4:
+        area1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1])
+        area2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1])
+        # Compute overlaps to generate matrix [boxes1 count, boxes2 count]
+        # Each cell contains the IoU value.
+        overlaps = np.zeros((boxes1.shape[0], boxes2.shape[0]))
+        for i in range(overlaps.shape[1]):
+            box2 = boxes2[i] #this is the gt box
+            overlaps[:, i] = compute_iou_2D(box2, boxes1, area2[i], area1)
+        return overlaps
+
+    else:
+        # Areas of anchors and GT boxes
+        volume1 = (boxes1[:, 2] - boxes1[:, 0]) * (boxes1[:, 3] - boxes1[:, 1]) * (boxes1[:, 5] - boxes1[:, 4])
+        volume2 = (boxes2[:, 2] - boxes2[:, 0]) * (boxes2[:, 3] - boxes2[:, 1]) * (boxes2[:, 5] - boxes2[:, 4])
+        # Compute overlaps to generate matrix [boxes1 count, boxes2 count]
+        # Each cell contains the IoU value.
+        overlaps = np.zeros((boxes1.shape[0], boxes2.shape[0]))
+        for i in range(overlaps.shape[1]):
+            box2 = boxes2[i]  # this is the gt box
+            overlaps[:, i] = compute_iou_3D(box2, boxes1, volume2[i], volume1)
+        return overlaps
+
+
+
+def box_refinement(box, gt_box):
+    """Compute refinement needed to transform box to gt_box.
+    box and gt_box are [N, (y1, x1, y2, x2)] / 3D: (z1, z2))
+    """
+    height = box[:, 2] - box[:, 0]
+    width = box[:, 3] - box[:, 1]
+    center_y = box[:, 0] + 0.5 * height
+    center_x = box[:, 1] + 0.5 * width
+
+    gt_height = gt_box[:, 2] - gt_box[:, 0]
+    gt_width = gt_box[:, 3] - gt_box[:, 1]
+    gt_center_y = gt_box[:, 0] + 0.5 * gt_height
+    gt_center_x = gt_box[:, 1] + 0.5 * gt_width
+
+    dy = (gt_center_y - center_y) / height
+    dx = (gt_center_x - center_x) / width
+    dh = torch.log(gt_height / height)
+    dw = torch.log(gt_width / width)
+    result = torch.stack([dy, dx, dh, dw], dim=1)
+
+    if box.shape[1] > 4:
+        depth = box[:, 5] - box[:, 4]
+        center_z = box[:, 4] + 0.5 * depth
+        gt_depth = gt_box[:, 5] - gt_box[:, 4]
+        gt_center_z = gt_box[:, 4] + 0.5 * gt_depth
+        dz = (gt_center_z - center_z) / depth
+        dd = torch.log(gt_depth / depth)
+        result = torch.stack([dy, dx, dz, dh, dw, dd], dim=1)
+
+    return result
+
+
+
+def unmold_mask_2D(mask, bbox, image_shape):
+    """Converts a mask generated by the neural network into a format similar
+    to it's original shape.
+    mask: [height, width] of type float. A small, typically 28x28 mask.
+    bbox: [y1, x1, y2, x2]. The box to fit the mask in.
+
+    Returns a binary mask with the same size as the original image.
+    """
+    y1, x1, y2, x2 = bbox
+    out_zoom = [y2 - y1, x2 - x1]
+    zoom_factor = [i / j for i, j in zip(out_zoom, mask.shape)]
+    mask = scipy.ndimage.zoom(mask, zoom_factor, order=1).astype(np.float32)
+
+    # Put the mask in the right location.
+    full_mask = np.zeros(image_shape[:2])
+    full_mask[y1:y2, x1:x2] = mask
+    return full_mask
+
+
+
+def unmold_mask_3D(mask, bbox, image_shape):
+    """Converts a mask generated by the neural network into a format similar
+    to it's original shape.
+    mask: [height, width] of type float. A small, typically 28x28 mask.
+    bbox: [y1, x1, y2, x2, z1, z2]. The box to fit the mask in.
+
+    Returns a binary mask with the same size as the original image.
+    """
+    y1, x1, y2, x2, z1, z2 = bbox
+    out_zoom = [y2 - y1, x2 - x1, z2 - z1]
+    zoom_factor = [i/j for i,j in zip(out_zoom, mask.shape)]
+    mask = scipy.ndimage.zoom(mask, zoom_factor, order=1).astype(np.float32)
+
+    # Put the mask in the right location.
+    full_mask = np.zeros(image_shape[:3])
+    full_mask[y1:y2, x1:x2, z1:z2] = mask
+    return full_mask
+
+
+############################################################
+#  Anchors
+############################################################
+
+def generate_anchors(scales, ratios, shape, feature_stride, anchor_stride):
+    """
+    scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]
+    ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]
+    shape: [height, width] spatial shape of the feature map over which
+            to generate anchors.
+    feature_stride: Stride of the feature map relative to the image in pixels.
+    anchor_stride: Stride of anchors on the feature map. For example, if the
+        value is 2 then generate anchors for every other feature map pixel.
+    """
+    # Get all combinations of scales and ratios
+    scales, ratios = np.meshgrid(np.array(scales), np.array(ratios))
+    scales = scales.flatten()
+    ratios = ratios.flatten()
+
+    # Enumerate heights and widths from scales and ratios
+    heights = scales / np.sqrt(ratios)
+    widths = scales * np.sqrt(ratios)
+
+    # Enumerate shifts in feature space
+    shifts_y = np.arange(0, shape[0], anchor_stride) * feature_stride
+    shifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride
+    shifts_x, shifts_y = np.meshgrid(shifts_x, shifts_y)
+
+    # Enumerate combinations of shifts, widths, and heights
+    box_widths, box_centers_x = np.meshgrid(widths, shifts_x)
+    box_heights, box_centers_y = np.meshgrid(heights, shifts_y)
+
+    # Reshape to get a list of (y, x) and a list of (h, w)
+    box_centers = np.stack(
+        [box_centers_y, box_centers_x], axis=2).reshape([-1, 2])
+    box_sizes = np.stack([box_heights, box_widths], axis=2).reshape([-1, 2])
+
+    # Convert to corner coordinates (y1, x1, y2, x2)
+    boxes = np.concatenate([box_centers - 0.5 * box_sizes,
+                            box_centers + 0.5 * box_sizes], axis=1)
+    return boxes
+
+
+
+def generate_anchors_3D(scales_xy, scales_z, ratios, shape, feature_stride_xy, feature_stride_z, anchor_stride):
+    """
+    scales: 1D array of anchor sizes in pixels. Example: [32, 64, 128]
+    ratios: 1D array of anchor ratios of width/height. Example: [0.5, 1, 2]
+    shape: [height, width] spatial shape of the feature map over which
+            to generate anchors.
+    feature_stride: Stride of the feature map relative to the image in pixels.
+    anchor_stride: Stride of anchors on the feature map. For example, if the
+        value is 2 then generate anchors for every other feature map pixel.
+    """
+    # Get all combinations of scales and ratios
+
+    scales_xy, ratios_meshed = np.meshgrid(np.array(scales_xy), np.array(ratios))
+    scales_xy = scales_xy.flatten()
+    ratios_meshed = ratios_meshed.flatten()
+
+    # Enumerate heights and widths from scales and ratios
+    heights = scales_xy / np.sqrt(ratios_meshed)
+    widths = scales_xy * np.sqrt(ratios_meshed)
+    depths = np.tile(np.array(scales_z), len(ratios_meshed)//np.array(scales_z)[..., None].shape[0])
+
+    # Enumerate shifts in feature space
+    shifts_y = np.arange(0, shape[0], anchor_stride) * feature_stride_xy #translate from fm positions to input coords.
+    shifts_x = np.arange(0, shape[1], anchor_stride) * feature_stride_xy
+    shifts_z = np.arange(0, shape[2], anchor_stride) * (feature_stride_z)
+    shifts_x, shifts_y, shifts_z = np.meshgrid(shifts_x, shifts_y, shifts_z)
+
+    # Enumerate combinations of shifts, widths, and heights
+    box_widths, box_centers_x = np.meshgrid(widths, shifts_x)
+    box_heights, box_centers_y = np.meshgrid(heights, shifts_y)
+    box_depths, box_centers_z = np.meshgrid(depths, shifts_z)
+
+    # Reshape to get a list of (y, x, z) and a list of (h, w, d)
+    box_centers = np.stack(
+        [box_centers_y, box_centers_x, box_centers_z], axis=2).reshape([-1, 3])
+    box_sizes = np.stack([box_heights, box_widths, box_depths], axis=2).reshape([-1, 3])
+
+    # Convert to corner coordinates (y1, x1, y2, x2, z1, z2)
+    boxes = np.concatenate([box_centers - 0.5 * box_sizes,
+                            box_centers + 0.5 * box_sizes], axis=1)
+
+    boxes = np.transpose(np.array([boxes[:, 0], boxes[:, 1], boxes[:, 3], boxes[:, 4], boxes[:, 2], boxes[:, 5]]), axes=(1, 0))
+    return boxes
+
+
+def generate_pyramid_anchors(logger, cf):
+    """Generate anchors at different levels of a feature pyramid. Each scale
+    is associated with a level of the pyramid, but each ratio is used in
+    all levels of the pyramid.
+
+    from configs:
+    :param scales: cf.RPN_ANCHOR_SCALES , e.g. [4, 8, 16, 32]
+    :param ratios: cf.RPN_ANCHOR_RATIOS , e.g. [0.5, 1, 2]
+    :param feature_shapes: cf.BACKBONE_SHAPES , e.g.  [array of shapes per feature map] [80, 40, 20, 10, 5]
+    :param feature_strides: cf.BACKBONE_STRIDES , e.g. [2, 4, 8, 16, 32, 64]
+    :param anchors_stride: cf.RPN_ANCHOR_STRIDE , e.g. 1
+    :return anchors: (N, (y1, x1, y2, x2, (z1), (z2)). All generated anchors in one array. Sorted
+    with the same order of the given scales. So, anchors of scale[0] come first, then anchors of scale[1], and so on.
+    """
+    scales = cf.rpn_anchor_scales
+    ratios = cf.rpn_anchor_ratios
+    feature_shapes = cf.backbone_shapes
+    anchor_stride = cf.rpn_anchor_stride
+    pyramid_levels = cf.pyramid_levels
+    feature_strides = cf.backbone_strides
+
+    anchors = []
+    logger.info("feature map shapes: {}".format(feature_shapes))
+    logger.info("anchor scales: {}".format(scales))
+
+    expected_anchors = [np.prod(feature_shapes[ii]) * len(ratios) * len(scales['xy'][ii]) for ii in pyramid_levels]
+
+    for lix, level in enumerate(pyramid_levels):
+        if len(feature_shapes[level]) == 2:
+            anchors.append(generate_anchors(scales['xy'][level], ratios, feature_shapes[level],
+                                            feature_strides['xy'][level], anchor_stride))
+        else:
+            anchors.append(generate_anchors_3D(scales['xy'][level], scales['z'][level], ratios, feature_shapes[level],
+                                            feature_strides['xy'][level], feature_strides['z'][level], anchor_stride))
+
+        logger.info("level {}: built anchors {} / expected anchors {} ||| total build {} / total expected {}".format(
+            level, anchors[-1].shape, expected_anchors[lix], np.concatenate(anchors).shape, np.sum(expected_anchors)))
+
+    out_anchors = np.concatenate(anchors, axis=0)
+    return out_anchors
+
+
+
+def apply_box_deltas_2D(boxes, deltas):
+    """Applies the given deltas to the given boxes.
+    boxes: [N, 4] where each row is y1, x1, y2, x2
+    deltas: [N, 4] where each row is [dy, dx, log(dh), log(dw)]
+    """
+    # Convert to y, x, h, w
+    height = boxes[:, 2] - boxes[:, 0]
+    width = boxes[:, 3] - boxes[:, 1]
+    center_y = boxes[:, 0] + 0.5 * height
+    center_x = boxes[:, 1] + 0.5 * width
+    # Apply deltas
+    center_y += deltas[:, 0] * height
+    center_x += deltas[:, 1] * width
+    height *= torch.exp(deltas[:, 2])
+    width *= torch.exp(deltas[:, 3])
+    # Convert back to y1, x1, y2, x2
+    y1 = center_y - 0.5 * height
+    x1 = center_x - 0.5 * width
+    y2 = y1 + height
+    x2 = x1 + width
+    result = torch.stack([y1, x1, y2, x2], dim=1)
+    return result
+
+
+
+def apply_box_deltas_3D(boxes, deltas):
+    """Applies the given deltas to the given boxes.
+    boxes: [N, 6] where each row is y1, x1, y2, x2, z1, z2
+    deltas: [N, 6] where each row is [dy, dx, dz, log(dh), log(dw), log(dd)]
+    """
+    # Convert to y, x, h, w
+    height = boxes[:, 2] - boxes[:, 0]
+    width = boxes[:, 3] - boxes[:, 1]
+    depth = boxes[:, 5] - boxes[:, 4]
+    center_y = boxes[:, 0] + 0.5 * height
+    center_x = boxes[:, 1] + 0.5 * width
+    center_z = boxes[:, 4] + 0.5 * depth
+    # Apply deltas
+    center_y += deltas[:, 0] * height
+    center_x += deltas[:, 1] * width
+    center_z += deltas[:, 2] * depth
+    height *= torch.exp(deltas[:, 3])
+    width *= torch.exp(deltas[:, 4])
+    depth *= torch.exp(deltas[:, 5])
+    # Convert back to y1, x1, y2, x2
+    y1 = center_y - 0.5 * height
+    x1 = center_x - 0.5 * width
+    z1 = center_z - 0.5 * depth
+    y2 = y1 + height
+    x2 = x1 + width
+    z2 = z1 + depth
+    result = torch.stack([y1, x1, y2, x2, z1, z2], dim=1)
+    return result
+
+
+
+def clip_boxes_2D(boxes, window):
+    """
+    boxes: [N, 4] each col is y1, x1, y2, x2
+    window: [4] in the form y1, x1, y2, x2
+    """
+    boxes = torch.stack( \
+        [boxes[:, 0].clamp(float(window[0]), float(window[2])),
+         boxes[:, 1].clamp(float(window[1]), float(window[3])),
+         boxes[:, 2].clamp(float(window[0]), float(window[2])),
+         boxes[:, 3].clamp(float(window[1]), float(window[3]))], 1)
+    return boxes
+
+def clip_boxes_3D(boxes, window):
+    """
+    boxes: [N, 6] each col is y1, x1, y2, x2, z1, z2
+    window: [6] in the form y1, x1, y2, x2, z1, z2
+    """
+    boxes = torch.stack( \
+        [boxes[:, 0].clamp(float(window[0]), float(window[2])),
+         boxes[:, 1].clamp(float(window[1]), float(window[3])),
+         boxes[:, 2].clamp(float(window[0]), float(window[2])),
+         boxes[:, 3].clamp(float(window[1]), float(window[3])),
+         boxes[:, 4].clamp(float(window[4]), float(window[5])),
+         boxes[:, 5].clamp(float(window[4]), float(window[5]))], 1)
+    return boxes
+
+
+
+def clip_boxes_numpy(boxes, window):
+    """
+    boxes: [N, 4] each col is y1, x1, y2, x2 / [N, 6] in 3D.
+    window: iamge shape (y, x, (z))
+    """
+    if boxes.shape[1] == 4:
+        boxes = np.concatenate(
+            (np.clip(boxes[:, 0], 0, window[0])[:, None],
+            np.clip(boxes[:, 1], 0, window[0])[:, None],
+            np.clip(boxes[:, 2], 0, window[1])[:, None],
+            np.clip(boxes[:, 3], 0, window[1])[:, None]), 1
+        )
+
+    else:
+        boxes = np.concatenate(
+            (np.clip(boxes[:, 0], 0, window[0])[:, None],
+             np.clip(boxes[:, 1], 0, window[0])[:, None],
+             np.clip(boxes[:, 2], 0, window[1])[:, None],
+             np.clip(boxes[:, 3], 0, window[1])[:, None],
+             np.clip(boxes[:, 4], 0, window[2])[:, None],
+             np.clip(boxes[:, 5], 0, window[2])[:, None]), 1
+        )
+
+    return boxes
+
+
+
+def bbox_overlaps_2D(boxes1, boxes2):
+    """Computes IoU overlaps between two sets of boxes.
+    boxes1, boxes2: [N, (y1, x1, y2, x2)].
+    """
+    # 1. Tile boxes2 and repeate boxes1. This allows us to compare
+    # every boxes1 against every boxes2 without loops.
+    # TF doesn't have an equivalent to np.repeate() so simulate it
+    # using tf.tile() and tf.reshape.
+    boxes1_repeat = boxes2.size()[0]
+    boxes2_repeat = boxes1.size()[0]
+    boxes1 = boxes1.repeat(1,boxes1_repeat).view(-1,4)
+    boxes2 = boxes2.repeat(boxes2_repeat,1)
+
+    # 2. Compute intersections
+    b1_y1, b1_x1, b1_y2, b1_x2 = boxes1.chunk(4, dim=1)
+    b2_y1, b2_x1, b2_y2, b2_x2 = boxes2.chunk(4, dim=1)
+    y1 = torch.max(b1_y1, b2_y1)[:, 0]
+    x1 = torch.max(b1_x1, b2_x1)[:, 0]
+    y2 = torch.min(b1_y2, b2_y2)[:, 0]
+    x2 = torch.min(b1_x2, b2_x2)[:, 0]
+    zeros = Variable(torch.zeros(y1.size()[0]), requires_grad=False)
+    if y1.is_cuda:
+        zeros = zeros.cuda()
+    intersection = torch.max(x2 - x1, zeros) * torch.max(y2 - y1, zeros)
+
+    # 3. Compute unions
+    b1_area = (b1_y2 - b1_y1) * (b1_x2 - b1_x1)
+    b2_area = (b2_y2 - b2_y1) * (b2_x2 - b2_x1)
+    union = b1_area[:,0] + b2_area[:,0] - intersection
+
+    # 4. Compute IoU and reshape to [boxes1, boxes2]
+    iou = intersection / union
+    overlaps = iou.view(boxes2_repeat, boxes1_repeat)
+    return overlaps
+
+
+
+def bbox_overlaps_3D(boxes1, boxes2):
+    """Computes IoU overlaps between two sets of boxes.
+    boxes1, boxes2: [N, (y1, x1, y2, x2, z1, z2)].
+    """
+    # 1. Tile boxes2 and repeate boxes1. This allows us to compare
+    # every boxes1 against every boxes2 without loops.
+    # TF doesn't have an equivalent to np.repeate() so simulate it
+    # using tf.tile() and tf.reshape.
+    boxes1_repeat = boxes2.size()[0]
+    boxes2_repeat = boxes1.size()[0]
+    boxes1 = boxes1.repeat(1,boxes1_repeat).view(-1,6)
+    boxes2 = boxes2.repeat(boxes2_repeat,1)
+
+    # 2. Compute intersections
+    b1_y1, b1_x1, b1_y2, b1_x2, b1_z1, b1_z2 = boxes1.chunk(6, dim=1)
+    b2_y1, b2_x1, b2_y2, b2_x2, b2_z1, b2_z2 = boxes2.chunk(6, dim=1)
+    y1 = torch.max(b1_y1, b2_y1)[:, 0]
+    x1 = torch.max(b1_x1, b2_x1)[:, 0]
+    y2 = torch.min(b1_y2, b2_y2)[:, 0]
+    x2 = torch.min(b1_x2, b2_x2)[:, 0]
+    z1 = torch.max(b1_z1, b2_z1)[:, 0]
+    z2 = torch.min(b1_z2, b2_z2)[:, 0]
+    zeros = Variable(torch.zeros(y1.size()[0]), requires_grad=False)
+    if y1.is_cuda:
+        zeros = zeros.cuda()
+    intersection = torch.max(x2 - x1, zeros) * torch.max(y2 - y1, zeros) * torch.max(z2 - z1, zeros)
+
+    # 3. Compute unions
+    b1_volume = (b1_y2 - b1_y1) * (b1_x2 - b1_x1)  * (b1_z2 - b1_z1)
+    b2_volume = (b2_y2 - b2_y1) * (b2_x2 - b2_x1)  * (b2_z2 - b2_z1)
+    union = b1_volume[:,0] + b2_volume[:,0] - intersection
+
+    # 4. Compute IoU and reshape to [boxes1, boxes2]
+    iou = intersection / union
+    overlaps = iou.view(boxes2_repeat, boxes1_repeat)
+    return overlaps
+
+
+
+def gt_anchor_matching(cf, anchors, gt_boxes, gt_class_ids=None):
+    """Given the anchors and GT boxes, compute overlaps and identify positive
+    anchors and deltas to refine them to match their corresponding GT boxes.
+
+    anchors: [num_anchors, (y1, x1, y2, x2, (z1), (z2))]
+    gt_boxes: [num_gt_boxes, (y1, x1, y2, x2, (z1), (z2))]
+    gt_class_ids (optional): [num_gt_boxes] Integer class IDs for one stage detectors. in RPN case of Mask R-CNN,
+    set all positive matches to 1 (foreground)
+
+    Returns:
+    anchor_class_matches: [N] (int32) matches between anchors and GT boxes.
+               1 = positive anchor, -1 = negative anchor, 0 = neutral
+    anchor_delta_targets: [N, (dy, dx, (dz), log(dh), log(dw), (log(dd)))] Anchor bbox deltas.
+    """
+
+    anchor_class_matches = np.zeros([anchors.shape[0]], dtype=np.int32)
+    anchor_delta_targets = np.zeros((cf.rpn_train_anchors_per_image, 2*cf.dim))
+    anchor_matching_iou = cf.anchor_matching_iou
+
+    if gt_boxes is None:
+        anchor_class_matches = np.full(anchor_class_matches.shape, fill_value=-1)
+        return anchor_class_matches, anchor_delta_targets
+
+    # for mrcnn: anchor matching is done for RPN loss, so positive labels are all 1 (foreground)
+    if gt_class_ids is None:
+        gt_class_ids = np.array([1] * len(gt_boxes))
+
+    # Compute overlaps [num_anchors, num_gt_boxes]
+    overlaps = compute_overlaps(anchors, gt_boxes)
+
+    # Match anchors to GT Boxes
+    # If an anchor overlaps a GT box with IoU >= anchor_matching_iou then it's positive.
+    # If an anchor overlaps a GT box with IoU < 0.1 then it's negative.
+    # Neutral anchors are those that don't match the conditions above,
+    # and they don't influence the loss function.
+    # However, don't keep any GT box unmatched (rare, but happens). Instead,
+    # match it to the closest anchor (even if its max IoU is < 0.1).
+
+    # 1. Set negative anchors first. They get overwritten below if a GT box is
+    # matched to them. Skip boxes in crowd areas.
+    anchor_iou_argmax = np.argmax(overlaps, axis=1)
+    anchor_iou_max = overlaps[np.arange(overlaps.shape[0]), anchor_iou_argmax]
+    if anchors.shape[1] == 4:
+        anchor_class_matches[(anchor_iou_max < 0.1)] = -1
+    elif anchors.shape[1] == 6:
+        anchor_class_matches[(anchor_iou_max < 0.01)] = -1
+    else:
+        raise ValueError('anchor shape wrong {}'.format(anchors.shape))
+
+    # 2. Set an anchor for each GT box (regardless of IoU value).
+    gt_iou_argmax = np.argmax(overlaps, axis=0)
+    for ix, ii in enumerate(gt_iou_argmax):
+        anchor_class_matches[ii] = gt_class_ids[ix]
+
+    # 3. Set anchors with high overlap as positive.
+    above_trhesh_ixs = np.argwhere(anchor_iou_max >= anchor_matching_iou)
+    anchor_class_matches[above_trhesh_ixs] = gt_class_ids[anchor_iou_argmax[above_trhesh_ixs]]
+
+    # Subsample to balance positive anchors.
+    ids = np.where(anchor_class_matches > 0)[0]
+    extra = len(ids) - (cf.rpn_train_anchors_per_image // 2)
+    if extra > 0:
+        # Reset the extra ones to neutral
+        ids = np.random.choice(ids, extra, replace=False)
+        anchor_class_matches[ids] = 0
+
+    # Leave all negative proposals negative now and sample from them in online hard example mining.
+    # For positive anchors, compute shift and scale needed to transform them to match the corresponding GT boxes.
+    ids = np.where(anchor_class_matches > 0)[0]
+    ix = 0  # index into anchor_delta_targets
+    for i, a in zip(ids, anchors[ids]):
+        # closest gt box (it might have IoU < anchor_matching_iou)
+        gt = gt_boxes[anchor_iou_argmax[i]]
+
+        # convert coordinates to center plus width/height.
+        gt_h = gt[2] - gt[0]
+        gt_w = gt[3] - gt[1]
+        gt_center_y = gt[0] + 0.5 * gt_h
+        gt_center_x = gt[1] + 0.5 * gt_w
+        # Anchor
+        a_h = a[2] - a[0]
+        a_w = a[3] - a[1]
+        a_center_y = a[0] + 0.5 * a_h
+        a_center_x = a[1] + 0.5 * a_w
+
+        if cf.dim == 2:
+            anchor_delta_targets[ix] = [
+                (gt_center_y - a_center_y) / a_h,
+                (gt_center_x - a_center_x) / a_w,
+                np.log(gt_h / a_h),
+                np.log(gt_w / a_w),
+            ]
+
+        else:
+            gt_d = gt[5] - gt[4]
+            gt_center_z = gt[4] + 0.5 * gt_d
+            a_d = a[5] - a[4]
+            a_center_z = a[4] + 0.5 * a_d
+
+            anchor_delta_targets[ix] = [
+                (gt_center_y - a_center_y) / a_h,
+                (gt_center_x - a_center_x) / a_w,
+                (gt_center_z - a_center_z) / a_d,
+                np.log(gt_h / a_h),
+                np.log(gt_w / a_w),
+                np.log(gt_d / a_d)
+            ]
+
+        # normalize.
+        anchor_delta_targets[ix] /= cf.rpn_bbox_std_dev
+        ix += 1
+
+    return anchor_class_matches, anchor_delta_targets
+
+
+
+def clip_to_window(window, boxes):
+    """
+        window: (y1, x1, y2, x2) / 3D: (z1, z2). The window in the image we want to clip to.
+        boxes: [N, (y1, x1, y2, x2)]  / 3D: (z1, z2)
+    """
+    boxes[:, 0] = boxes[:, 0].clamp(float(window[0]), float(window[2]))
+    boxes[:, 1] = boxes[:, 1].clamp(float(window[1]), float(window[3]))
+    boxes[:, 2] = boxes[:, 2].clamp(float(window[0]), float(window[2]))
+    boxes[:, 3] = boxes[:, 3].clamp(float(window[1]), float(window[3]))
+
+    if boxes.shape[1] > 5:
+        boxes[:, 4] = boxes[:, 4].clamp(float(window[4]), float(window[5]))
+        boxes[:, 5] = boxes[:, 5].clamp(float(window[4]), float(window[5]))
+
+    return boxes
+
+
+############################################################
+#  Pytorch Utility Functions
+############################################################
+
+
+def unique1d(tensor):
+    if tensor.size()[0] == 0 or tensor.size()[0] == 1:
+        return tensor
+    tensor = tensor.sort()[0]
+    unique_bool = tensor[1:] != tensor [:-1]
+    first_element = Variable(torch.ByteTensor([True]), requires_grad=False)
+    if tensor.is_cuda:
+        first_element = first_element.cuda()
+    unique_bool = torch.cat((first_element, unique_bool),dim=0)
+    return tensor[unique_bool.data]
+
+
+
+def log2(x):
+    """Implementatin of Log2. Pytorch doesn't have a native implemenation."""
+    ln2 = Variable(torch.log(torch.FloatTensor([2.0])), requires_grad=False)
+    if x.is_cuda:
+        ln2 = ln2.cuda()
+    return torch.log(x) / ln2
+
+
+
+def intersect1d(tensor1, tensor2):
+    aux = torch.cat((tensor1, tensor2), dim=0)
+    aux = aux.sort(descending=True)[0]
+    return aux[:-1][(aux[1:] == aux[:-1]).data]
+
+
+
+def shem(roi_probs_neg, negative_count, ohem_poolsize):
+    """
+    stochastic hard example mining: from a list of indices (referring to non-matched predictions),
+    determine a pool of highest scoring (worst false positives) of size negative_count*ohem_poolsize.
+    Then, sample n (= negative_count) predictions of this pool as negative examples for loss.
+    :param roi_probs_neg: tensor of shape (n_predictions, n_classes).
+    :param negative_count: int.
+    :param ohem_poolsize: int.
+    :return: (negative_count).  indices refer to the positions in roi_probs_neg. If pool smaller than expected due to
+    limited negative proposals availabel, this function will return sampled indices of number < negative_count without
+    throwing an error.
+    """
+    # sort according to higehst foreground score.
+    probs, order = roi_probs_neg[:, 1:].max(1)[0].sort(descending=True)
+    select = torch.tensor((ohem_poolsize * int(negative_count), order.size()[0])).min().int()
+    pool_indices = order[:select]
+    rand_idx = torch.randperm(pool_indices.size()[0])
+    return pool_indices[rand_idx[:negative_count].cuda()]
+
+
+
+def initialize_weights(net):
+    """
+   Initialize model weights. Current Default in Pytorch (version 0.4.1) is initialization from a uniform distriubtion.
+   Will expectably be changed to kaiming_uniform in future versions.
+   """
+    init_type = net.cf.weight_init
+
+    for m in [module for module in net.modules() if type(module) in [nn.Conv2d, nn.Conv3d,
+                                                                     nn.ConvTranspose2d,
+                                                                     nn.ConvTranspose3d,
+                                                                     nn.Linear]]:
+        if init_type == 'xavier_uniform':
+            nn.init.xavier_uniform_(m.weight.data)
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+        elif init_type == 'xavier_normal':
+            nn.init.xavier_normal_(m.weight.data)
+            if m.bias is not None:
+                m.bias.data.zero_()
+
+        elif init_type == "kaiming_uniform":
+            nn.init.kaiming_uniform_(m.weight.data, mode='fan_out', nonlinearity=net.cf.relu, a=0)
+            if m.bias is not None:
+                fan_in, fan_out = nn.init._calculate_fan_in_and_fan_out(m.weight.data)
+                bound = 1 / np.sqrt(fan_out)
+                nn.init.uniform_(m.bias, -bound, bound)
+
+        elif init_type == "kaiming_normal":
+            nn.init.kaiming_normal_(m.weight.data, mode='fan_out', nonlinearity=net.cf.relu, a=0)
+            if m.bias is not None:
+                fan_in, fan_out = nn.init._calculate_fan_in_and_fan_out(m.weight.data)
+                bound = 1 / np.sqrt(fan_out)
+                nn.init.normal_(m.bias, -bound, bound)
+
+
+
+class NDConvGenerator(object):
+    """
+    generic wrapper around conv-layers to avoid 2D vs. 3D distinguishing in code.
+    """
+    def __init__(self, dim):
+        self.dim = dim
+
+    def __call__(self, c_in, c_out, ks, pad=0, stride=1, norm=None, relu='relu'):
+        """
+        :param c_in: number of in_channels.
+        :param c_out: number of out_channels.
+        :param ks: kernel size.
+        :param pad: pad size.
+        :param stride: kernel stride.
+        :param norm: string specifying type of feature map normalization. If None, no normalization is applied.
+        :param relu: string specifying type of nonlinearity. If None, no nonlinearity is applied.
+        :return: convolved feature_map.
+        """
+        if self.dim == 2:
+            conv = nn.Conv2d(c_in, c_out, kernel_size=ks, padding=pad, stride=stride)
+            if norm is not None:
+                if norm == 'instance_norm':
+                    norm_layer = nn.InstanceNorm2d(c_out)
+                elif norm == 'batch_norm':
+                    norm_layer = nn.BatchNorm2d(c_out)
+                else:
+                    raise ValueError('norm type as specified in configs is not implemented...')
+                conv = nn.Sequential(conv, norm_layer)
+
+        else:
+            conv = nn.Conv3d(c_in, c_out, kernel_size=ks, padding=pad, stride=stride)
+            if norm is not None:
+                if norm == 'instance_norm':
+                    norm_layer = nn.InstanceNorm3d(c_out)
+                elif norm == 'batch_norm':
+                    norm_layer = nn.BatchNorm3d(c_out)
+                else:
+                    raise ValueError('norm type as specified in configs is not implemented... {}'.format(norm))
+                conv = nn.Sequential(conv, norm_layer)
+
+        if relu is not None:
+            if relu == 'relu':
+                relu_layer = nn.ReLU(inplace=True)
+            elif relu == 'leaky_relu':
+                relu_layer = nn.LeakyReLU(inplace=True)
+            else:
+                raise ValueError('relu type as specified in configs is not implemented...')
+            conv = nn.Sequential(conv, relu_layer)
+
+        return conv
+
+
+
+def get_one_hot_encoding(y, n_classes):
+    """
+    transform a numpy label array to a one-hot array of the same shape.
+    :param y: array of shape (b, 1, y, x, (z)).
+    :param n_classes: int, number of classes to unfold in one-hot encoding.
+    :return y_ohe: array of shape (b, n_classes, y, x, (z))
+    """
+    dim = len(y.shape) - 2
+    if dim == 2:
+        y_ohe = np.zeros((y.shape[0], n_classes, y.shape[2], y.shape[3])).astype('int32')
+    if dim ==3:
+        y_ohe = np.zeros((y.shape[0], n_classes, y.shape[2], y.shape[3], y.shape[4])).astype('int32')
+    for cl in range(n_classes):
+        y_ohe[:, cl][y[:, 0] == cl] = 1
+    return y_ohe
+
+
+
+def get_dice_per_batch_and_class(pred, y, n_classes):
+    '''
+    computes dice scores per batch instance and class.
+    :param pred: prediction array of shape (b, 1, y, x, (z)) (e.g. softmax prediction with argmax over dim 1)
+    :param y: ground truth array of shape (b, 1, y, x, (z)) (contains int [0, ..., n_classes]
+    :param n_classes: int
+    :return: dice scores of shape (b, c)
+    '''
+    pred = get_one_hot_encoding(pred, n_classes)
+    y = get_one_hot_encoding(y, n_classes)
+    axes = tuple(range(2, len(pred.shape)))
+    intersect = np.sum(pred*y, axis=axes)
+    denominator = np.sum(pred, axis=axes)+np.sum(y, axis=axes) + 1e-8
+    dice = 2.0*intersect / denominator
+    return dice
+
+
+
+def sum_tensor(input, axes, keepdim=False):
+    axes = np.unique(axes)
+    if keepdim:
+        for ax in axes:
+            input = input.sum(ax, keepdim=True)
+    else:
+        for ax in sorted(axes, reverse=True):
+            input = input.sum(int(ax))
+    return input
+
+
+
+def batch_dice(pred, y, false_positive_weight=1.0, eps=1e-6):
+    '''
+    compute soft dice over batch. this is a diffrentiable score and can be used as a loss function.
+    only dice scores of foreground classes are returned, since training typically
+    does not benefit from explicit background optimization. Pixels of the entire batch are considered a pseudo-volume to compute dice scores of.
+    This way, single patches with missing foreground classes can not produce faulty gradients.
+    :param pred: (b, c, y, x, (z)), softmax probabilities (network output).
+    :param y: (b, c, y, x, (z)), one hote encoded segmentation mask.
+    :param false_positive_weight: float [0,1]. For weighting of imbalanced classes,
+    reduces the penalty for false-positive pixels. Can be beneficial sometimes in data with heavy fg/bg imbalances.
+    :return: soft dice score (float).This function discards the background score and returns the mena of foreground scores.
+    '''
+    if len(pred.size()) == 4:
+        axes = (0, 2, 3)
+        intersect = sum_tensor(pred * y, axes, keepdim=False)
+        denom = sum_tensor(false_positive_weight*pred + y, axes, keepdim=False)
+        return torch.mean((2 * intersect / (denom + eps))[1:]) #only fg dice here.
+
+    if len(pred.size()) == 5:
+        axes = (0, 2, 3, 4)
+        intersect = sum_tensor(pred * y, axes, keepdim=False)
+        denom = sum_tensor(false_positive_weight*pred + y, axes, keepdim=False)
+        return torch.mean((2 * intersect / (denom + eps))[1:]) #only fg dice here.
+
+    else:
+        raise ValueError('wrong input dimension in dice loss')
+
+
+
+
+def batch_dice_mask(pred, y, mask, false_positive_weight=1.0, eps=1e-6):
+    '''
+    compute soft dice over batch. this is a diffrentiable score and can be used as a loss function.
+    only dice scores of foreground classes are returned, since training typically
+    does not benefit from explicit background optimization. Pixels of the entire batch are considered a pseudo-volume to compute dice scores of.
+    This way, single patches with missing foreground classes can not produce faulty gradients.
+    :param pred: (b, c, y, x, (z)), softmax probabilities (network output).
+    :param y: (b, c, y, x, (z)), one hote encoded segmentation mask.
+    :param false_positive_weight: float [0,1]. For weighting of imbalanced classes,
+    reduces the penalty for false-positive pixels. Can be beneficial sometimes in data with heavy fg/bg imbalances.
+    :return: soft dice score (float).This function discards the background score and returns the mena of foreground scores.
+    '''
+
+    mask = mask.unsqueeze(1).repeat(1, 2, 1, 1)
+
+    if len(pred.size()) == 4:
+        axes = (0, 2, 3)
+        intersect = sum_tensor(pred * y * mask, axes, keepdim=False)
+        denom = sum_tensor(false_positive_weight*pred * mask + y * mask, axes, keepdim=False)
+        return torch.mean((2 * intersect / (denom + eps))[1:]) #only fg dice here.
+
+    if len(pred.size()) == 5:
+        axes = (0, 2, 3, 4)
+        intersect = sum_tensor(pred * y, axes, keepdim=False)
+        denom = sum_tensor(false_positive_weight*pred + y, axes, keepdim=False)
+        return torch.mean((2 * intersect / (denom + eps))[1:]) #only fg dice here.
+
+    else:
+        raise ValueError('wrong input dimension in dice loss')
\ No newline at end of file