Page MenuHomePhabricator

CTP gets stuck on large import
Closed, ResolvedPublic

Description

Sending ~600GB of images in one batch caused CTP to crash without restart, resulting in a very large incoming folder and missing data in the PACS.

Related Objects

StatusAssignedTask
Resolvedschererj

Event Timeline

The CTP thread responsible for sending the data to the PACs crashed:

15:04:19 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
15:04:19 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
15:04:19 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
15:04:20 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
15:04:20 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
15:04:20 WARN  [AbstractExportService] DicomSTOWRSExportService Exporter Thread: Exception received
java.lang.NullPointerException
	at java.base/java.util.Arrays.sort(Arrays.java:1249)
	at org.rsna.ctp.pipeline.QueueManager.findFirstFile(QueueManager.java:242)
	at org.rsna.ctp.pipeline.QueueManager.findFirstFile(QueueManager.java:250)
	at org.rsna.ctp.pipeline.QueueManager.findFirstFile(QueueManager.java:250)
	at org.rsna.ctp.pipeline.QueueManager.dequeue(QueueManager.java:149)
	at org.rsna.ctp.pipeline.AbstractQueuedExportService.getNextFile(AbstractQueuedExportService.java:173)
	at org.rsna.ctp.pipeline.AbstractExportService$Exporter.run(AbstractExportService.java:156)
15:04:20 INFO  [AbstractExportService] DicomSTOWRSExportService Thread: Interrupt received; exporter thread stopped

So from this moment on, all data sent to the platform was stored in the queue.

In T27573 we discussed the problem of data transfer from CTP to DCM4CHE.
So basically it is a kind of similar problem: A datatransfer between CTP and DCM4CHE fails.
The problem is then, CTP just keeps the data in the queue (here) or inserts in the quarantine folder (T27573).
Anyway, we do not get notified, that there is a problem.

I would consider three different variants:

  1. Keep CTP as receiver -> move the dcmsend from ctp-pipeline to airflow (unknown end of transfer issue)
  2. Keep CTP as receiver and change the DICOMweb send pipline-step to the old DIMSE step (possible issue with missing slices <-> really fast)
  3. Don't use CTP at all -> Start a DICOM receiver in Airflow as a sensor and trigger DAGs at arrival (performance unknown)

@nolden @floca @gaoh feel free to add more variants or make comments on the approaches.

Thanks. Comments/questions

  • Regarding 1: This is done to ensure that you are notfied if a problem arrises in the dcmsend step?
  • Regarding 2: Do we have a test to verify if the problem still exists before we go into production with this? @gaoh: Was this a problem on the CTP side or on the DICOM4CHE side? So, did also other PACS like orthanc had the same problem?
  • Regarding 3: This road I would only go, if we also have a test suite ready to benchmark the performance and to ensure that we don't have the same problem problem. What kind of reciever would you use here.

Wouldn't it be possible to monitor the ctp as well? @gaoh seemed to have a log that indicates if a problem is there?

schererj triaged this task as Unbreak Now! priority.Nov 30 2020, 11:15 AM

Got similar issues with a "normal" data import.
So I just sent a couple images and got many of the following errors in CTP:

10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:17 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:18 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:18 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server returned HTTP response code: 409 for URL: http://dcm4chee-service.store.svc:8080/dcm4chee-arc/aets/KAAPANA/rs/studies
10:07:18 WARN  [DicomSTOWRSExportService] DicomSTOWRSExportService: export failed: Server retu

So this is not related to large datasets only.

Regarding 1: This is done to ensure that you are notfied if a problem arrises in the dcmsend step?

Yes, exactly and we have full control over the binary and parameters used for the data-transfer (eg storescu, dcmsend or DICOMweb-send etc.)

Regarding 2: Do we have a test to verify if the problem still exists before we go into production with this?

Yes, we have to verify. Afaik we don't know why it was not working anymore (could also be DCM4Chee as stated before - we have updated it meanwhile though).

Regarding 3: This road I would only go, if we also have a test suite ready to benchmark the performance and to ensure that we don't have the same problem problem. What kind of reciever would you use here.

I would start with the DCMTK dcm-receiver, because I think DCMTK is also a quite mature project (could be wrong though :) ).

Have you checked, if the dataset (or the slices) are already in the PACs. In my case (I got the error, when sending data from airflow to the PACs), the files creating the 409 were already in DCM4CHE.

No, the data has not been sent before.
There were only completely different images already stored.
Also OHIF was not showing any of the new sent data at all.

edit:
I just saw, that I changed the permissions in the fs for all the mounted dirs..
So my issue could be related to this - but I'm wondering why it was working after a restart of the CTP and DCM4Chee pods.

Thanks for the answers!

Regarding 1: This is done to ensure that you are notfied if a problem arrises in the dcmsend step?

Yes, exactly and we have full control over the binary and parameters used for the data-transfer (eg storescu, dcmsend or DICOMweb-send etc.)

You mean more explicit control? I would guess that we also have control about which version of CTP and its configuration we use.

Regarding 2: Do we have a test to verify if the problem still exists before we go into production with this?

Yes, we have to verify. Afaik we don't know why it was not working anymore (could also be DCM4Chee as stated before - we have updated it meanwhile though).

Do you mean, "yes we have a unit test ready to check and verify", or do "just" mean, "yes, we have to verify it, but have no test set in place to check, benchmark and detect regressions"?
If the later is the case, i think, before touching that matter, we should first establish a proper test set.

Regarding 3: This road I would only go, if we also have a test suite ready to benchmark the performance and to ensure that we don't have the same problem problem. What kind of reciever would you use here.

I would start with the DCMTK dcm-receiver, because I think DCMTK is also a quite mature project (could be wrong though :) ).

Guess it would be currently also my pick. But I also currently still lean more to the other option (but I cannot provide hard facts to support that guts feeling).

You mean more explicit control? I would guess that we also have control about which version of CTP and its configuration we use.

We just use the default CTP stages for DIMSE-send etc. - currently we don't have any influence on the binaries / parameters used. (And have no idea if things go wrong)

Do you mean, "yes we have a unit test ready to check and verify", or do "just" mean, "yes, we have to verify it, but have no test set in place to check, benchmark and detect regressions"?

No we don't have a unit test. Test data to test it is not a problem. Since the current system is broken we have to change one of these parts..

For me option 1 looks like a good solution.

We need a dataset to reporduce the problem, to setup a testsetup.

Just had another of the out of memory issues:
-> Even with the new increased memory!

We should definitely look into the CTP deployment in gerneral!

Starting CTP
java -Xmx1024m -Xms512m -jar libraries/CTP.jar 
stderr: Exception in thread "DicomSTOWRSExportService Exporter" java.lang.OutOfMemoryError: Java heap space
stderr: 	at java.base/java.util.Arrays.copyOf(Arrays.java:3745)
stderr: 	at java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:120)
stderr: 	at java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
stderr: 	at java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156)
stderr: 	at java.base/sun.net.www.http.PosterOutputStream.write(PosterOutputStream.java:78)
stderr: 	at org.rsna.util.ClientHttpRequest.pipe(ClientHttpRequest.java:153)
stderr: 	at org.rsna.util.ClientHttpRequest.addFilePart(ClientHttpRequest.java:208)
stderr: 	at org.rsna.ctp.stdstages.DicomSTOWRSExportService.export(DicomSTOWRSExportService.java:164)
stderr: 	at org.rsna.ctp.pipeline.AbstractExportService$Exporter.run(AbstractExportService.java:158)

At least the container should crash if this happens -> then it will be restarted.

The deployment right now just keeps running and nothing happens anymore.

Can you reproduce the error? If so, we can do something like: https://stackoverflow.com/questions/12096403/java-shutting-down-on-out-of-memory-error

-XX:+ExitOnOutOfMemoryError
-XX:+CrashOnOutOfMemoryError

The pod would crash and restart directly after.

This could be an option.
We should still try to solve the multiple issues with CTP.
I'm not sure if there is a single source of failure - but I have a lot of trouble with it.

Another issue I have regularly:

15:18:57 INFO  [AbstractExportService] DicomSTOWRSExportService: Exporter Thread: Started
15:21:00 ERROR [ServerImpl] java.io.EOFException
java.io.EOFException
        at org.dcm4cheri.net.UnparsedPDUImpl.readFully(UnparsedPDUImpl.java:115)
        at org.dcm4cheri.net.UnparsedPDUImpl.<init>(UnparsedPDUImpl.java:60)
        at org.dcm4cheri.net.FsmImpl.read(FsmImpl.java:502)
        at org.dcm4cheri.net.AssociationImpl.accept(AssociationImpl.java:287)
        at org.dcm4cheri.server.DcmHandlerImpl.handle(DcmHandlerImpl.java:248)
        at org.dcm4cheri.server.ServerImpl.run(ServerImpl.java:288)
        at org.dcm4cheri.util.LF_ThreadPool.join(LF_ThreadPool.java:174)
        at org.dcm4cheri.server.ServerImpl$1.run(ServerImpl.java:242)
        at java.base/java.lang.Thread.run(Thread.java:834)
15:21:18 WARN  [KaapanaDagTrigger] Send to Airflow seriesInstanceUID 1.2.826.0.1.3680043.8.498.34728877190970636564960936673618953800
15:21:18 WARN  [KaapanaDagTrigger] MetaExtraction: Triggering: service-process-incoming-dcm - 1.2.826.0.1.3680043.8.498.34728877190970636564960936673618953800
15:21:18 WARN  [KaapanaDagTrigger] Dicom Path: 1.2.826.0.1.3680043.8.498.34728877190970636564960936673618953800_20210223152118
15:21:24 WARN  [KaapanaDagTrigger] Final file-count: 3
15:21:24 WARN  [KaapanaDagTrigger] Dicom Folder send to airflow: 1.2.826.0.1.3680043.8.498.34728877190970636564960936673618953800_20210223152118
15:21:24 WARN  [KaapanaDagTrigger] MetaExtraction: URL: http://airflow-service.flow.svc:8080/flow/kaapana/api/trigger/service-process-incoming-dcm
15:21:24 WARN  [KaapanaDagTrigger] MetaExtraction: {"message":["service-process-incoming-dcm created!"]}

15:23:00 ERROR [ServerImpl] java.io.EOFException
java.io.EOFException
        at org.dcm4cheri.net.UnparsedPDUImpl.readFully(UnparsedPDUImpl.java:115)
        at org.dcm4cheri.net.UnparsedPDUImpl.<init>(UnparsedPDUImpl.java:60)
        at org.dcm4cheri.net.FsmImpl.read(FsmImpl.java:502)
        at org.dcm4cheri.net.AssociationImpl.accept(AssociationImpl.java:287)
        at org.dcm4cheri.server.DcmHandlerImpl.handle(DcmHandlerImpl.java:248)
        at org.dcm4cheri.server.ServerImpl.run(ServerImpl.java:288)
        at org.dcm4cheri.util.LF_ThreadPool.join(LF_ThreadPool.java:174)
        at org.dcm4cheri.server.ServerImpl$1.run(ServerImpl.java:242)
        at java.base/java.lang.Thread.run(Thread.java:834)
root@ctp-77f86d6c68-nk75q:/opt/CTP#

It's corresponding to my model DICOMs - but dciodvfy is almost happy and I can upload them to DCM4CHee directly.

Can you send me a dataset to reproduce it? I could try to debug CTP, but it looks like it is not even a problem in CTP but in org.dcm4chri

This is one of my model DICOMs (not 100% sure if this will reproduce it).
If not I can try to make bigger parts.

Download-link from my instance

This is one of my model DICOMs (not 100% sure if this will reproduce it).
If not I can try to make bigger parts.

Download-link from my instance

Just a remark: The download link does not work for me. I get forbidden access. This is also the case, even if I am logged into the platform. But since there is only one folder in minio I also found it without the link:)

I cannot reproduce the error. When sending the data to an instance the data gets imported. When debugging CTP locally there also seems to be no problem. For me also it looks like the problem is not directly in the CTP but in the java library (dcm4che). Did the files got triggered/send to airflow? Did they get stuck in a quarantine folder of the CTP, and if so in which?

Not sure if this is related:

I did some experiments with sending ~300 GB as a bulk import from a local directory to Kaapana using dcmsend. After a few minutes system load went >30 (24 core machine) and the web interface of Kaapana failed with errors because of timeouts.

After the dcmsend had finished , I could login again , airflow was busy processing, but a few extract-metadata jobs failed. Example log for reference:

*** Reading local file: /root/airflow/logs/service-extract-metadata/dcmsend/2021-03-15T11:56:36.556587+00:00/3.log
[2021-03-15 13:39:02,064] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: service-extract-metadata.dcmsend 2021-03-15T11:56:36.556587+00:00 [queued]>
[2021-03-15 13:39:02,089] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: service-extract-metadata.dcmsend 2021-03-15T11:56:36.556587+00:00 [queued]>
[2021-03-15 13:39:02,089] {taskinstance.py:880} INFO - 
--------------------------------------------------------------------------------
[2021-03-15 13:39:02,089] {taskinstance.py:881} INFO - Starting attempt 3 of 3
[2021-03-15 13:39:02,089] {taskinstance.py:882} INFO - 
--------------------------------------------------------------------------------
[2021-03-15 13:39:02,207] {taskinstance.py:901} INFO - Executing <Task(DcmSendOperator): dcmsend> on 2021-03-15T11:56:36.556587+00:00
[2021-03-15 13:39:02,211] {standard_task_runner.py:54} INFO - Started process 26302 to run task
[2021-03-15 13:39:02,256] {standard_task_runner.py:77} INFO - Running: ['airflow', 'run', 'service-extract-metadata', 'dcmsend', '2021-03-15T11:56:36.556587+00:00', '--job_id', '1544', '--pool', 'MEMORY', '--raw', '-sd', 'DAGS_FOLDER/dag_service_extract_metadata.py', '--cfg_path', '/tmp/tmpa5xrvgq4']
[2021-03-15 13:39:02,256] {standard_task_runner.py:78} INFO - Job 1544: Subtask dcmsend
[2021-03-15 13:39:02,376] {logging_mixin.py:112} INFO - Running <TaskInstance: service-extract-metadata.dcmsend 2021-03-15T11:56:36.556587+00:00 [running]> on host airflow-7446c74d56-nm5v4
[2021-03-15 13:39:02,393] {logging_mixin.py:112} INFO - ++++++++++++++++++++++++++++++++++++++++++++++++ launch pod!
[2021-03-15 13:39:02,393] {logging_mixin.py:112} INFO - dcmsend
[2021-03-15 13:39:02,405] {pod_launcher.py:88} ERROR - Exception when attempting to create Namespaced Pod.
Traceback (most recent call last):
  File "/root/airflow/plugins/kaapana/kubetools/pod_launcher.py", line 83, in run_pod_async
    resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 6174, in create_namespaced_pod
    (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 6251, in create_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 340, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 172, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 382, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 272, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 15 Mar 2021 12:39:02 GMT', 'Content-Length': '436'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"dcmsend-a5a9a218\" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: \"/data\": must be unique","reason":"Invalid","details":{"name":"dcmsend-a5a9a218","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"/data\": must be unique","field":"spec.containers[0].volumeMounts[1].mountPath"}]},"code":422}


[2021-03-15 13:39:02,408] {taskinstance.py:1150} ERROR - (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 15 Mar 2021 12:39:02 GMT', 'Content-Length': '436'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"dcmsend-a5a9a218\" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: \"/data\": must be unique","reason":"Invalid","details":{"name":"dcmsend-a5a9a218","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"/data\": must be unique","field":"spec.containers[0].volumeMounts[1].mountPath"}]},"code":422}

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/airflow/models/taskinstance.py", line 979, in _run_raw_task
    result = task_copy.execute(context=context)
  File "/root/airflow/plugins/kaapana/operators/HelperCaching.py", line 55, in wrapper
    x = func(self, *args, **kwargs)
  File "/root/airflow/plugins/kaapana/operators/KaapanaBaseOperator.py", line 402, in execute
    (result, message) = launcher.run_pod(
  File "/root/airflow/plugins/kaapana/kubetools/pod_launcher.py", line 101, in run_pod
    resp = self.run_pod_async(pod)
  File "/root/airflow/plugins/kaapana/kubetools/pod_launcher.py", line 83, in run_pod_async
    resp = self._client.create_namespaced_pod(body=req, namespace=pod.namespace)
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 6174, in create_namespaced_pod
    (data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api/core_v1_api.py", line 6251, in create_namespaced_pod_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 340, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 172, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/api_client.py", line 382, in request
    return self.rest_client.POST(url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 272, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.8/dist-packages/kubernetes/client/rest.py", line 231, in request
    raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Mon, 15 Mar 2021 12:39:02 GMT', 'Content-Length': '436'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod \"dcmsend-a5a9a218\" is invalid: spec.containers[0].volumeMounts[1].mountPath: Invalid value: \"/data\": must be unique","reason":"Invalid","details":{"name":"dcmsend-a5a9a218","kind":"Pod","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"/data\": must be unique","field":"spec.containers[0].volumeMounts[1].mountPath"}]},"code":422}


[2021-03-15 13:39:02,408] {taskinstance.py:1187} INFO - Marking task as FAILED. dag_id=service-extract-metadata, task_id=dcmsend, execution_date=20210315T115636, start_date=20210315T123902, end_date=20210315T123902
[2021-03-15 13:39:02,408] {logging_mixin.py:112} INFO - ##################################################### ON FAILURE!
[2021-03-15 13:39:02,408] {logging_mixin.py:112} INFO - ## POD: dcmsend-a5a9a218
[2021-03-15 13:39:02,408] {logging_mixin.py:112} INFO - RESULT_MESSAGE: None
[2021-03-15 13:39:02,408] {logging_mixin.py:112} INFO - --> delete pod!
[2021-03-15 13:39:02,409] {logging_mixin.py:112} INFO - 
[2021-03-15 13:39:02,409] {pod_stopper.py:53} INFO - ################ Deleting Pod: dcmsend-a5a9a218
[2021-03-15 13:39:02,488] {pod_stopper.py:67} INFO - ################ Pod not found!
[2021-03-15 13:39:02,488] {logging_mixin.py:112} INFO - 
[2021-03-15 13:39:10,634] {local_task_job.py:102} INFO - Task exited with return code 1
gaoh removed gaoh as the assignee of this task.Mar 22 2021, 2:53 PM
gaoh moved this task from In Progress to Backlog on the Kaapana (internal) board.

after tests of the system, the problem might be at the airflow part, this has to be tested

schererj claimed this task.

So after I switched the DICOM send from CTP to Airflow the issue should be solved.
I have sent multiple terabyte of data without any noteworthy issues.
I still think we could probably remove CTP completely (since it only is used as a DICOM receiver anyway and introduces a considerable amount of complexity to the system).
But it should work as it is right now and the removal can be handled as future work.

@gaoh have we also tested it with our wDB stress test?

In the current setup, the transfer is handled from airflow. This is also the case in the wDB-gateway. This stress test should therefore perform the same way and work. I have also a different test with random data, that work up to a limit. The system has now different recovery possibilities and can therefore handle large (random sorted) datasets better. I guess that there are sometimes sill errors, but since the system then restarts, no-one notices it. But there are still limits (depending mainly on the server (RAM)).
So I would also say this ticket is resolved for now. On a long run, changing the whole import process could remain a valid option.

gaoh closed subtask Restricted Maniphest Task as Resolved.Jan 25 2022, 12:57 PM