Page MenuHomePhabricator

DICOM with 'ä' in filename not readable on Windows
Closed, ResolvedPublic

Description

I just tried to open a DICOM MR image with the windows installer posted here (https://camic-dkfz.slack.com/archives/C0110LS9ADA/p1612512246049700) but there is no reader found. It works fine with the MITK Workbench v2018.04.2.

Test Data:

(Rename to have an 'ä' in the filename, only fails on Windows)

Capture.PNG (136×426 px, 4 KB)

Revisions and Commits

rMITK MITK
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision

Event Timeline

kislinsk renamed this task from DICOM not readable to DICOM with 'ä' in filename not readable on Windows.Feb 8 2021, 2:28 PM
kislinsk triaged this task as Unbreak Now! priority.
kislinsk updated the task description. (Show Details)
kislinsk updated the task description. (Show Details)
kislinsk edited projects, added MITK (v2021.02); removed MITK.
kislinsk added a parent task: T28000: MITK v2021.02.

Additional info: This is not a DICOM reader problem in the first place, as the filename string encoding seems to be unexpected (see screenshot above). Other readers work but the node still appears with a dummy symbol in the data node name.

Oh no, I can remember similar issues with German Ubuntu and "Arbeitsfläche" ... Basically, CI clients should probably always have a path compontent like /bäh / included to catch new issues like this one, and build system stuff, which is also a candidate.

This is a problem that becomes visible in gdcm and it is not the first time that we encouter it.
Currently the gdcm reader fails to open the file. I assume it is correlated with gdcm changes of how to handle filenames under windows in gdcm (gdcmReader.cxx 836ff)

void Reader::SetFileName(const char *uft8path)

Here some conversion between utf8 and utf16 was introduced and this string:

"C:/Users/floca/Downloads/sämple.IMA"

is changed to:

L"C:/Users/floca/Downloads/s�mple.IMA"

and for the later string the loading fails.

As I am no unicode expert I do not know, if this is a problem of how we handle the strings or if we hit some corner case in gdcm where its changes end up in a regression. But as a result we currently have (still) problems with any Umlaute or alike.

Ok, I debugged to SetFileName @floca mentioned, it gets called from GDCMImageIO::InternalReadImageInformation in ITK

The "string" at that point is definitively not utf8 . There is one _byte_ at the position of the "ä" and VS debugger shows -28 as value which is 228 unsigned which is an "ä" in latin1. In QmitkIOUtil::Load this string is created by a call to toLocal8Bit() . Unfortunately, changing this to toUtf8() also breaks the file loading, but differently. A pop up dialog from MITK tells me the file doesn't exist.

So, if GDCM expects an UTF8 encoded filename as the method signature suggests, I think the problem is not on GDCM side, since we call it with a Latin1 encoded filename.

Unfortunately since my fix doesn't work I assume there is some inconsistency about file name encodings and it will need further investigation ...

Ok, I did look around a bit more. To summarize, I guess going for the "local 8bit representation" seems to be the most feasible way, since ITK also assumes this and I think we have a lot of ITK file handling code.

There was a fix in GDCM about unicode filename handling on Windows but I'm not sure if I fully get it.

Ok, the GDCM unicode windows fix was introduced in the 3.0 series: https://github.com/malaterre/GDCM/commit/d4dfa144b941c6cc4da87f9795f7f5aba2bb7481

I have no good ideas right now how to continue here.

I also looked into this and one of the problems/differences to "back in the days" is that Qt nowadays assumes by default that C-strings in the constructor of QString are UTF-8 encoded. So, for example, Marco, if you change toLocal8Bit() in QmitkIOUtil to toUtf8(), you get working strings in the GUI (can be easily tested with a MessageBox::information()), but invalid strings for most of the readers, as they assume the local codepage. So we can have one or the other, but not both at the moment. The thing is that QmitkIOUtil works with QString, but mitk::IOUtil uses std::string in the structure that holds the file path (LoadInfo). This is then later used to be stuffed into StringProperty for the name property or the path property. Qt of course assumes these strings to be UTF-8 and hence we see the ?'s in the GUI.

So, to get this right, the cleanest solution would be to indeed replace toLocal8Bit() with toUtf8() and to check the different readers on how they actually deal with file names to convert accordingly. Do they use the ...W WinAPI functions based on UTF-16 or do they use the ...A WinAPI functions based on the current codepage? It's a bit of a mess since I guess different dependencies do it differently. The GDCM fix Marco posted above is for example half of what's necessary on Windows to convert a UTF-8 string to a "Local8Bit" string. It converts to UTF-16 to eventually call the ...W WinAPI functions. But you can then convert the wide string back to a narrow string encoded in in "Local8Bit". TL;DR we can convert back and forth between UTF-8 and Local8Bit without using Qt.

To make the mess complete, Microsoft kind of supports an UTF-8 code page locally set for applications since Windows 10 2019.03. You have to write a manifest file for the application with a certain setting and the application then has its code page set to 65001 (UTF-8). These applications should then NOT #define UNICODE anymore to again use the "old" ...A WinAPI functions that were discouraged for the past decade(s) or so. These ...A functions were extended to be able to work with UTF-8 if the codepage is set accordingly. Interestingly they just do what the GDCM fix does and forward the call to the ...W functions after a conversion. Sigh... Again, long story short, on Windows we should actually try to set the UNICODE compile definition to have all the API functions to be defined as the ...W functions.

But wait, it doesn't stop there. Apparently the Microsoft C++ Standard Library has/had(?) a non-standard overload of std::fstream() to accept an wchar_t* as filename to internally call the ...W WinAPI functions. To have it working platform independent and standard conform, you need C++ 17, which introduced for example std::ofstream out(std::filesystem::path(u8"ÄäÖöÜüß.txt")).

And then there is GDCM that already tries to use UTF-8 but I guess we run into an error condition before in MITK since we use itksys::SystemTools::... to check filenames before we pass them down. In theory, GDCM would probably work without modifications after we change toLocal8Bit() to toUtf8().

To disentangle all of this is probably too much pre-release but we should address it afterwards ASAP.

edit: And then there is Boost.NoWide. 😄

Uh, maybe I made it more dramatic than it is. We can also do the complete opposite and stay with toLocal8Bit(). That means, that we only need conversion when writing the path and name into the properties. I wrote a Local8BitToUtf8() function and it works in a first test. We can also use it to pass an UTF-8 string to the GDCM reader. Didn't check for all code locations where to do it but its late already and tomorrow is another day. :-)

If you are interested, here is the function(s) I came up with:

std::wstring Local8BitToUtf16(const std::string& local8BitStr)
{
  auto numChars = MultiByteToWideChar(CP_ACP, 0, local8BitStr.data(), local8BitStr.size(), nullptr, 0);

  if (0 >= numChars)
    mitkThrow() << "Failure to convert multi-byte character string to wide character string";

  std::wstring utf16Str;
  utf16Str.resize(numChars);

  MultiByteToWideChar(CP_ACP, 0, local8BitStr.data(), local8BitStr.size(), &utf16Str[0], static_cast<int>(utf16Str.size()));

  return utf16Str; 
}

std::string Utf16ToUtf8(const std::wstring& utf16Str)
{
  auto numChars = WideCharToMultiByte(CP_UTF8, 0, utf16Str.data(), utf16Str.size(), nullptr, 0, nullptr, nullptr);

  if (0 >= numChars)
    mitkThrow() << "Failure to convert wide character string to multi-byte character string";

  std::string utf8Str;
  utf8Str.resize(numChars);

  WideCharToMultiByte(CP_UTF8, 0, utf16Str.data(), utf16Str.size(), &utf8Str[0], static_cast<int>(utf8Str.size()), nullptr, nullptr);

  return utf8Str;
}

std::string Local8BitToUtf8(const std::string& local8BitStr)
{
  return Utf16ToUtf8(Local8BitToUtf16(local8BitStr));
}

That's a thorough dive in the topic ... Yes, toUtf8(), and nicely prints the correct Umlaut in the message box, telling you the file doesn't exist ;) I only had a quick glance around, and since a lot of stuff goes though ITK, and ITK decided to stay with local 8bit as far as I understand, I also think this would be the way to go. But I could really live with declaring this a "Known issue". Maybe something on the application user level would be nice, like

if toUtf8 != toLocal8Bit then show error

Hi you both, thanks for the digging.
I also tend too follow @nolden propsal 1) declaring it a kown issue and 2) returning a meaningfull error message, of invalid characters are used.

Then deal with the issue thouroughly post release.

The pragmatic way described above won't work. The reason is that the gdcm code that expects UTF8 and now does the conversion on windows is not directly accessed but covered by itk code, which does its own checks and cannot handle the strings converted to UTF8.
So I am not even sure if we would solve that issue if we change all our path strings to UTF8, if the ITK code does not play along. Might be that on windows the only possiblity to use GDCM within ITK is to patch GDCM in such a way that you can deactivate the win specific conversion from UTF8 if needed.

I did some Googling, it seems that starting with 2019 release, Windows 10 has a UTF8 mode. It's a bit difficult to enable: "old" control panel -> region -> Administrative settings -> Change System locale -> Enable UTF8

After checking this box, MITK loads the data without code changes, but DataNode name is just a single slash "/" ... ;)

kislinsk added a revision: Restricted Differential Revision.Feb 9 2021, 11:43 AM

I did some Googling, it seems that starting with 2019 release, Windows 10 has a UTF8 mode. It's a bit difficult to enable: "old" control panel -> region -> Administrative settings -> Change System locale -> Enable UTF8

After checking this box, MITK loads the data without code changes, but DataNode name is just a single slash "/" ... ;)

This is a nice workaround that we can communicate for the known issue of Umlaut DICOMs.

However, I fixed everything but the initially reported DICOM filename issue with my second proposal above. As @floca said, the DICOM filename thing will still be a known issue as ITK sits in between MITK and GDCM and expects strings to be encoded with the current code page and it would require to patch ITK. Happy to have a workaround now, though.

I created a draft Differential attached to this task to see if everything still compiles on Unix as well. I also added a unit test. The Differential contains all changes of the release branch, though, so I also pushed a bugfix branch for easy inspection if you guys are interested?

edit: The name / is unfortunately(?) expected as it is the patient name or something similar stored in the DICOM file. Not related to encoding. :)

edit2: The bugfix branch relies on the latest MITK-Data, so a superbuild is necessary or simply a git pull in MITK-superbuild/MITK-Data.

I did not dive into Windows APIs, so I just hope you know what you are doing ;)

The test looks reasonable and passes also on Linux.

One improvement could be to output the string (and maybe the hex version) in the WARNING output if the conversion fails.

And did you look into boost? There is also some conversion stuff

I did not dive into Windows APIs, so I just hope you know what you are doing ;)

The test looks reasonable and passes also on Linux.

One improvement could be to output the string (and maybe the hex version) in the WARNING output if the conversion fails.

And did you look into boost? There is also some conversion stuff

Cool, thank you! 👍

Developed many years with the WinAPI in the good old days. Had flashbacks. :-)

I see the potential for improvement of the warning and I already did it this way but removed it later, as it was not very helpful. The console window uses a different encoding somehow and you get cryptic symbols there for both UTF-8 strings and "Local8Bit" strings. Hex could really be an alternative but maybe a little over the top resp. even more cryptic as an output for a user for an edge case that is expected not to happen (and now it will because I said it :D).

I looked into Boost and the best option there is Boost.NoWide but it is a binary dependency and we strive to keep the core free from binary boost dependencies. It is also more focused to provide alternatives for std::fstream and so on...

kislinsk removed kislinsk as the assignee of this task.EditedFeb 9 2021, 12:49 PM
kislinsk edited projects, added MITK, Next Milestone; removed MITK (v2021.02).

Dear future assignee of this task,

the original issue is not resolved. DICOM files with Umlauts in the filename still cannot be read. We use our superbuild GDCM for it through ITK (MITK <-> ITK <-> GDCM). GDCM expects an UTF-8 encoded string and now we have methods for conversion in mitk::IOUtil but ITK in the middle cannot handle UTF-8 encoded strings for this purpose and will return an error before passing anything down to GDCM.

Current workarounds:

  • Rename the DICOM file, or
  • try the UTF-8 locale in Windows 10 available since 2019.03 (see comment of Marco above)
kislinsk lowered the priority of this task from Unbreak Now! to Normal.Feb 9 2021, 12:50 PM
kislinsk removed a parent task: T28000: MITK v2021.02.

Deleted branch from rMITK MITK: bugfix/T28285-Local8BitToUtf8.

kislinsk added a revision: Restricted Differential Revision.May 19 2022, 10:09 AM

Deleted branch from rMITK MITK: bugfix/T28285-Windows-UTF-8.