Page MenuHomePhabricator

Unify handling of filename encoding in MITK
Closed, WontfixPublic

Description

During T22602 @maleike dug deeper into the current way strings are handled in MITK, especially regarding their encoding and file I/O. He reported:

Observed problems

On a Windows machine with German_Germany.1252 locale (Latin1):

  • (without any of this task's patches) editing DataNode names in Data Manager does not accept "รถ" or other special characters. Input is possible, but as soon as the in-place editor is closed, the Data Manager does not display it as "รถ" but as "?" (missing character symbol)
  • when fixing Data Manager's display (by making it use QString<-->std::string conversions via default conversions), files named with special characters such as "รถ.nrrd" will be loaded but their names will be displayed as "missing character"

String representation summary

What happens in Qt

  • QString contains UTF-16 internally (but we should not care)
  • QString offers convenient conversion to std::string via fromStdString() and toStdString(). Conversion logic is identical when constructing QString from a const char*

What happens in MITK's Qt UI

  • Nearly all UI code converts using Qt's default conversion (toStdString/fromStdString). This seems reasonable. However, it changed its meaning with the migration to Qt 5. Before, MITK UI was assuming "std::string == Latin1", now MITK UI is assuming "std::string == UTF-8".
  • Exceptions were found in:
    • Qt-based models: were using QFile::encodeName(). This method uses the the user's locale. Using encodeName() leads to broken string when typing "รถ" into views based on those models (when QString's default conversion assumes UTF-8 and QFile::encodeName assumes something else, e.g. Latin1).
    • QmitkIOUtil: also uses QFile::encodeName(). This leads to DataNode names encoded using the user's locale (say Latin1) which is displayed in Data Manager assuming it was UTF-8 (i.e. missing character symbols)

What happens in MITK's I/O

mitk::IOUtil and probably other ITK related file reader/writer use itksys::SystemTools::FileExists(). This method takes a const char* as input. Internally it uses a wrapped version of the locale-dependent mbstowcs (http://en.cppreference.com/w/cpp/string/multibyte) to create a wide-character representation of the input. On Windows this wide-character string is handed to GetFileAttributesW (https://msdn.microsoft.com/en-us/library/windows/desktop/aa364944(v=vs.85).aspx) to decide whether a file exists or not. GetFileAttributesW assumes unicode input.

Summary / Conclusion

  • Default conversions between QString and std::string seem the best choice for UI classes because it's Qt's default.
  • However, MITK's I/O classes (notably QmitkIOUtil) are built around "Local8Bit()" (hidden behind QFile::encodeName).
  • calls to SystemTools::FileExists() have been very lucky so far because: via QmitkIOUtil's QFile::encodeName() they got to see "locale" encoded C strings instead of UTF-8 - which fits the further processing. If we would feed UTF-8 into FileExists() it would not work (on Latin1 systems).
  • Core question: What do we want to be in DataNode::m_Name? locale dependent strings or UTF-8? Currently we have a mix of assumptions in MITK that sometimes don't match.

We should define a standard way the encoding is handled in MITK and provide a utf8 test which tries to read/write sample files with umlauts/russian/chinese characters for each registered mimetype. However we currently do not have a way to create a sample file for each mime type, even those registered by a third party.

Event Timeline

You wrote:

we currently do not have a way to create a sample file for each mime type

Just to make sure we understand the problem in the same way: It had nothing to do with the encoding of file contents!

The problem was only about the file name and about the fact that (non-Qt) file readers expect file names in the locale's encoding. Such file names are stored to and potentially used from StringProperty ("name") in this locale's encoding. At the same time our Qt-based UI code expects that string properties are UTF-8 encoded --> collision

Very true. The "each mime type" is to ensure that every reader/writer is tested, not because of the content of the file.

My idea of a automated utf8 test is roughly as follows:

Query the IMimeTypeProvider for all registered mime types
for (MimeType : MimeTypes)
{
 SampleFile = MimeType->GenerateSampleFile (This needs to be implemented)
 for ( Extension : MimeType->GetExtensions() )
 {
   for( UTF8-FileName : VectorOfUTF8-FileNames )
   {
     Write(SampleFile  ,UTF8-FileName + Extension )
     Read(UTF8-FileName + Extension)
    }
  }
}
kislinsk claimed this task.
kislinsk added a project: Auto-closed.

Hi there! 🙂

This task was auto-closed according to our Task Lifecycle Management.
Please follow this link for more information and don't forget that you are encouraged to reasonable re-open tasks to revive them. 🚑

Best wishes,
The MITK devs

kislinsk removed kislinsk as the assignee of this task.May 26 2020, 12:05 PM
kislinsk removed a subscriber: kislinsk.