Page MenuHomePhabricator

SceneIORewrite: Incremental saving
Closed, WontfixPublic

Description

Description:
Save only items that are different from items at are already part of a serialized scene

Justification / Use-Case:
Given the nature of many applications, this would spare the majority of (input) data to be serialized more than once. Often, the input (image) never changes. Thus, incremental saving would provide a major speedup from the second saving on.

Implementation and questions:
There is already a patch by Rostislav Khlebnikov, based on ITK time-stamps. This solution could work nicely. Only Downside: in the current version, it needs a “state”. Daniel Maleike thinks it would be more interesting to put information about time/date/version and modified times into the file(s) and to provide just an UpdateArchive() method that checks what needs updates.

Questions:
Daniels suggestion could work efficiently if we would not compress our archives.. compression makes reading and determining what data needs an update a blocking task. This, however, is mainly due to the current “all or nothing” implementation. If we’d use a ZIP-library to read only specific (small) index files, we could be faster.

Event Timeline

Well, the UpdateArchive is conceptually already in my pull request - the code that compares the time stamps essentially does that. However, I agree there might be a better approach to store the "state" (the time stamps and file name of the scene file). Given that this is basically per-application-instance information, we could use either a temporary external file instead of storing this state in the SceneIO itself, or, even better, use berry Preferences to store it.

I had some time thinking about this more in depth yesterday. I am now convinced that just keeping track of MTimes during load and save is enough (i.e. exactly what Rostislav implemented).

There are unprobable use cases that could profit from a more complicated solution (involving keeping track of complete object state via hashes or similar), but I think the additional implementation effort is just not justified.

For a proper solution, I would propose the following:

  • remove the "currently open scene file" information from the interface; keep track of what BaseData with which MTimes are stored in what zip files globally for all instances of SceneIO (static member). Like this, the application level needs not to cover that topic and multiple SceneIO instances can share their information.
  • move the zip archive updater to a separate class (I already have an initial version of that locally)
  • verify the MTime comparisons. It seemed to me that one place takes DataNode times while another location compares against BaseData MTime
  • try to refactor the SaveScene method into smaller pieces. Currently it looks too long and complicated
  • important one: accompany the whole change with tests. We need tests to verify this feature, but also to have a basis for evaluating further ideas (see other bugs blocking T19248)

I can probably (needs internal discussion) volunteer to work on this topic and start quite soon.

Daniel, I agree with most of what you are saying, but here are some thoughts.

  1. Do you suggest replacing the m_LoadedProjectFileName and m_NodeLoadTimeStamps to basically a map<filename, map<BaseData*, timestamp_at_load>>? This sounds more flexible - i.e. loading two scenes at a time and saving to either of them would allow incremental saving. I personally prefer to have the notion of currently open file which is the most widespread use case for my application (not many people load multiple scene files) and it allows to have the familiar "save" action which doesn't ask which file to save to and a "save as". But I guess it's a matter of preference and could be implemented on top of a more flexible underlying approach.
  1. MTime comparison looks like a "bug" in my implementation. The fact that it works just comes from the fact that the data node timestamp is never older than that of the base data it contains (due to how mitk::DataNode::GetMTime()) is implemented. Clearly, using BaseData time stamp everywhere would make more sense. However, care should be taken if the property persistence service (http://bugs.mitk.org/show_bug.cgi?id=19209) actually involves the data node properties. I'm just not sure how it's implemented. If only BaseData properties can be used in this property persistence service - then using BaseData timestamp should be good enough.

(In reply to Rostislav Khlebnikov from comment #3)

Daniel, I agree with most of what you are saying, but here are some thoughts.

  1. Do you suggest replacing the m_LoadedProjectFileName and

m_NodeLoadTimeStamps to basically a map<filename, map<BaseData*,
timestamp_at_load>>? This sounds more flexible - i.e. loading two scenes at
a time and saving to either of them would allow incremental saving. I
personally prefer to have the notion of currently open file which is the
most widespread use case for my application (not many people load multiple
scene files) and it allows to have the familiar "save" action which doesn't
ask which file to save to and a "save as". But I guess it's a matter of
preference and could be implemented on top of a more flexible underlying
approach.

Hi Rostislav,

thanks for your input. Yes, you correctly summarized what I proposed. I think this would still allow for the "application layer" to keep the concept of a currently open project without forcing it to keep alive one instance of SceneIO.

  1. MTime comparison looks like a "bug" in my implementation. The fact that

it works just comes from the fact that the data node timestamp is never
older than that of the base data it contains (due to how
mitk::DataNode::GetMTime()) is implemented. Clearly, using BaseData time
stamp everywhere would make more sense. However, care should be taken if the
property persistence service (http://bugs.mitk.org/show_bug.cgi?id=19209)
actually involves the data node properties. I'm just not sure how it's
implemented. If only BaseData properties can be used in this property
persistence service - then using BaseData timestamp should be good enough.

Ok, thanks for confirming what I have seen. Pointing out T19209 is a good thing, I did not yet look too much into what is happening there, but I'll do that before startin here.

I'll keep you updated and would welcome having your opinion once I have a first implementation.

kislinsk claimed this task.
kislinsk added a subscriber: kislinsk.
This task was automatically closed because it wasn't updated at least since July 2016 (over 2 years). Please re-open this task if you think that it is still relevant. This most probably means that you will resolve it.