Message boards : Number crunching : More checkpoint problems
Author | Message |
---|---|
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
For example, a current tusc work unit has been running for an hour without committing a single checkpoint. On some of my machines I'll just sleep them when I notice these things, but not an option on this machine. (Security rules.) I've noticed that a lot of these work units have weird annotations, in this example, the current workunit name includes the words 'tusc closed IGNORE THE REST', which suggests I should be able to nuke it without a scientific loss? Or not? Certainly looks like my machine loses an hour of work if I nuke it. P.S. In case it isn't obvious, I'd prefer the projects run without problems. These things most often call themselves to my attention when I see work units that are hung for several days, apparently restarting from zero each time the machine is booted. |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
the current workunit name includes the words 'tusc closed IGNORE THE REST', which suggests I should be able to nuke it without a scientific loss? Or not? Rosetta task unit names are normally incromprehensible to anyone not intimately familiar with protein science and the Rosetta system. However, in general terms the names usually contain details of the protein being worked on and the investigation method used. I am not sure what exactly "Ignore the rest" means but I suspect it is something like "save all results showing a particular characteristic and ignore the rest". You will also sometimes see instead "save all out", which I suspect is a less specific tool that records all data discovered in the task. If you post a link to the task page or give the full task name there may be other elements that we can have a go at translating (assuming that one of the scientists doesn't jump in with the real answer). These things most often call themselves to my attention when I see work units that are hung for several days, apparently restarting from zero each time the machine is booted. When you see one hanging around for a few days or constantly losing progress please report it in the Minirosetta 3.52 thread (or equivalent thread for a later version of Minirosetta). Giving the task number or task name will also help the scientists trace the problem and fix it in the next version update. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
Okay, just the ACK for now, but I'll try to keep my eyes open. I have discovered that the Properties button will tell you if there is a major discrepancy in the save time, but I suspect the deeper problem is that there is simply no progress being made... |
Message boards :
Number crunching :
More checkpoint problems
©2024 University of Washington
https://www.bakerlab.org