Message boards : Number crunching : Odd wu/wu behaviour?
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue. Thing is, next time I looked at it, ~5 minutes later, it was complate and ready to report. Note despite a 7200 seconds preference, it ran for a lot less. This is total guess work, but is the short run time because the first "decoy" took more then half the allotted time slot? Still odd with the progress bar. My current wu, seems to be doing the same thing. It's been running for 47 minutes but is only 1.2% complete. The wu on my other machine has been running 1 hour 23 and showing 62% finished which is much more like it? Both are running 4.98, (4.83 of course). *** EDIT *** That wu is now pre-empted at 59:53 and is showing 1.26% complete. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Have a look at my reply to some other fellow here He asked: Here's an odd one... Rosetta 4.98, WU _largescale_large_fullatom_relax_dec7449_1_08_2.pdb_431_25_0 running with BOINC 5.2.13 on Windows XP 64-bit SP1 on an Athlon 64 3200+ with 512MB RAM. I also have SETI@home on that machine. Starts up, 50% done, 2 hours CPU time used, runs for about an hour, at the end of that time it's still about 50% done, but has 3 hours CPU time; swaps out... SETI runs for an hour and swaps out... and then Rosetta swaps in again, 50% done, 2 hours (!) CPU time used. Caught this one because the accepted protein shape is pretty uncommon (looks sort of like a lollipop). Shall I kill it or do you want me to keep watching it for a while? It's been on here for three days now, which means ballpark 36 hours, but I think I have only 2 hours credit for it... these *_largescale_large_fullatom_relax* WUs are very big WUs which take a loooong time per model, on P4s they take 2-4 HOURS PER MODEL, so unless you have "Leave in mem when pre-empted"=YES, the PC can't complete even 1 model in 2hr before Rosetta gets swapped out to run SETI and your PC starts the WU from 0 again... Solution: increase "time between swaps" to e.g. 4hr or IDEALLY (if your PC has enough RAM and/or run few BOINC projects) set "leave in mem when preempted"=YES I always choose leave in mem=YES. This very example is why Rosetta needs a BigWU flag in preferences IMHO... AMD_is_logical also explained it in a previous comment: Another problem is that the bug requiring "keep in memory" has been fixed. That means a lot of people are setting "keep in memory" to "no". There are places in some WUs that require more than an hour to get to the next checkpoint, so with the default switching time of one hour the WU will keep dropping back to the last checkpoint indefinitly. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
I didn't know the "leave in memory" problem was still an issue here, but have been a member for a long time, and habitually set "leave in memory" to true anyway. I've watched a couple of wu now, and there does seem to be a difference in the way my 2 machines here run. Foniks Seems to have a working progress bar, Evesham does not. The current wu on Evesham has 1:59:47 time, (pre-empted now), but shows 1.50% complete. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
The "leave in memory" is no longer an issue, but Rosetta "check-points" at the end of each model (notice the Model/Step info in the screensaver). AMD_is_logical explained it very well. Most WUs are small proteins, which take only ~10min per model. Some recent ones are very big which take 2-4hr per model (on Pentium4!). So to finish such a WU, Rosetta needs to run on your PC for 4hrs, WITHOUT being unloaded from memory. If a PC unloads Rosetta every hour to run another project, it will never finish, as it'll start everytime from scratch. The surest way to run Big WUs would be check the "leave in memory when preempted"=YES. IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
As I said above, "Leave in Memory" is set to true. This is not the issue. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes. Which will require a modification to Boinc to pass that information back to the Rosetta servers. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
IMHO this needs to be handled by the project, submitting big jobs only to PCs which run 24/7 and/or have leave-in-mem=yes. The way BOINC works, global and per-project preferences are already stored on the project's (Rosetta's in this case) servers. (unless one uses a local .xml file override, which virtually nobody knows about). Actually, you're correct in that our PC's BOINC client uses global BOINC settings from the most-recently updated profile of all BOINC project we run. So if I run Rosetta+SETI+Einstein and make a change in the global settings of e.g. Einstein, then Rosetta won't know it. I think the idea was to use some BOINC "account manager" to sync this info. I don't know if any BOINC project is sending WUs customized to client PC's profile-preferences (BigWU=yes) and/or capabilities (RAM>512, fast CPU, 24/7 operation etc). A BigWU flag could be part of local-preferences (like the flags for WU-runtime and %-CPU-time-taken-by-screensaver), but I don't know if the BOINC server code supports customised feeding of WUs. Seems like this needs to coded in BOINC's scheduler/feeder https://boinc.bakerlab.org/rosetta/rah_status.php Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue. The WU will start at 1%. Then there will be small increments as the WU goes through several stages of the first model (less than 1% total). After each model a larger jump in percentage is made based on how many models the rosetta app thinks it can do. If it doesn't think it can fit another model in without running too far over, the percent will jump directly to 100%. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
17128664 seemed to be going wrong. It was still under 2% finished after more then an hour of crunching. It was moving, because I'd noticed it at 1.87% and later saw it had crawled to 1.9%, so not a 1% issue. That is the correct answer. We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
The "leave in memory" is no longer an issue, but Rosetta "check-points" at the end of each model (notice the Model/Step info in the screensaver). AMD_is_logical explained it very well. we'd like to be able to do this, but there is no mechanism currently in boinc that allows this. during casp which is coming up soon, we may ask participants to set leave in memory = yes as there are likely to be some larger proteins |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
Having "noticed" this behaviour, I have taken to watching that machine more carefully. I can now report that all the wu's I've watched have done the same thing. The wu progresses very slowly up to around 2% then finishes. This is not the same as wu's on this machine. Here, my current wu has 1:04:33 and 37.4% done, the box in question has 1:59:46 and 1.52% done. Both machines have work unit length set to 7200 seconds, as can be seen in the results. The progress bar is working differently on that machine for some reason. Both are Intel P-IV systems, on similar ASUS MoBo's, both run BOINC 5.2.x, both are returning valid results after about the same amount of time. The machine showing this behaviour has considerably less memory in it then the other, (256M v 1G), and is running NT4 rather then XP. I don't particularly care, but it does seem odd, and "odd" behaviour often is symtomatic of a deeper problem. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
Sorry, it won't let me edit! The WU will start at 1%. Then there will be small increments as the WU goes through several stages of the first model (less than 1% total). After each model a larger jump in percentage is made based on how many models the rosetta app thinks it can do. If it doesn't think it can fit another model in without running too far over, the percent will jump directly to 100%. If this was the case, would not the details of my results show that only 1 structure had been produced? This is not as observed in the results for that machine. I will set the wu length up to 4 hours to see if the behaviour changes however. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Knorr Send message Joined: 18 Feb 06 Posts: 21 Credit: 373,953 RAC: 0 |
Could you post a link to the results you are thinking about? Because all the largescale_large_fullatom_relax WU's I've checked in your results actually has only 1 structure. And as explained before. The percentage will hardly reach 2% before the first model is completed. Then the percentage will jump acoordinally to your CPU run time pref. If you have a largescale WU which would take about 3 hrs for one model and your run time is set at 2 hrs, then the WU is gonna complete that model. Even if it's 1 hr above your settings. And the progress procentage will start at 1% increase in small steps towards 2% all the way up to 3 hrs CPU time and then jump to 100% and completion. - Knorr |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
This wu was crunched by Evesham, (the node which is showing this odd behaviour, and looks to me as if it has done 6. This one was crunched by the machine which is not showing the errant behaviour and appears to have only done 1. Right now, this machine has a wu in progress, at 2:16:48 it shows 49.32% complete, (I have changed the target run time to 4 hours). The wu on Evesham is pre empted at the moment at 0:57:07 showing 1.53% complete. Evesham is a slower machine, but not by a vast amount, it is a 2.533GHz Northwood, whilst Foniks is a 3.2GHz Prescott. Evesham is only running BOINC whilst Foniks is running my production web server and database server, so is doing a little other work as well. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
I have one of those giants right now, and after I read this thread, I set it to stay in memory, but I had to close down my computer, so now it's back to zero! Both in time and in progress, so the almost 2 hours it ran seems to be lost. :-( Is there any possibility to let those giants write to disk more often by creating some checkpoints? Then it will only go back the the checpoint and continue from there. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Having "noticed" this behaviour, I have taken to watching that machine more carefully. I can now report that all the wu's I've watched have done the same thing. The wu progresses very slowly up to around 2% then finishes. This is not the same as wu's on this machine. adrianxw, I think the "odd" %-progress behaviour you're seeing might be because the WUs running on your PC can be very different. It can be apples and oranges. A model on a HBLR* WU might take 10min and a *_largescale_large_fullatom_relax_* might take 3hr. Rosetta is very different than most other BOINC projects, which have more or less constant size WUs. WU %-progress might not increase linearly with time, as AMD_is_logical / Snake_Doc said. Especially if you're using very short WU runtime. The *_largescale_large_fullatom_relax_* WUs are very big and sometimes "Steps" remains at 0. Usually just one "Model" will fit in the 7200 seconds (2hr) timeframe, in which case the %-progress indicator may stay at e.g. 1.5% for 1-3 hours while computing the first model and then finish, realise that it can't run a second model per your WU-runtime settings (7200 sec might have been already exceeded), so it jumps to 100% and finishes. I use 8-hr WU-runtimes (instead of 2hr default) and a big WU taking 2hr per Model might jump 0% -> 25% -> 50% -> 75% -> 100% in BOINC progress Hope this helps and I understood your questions correctly this time. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Mikkie Send message Joined: 1 Apr 06 Posts: 9 Credit: 5,700 RAC: 0 |
The *_largescale_large_fullatom_relax_* WUs are very big and sometimes "Steps" remains at 0. Usually just one "Model" will fit in the 7200 seconds (2hr) timeframe, in which case the %-progress indicator may stay at e.g. 1.5% for 1-3 hours while computing the first model and then finish, realise that it can't run a second model per your WU-runtime settings (7200 sec might have been already exceeded), so it jumps to 100% and finishes. Yeah right, ever thought about people who doesn't run 24/7 or not running power engines? There are still people overhere who do it just for the fun. I had such largescale wu but abadon it because it was still busy crunching Model 1 after 9 hours at 1.4% All I get on the moment are these things. They all get dumped. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
I'd set my wu runtime down to 1 hour because of the problems with application 4.97. I have set it back to 4 hours, and not noticed any real change in behaviour, although I have not had it like this for long yet. Certainly, the wu's I have running now should be with the new target time, and they are presenting VERY differently on screen as I look. Both are running largescale_large_fullatom wu's, on one machine it shows 01:11:44 and 1.58%, the other, 1:14:06 and 27.27%. I do understand the way the system works. I am well used to non-linear progress bars. As I've said above, I don't really care, but it struck me that my 2 systems are behaving very differently, and I'd like to understand that so I am happy that there is not a hidden issue here. I linked a couple of results further up, one crunched by the "weird" system which, to me at least, seems to show it ran several structures. Another from the machine that has a much more linear progress bar, which apparently only managed one. I'd appreciate someone who knows what those results show explaining it to me. It is not the first time I've had a problem. I had to switch off Leiden@Home on Evesham because after the recent upgrade to the science application, my results were being judged invalid. This was due to a reporting difference between NT4 and XP. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Certainly, the wu's I have running now should be with the new target time, and they are presenting VERY differently on screen as I look. Both are running largescale_large_fullatom wu's, on one machine it shows 01:11:44 and 1.58%, the other, 1:14:06 and 27.27%. On a dual-CPU PC (WinXP) in front of me I have two largescale_large_fullatom WUs running concurrently. Both WU have been running about the same time ~1+hr, one shows progress 1.3021%, the other about 25%, very similar to you see. But mine are on the same PC. I think that the diff is in how/whether "steps" are incremented. In one case it stays at 0, in the other it increments as usual. Bottom line: I wouldn't worry about it! Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
I wasn't overly worried about it as an issue, I was simply afraid that there may be an underlying problem with Rosetta and NT4, (the latest Leiden@Home client does not work properly with NT4 for example). This morning however, I saw I had a wu on Evesham that was 01:08:01 and 28.43% complete. So I believe the "issue" to be an non-issue as explained. Cheers folks. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Message boards :
Number crunching :
Odd wu/wu behaviour?
©2024 University of Washington
https://www.bakerlab.org