Message boards : Number crunching : Target CPU run time 2 hours
Author | Message |
---|---|
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
But machine is already running the same wu (HB_BARCODE_30_2chf__351_46196_0) for 4 hours and now at 60%. What is wrong ????? After a restart it is back to less than 2 hours. Get's a bit annoying again. Do we have to delete all BARCODE WU's perhaps ? Sometimes I think it's too bad our Stampede went over to R@H instead of F@H. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
And again more than 2.5 hours. Delete all BARCODE WU's. Really getting a bit sick of it. If I wanted to babysit, it still wasn't gonna be computers. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
Since I can't see your computers (hidden) and you don't give more info, I don't think anyone can help you. Remember that regardless of CPU runtime settings, a WU needs to complete AT LEAST ONE (1) MODEL, on that particular protein you're studying. A very fast computer might compute e.g. 10 models within those 2hr, whereas a slower one might need 10hr hours for just one model. It also depends on the protein and type of study, some models have 10,000s steps and others have 100,000s steps. You can tell if your Rosetta WU is "hung" by looking at the graphics. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
Graphics show it is running and stuck at 60% : yet model 3 step 44.444 and going. Still no explanation for the exceeding target cpu run time. And why it ran for 4 hours the first time and after a restart it the shown time jumped back to less than 2 hours. Do I have to keep on restarting to get it back to 2 hours ??????? And what is the use of setting the run time to 2 hours if it just keeps on running ? Or am I missing the purpose of this parameter ? You just can ignore it than. Just deleted al the BARCODE WU's, also the running one, and will wait to see what wil happen next. When something like this happens again this machine will be switched over to another projekt. The same will happen with my other machines if needed. I have to pay enough for this crunching so I'll not waste anymore money. EDIT : Running R@H only ! And allthough it should make no difference it is set to yes. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
The % complete is less useful, because Rosetta checkpoints infrequently (this is true for many projects). When you monitor a WU, just watch model #, step ###. If they increment, it's going fine. There are other circumstances, when a WU might take long. Do you run other BOINC projects on that particular PC, switching between them every 1hr (default)? Do you have the "keep in memory when pre-empted" setting to "Yes" or "No"? PS: Personally, I run Rosetta on 3x P4 PCs (along with other BOINC projects, but Rosetta is my main project) for the last 3+ months and sofar I've only had ONE (1) WU stuck (in 3 months). And that was 2 months ago. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
The way the project works is that you are given the configuration option of setting the target runtime. The client then checks after each model run to see if the next run would put you over that time limit. If the WU runs longer, it just means you're getting more models processed. And you will get credit for more models processed. Each time you restart a WU, you are throwing away the time spent since the last checkpoint, and this has to be computed again when it restarts. They've added some additional checkpoints, and thus greatly reduced the amount of work that is lost. I wasn't clear on your description about restarting and it going back to 2hrs. Were you looking at the "CPU time"? or the "Time to completion"? Here is the behavior that I would expect, see if you're seeing the same. I'd expect that I'd set my project preferences to (in your thread subject line) 2hrs, then I'd update to the project for the preference to take effect. Then, (from what I'm seeing) any WUs in progress will, when they reach the end of the model they are presently processing, check if they're over that 2hrs, and if they are, they will not start another model. They'll mark the WU completed and send back the results for it, with the number of models you've processed. Now, the "time to completion", that's a longer story. I'm going to keep it short and just say "don't worry about it". It will take BOINC crunching a few WUs and updating to the project a few times to get the hang of that fact that the WUs are completing in the 2hrs. It will gradually see that the WUs are in fact taking 2hrs, and make the estimates start at 2hrs. In short, leave it alone for a few hours and see if it's working the way it's supposed to. It sounds like you're interrupting it just when it's nearing the end of a model, and losing valuable work because it's unable to complete normally. I'm thinking that you've got a WU with a long chain, it's checkpointed itself partway through that 3rd model, and it's just trying to complete model 3, and then will see it's exceeded the 2hr target, and complete the WU. Keep crunchin' Rosetta! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
It was "CPU TIME". And before they ran for no more than 2 hrs (even when time to completion was sometimes perhaps 12 hrs or more (on a Cellie) it did not exceed 2 hrs), so when I see it running for 3 hrs or more I expect somethings wrong. Set it to 4 hrs now, but perhaps it's better to make that 16 hrs. Still I don't understand why this machine downloaded a WU on which it had to crunch much longer than the TCRT set. Why this parameter if it is used random(ly) ? |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
It was "CPU TIME". Please read the explanation for all of this in the FAQs. Any WU can run longer than the user selected time if it takes longer for it to complete a model than the users setting. All Wus will complete at least ONE model no matter how long that takes. By stopping and restarting the WU you are causing them to loose CPU time. When you stop the WU it will loose all CPU time since the last checkpoint. In your case you are causing the WUs to loose over two hours of processing time every time you reset them. The "Leave in memory" setting has nothing to do with your problem. If you let the WUs finish, the BOINC software will adjust to the WU run time parameters you have set. By aborting them you are preventing the system from adjusting, and causing the problem yourself. Moderator9 ROSETTA@home FAQ Moderator Contact |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
It was "CPU TIME". If I read it correctly you cannot stop/restart your computer without the chance of loosing time. If so, it is made very easy to change project. I cannot remember having this much trouble running DF, FAD etc. And I just want to run a medical project without this much trouble. It has to be fun and no annoyance. As far as I'm concerned just another 24 days. And if you read the last post of Dimitris you could've known I replied to his remark "Do you run other BOINC projects on that particular PC, switching between them every 1hr (default)? Do you have the "keep in memory when pre-empted" setting to "Yes" or "No"?". |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
I cannot remember having this much trouble running DF, FAD etc. Perhaps you're right, BUT the projects you mentioned like FAD (and F@H etc) ran STAND-ALONE, whereas Rosetta runs via BOINC, which switches (unloads) tasks every X minutes (usually 60). It is very different from technical perspective. Rosetta (and many other programs, e.g. AutoDOCK which is used by projects like FightAIDS@home) can't just "checkpoint" at a moment's notice, because then it'd have to "dump" (write) 250MBytes of data from RAM onto disk, so it can continue from the same point later. There are many ways to solve this issue for slower PCs and/or occasional bigger WU: have BOINC switch projects every e.g. 2hr or use the "Leave in memory when pre-empted" = YES. Personally I use the latter (leave in mem=YES) So, with BOINC-powered projects we have some "advantages", i.e. we can contribute to MANY projects we like, not just ONE project per PC, but we need to do a little reading / tweaking occasionally. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
...If I read it correctly you cannot stop/restart your computer without the chance of loosing time. You can stop and start the computer. It is more a matter of WHEN you start and stop. If you stop right after the application checkpoints, you will not loose work. If you wait and stop it after a long time after a checkpoint you will loose all the work since the last checkpoint. This is true of ALL Boinc projects, but most only loose a few seconds to a few min. CPDN and Rosetta do not checkpoint that often so the loss is more on these projects. The important point is that you MUST leave the system alone so that it can adjust itself as it runs the Workunits. By aborting them, and starting and stopping them frequently you are preventing the BOINC client from adjusting the system to properly handle the user time setting. From what I can see, many of the Workunits you have been stopping are not hung. They are simply taking longer than you want them to take to complete. I am not sure I understand what difference it makes how long they take as long as they complete successfully. It is also important to remember that if you change your time setting, the change will not take effect until (unless) you update the project from the projects tab. In most cases all the Wus will adjust to the new run length. In rare cases a running WU will not adjust but it will finish using the old time setting, but that does not happen often. During the high resolution phase of model generation you have to look very closely at the graphic display. Sometimes it can take 45 seconds to over a min to see if the WU is "Stepping" or not. During the high resolutions search the steps take a long time, but as long as it is stepping it is not hung, and you should let it run. Also the graphic is will not move very much. During the low resolution search the graphic moves a lot, and people expect to see that. But that is not the case for the high resolution part of the model. AS to keep in memory, the point I was trying to make is if you restart BOINC or interrupt the Rosetta application the WU will always start over from the last checkpoint. The keep in memory setting is only an issue for application switching or application suspend functions. If you are running only Rosetta, AND you never suspend the application or the WU, this setting does not matter. In your case the best advice would be to pick a "Time" setting, and leave it alone. Then let the system run undisturbed for a few days. The system will then be able to adjust to the run length of the WUs and the run time will become more accurate. IF you are certain that a WU is hung, then abort only that WU. Each WU is handled separatly by your system. Just because one hangs, that does not mean that all of those with the same name will hang. Moderator9 ROSETTA@home FAQ Moderator Contact |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
If I read it correctly you cannot stop/restart your computer without the chance of loosing time. If so, it is made very easy to change project. Yes, with boinc it is very easy to change project. However, ALL boinc projects drop back to the last checkpoint when you stop/restart your computer. I cannot remember having this much trouble running DF, FAD etc. FaD ran smoothly for the most part, but DF was far worse than rosetta. It was just one problem after another, and near the end I gave up on it (as did most other crunchers) when I couldn't even upload results after the latest "fix" to the DF server made it unbelievably slow (and they kept saying it was running fine and refused to do anything about it). |
Angus Send message Joined: 17 Sep 05 Posts: 412 Credit: 321,053 RAC: 0 |
The issue is the long checkpoints, and no visibility of checkpoints happening or other progress indicators WITHOUT going into the graphics. There is nothing available to BOINC Manager to show what is really happening. Can the client be modified to output the model# and step to a Boinc Manager message at frequent intervals- either timed or after a pre-set number of steps? Why can't the Rosetta client checkpoint at fixed, frequent intervals - like 15 minutes or less? Then the maximum you would lose from a re-boot would be limited. I think people would accept the write-to-disk penalty over losing hours of crunch time. In the case being descussed here, allowing the system to adjust the Duration Correction Factor doesn't seem to be the issue. It's a Run-Time preference of 2 hours that is being drastically exceeeded. DCF will NOT affect run time - only the totally broken estimate of Time to Completion. You can stop and start the computer. It is more a matter of WHEN you start and stop. If you stop right after the application checkpoints, you will not loose work. If you wait and stop it after a long time after a checkpoint you will loose all the work since the last checkpoint. This is true of ALL Boinc projects, but most only loose a few seconds to a few min. CPDN and Rosetta do not checkpoint that often so the loss is more on these projects. Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :) "You can't fix stupid" (Ron White) |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
Dimitris' quote : Rosetta (and many other programs, e.g. AutoDOCK which is used by projects like FightAIDS@home) can't just "checkpoint" at a moment's notice, because then it'd have to "dump" (write) 250MBytes of data from RAM onto disk, so it can continue from the same point later. As Angus says : Why can't the Rosetta client checkpoint at fixed, frequent intervals - like 15 minutes or less? Then the maximum you would lose from a re-boot would be limited. I think people would accept the write-to-disk penalty over losing hours of crunch time. I prefer that. Quote of Moderator9 : It is more a matter of WHEN you start and stop. If you stop right after the application checkpoints, you will not loose work. So you can check whether you can stop at the right moment or not ? Or is it just a gamble ? Is it perhaps a 24/7 project without any interruptions ? If you have to restart because you just installed new defenitions of your virusscanner for instance ? In some cases you're just unlucky ? @ AMD is logical : I didn't mean another Boinc project (but F@H) and as I stated before : I cannot remember having this much problems with DF or any other project. Perhaps I was lucky. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
My experience with Rosetta is that on an average P4/2.6GHz it takes between 10-25min to complete most models (I don't remember the stats of the HUGE 250MB WUs). A slower PC might take 2-5 times as long. So, if a "slower" PC has both "leave in mem when pre-empted"=NO (default) and relatively short "switch between projects every X min"=60min it MIGHT lose almost all its work during a short 60min "timeslot" if it gets swapped out just before it checkpoints. Personally, I use "leave in mem when pre-empted"=YES, which means I never lose ANY work due to BOINC switching projects. The next best setting would be to increase "switch between projects every" to 3 or 4 hours. Why can't the Rosetta client checkpoint at fixed, frequent intervals - like 15 minutes or less? Then the maximum you would lose from a re-boot would be limited. I think people would accept the write-to-disk penalty over losing hours of crunch time. I believe the reason is what I wrote earlier: "Rosetta (and many other programs, e.g. AutoDOCK which is used by projects like FightAIDS@home) can't just "checkpoint" at a moment's notice, because then it'd have to "dump" (write) 100-250MBytes of data from RAM onto disk, so it can continue from the same point later." Rosetta checkpoints at the end of every "model" (predicted protein structure). Basically, some applications are better suited to frequent checkpoints. If you go over at the WCG/FightAIDS@home forums, you'll see this subject being raised every day... Actually, they ask "if DOCK could be made to checkpoint as frequently as Rosetta" (used by WCG/HPF their other project) LOL.. I expect this to change when they start running Rosetta/HPF in "full atom relax" mode Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
DPC Bearhunter: Keep in mind that Boinc is still in a state of change, and Rosetta is still being worked on and modified. Coming from single client projects like DF, FaD, F@H, etc, this has its own set of pitfalls and traps during setup. If you setup 200 megs of diskspace for Boinc that is shared between two projects that require 100 megs of diskspace each, and give one project more than 50% of the cpu time the other project will complain about not having 100 megs of diskspace allotted to it. It's not a setup that makes sense to me - having multiple settings that determine the disk space allowed per project. Boinc master setting, Project (Rosetta) setting, and run time percentage determining how much of the Boinc master setting the project is allowed. At least with the instructions the DPC team(s) have been giving out - you don't have to worry about that kind of problem since you're giving 100% of the cpu time to one project. Everyone so far in this thread has touched on how things work with the Rosetta client, but has not touched on WHY Rosetta works the way it does. When I started running Rosetta, I was downloading about 50 megs a day for the first 5 days. By the end of 2 weeks, I'd downloaded 1 gigabyte. The Rosetta client would download a WU, and run however many models the WU was programmed for. It was usually 10 models; although the big WUs may have been programmed to run less than 10 models. Really small, fast WUs would finish in about 15 mins and when every system tried sending back results and asking for new WUs, we overloaded the Rosetta servers. The change made communications with the Rosetta servers much less frequent, and allowed the user to choose between 2 hours and 4 days of run time for each WU. This dramatically reduced the bandwidth usage for those with dialup connections or broadband connections with caps. We were then given the max cpu time setting with a default setting of 8 hours. Due to the rate of problems with clients before Windows version 4.83, the default was dropped to 2 hours to make sure that if people ran into problems, they'd lose less than 2 hours of cpu time. If you ran into a problem that corrupted the data for your workunit at hour 4, you basically lost all your cpu time. The maximum setting for max cpu time was recently reduced to 24 hours to limit the damage done by the 1% bug; although other changes in the 4.83 Windows client have made the "1% bug" much less common. When Boinc now connects to the Rosetta server, it sends the data from the WU or WUs that we've completed, and then asks for enough WUs to fill up the cache. (Mine's set for 3 days.) It divides the total time of the cache (3 days for me) by the average time for a WU (it's figured out that mine finish in around 24 hours) and comes up with the fact there should be 3 WUs in my cache. It then asks the Rosetta server for enough WUs to get a total of 3 WUs in my cache. Boinc knows nothing about your max cpu time setting. So if you change your max cpu time setting from the default 2 hours to a larger number like 4, 8, 12, 24 - it'll take a couple of days before it figures out that your WUs are now taking longer than an average of 2 hours to finish. Once the average WU time matches your max cpu time, the number of WUs in the cache will be roughly appropriate for your cache time setting. One side affect of Boinc not knowing about your max cpu time setting - and it not passing enough information back to the Rosetta server is that we can't send requests like "I want 2 hours of work for a system that is performing with a RAC of 255" and only get WUs that will take less than 2 hours on our machine. So when we've been sent a large or really slow WU, it'll finish 1 model regardless of our max cpu time settings. Until Boinc is updated to pass back information like this, we're stuck with the current setup which keeps confusing people when they notice a WU taking longer than their Max Cpu Time. (There's a few threads about this issue, and one this week talking about requesting that Boinc allow clients to send various program flags back to the project servers.) Hopefully, this helps explain why the current setup doesn't make sense. If my explanation was clear enough, feel free to translate it for any of the other DPC that are having problems with this issue. We may get Boinc changed in the near future, and this issue may disappear by the next time you return to (or just TRY TO) turn FozzieBear into roadkill on this project. |
Message boards :
Number crunching :
Target CPU run time 2 hours
©2024 University of Washington
https://www.bakerlab.org