Discussion on increasing the default run time

Message boards : Number crunching : Discussion on increasing the default run time

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

AuthorMessage
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,277,018
RAC: 1,575
Message 63043 - Posted: 25 Aug 2009, 23:48:35 UTC - in response to Message 63032.  

I'd happily change my run-time prefs so that computers that are on lots have a high run-time and the others have a low run-time but I find this really difficult as they're tied to the BOINC work/home/school settings (which I think are poor, but not the project's fault ;) ).

I also use BAM but that doesn't allow changes to the run-time, so I'm left with the default. Being able to select a run-time preferences per machine would be useful, but probably only for a minority i guess...

(just noticed the project haven't posted on this for a while!)


I've noticed that the World Community Grid project lets you make some machine-specific settings through their web site, but then several other settings do not propagate through to that machine if changed in other ways. I don't use BAM, so I don't know if this is compatible with BAM.

However, it looks like I may soon need to switch managers so that I can control BOINC on my two desktops from my laptop, which appears to be short of much power for running longer workunits well, so could you tell me if BAM seems suitable for that purpose?
ID: 63043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warped

Send message
Joined: 15 Jan 06
Posts: 48
Credit: 1,788,185
RAC: 0
Message 63068 - Posted: 28 Aug 2009, 17:05:24 UTC - in response to Message 63042.  

I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time.

However, I find this thread as well as the others discussing long-running models to be of little interest when I have work units running for about 4 hours. Is the preferred run time really applied?


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?
ID: 63068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 63069 - Posted: 28 Aug 2009, 18:08:22 UTC - in response to Message 63068.  
Last modified: 28 Aug 2009, 18:28:50 UTC


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?


I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;)

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.
ID: 63069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile cenit

Send message
Joined: 1 Apr 07
Posts: 13
Credit: 1,630,287
RAC: 0
Message 63070 - Posted: 28 Aug 2009, 21:34:07 UTC - in response to Message 63069.  


I have noticed that on my faster machine, the limit of 99 decoys is usually reached before the 12-hour expected runtime I've requested. You might want to check the report visible on the Rosetta@home of how well the workunit succeeded to see if your workunits also often stop at the 99 decoys limit instead of near the requested run time.


The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip?


I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;)

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.


"Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now...
ID: 63070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 63073 - Posted: 29 Aug 2009, 9:55:27 UTC - in response to Message 63070.  


Snip ...

In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct.

That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly.


"Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now...


Interesting. If anyone is looking for fairly reliable repro steps on getting uploads to fail, try the following.

Set up a machine, and adjust the maximum upload rate to 2 kbytes/sec on the advanced preferences page. Grab yourself a task like this one:

276073593

let it complete, and then try to upload it. The key section of the name appears to be the "ddg_predictions" string. I've seen a few of these guys going by, they seem to produce very large result files. I've had two that are in excess of 7 Mb and one that was over 11 Mb.

It's worth noting that if I temporarily adjust the upload speed to something over my connection's max (384 kbits/sec, i.e. ~ 40 kbytes/sec), the transfer will then go through without problems.

However it's a bit of a pain doing this, I'm about to the point that if I find another of these WU that's stuck uploading, I'm going to force that upload through, and then abort any of these jobs that I see in the queue.
ID: 63073 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 63076 - Posted: 29 Aug 2009, 13:44:35 UTC

So yes warped, you are seeing what I would expect. When your preference of 16hrs is near, the tasks end. And if 99 models is reached prior to that, the task will end at that time (at least in the "mini" application).

The amount of data reported back on the uploads varies by the type of protein and type of study being done. But the primary factor or multiplier on the size is the number of models. At one point there were batches of WUs that were running 20 models an hour. The upload size and potential for hitting the maximum outfile size were very large for long runtime preferences. 99 models was just a way to strike a compromise between giving the desired runtime, and having a predictable and reasonable upload size.

dgnuff From what you are describing, it sounds like the only issue with any given type of work unit is the resulting size of the result file. Any time you have a large file that must move and a very limited bandwidth, there is a conflict to be resolved. The BOINC client can do partial file transfers and continue where it has left off. But I believe it also times out on connections that are actually moving data as well. I've seen connections ended after 5 minutes, and then restarting, at least on downloads. I presume uploads are similar. I am not sure why Berkeley made the client work that way. Seems to me that an active connection that is still successfully moving data should be left alone.

So when you say you can get an upload to "fail", do you mean a retry occurs? Or do you mean that so many retries occur that... well the WU you linked looks like it arrived in-tact. So, eventually the upload was completed. I am unclear what you mean about the upload being "stuck". I think what you are seeing sounds normal for a connection with very limited bandwidth. And the client will continue working on, and completing getting it sent all by itself.

This is part of why they decided to limit to 99 models too. The uploads on tasks that produced many many models were approaching 100MB, which is large enough to cause difficulty in many environments.
Rosetta Moderator: Mod.Sense
ID: 63076 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
thatoneguy

Send message
Joined: 8 Jun 06
Posts: 3
Credit: 2,636,731
RAC: 0
Message 64601 - Posted: 26 Dec 2009, 3:45:42 UTC - in response to Message 56932.  

Back to the main issue...
what would be the best way to transition to an increased run time.

If it is possible to do so, temporarily decrease the amount of work that can be downloaded. I think it is possible to fudge the report deadline so that computers don't ask for more work, but still receive credit for past due WUs. Following the change, simply increasing the deadline would ease almost all problems stemming from long run-times. The problem remains of course that WUs may take a long time to return to the server.
As long as credit is given for the late work, I think most people won't care about the change (except for the few people who have their computer on so seldom that they won't be able to complete any work on time).
ID: 64601 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
S_Koss

Send message
Joined: 7 Jan 10
Posts: 4
Credit: 37,252
RAC: 0
Message 64914 - Posted: 11 Jan 2010, 16:07:22 UTC

I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work.
ID: 64914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
S_Koss

Send message
Joined: 7 Jan 10
Posts: 4
Credit: 37,252
RAC: 0
Message 64915 - Posted: 11 Jan 2010, 16:32:14 UTC

On second thought, you can do whatever you want. I am outa here.............
ID: 64915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64917 - Posted: 11 Jan 2010, 17:12:46 UTC

Steve, what you are describing is a very new issue that has turned up with a new type of work unit that seems to be having some checkpointing issues.

Transient, I don't think an overclock would be needed to cause the symptoms he's reporting. I've asked Sarel to look in to it.

Steve, I'm curious how the runtime of a task is effecting your user experience (other then loss of work, which I clearly already understand). You appear to have racked up 25,000 credits in just 4 days, so clearly you have machines running 24x7 so how does running one task for 3 hours have a disadvantage over running 3 tasks for an hour each?
Rosetta Moderator: Mod.Sense
ID: 64917 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 64918 - Posted: 11 Jan 2010, 18:24:48 UTC - in response to Message 64914.  

I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work.



can you give us more information. what was the job id? can you link us to your job information?

DK
ID: 64918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
S_Koss

Send message
Joined: 7 Jan 10
Posts: 4
Credit: 37,252
RAC: 0
Message 64921 - Posted: 11 Jan 2010, 20:57:40 UTC

Hi, so let me try to explain this better. If you have 4 or 8 or 12 WU in varying degrees of completion and you shut down for the night (because I shut 2 computers of 3 down at night) the average loss will be higher than 1 hour WU. When you restart the next morning and you loose everything that you did the night before it gets frustrating and it has been so for the past 4 days. That is why I am not really interested in your project. I have since detached from your project so I cannot give you WU numbers.

Thank you.
ID: 64921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64924 - Posted: 12 Jan 2010, 4:21:20 UTC

Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off.

Sarel is making the needed changes (posted here) so this will be true for his new type of work units as well. I just don't want to see you leave a very worthwhile project for the wrong reasons. Mad Max's Post on Saturday the 9th was one of the first posts that was specific enough to identify the problem, and yours then confirmed the issue. And here we are Monday the 11th, and the problem is being addressed.
Rosetta Moderator: Mod.Sense
ID: 64924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,250,162
RAC: 0
Message 64939 - Posted: 12 Jan 2010, 20:23:25 UTC

Even if you ignore Steve's experience with this project, I hope you recognize that one point has been made clear repeatedly. Checkpoints are a critical feature of BOINC applications. If you need to make checkpoints work within a single decoy's generation, then make it happen.

Given that, there's nothing wrong with doubling the default/minimum run times.
ID: 64939 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,954,001
RAC: 13,616
Message 64957 - Posted: 14 Jan 2010, 1:51:55 UTC - in response to Message 64924.  
Last modified: 14 Jan 2010, 2:18:58 UTC

Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off.


On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit" (on the scale of the concrete computer). As in this case: https://boinc.bakerlab.org/rosetta/result.php?resultid=309578283

I think having the complete server statistics probably to sort tasks by this ratio and to look in what types of tasks there is a bad ratios more often.
By this criterion tasks having one (or both) from following disadvantages should "emerge":
1. Problems with checkpoints mechanism
2. Bad optimisation (executed more slowly in comparison with the others)

But, while I do not have any ideas how to separate one from others...

P.S.
I am impressed by speed of response (only few days between "bug report" and fix for it), on matching with many other projects it are very fast feedback.
ID: 64957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 356
Credit: 382,349
RAC: 0
Message 64966 - Posted: 14 Jan 2010, 14:33:13 UTC
Last modified: 14 Jan 2010, 14:35:02 UTC

@S_Koss: why don't you just hibernate the systems instead of shuting them down? Works perfect for me and I don't loose even one second of work.

BTT: no problem for me if the default run times be encreased, I run WUs for 12-24 hours.
.
ID: 64966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rabinovitch
Avatar

Send message
Joined: 28 Apr 07
Posts: 28
Credit: 5,439,728
RAC: 0
Message 64973 - Posted: 14 Jan 2010, 17:07:41 UTC - in response to Message 56932.  

We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers.


Nice idea.

And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-)
ID: 64973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,954,001
RAC: 13,616
Message 64981 - Posted: 14 Jan 2010, 21:44:43 UTC - in response to Message 64957.  
Last modified: 14 Jan 2010, 21:53:40 UTC


On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit"


Long it was not necessary to wait, it is seems I got one of such tasks just right now. I will post the "report" in an appropriate topic a bit later: minirosetta 2.03
ID: 64981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 64982 - Posted: 14 Jan 2010, 21:46:36 UTC - in response to Message 64973.  

And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-)


There is no end. There are literally trillions of trillions of possible models. The current 24hr maximum attempts to strike a balance between getting results back to the project with a fast turnaround time, and minimizing burden on servers and bandwidth for downloads. Originally the maximum was 4 days, but just think if a problem arose and you ran for 4 days before the watchdog realized it and kicked in to end the task.
Rosetta Moderator: Mod.Sense
ID: 64982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nuadormrac

Send message
Joined: 27 Sep 05
Posts: 37
Credit: 202,469
RAC: 0
Message 66124 - Posted: 15 May 2010, 0:54:02 UTC - in response to Message 56983.  

This also brings up another issue with such a possible increase; though it's a credit related one, so might not take the same precedence as... And yet depending how the units are treated, it might effect the science also.

If the processing time is increased, and the unit deadlocks, hangs, or in some way crashes after the initial model(s) had been successfully been processed, it will after whatever time is spent hanging, error out. And yet not everything in the WU was bad. Now because the units don't have the time involved of a CPDN unit, it's unlikely that trickles would be introduced.

However, an effect of lengthening the runtime can also be that a unit that does error latter on will have a higher chance to error out; and if this occurs then any science which was accumulated prior to the model within the WU that did error could be lost, and the credits for those models which were completed without error most assuredly would be, unless something along the line of trickles or partial validation/crediting could be implemented to allow the successfully processed models within the unit to be validated and counted as such.

I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully.

But as I report here (and previously) I get Mini Rosetta WUs constantly crashing out with "Can't acquire lockfile - exiting" error messages - maybe 60% failure rate with a 3-hour runtime, reducing to 40% failure rate with a 2-hour run time.

I've seen this reported by several other people running a 64-bit OS - not just on Vista or with an AMD machine. That said, I don't know how widespread it is. Perhaps you can analyse results at your end.

As stated in the post linked above, I get no errors at all with Rosetta Beta, so I'm inclined to think it's not some aberration with my machine. I'd really like to see some feedback on this issue and some assurance it's being investigated in some way.

I'd ask that a minimum run time of 2 hours is allowed (I can just about handle that) or some mechanism that allows me to reject all Mini Rosetta WUs. If not, I'm prepared to abort all Mini Rosetta WUs before they run. It's really a waste of time me receiving them if 60% are going to crash out on me anyway.

I've commented on this before here, here, here and first of all and more extensively here - see follow-up messages in that thread.

No such issues arose for me with my old AMD single core XPSP2 machine - only when I got this new AMD quad-core Vista64 machine.

Any advice appreciated. It's a very big Rosetta issue for me, so while I'm sure you'll save a whole load of bandwidth if you go ahead with the proposed changes I just hope some allowance can be made for people in my situation.


ID: 66124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next

Message boards : Number crunching : Discussion on increasing the default run time



©2024 University of Washington
https://www.bakerlab.org