no work units

Author	Message
Bernd Schnitker Send message Joined: 2 Jan 09 Posts: 10 Credit: 62,009 RAC: 0	Message 67507 - Posted: 1 Sep 2010, 4:55:40 UTC Well I am still having problems getting work. Have for the last 3 days. Mac Mini 2010 8 GB RAM. OSX 10.6.4 Boinc 6.19.58 ID: 67507 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 67508 - Posted: 1 Sep 2010, 6:38:47 UTC - in response to Message 67503. At your suggestion, I have increased my runtime from 4 hours to 12 hours. I increased my buffer from .5 days to 1.5 days. In the past, I had a 24-hour runtime, and had problems. I shortened it to 4 hours when I upgraded from Windows XP to Windows 7, and reinstalled BOINC. Is there an optimum? If so, what is it, and why? There really is no optimum as there are too many variables between different computers to declare a one-size-fits-all approach. The ideal position is to select the longest run time your system can cope with and still return work within the deadline. While some projects have a fixed amount of time per work unit Rosetta will just repeat the unit with slightly different variables for as long as your run time allows. For example, 1 project gives you a file and lets you crunch it for five hours and once completed any further crunching will be repeating what you have already done. Rosetta on the other hand sends you a file and lets you crunch it for an hour to produce 1 model but once complete it begins the process again and starts to generate a second model with slightly different variables from the first. Once the second model is complete Rosetta will keep repeating until it has run out of time. If you have a 5 hour run time you could generate 5 models while a 12 hour run time generates 12 models. The strain on the server is the same but you get more work done. That is a fairly simplified example as some work units are quicker and you can sometimes generate over 50 models in an hour on some tasks, but the principle still holds. A longer run time generates more results with less server strain. Don't you believe that it might be insulting to tell contributors that there really are no problems when they report that they are not receiving work? I don't think anyone has said that there are no problems. Can you point to an example? ID: 67508 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 67509 - Posted: 1 Sep 2010, 6:46:27 UTC - in response to Message 67506. Last modified: 1 Sep 2010, 6:49:09 UTC I guess the real question is would larger buffers be more beneficial to the project, or does it really not matter? For the individual user, having backup projects to run serves the same purpose, I suppose. Or just giving the processors a rest, when the project isn't feeding them with work. Large buffers give no significant advantage or disadvantage to the project. It is an option mainly intended to keep crunchers busy during times when they can't connect to the server for whatever reason. To look at it one way, if the spare work unit wasn't in your buffer and remained on the server another user would probably have picked it up during this work shortage, so the project gets an answer no matter whether it is with you or someone else. Longer run times are an advantage to the project as you will get more work done per download so reduce strain on the distribution servers. See my explanation in the post above. ID: 67509 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2469 Credit: 46,465,735 RAC: 154	Message 67512 - Posted: 1 Sep 2010, 10:25:50 UTC - in response to Message 67503. Yes, I have been receiving WU's for the past three days. When I originally posted on August 27, however, I had received none at all for about 36 hours. Is this not the proper place to report problems and ask questions? Don't you believe that it might be insulting to tell contributors that there really are no problems when they report that they are not receiving work? 5 days ago, yes it's quite right. That's why I said precisely that 7 days ago. Since then, the situation changed. "No work" is false, as you confirm above. Increasing runtime with no work would make no difference to anything. It's because there IS work now that changing the runtime makes a difference. As I said, my problem has been resolved (by Rosetta and/or BOINC). What is being gained by continuing to assert that there were no interruptions in the assignment of Work Units when there clearly *was* an interruption, even though it might now have been rectified? I guess I don't see the point. Presumably you gleaned this from me writing "No-one is trying to convince anyone there are no problems". The first sentence in my previous post still applies. So, what did you increase your runtime to? That's all you really needed to respond to. It's the only thing we can control and contribute from our end. At your suggestion, I have increased my runtime from 4 hours to 12 hours. I increased my buffer from .5 days to 1.5 days. In the past, I had a 24-hour runtime, and had problems. I shortened it to 4 hours when I upgraded from Windows XP to Windows 7, and reinstalled BOINC. Is there an optimum? If so, what is it, and why? At least this is worth discussing. An optimum depends on the circumstances - if you've tried a long runtime and it caused problems on your system, while a shorter runtime worked, you need to tweak it down - Murasaki's advice seems good. If you seem to average 10 WUs on a dual core at 4 hours runtime each your buffer will last 20 CPU hours for each core, which covers your 0.5 day buffer. If you increase your runtime to 12 hours you'll have 2.5 days covered, but more importantly while there's a shortage of new tasks, you'll have 12 hours to get a replacement WU (for each core) instead of just 4 hours, and you won't be calling for more tasks at all for a day or two, allowing those with no tasks at all to get something to work on. Normally I wouldn't suggest increasing your buffer as well as your runtime, but 0.5 days is pretty low and 1.5 days not especially high. As a rule, though, until we're more certain of the resupply of tasks, I'd suggest reducing buffers until everyone's got their share and only increasing again when there are plenty to go round. A lot of the complaints recently have been about having insufficient tasks to completely fill a buffer, not to actually crunch. The purpose of a buffer is principally to give you some leeway when you have local connection issues (accidental or deliberate) but can also be used when resupply is an issue, like we have at the moment. If the buffer isn't flexed for either reason, there's little point to it at all and may as well stay at Boinc's default of 0.25 days. ID: 67512 · Rating: 0 · rate: / Reply Quote

Bernd Schnitker Send message Joined: 2 Jan 09 Posts: 10 Credit: 62,009 RAC: 0	Message 67516 - Posted: 1 Sep 2010, 14:08:20 UTC I got one WU over night. Since I keep a very small buffer it is about what I expect for a full cache. I keep 0.01 days cache. ID: 67516 · Rating: 0 · rate: / Reply Quote

Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 16,053,885 RAC: 0	Message 67517 - Posted: 1 Sep 2010, 14:33:41 UTC - in response to Message 67509. Large buffers give no significant advantage or disadvantage to the project... Can keeping a large buffer significantly increase the amount of time a project must wait to get an individual result back? That was always my concern in the past. I understand that increasing the run time does the same thing, but in return Rosetta gets more results for that WU. I guess I can now see the rationale behind Sid's assertion that increasing the run time is the only really helpful thing users can do to help a project during problems. Does Rosetta have the capability of increasing the default run time when they are having problems? Users with "No preference" for the target cpu run time default to 3 hours, I believe. Increasing that during server problems might be a useful trick. ID: 67517 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2469 Credit: 46,465,735 RAC: 154	Message 67519 - Posted: 1 Sep 2010, 14:57:53 UTC - in response to Message 67517. Can keeping a large buffer significantly increase the amount of time a project must wait to get an individual result back? That was always my concern in the past. I understand that increasing the run time does the same thing, but in return Rosetta gets more results for that WU. Of course, yes, but it doesn't matter as long as they go back before the deadline... ...except during CASP when some tasks had to be crunched and back within 18 hours (0.75 days) and it was impossible to tell which. From that occasion either there was no mechanism to change the deadline at source or they didn't use one, which was why I kept buffer+runtime < 0.70 then moved it back to 2.0 + runtime when CASP ended. Lucky for me, I guessed right and only slipped into WCG work when there was nothing at all for a few days last week. ID: 67519 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 67520 - Posted: 1 Sep 2010, 15:01:05 UTC - in response to Message 67517. Large buffers give no significant advantage or disadvantage to the project... Can keeping a large buffer significantly increase the amount of time a project must wait to get an individual result back? That was always my concern in the past. I understand that increasing the run time does the same thing, but in return Rosetta gets more results for that WU. I guess I can now see the rationale behind Sid's assertion that increasing the run time is the only really helpful thing users can do to help a project during problems. Does Rosetta have the capability of increasing the default run time when they are having problems? Users with "No preference" for the target cpu run time default to 3 hours, I believe. Increasing that during server problems might be a useful trick. Right, a long buffer helps you complete more work overall, because you are still working even through an outage. However, your average to to return a result increases as well. During CASP the Project Team said that a very short turnaround time (12 hours or so) was desirable. But now that CASP is behind us, I'd suggest that if you have full-time internet access, that about a 1 day cache is a good compromise between working through server problems (which in the past have averaged just a few hours), and reporting back completed results in a timely mannar. Yes, a larger cache and longer runtime tend to reduce the load on the servers. And yes, all a longer runtime does is complete more models against the same work unit. No, it would not be feasible for the project to attempt to manipulate the default runtimes during outages. In fact it would cause considerable problems for the BOINC client to understand how much work to be requesting. Rosetta Moderator: Mod.Sense ID: 67520 · Rating: 0 · rate: / Reply Quote

Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 16,053,885 RAC: 0	Message 67521 - Posted: 1 Sep 2010, 15:54:39 UTC Okay, thanks for the answers! I'll set a one day buffer from now on, until CASP10 starts! ID: 67521 · Rating: 0 · rate: / Reply Quote

deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0	Message 67524 - Posted: 1 Sep 2010, 16:37:32 UTC Sid Celery said: No-one is trying to convince anyone there are no problems . . . Jochen said: There's nothing wrong and absolutely nothing you could do. Evan said: No, there is a problem with the servers. Look on the server status page and you will see they have taken nearly all of them off line while they make a proper fix. Michael Gould said: It seems that on the rosetta home page, the "Server status" box almost always says "scheduler running," ... Chris Holvenstot said: Maybe, I don't know - but I can tell you from what I have seen this whole thing amounts to nothing more than a minor slowdown - and I suspect that a lot of the credit for it being a slowdown instead of an outage goes to the Admins for holding things together. Chilean said: btw, I still have plenty of WUs in queue. (implying that work interruptions must not be real ... otherwise why say what was said?) Sid Celery said: Unless your aim is to run out of work I can't see the problem. It's the responsible thing to do. and also: "No work" is false, as you confirm above. These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown," I would have been bounced all the way across the parking lot to my car. :-\| deesy ID: 67524 · Rating: 0 · rate: / Reply Quote

Jochen Send message Joined: 6 Jun 06 Posts: 133 Credit: 3,847,433 RAC: 0	Message 67525 - Posted: 1 Sep 2010, 17:12:06 UTC I actually wrote: There's nothing wrong and absolutely nothing you could do. Rosetta just generates only a limited number of WUs right now. How do you know, there is a problem? The limited number of WUs could as well be a side-effect of an update (software or hardware). We simply don't know. ID: 67525 · Rating: 0 · rate: / Reply Quote

Teck7 Send message Joined: 23 Aug 10 Posts: 2 Credit: 198,527 RAC: 0	Message 67528 - Posted: 1 Sep 2010, 18:22:52 UTC The very problem is exactly what was just said, that we DONT KNOW. It would be nice to get an update as to whats going on, expected time line on resolution or just any kind of actual input from the bakerlab team. ID: 67528 · Rating: 0 · rate: / Reply Quote

Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0	Message 67529 - Posted: 1 Sep 2010, 19:06:44 UTC - in response to Message 67528. The very problem is exactly what was just said, that we DONT KNOW. It would be nice to get an update as to whats going on, expected time line on resolution or just any kind of actual input from the bakerlab team. They must have heard you. There is an update on the homepage. ID: 67529 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 67530 - Posted: 1 Sep 2010, 19:16:44 UTC - in response to Message 67524. Last modified: 1 Sep 2010, 19:24:28 UTC These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. I don't see anything wrong with those statements or anyone denying that Rosetta is not performing at its usual capacity. There is indeed some contradictory phrasing used as each quote is the opinion of a different volunteer. Different people will always have a different perspective on an issue. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. Normally Rosetta puts out an average of tens of thousands of work units an hour but at the moment appears to be generating several thousand instead. When those work units become available it is a bit of pot-luck as to who gets them. If your BOINC manager calls the server just as work is released then you will get some, if your manager calls the server a couple of minutes later there might not be any left. If you had a two day cache of work units prior to the slow down then you would have had two days in which to ask the server for more work, raising the chance of you seeing no interruption. People with no cache have no spare capacity, so they either got a new task when the last one finished or they didn't. Your question also revolves around the issue of "plenty". I think it must be a very rare individual who has not had a reduced amount of work from Rosetta in the last week, but many supplement Rosetta with other projects to keep their cores busy. It also depends on scale; if someone with 1 computer with 1 core says, "I didn't get any work" but someone with 10 computers with multiple cores says, "I got some work but less than usual" is it any surprise that the person with more cores got more work? In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown," I would have been bounced all the way across the parking lot to my car. :-\| deesy This is a question of perspective. For you there was a stoppage. For the project there was a slowdown (processing speed dropping to half then a third of normal). In the perspective of a super computer, which BOINC simulates, your account happens to be one set of cores in the super computer that didn't get work for a time; an unfortunate slow down of the project but hardly a complete crash of a system. Every BOINC project breaks down or experiences difficulties now and then, which is why many people have a backup project. I am not sure why you choose not to have a backup (such as World Community Grid's Human Proteome Folding project which is directly related to Rosetta), but periods of complete inactivity are an unfortunate but inevitable consequence. ID: 67530 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 67535 - Posted: 1 Sep 2010, 21:27:54 UTC They just had the entire Rosetta database down for maintenance and now it is back I see a rather healthy 21,798 tasks ready to send. Hopefully that is a sign things are getting back to normal. ID: 67535 · Rating: 0 · rate: / Reply Quote

deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0	Message 67536 - Posted: 1 Sep 2010, 21:54:42 UTC - in response to Message 67530. These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. I don't see anything wrong with those statements or anyone denying that Rosetta is not performing at its usual capacity. There is indeed some contradictory phrasing used as each quote is the opinion of a different volunteer. Different people will always have a different perspective on an issue. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. Normally Rosetta puts out an average of tens of thousands of work units an hour but at the moment appears to be generating several thousand instead. When those work units become available it is a bit of pot-luck as to who gets them. If your BOINC manager calls the server just as work is released then you will get some, if your manager calls the server a couple of minutes later there might not be any left. If you had a two day cache of work units prior to the slow down then you would have had two days in which to ask the server for more work, raising the chance of you seeing no interruption. People with no cache have no spare capacity, so they either got a new task when the last one finished or they didn't. Your question also revolves around the issue of "plenty". I think it must be a very rare individual who has not had a reduced amount of work from Rosetta in the last week, but many supplement Rosetta with other projects to keep their cores busy. It also depends on scale; if someone with 1 computer with 1 core says, "I didn't get any work" but someone with 10 computers with multiple cores says, "I got some work but less than usual" is it any surprise that the person with more cores got more work? In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown," I would have been bounced all the way across the parking lot to my car. :-\| deesy This is a question of perspective. For you there was a stoppage. For the project there was a slowdown (processing speed dropping to half then a third of normal). In the perspective of a super computer, which BOINC simulates, your account happens to be one set of cores in the super computer that didn't get work for a time; an unfortunate slow down of the project but hardly a complete crash of a system. Every BOINC project breaks down or experiences difficulties now and then, which is why many people have a backup project. I am not sure why you choose not to have a backup (such as World Community Grid's Human Proteome Folding project which is directly related to Rosetta), but periods of complete inactivity are an unfortunate but inevitable consequence. Thanks! Your information is lucid and helpful. At this time, I contribute only to Rosetta because it appears to me that this project is closest to the sort of applied science that might help to find a cure for a deadly disease that took a very close loved one several years ago. I have a good GPU, and I used it on the Collatz Conjecture Project for a while. Then I realized that the conjecture could never be proved, no matter how many years was spent trying. I "folded" for the "other guys" for several years, but that work appears to me to be more theoretical science, and less applied science. Credits mean little to me, except as a measure of how much my machine is contributing. I'm trying to assist the science, not engage in a competition. deesy ID: 67536 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2469 Credit: 46,465,735 RAC: 154	Message 67541 - Posted: 2 Sep 2010, 3:14:43 UTC - in response to Message 67524. These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. Well, you could define your terms better, but I'll give you a scenario. Say you run one task at a time and you want to have another spare task made available to you as soon as you start the first one. Think of it as a KanBan in ERP terms. You call for it and you get it. Great. But if you don't get it, you call for it again 10 minutes later, then again and again and finally after an hour, let's say, (and 6 'failures') it arrives. Is that 6 failures or not a problem at all because your first job takes 3 hours to run and the new one arrived well before the in-progress one finished? It's a failure because it doesn't meet your criteria, but if your criteria is decided in order to allow for 20+ failures without facing a real problem then it's a success for your criteria and for your 'production'. This, I believe, is the crux of your misinterpretation. You make no distinction when the distinction is actually everything. In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown" I would have been bounced all the way across the parking lot to my car. :-\| In my experience, having a VP who wasn't a bit of an idiot is a rarity. Sounds like you have one of the usual ones. First, though, you'd need to distinguish between the 'crash' over a week ago and the slow-down during all the time since. If you hadn't already assessed the difference between trivial ups and downs and built in some margin for that then you'd deserve that walk to the car. Don't sweat the small stuff and especially don't trouble the big guys with every bump and squeak. They expect you to handle that yourself. If it's something major that requires more heavyweight intervention you can tweak some stuff (runtimes in our case here to provide more time for the big solution) or the final contingency of having a back-up project altogether. We're lucky that our machinery (Boinc) can run anything else with no changeover time. The point being, set a safety margin that covers the small stuff and stop worrying if your safety margin is being eaten into - that's precisely what it's for. How you set your safety level depends on your situation. I'm away for half of each week so I keep 2 days usually. 1 day may be better for you if you check things each night. If your safety margin is close to being exhausted, as long as you're sure solutions are being worked on by TPTB and you've tweaked as much as you can, it's out of our hands. At the end of the day, the loss is theirs. If you're prepared to keep a back-up project you can stay productive and returned when the problem's properly solved (looking better now, and the slow validator issue seems to have gone too). Hope that helps. ID: 67541 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 67542 - Posted: 2 Sep 2010, 4:33:00 UTC Ready to send 22,010 ID: 67542 · Rating: 0 · rate: / Reply Quote

. Send message Joined: 8 Aug 10 Posts: 2 Credit: 485,829 RAC: 0	Message 67545 - Posted: 2 Sep 2010, 12:00:34 UTC It seems that since yesterday the team fixed the problem with giving out work units. I have a full buffer now (48 hours of WU's). Also the production in Teraflops is up from 32.000 to about 69.000 at this moment. It would be nice to hear confirmation that everybody is receiving new work. ID: 67545 · Rating: 0 · rate: / Reply Quote

Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0	Message 67548 - Posted: 2 Sep 2010, 13:34:21 UTC - in response to Message 67545. Last modified: 2 Sep 2010, 13:37:09 UTC It would be nice to hear confirmation that everybody is receiving new work. You can safely assume that the work flow is back to normal again: 1. As you have observed, the teraFLOPS reported has increased dramatically and is fast approaching the normal levels. 2. The server status page is reporting in excess of 20k workunits ready to send. 3. All the moans about no work have ceased. 4. Those of us who have required work and commented have received it immediately. I trust the cause has been identified and will not recur. ID: 67548 · Rating: 0 · rate: / Reply Quote