Message boards : Number crunching : no work units
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Bernd Schnitker Send message Joined: 2 Jan 09 Posts: 10 Credit: 62,009 RAC: 0 |
Well I am still having problems getting work. Have for the last 3 days. Mac Mini 2010 8 GB RAM. OSX 10.6.4 Boinc 6.19.58 |
Murasaki![]() Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
At your suggestion, I have increased my runtime from 4 hours to 12 hours. I increased my buffer from .5 days to 1.5 days. In the past, I had a 24-hour runtime, and had problems. I shortened it to 4 hours when I upgraded from Windows XP to Windows 7, and reinstalled BOINC. Is there an optimum? If so, what is it, and why? There really is no optimum as there are too many variables between different computers to declare a one-size-fits-all approach. The ideal position is to select the longest run time your system can cope with and still return work within the deadline. While some projects have a fixed amount of time per work unit Rosetta will just repeat the unit with slightly different variables for as long as your run time allows. For example, 1 project gives you a file and lets you crunch it for five hours and once completed any further crunching will be repeating what you have already done. Rosetta on the other hand sends you a file and lets you crunch it for an hour to produce 1 model but once complete it begins the process again and starts to generate a second model with slightly different variables from the first. Once the second model is complete Rosetta will keep repeating until it has run out of time. If you have a 5 hour run time you could generate 5 models while a 12 hour run time generates 12 models. The strain on the server is the same but you get more work done. That is a fairly simplified example as some work units are quicker and you can sometimes generate over 50 models in an hour on some tasks, but the principle still holds. A longer run time generates more results with less server strain. Don't you believe that it might be insulting to tell contributors that there really are no problems when they report that they are not receiving work? I don't think anyone has said that there are no problems. Can you point to an example? |
Murasaki![]() Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
I guess the real question is would larger buffers be more beneficial to the project, or does it really not matter? For the individual user, having backup projects to run serves the same purpose, I suppose. Or just giving the processors a rest, when the project isn't feeding them with work. Large buffers give no significant advantage or disadvantage to the project. It is an option mainly intended to keep crunchers busy during times when they can't connect to the server for whatever reason. To look at it one way, if the spare work unit wasn't in your buffer and remained on the server another user would probably have picked it up during this work shortage, so the project gets an answer no matter whether it is with you or someone else. Longer run times are an advantage to the project as you will get more work done per download so reduce strain on the distribution servers. See my explanation in the post above. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,304,766 RAC: 24,305 ![]() |
Yes, I have been receiving WU's for the past three days. When I originally posted on August 27, however, I had received none at all for about 36 hours. Is this not the proper place to report problems and ask questions? Don't you believe that it might be insulting to tell contributors that there really are no problems when they report that they are not receiving work? 5 days ago, yes it's quite right. That's why I said precisely that 7 days ago. Since then, the situation changed. "No work" is false, as you confirm above. Increasing runtime with no work would make no difference to anything. It's because there IS work now that changing the runtime makes a difference. As I said, my problem has been resolved (by Rosetta and/or BOINC). What is being gained by continuing to assert that there were no interruptions in the assignment of Work Units when there clearly was an interruption, even though it might now have been rectified? I guess I don't see the point. Presumably you gleaned this from me writing "No-one is trying to convince anyone there are no problems". The first sentence in my previous post still applies. So, what did you increase your runtime to? That's all you really needed to respond to. It's the only thing we can control and contribute from our end. At least this is worth discussing. An optimum depends on the circumstances - if you've tried a long runtime and it caused problems on your system, while a shorter runtime worked, you need to tweak it down - Murasaki's advice seems good. If you seem to average 10 WUs on a dual core at 4 hours runtime each your buffer will last 20 CPU hours for each core, which covers your 0.5 day buffer. If you increase your runtime to 12 hours you'll have 2.5 days covered, but more importantly while there's a shortage of new tasks, you'll have 12 hours to get a replacement WU (for each core) instead of just 4 hours, and you won't be calling for more tasks at all for a day or two, allowing those with no tasks at all to get something to work on. Normally I wouldn't suggest increasing your buffer as well as your runtime, but 0.5 days is pretty low and 1.5 days not especially high. As a rule, though, until we're more certain of the resupply of tasks, I'd suggest reducing buffers until everyone's got their share and only increasing again when there are plenty to go round. A lot of the complaints recently have been about having insufficient tasks to completely fill a buffer, not to actually crunch. The purpose of a buffer is principally to give you some leeway when you have local connection issues (accidental or deliberate) but can also be used when resupply is an issue, like we have at the moment. If the buffer isn't flexed for either reason, there's little point to it at all and may as well stay at Boinc's default of 0.25 days. ![]() ![]() |
Bernd Schnitker Send message Joined: 2 Jan 09 Posts: 10 Credit: 62,009 RAC: 0 |
I got one WU over night. Since I keep a very small buffer it is about what I expect for a full cache. I keep 0.01 days cache. |
Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 15,727,161 RAC: 6,755 ![]() |
Can keeping a large buffer significantly increase the amount of time a project must wait to get an individual result back? That was always my concern in the past. I understand that increasing the run time does the same thing, but in return Rosetta gets more results for that WU. I guess I can now see the rationale behind Sid's assertion that increasing the run time is the only really helpful thing users can do to help a project during problems. Does Rosetta have the capability of increasing the default run time when they are having problems? Users with "No preference" for the target cpu run time default to 3 hours, I believe. Increasing that during server problems might be a useful trick. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,304,766 RAC: 24,305 ![]() |
Can keeping a large buffer significantly increase the amount of time a project must wait to get an individual result back? That was always my concern in the past. I understand that increasing the run time does the same thing, but in return Rosetta gets more results for that WU. Of course, yes, but it doesn't matter as long as they go back before the deadline... ...except during CASP when some tasks had to be crunched and back within 18 hours (0.75 days) and it was impossible to tell which. From that occasion either there was no mechanism to change the deadline at source or they didn't use one, which was why I kept buffer+runtime < 0.70 then moved it back to 2.0 + runtime when CASP ended. Lucky for me, I guessed right and only slipped into WCG work when there was nothing at all for a few days last week. ![]() ![]() |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Right, a long buffer helps you complete more work overall, because you are still working even through an outage. However, your average to to return a result increases as well. During CASP the Project Team said that a very short turnaround time (12 hours or so) was desirable. But now that CASP is behind us, I'd suggest that if you have full-time internet access, that about a 1 day cache is a good compromise between working through server problems (which in the past have averaged just a few hours), and reporting back completed results in a timely mannar. Yes, a larger cache and longer runtime tend to reduce the load on the servers. And yes, all a longer runtime does is complete more models against the same work unit. No, it would not be feasible for the project to attempt to manipulate the default runtimes during outages. In fact it would cause considerable problems for the BOINC client to understand how much work to be requesting. Rosetta Moderator: Mod.Sense |
Michael Gould Send message Joined: 3 Feb 10 Posts: 39 Credit: 15,727,161 RAC: 6,755 ![]() |
Okay, thanks for the answers! I'll set a one day buffer from now on, until CASP10 starts! |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
Sid Celery said: Jochen said: Evan said: Michael Gould said: Chris Holvenstot said: Chilean said: (implying that work interruptions must not be real ... otherwise why say what was said?) Sid Celery said: and also: These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown," I would have been bounced all the way across the parking lot to my car. :-| deesy |
Jochen Send message Joined: 6 Jun 06 Posts: 133 Credit: 3,847,433 RAC: 0 |
I actually wrote: There's nothing wrong and absolutely nothing you could do. Rosetta just generates only a limited number of WUs right now. How do you know, there is a problem? The limited number of WUs could as well be a side-effect of an update (software or hardware). We simply don't know. |
Teck7 Send message Joined: 23 Aug 10 Posts: 2 Credit: 198,527 RAC: 0 |
The very problem is exactly what was just said, that we DONT KNOW. It would be nice to get an update as to whats going on, expected time line on resolution or just any kind of actual input from the bakerlab team. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
The very problem is exactly what was just said, that we DONT KNOW. It would be nice to get an update as to whats going on, expected time line on resolution or just any kind of actual input from the bakerlab team. They must have heard you. There is an update on the homepage. |
Murasaki![]() Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. I don't see anything wrong with those statements or anyone denying that Rosetta is not performing at its usual capacity. There is indeed some contradictory phrasing used as each quote is the opinion of a different volunteer. Different people will always have a different perspective on an issue. What does it mean when some contributors stop receiving work, while others apparently continue to receive plenty? I'm asking because I want to know, and because I believe that it is a legitimate question. Normally Rosetta puts out an average of tens of thousands of work units an hour but at the moment appears to be generating several thousand instead. When those work units become available it is a bit of pot-luck as to who gets them. If your BOINC manager calls the server just as work is released then you will get some, if your manager calls the server a couple of minutes later there might not be any left. If you had a two day cache of work units prior to the slow down then you would have had two days in which to ask the server for more work, raising the chance of you seeing no interruption. People with no cache have no spare capacity, so they either got a new task when the last one finished or they didn't. Your question also revolves around the issue of "plenty". I think it must be a very rare individual who has not had a reduced amount of work from Rosetta in the last week, but many supplement Rosetta with other projects to keep their cores busy. It also depends on scale; if someone with 1 computer with 1 core says, "I didn't get any work" but someone with 10 computers with multiple cores says, "I got some work but less than usual" is it any surprise that the person with more cores got more work? In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown," I would have been bounced all the way across the parking lot to my car. :-| This is a question of perspective. For you there was a stoppage. For the project there was a slowdown (processing speed dropping to half then a third of normal). In the perspective of a super computer, which BOINC simulates, your account happens to be one set of cores in the super computer that didn't get work for a time; an unfortunate slow down of the project but hardly a complete crash of a system. Every BOINC project breaks down or experiences difficulties now and then, which is why many people have a backup project. I am not sure why you choose not to have a backup (such as World Community Grid's Human Proteome Folding project which is directly related to Rosetta), but periods of complete inactivity are an unfortunate but inevitable consequence. |
Murasaki![]() Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
They just had the entire Rosetta database down for maintenance and now it is back I see a rather healthy 21,798 tasks ready to send. Hopefully that is a sign things are getting back to normal. |
deesy58 Send message Joined: 20 Apr 10 Posts: 75 Credit: 193,831 RAC: 0 |
These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. Thanks! Your information is lucid and helpful. At this time, I contribute only to Rosetta because it appears to me that this project is closest to the sort of applied science that might help to find a cure for a deadly disease that took a very close loved one several years ago. I have a good GPU, and I used it on the Collatz Conjecture Project for a while. Then I realized that the conjecture could never be proved, no matter how many years was spent trying. I "folded" for the "other guys" for several years, but that work appears to me to be more theoretical science, and less applied science. Credits mean little to me, except as a measure of how much my machine is contributing. I'm trying to assist the science, not engage in a competition. deesy |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2220 Credit: 42,304,766 RAC: 24,305 ![]() |
These are a few posts that I (perhaps incorrectly) interpreted to mean that at least some posters were implying that there were no problems, and no interruptions. This was in spite of the fact that numerous crunchers were reporting that they were not receiving Work Units. Some of these posts appear contradictory to me, and perhaps might explain my apparent confusion. Well, you could define your terms better, but I'll give you a scenario. Say you run one task at a time and you want to have another spare task made available to you as soon as you start the first one. Think of it as a KanBan in ERP terms. You call for it and you get it. Great. But if you don't get it, you call for it again 10 minutes later, then again and again and finally after an hour, let's say, (and 6 'failures') it arrives. Is that 6 failures or not a problem at all because your first job takes 3 hours to run and the new one arrived well before the in-progress one finished? It's a failure because it doesn't meet your criteria, but if your criteria is decided in order to allow for 20+ failures without facing a real problem then it's a success for your criteria and for your 'production'. This, I believe, is the crux of your misinterpretation. You make no distinction when the distinction is actually everything. In the meantime, it is an absolute fact that my computer, which runs 24/7 for Rosetta@Home, received absolutely no Work Units for a period of about two or three days beginning on August 25 or 26. The buffer was emptied, and no crunching at all was accomplished during this time. I interpret that as a stoppage, not a slowdown. If I had tried to explain a "crash" of our company's ERP system to the V.P. of Operations as a "slowdown" I would have been bounced all the way across the parking lot to my car. :-| In my experience, having a VP who wasn't a bit of an idiot is a rarity. Sounds like you have one of the usual ones. First, though, you'd need to distinguish between the 'crash' over a week ago and the slow-down during all the time since. If you hadn't already assessed the difference between trivial ups and downs and built in some margin for that then you'd deserve that walk to the car. Don't sweat the small stuff and especially don't trouble the big guys with every bump and squeak. They expect you to handle that yourself. If it's something major that requires more heavyweight intervention you can tweak some stuff (runtimes in our case here to provide more time for the big solution) or the final contingency of having a back-up project altogether. We're lucky that our machinery (Boinc) can run anything else with no changeover time. The point being, set a safety margin that covers the small stuff and stop worrying if your safety margin is being eaten into - that's precisely what it's for. How you set your safety level depends on your situation. I'm away for half of each week so I keep 2 days usually. 1 day may be better for you if you check things each night. If your safety margin is close to being exhausted, as long as you're sure solutions are being worked on by TPTB and you've tweaked as much as you can, it's out of our hands. At the end of the day, the loss is theirs. If you're prepared to keep a back-up project you can stay productive and returned when the problem's properly solved (looking better now, and the slow validator issue seems to have gone too). Hope that helps. ![]() ![]() |
![]() ![]() Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
.![]() Send message Joined: 8 Aug 10 Posts: 2 Credit: 485,829 RAC: 0 |
It seems that since yesterday the team fixed the problem with giving out work units. I have a full buffer now (48 hours of WU's). Also the production in Teraflops is up from 32.000 to about 69.000 at this moment. It would be nice to hear confirmation that everybody is receiving new work. |
Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0 |
You can safely assume that the work flow is back to normal again: 1. As you have observed, the teraFLOPS reported has increased dramatically and is fast approaching the normal levels. 2. The server status page is reporting in excess of 20k workunits ready to send. 3. All the moans about no work have ceased. 4. Those of us who have required work and commented have received it immediately. I trust the cause has been identified and will not recur. |
Message boards :
Number crunching :
no work units
©2025 University of Washington
https://www.bakerlab.org