Message boards : Number crunching : Credits Granted
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
But to me, and I assume to others as well, it (the "Technical News" statement) looked like that was all they were going to do (rather incomplete) and that made me rather grumpy. Reading it again, it could sound that way; _I_ took it more as "here is what we have done", and didn't give up on more to come. I'm stubborn, or optimistic, or something - until I hear a flat "no", I assume something is either "yes" or "maybe". :-) |
Deamiter Send message Joined: 9 Nov 05 Posts: 26 Credit: 3,793,650 RAC: 0 |
Am I the only one who assumes to have a certain "loss rate" in my crunching? There are dozens if not hundreds of factors, but the biggest are probably in computer resets (particularly for the highly-mobile laptop and borged boxes). Of course there are also network outages, power outages, cycle loss due to the OS etc... Somewhere in there there are losses due to project problems. But there's a big reason I'm working on alpha projects -- quite simply, I strongly feel I'm getting more scientific value for my CPU time by running the projects that are less popular! Of course that also means they're much less stable. I guess I just don't complain if my RAC is down 50 points for the day because a WU got lost in the shuffle. Maybe one of the PCs in my lab were left off for the night, or maybe my home router needs a reset, or MAYBE one of my WUs was bad. I guess I'm very content that the systematic problems are being worked on. Yeah, I'll probably lose some credit here and there for participating in the pre-release probjects. The time I DO donate, however, is worth so much more to a project like Rosetta than to the overloaded SETI, I feel such intermittant troubles more than make up for a slightly attenuated credit rate. |
Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0 |
Am I the only one who assumes to have a certain "loss rate" in my crunching? There are dozens if not hundreds of factors, but the biggest are probably in computer resets (particularly for the highly-mobile laptop and borged boxes). Of course there are also network outages, power outages, cycle loss due to the OS etc... I quiet agree with you, the possibility to help achieving some of the goals of Rosetta is why I joined this project. And yes, a starting project deserves lots of understanding and courtesy.. but (there have to be a but somewhere) I value my cpu-time a lot. And when cpu-time is wasted due to project problems, it is fine with me. But it still hurts. Because I go thru quiet some effort to gather as much computing power and time as I can get for this project. That is why I expect some understanding from the project staff in return. And the way of showing that is keeping us informed, involve us with the problems and granting us credits (as a mather of fact I don't give a sh*t about credits, but I keep telling myself (and my wife) that there will be a time that the electricity company will accept them as payment for their bills). I think others go thru the same ammount of trouble to keep their farms chruching and that is why they like to have their lost cpu-time rewarded. By the way, imho the Rosetta project staff is doing a great job so far... thanks. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
A lot of what we intend to do is based on feedback from users like you. For instance, we are now discussing whether we should grant credit to all "Time exceeded" errors. My vote is rather then worrying about past lost credit, spend time on tracking down and fixing the cause. We have an important question to answer about these errors, Are they due to stuck jobs (i.e. 1% errors), or are they being terminated prematurely due to the rsc_fpops_bound being set too low on our end? I have not seen evidence yet suggesting the later in general. The bound is set conservatively. We are definitely going to try to fix this issue in the next app update. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
[A lot of what we intend to do is based on feedback from users like you. For instance, we are now discussing whether we should grant credit to all "Time exceeded" errors. My vote is rather then worrying about past lost credit, spend time on tracking down and fixing the cause. We have an important question to answer about these errors, You have losses on all projects for one reason or another. Perhaps rather than hunting around, just give a "flat rate" bonus to the people on the project. Much simpler, easiser, less time, ... But, I also would much rather you spent the time moving forward on the fixes and improvements ... |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
But, I also would much rather you spent the time moving forward on the fixes and improvements ... I agree! :) Regards, Bob P. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
Are they due to stuck jobs (i.e. 1% errors), or are they being terminated prematurely due to the rsc_fpops_bound being set too low on our end? The ones that I looked into were definitely not "stuck" at any point, and they actually ran a fairly "normal" amount of time for that WU type - at the long end, but not unreasonable. I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks. Example... Let's say (for simplicity of math) the bound is 100,000, and the benchmark is 100. You would expect this result to hit the boundary after 1000 seconds. If the host normally finishes an "average" result in 500 seconds (estimated fpops is 50,000), the setting is quite conservative; you're allowing this result to run up to twice the normal expected time. But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit. DCF is lowered by short (quicker than estimated) results, and raised by long (longer than estimated) ones; if this host happens to get a handful of very short WUs (say 250 seconds) immediately before getting _this_ result, then the DCF could be, for example, 0.5 when this result starts. Multiply the bound by that DCF, you're suddenly down to 50,000, or 500 seconds - and if this result runs even ONE SECOND longer than the "average result", it's exceeded the boundary. Now, in general, the DCF is a very good thing; it keeps the cache filled with the correct amount of work, it lets the "to completion" times be reasonably accurate, etc. But the accuracy of the DCF itself depends _entirely_ on the accuracy of the project's estimates of "how long" a result will take. THAT is, I believe, the source of this problem; Rosetta simply isn't very accurate on those estimates, making DCF a matter of "luck" - the order in which a host did what type of results. I have seen DCF vary IN ONE DAY, on one of my machines, by a factor of 2x, and that was _without_ any "short error" WUs. (Error WUs shouldn't lower the DCF, but I haven't been able to prove that they don't...) I don't know what your estimated fpops is or your boundary fpops, so I don't know what "conservative" means - 2x? 3x? I'm guessing it's a bit more than 2x, or we would see many more "max cpu exceeded" errors. I've had a result with an original estimated time of 10 hours take 21 on a slow Mac and not blow up, but I'd bet it was getting pretty close. The long term solution is simple - use the alpha project or internal machines to run a hundred of each new WU type before releasing them to the project, and set the estimated fpops for that WU type based on the times you see. The short term solution may well require temporarily raising the boundary - I _don't_ think it is helping with the 1% problem, it sounds like some of those have still been "stuck" well after I think they should have gotten the "max cpu exceeded" error. |
nasher Send message Joined: 5 Nov 05 Posts: 98 Credit: 618,288 RAC: 0 |
[/quote] The time exceeding WU have not been granted credit yet. The "Bad Random Number Seeds" WU usually error out in less than 20 CPU seconds and are a different kind of WU problem. Regards Phil
Well about those 20 second WU's just cuirous how much credit do you think should be granted... most my jobs that run 5000 seconds get about 14 credits. this may be low or high for the avarage .. but anyway asuming 5000 seconds =14 credits then 1 credit is about 357 seconds (~6 min) so um.. i hate saying it but it probaly isnt worth the effort to worry about loosing a 20 second job (unless you pay for your UL/DL bandwith... yes i love to see credits for work done and credits if an error occours beyond your control... but personaly every time i reboot one of my computers it looses back to the last benchmark (probaly miniuts or more of work + 3-6 min to reboot) so i expect to loose credits now and then sorry for the soapbox lecture |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks. Can anyone confirm this by showing me where to find it in the boinc client code? I do not see this and, in fact, I see what I thought was the way it is calculated in client/app.C: max_cpu_time = rp->wup->rsc_fpops_bound/gstate.host_info.p_fpops; so it depends on your benchmark (p_fpops) and the rsc_fpops_bound that we set for the work unit, as far as I can tell. If the benchmarks are off (p_fpops too big), then there could be a chance that a result can be terminated prematurely. Also, due to the random nature of the calculations, a particular work unit may need more fpops (floating-point operations) to finish, but it would have to be quite a bit more since our bound is rather conservative. Currently we use: rsc_fpops_est = 2e13 rsc_fpops_bound = 9e13 (so 4.5x) On one of my computers the benchmark currently gives 1333470000 fpops/s and successful results have completed in less than 3 hours (10800sec) so that is a total of 1.44e13 which is not too far off the estimate. My understanding is that the fpops_est primarily effects the run time estimates shown on the client and how many work units to download given the communication interval with the server. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
DELETED DOUBLE POST We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
The time exceeding WU have not been granted credit yet. The "Bad Random Number Seeds" WU usually error out in less than 20 CPU seconds and are a different kind of WU problem. I agree 100% that the WUs that error at 20 seconds do not amount to anything in terms of credit, and that it is a waist of valuable project time and resources to credit those. Unfortunately, almost everyone got a few of these and people have been screaming their heads off about it on the boards. The "squeaky wheel" theory has now come into play and the project has (to their credit) responded to the demand that credits be awarded. But the most significant loss of credit is occurring on the WUs that error for "Max CPU time exceeded." These frequently error after 80% simply because they run longer than the system expects them to run. I, and others, have had a number of these amounting to a few thousand credits over the last month or so. While I would like to see the credit for those awarded, I would prefer to see a fix for the problem. Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times. At best this is temporary and requires a lot of monitoring of the system to keep things running. It is for that reason that I pointed out that the "Random number" problem and the "Max time" problem are not the same thing. If we are really concerned about loss of project resources then the effort should be focused on the Max time issue. If I have one WU that fails at 80% complete for Max time, that represents a loss of 5-8 hours of time for the project every time it happens. It would take more than a few hundred "20 second" failures to equal that single failed WU. I think the project team has a handle on the Randon number problem. It may take a while to implement the fix, but it is at hand. They should not waste time awarding credit for these. It is simple math. The project should concentrate its limited resources on the problems that slow the production of science the most. These would not necessarily be the issues that make the most noise in the user community. In this case it is not WUs that error in 20 seconds. I lose more CPU cycles in rounding errors than I ever lost on the 20 second failures. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
David, For what it is worth, I confirm your analysis. The place that that boundary is used to abort the task and emit the message uses the number un-modified. |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks. Clearly you are in a much better position to assess the cause of this problem than I . That said There seems to be more at play here. These Max time failures (at least for me) started about a week after I upgraded to BOINC 5.2.13. The systems ran ok for that first week, then I started seeing a number of the Max time errors. I have just completed a WU on one system than ran over 20 hours. This is 4 time the normal run time. If I had not manually adjusted the DCF it would have errored at around 7. It seems to me that part of the problem is the wide variation in WU size. Typically BOINC expects to see WU of similar size perhaps with a variation of 10% one way or the other. But R@H WUs can vary by 200% or more. Since the boundaries are for all practical purposes fixed, this causes a problem. I for one do not need the system to decide if a WU should be aborted because it is taking a long time. If it is progressing, I would prefer to let it complete. I have seen R@H WUs run as long as 35 hours and complete successfully. The most recent releases of BOINC will not allow that unless it is manually adjusted to for long run times. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times. While the code may say it's not using the DCF, _something_ is causing this condition; why would one host get a "max CPU time exceeded" error on a WU that ran _less_ time than ones shortly before and after it that were successful? The case I investigated, the only difference I could see was a string of "short" WUs immediately before the "max CPU" one, which would strongly indicate DCF... |
Snake Doctor Send message Joined: 17 Sep 05 Posts: 182 Credit: 6,401,938 RAC: 0 |
Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times. I agree Bill, if the DCF was not involved then raising it would have no effect on the outcome. Clearly it does. Now it may also effect the number of WU a particular machine can download at once, but that is a different issue. If the DCF is raised sufficiently, all WUs seem to complete successfully irrespective of the CPU time they take. This implys that the DCF IS used in the calculations for Max time. However, I have only seen the system make very small changes in the DCF over time, and they have always been to make it less. Regards Phil We Must look for intelligent life on other planets as, it is becoming increasingly apparent we will not find any on our own. |
Grutte Pier [Wa Oars]~MAB The Frisian Send message Joined: 6 Nov 05 Posts: 87 Credit: 497,588 RAC: 0 |
I do not care about these 20 seconds WU's I didnt get credits granted for. It's just that were a lot of these WU's that got credits and I found it a bit strang I didn't get anything so I thought what was wrong with the ones I had uploaded. Just curiousity. Too much time gets wasted on these 0.00xxxx credits but if you've had a lot of these it might count. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 7 |
But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit. Just had a good example of this; a WU that was estimated at 20+ hours just finished in 4:11:22. Remaining WUs in the queue dropped to 18:47 estimates. If you look at this host here you'll see a _6x_ range of CPU times... the DCF was set very high by the 26-hour one, is just now back _down_ to 2.33... There's only eight completed results for that host, makes it very easy to see what's going on. I think the "Increase_cycles" WUs should have been issued with at least double the estimate they got, and the "no_sim_anneal" ones possibly half the estimate. However it's done, the estimates should definitely not be the same on them. |
Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0 |
But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit. After another "maximum cpu time exceeded" error I suspend the networkactivities on a dual G5 2GHz 2,5GB ram (boinc 5.2.13 no other projects) I have the following (not yet uploaded) queue of results: cpu-time - status 12:38:43 - maximum cpu time exceeded 02:32:24 - finished 03:47:37 - finished 03:12:34 - finished 12:38:43 - maximum cpu time exceeded 08:49:47 - finished 06:19:48 - finished 03:46:29 - finished 07:42:59 - finished 03:17:35 - finished 06:08:48 - finished 05:53:45 - finished 03:05:26 - finished 02:28:05 - finished 02:22:14 - finished 12:38:45 - maximum cpu time exceeded 06:47:13 - finished 04:38:56 - finished 08:04:36 - finished 04:44:13 - finished 03:14:39 - finished 01:41:28 - finished 08:02:37 - finished 06:02:18 - finished 04:58:23 - finished 05:28:48 - 70% 01:53:23 - 50% So far I didn't keep track of the variations in the estimated_time (at the moment: 07:44:12) Although there is a sequence of 3 short wu's before an error I don't think that that's the real cause. As you can see some wu's take just too much time to finish (one was at 80%, the other at 90% when they errored out). And I can't recall seeing an estimated_time on this machine greater then 12:00:00. Unless the max_cpu_time is increased these wu's will never finish. |
AMD_is_logical Send message Joined: 20 Dec 05 Posts: 299 Credit: 31,460,681 RAC: 0 |
I notice that some of the people who see a DCF dependance have computers with *very* high benchmark scores. I assume that's because they're using an "optimized" version of boinc. Some of those boinc versions may have been modified to include DCF in the max_cpu_time calculation. In fact, they would pretty much have to do something of the sort because otherwise the extremely high benchmark scores would result in a very short max_cpu_time, and all work units would time out. |
Los Alcoholicos~La Muis Send message Joined: 4 Nov 05 Posts: 34 Credit: 1,041,724 RAC: 0 |
I notice that some of the people who see a DCF dependance have computers with *very* high benchmark scores. I assume that's because they're using an "optimized" version of boinc. Some of those boinc versions may have been modified to include DCF in the max_cpu_time calculation. I don't think optimized clients cause the problem. The standard version of Boinc has "maximum cpu time exceeded" errors as well. My G4 with the standard (recommended) version of boinc (5.2.13) did have 3 out of 13 wu's with the "maximum cpu time exceeded" errors in 10 days. Just like my G4 Powerbook with the 4.44 superbench client where 4 out of 15 wu's errored out. Beside that, Rosetta uses the Boinc platform where the utilization of an optimized client is quiet common (i.e. Seti). So Rosetta should meet those multi_project requirements. It should be very odd to ask people to change their Boinc clients for every other project, isn't it? |
Message boards :
Number crunching :
Credits Granted
©2024 University of Washington
https://www.bakerlab.org