Message boards : Number crunching : Please abort WUs with
Author | Message |
---|---|
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
Unfortunately, we seem to have had some problems with our latest batch of Work Units. The biggest one is that we inadvertantly instructed each WU to make 1000 structures instead of 10. This is clearly not possible before the deadlines for these WUs. So to make sure you don't lose any credit, and we don't lose any results, please ABORT any WUs whose names start with "DEFAULT_....._205_...." You will also notice that the percentage resolution is higher on these WUs. The percentage is based on the fraction of target structures made. 1000 structures means that you can have a .1% resolution. There also seems to be a problem with some other WUs exiting quickly. It is likely due to another mistake on our command line that we can fix quickly. We are looking into it. The message is that it's always dangerous to release a bunch of new stuff just before the holidays... :) We appreciate your patience as we work through these issues. The newly queued WU's should work better. |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,630,763 RAC: 5,713 |
Appreciate the update! No problem being patient, through the issues, as long as we remain informed. And, since you guys there do such a great job, at communicating, well as I said before, no problem. :-) Thanks. |
Grutte Pier [Wa Oars]~GP500 Send message Joined: 30 Nov 05 Posts: 14 Credit: 432,089 RAC: 0 |
Funny; We don't see here first, if the wu's are good. I would find it logical that i get credit for the work i did. https://boinc.bakerlab.org/rosetta/result.php?resultid=4421107 some hours of work went in too that 1, and it was a fault on your side. so some credit for the work we did would be appreciated. Even if you can't use the result we produced. 173.56 is more then half of my days production. almost 40.000 sec. Ps: good luck :) |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
We'll look into giving those who ran batch 205 credit. It is appearant that a local test system should be in place instead of sending test batches to the production server, particularly after this mistake was made for the very first batch submitted through our automated work generator. We'll be working hard to prevent future problems like this one. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=4421107 The _good_ news on these is that they apparently are failing with a 'CPU time exceeded' error at around 11 hours, and not just running for days... |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
We'll look into giving those who ran batch 205 credit. David, this is in part a repeat of what I said in another thread, but it's worth repeating if it saves you re-inventing the wheel. When Einstein made a mistake like this, they managed to give everyone credit for the aborted WU - if you ask them nicely they may still have the script handy. (If they are not sure when this was, it was when they issued WU whose names differed only by upper-vs-lower case, and they confused the Windows machines) On the other hand the script might be so simple that it's easier to re-write it - I don't know enough to say; but the E@h script proves it is possible to do. and the user response proved it was a worthwhile PR move. People initially got 0 credit, but a few days later after the script was run everyone ended up getting what their client had claimed for the result. And the message for participants is to abort the WU and let it report, people who still lost out on the Einstein blunder were those who'd reset their project. River~~ |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
It should be pretty easy to write a script to do this. I will be gone for the holidays for a week starting tomorrow, but when I get back I will give people credit for this batch. That should be enough time for most to have errored out, for those who did not get a chance to abort them. |
The Pirate Send message Joined: 22 Sep 05 Posts: 20 Credit: 7,090,933 RAC: 0 |
|
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
I agree that I don't care if credit is granted or not, but I do ask that the staff be careful here. Saying "we will try" is fine, or "it should be pretty easy" - but until you _know_ that you can grant the credit, don't give out a blanket "you will get credit" statement. SETI said "you will get credit" for WUs that were late due to their latest outage. MOST did, but a certain set of those that were due the first or second day of it did NOT, and there are a lot of people upset. SZTAKI had a 0-credit problem three times, the first two, they granted credit. The third, they _said_ they would grant credit, and every time they were reminded "you still haven't", they said "we will we just haven't gotten to it". Then they deleted all the results before they "got to it", and went silent on it. No comment at all since. Don't care about the credit, 50-60 points out of 5000+, but don't lie to me! Bye, SZTAKI! While "we gave you credit" would be great, it's far better to say "we'll try" and then "we're sorry, we tried but couldn't give you credit" than it is to say "we will give you credit" and then "oops we can't". |
hans jørn enevoldsen Send message Joined: 18 Dec 05 Posts: 3 Credit: 404 RAC: 0 |
Please abort WUs with "DEFAULT_xxxxx_207_... |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
Please abort WUs with "DEFAULT_xxxxx_207_... Why? I can find no evidence that there even IS a "default_xxxxx_207" yet. Please provide some justification or explanation. |
hans jørn enevoldsen Send message Joined: 18 Dec 05 Posts: 3 Credit: 404 RAC: 0 |
All is stopping after 10 minutes |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
All is stopping after 10 minutes Like Bill, I have not seen any WUs with names that start with "DEFAULT_xxxxx_207_" (replacing xxxxx with a number). However, I have had a relatively high number of work units in the 204 and 207 batch crash after a few minutes. Some have been OK. I will persist with Rosetta for the day but if it gets out of hand I will suspend Rosetta on most of my computers (keeping an eye on the remaining one). The (minimal) time taken and lack of credit is not a major concern, but why waste bandwidth to download work units that are going to crash. EDIT 1: I just had three in a row crash, minutes after posting this message. All in the 207 batch: 1ogw__topology_sample_207_10103_1 1hz6A_topology_sample_207_7644_1 1ogw__topology_sample_207_14401_0 All with error 0xC00000005, in a matter of 10-30 seconds EDIT 2: 4 more crashed, on two computers, since writing the above (a few minutes ago). Two of them were batch 208, two of them batch 207 EDIT 3: Minutes later, another 3. It is getting out of hand - Rosetta has been set to "no new work" on all my computers pending a fix. *** Join BOINC@Australia today *** |
Scribe Send message Joined: 2 Nov 05 Posts: 284 Credit: 157,359 RAC: 0 |
Please abort WUs with "DEFAULT_xxxxx_207_... See This thread towards the end. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
|
pb Send message Joined: 30 Nov 05 Posts: 6 Credit: 65,632 RAC: 0 |
Hello! I've aborted this one: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3760695 Will I get promised credit for it, as it is said in News? Thanks. |
Jack Schonbrun Send message Joined: 1 Nov 05 Posts: 115 Credit: 5,954 RAC: 0 |
2am Seattle time, and I've found the source of the problem for the quick crashing jobs. It's amazing how distributed computing puts ones code to the test. David Kim's work-around should make things okay until we fix the code. Unfortunately, I think the bad work units will have to error out to be removed from the queue. Again, we appreciate your patience. |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
... Hate to be this way but, let them error out on U of W Housing and Food Services computers not mine. Rosetta is currently suspended and EaH is merrily computing double quota, perhaps until after Christmas. I'm a retired computer programmer myself, and the concept of "work around" doesn't appeal to me. |
Grutte Pier [Wa Oars]~GP500 Send message Joined: 30 Nov 05 Posts: 14 Credit: 432,089 RAC: 0 |
i see 207 jobs crash too. It can't overview wich do and don't crash. some 207's do compute right. New stuff before the Christmis holiday isn't a good idea. Maybe refert too the previous calculation methodes. And research before experimenting. This isn't too be anoying or cinical, i can't see what is being done. Good luck, if we can help, if you need feedback ask us. These are the bumps in the road, for a new project. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
From the Technical News: Batch 206 and greater are okay, and should not be aborted. There appear to be two separate problems - the "DEFAULT_xxxxx_205" WUs which run for about 11 hours and then fail (and should be aborted) and an application(?) problem that is causing _various_ WUs to fail very quickly. There is no need to abort any WUs other than the specified ones; you can't tell by the name if a WU other than those is going to fail quickly, and a quick failure is not a big problem to either the project or the participants. The "short" failures shouldn't add up to more than a minute or two on average for everyone, and there is little point in granting 0.x credits - those who have spent up to 11 hours on one of the "DEFAULT 205"s, the project has said _will_ get credit for the time spent, after they have been "flushed", and after the holidays. If credit isn't being lost, there really is no reason _not_ to be running Rosetta right now. If you see one of the long ones after it starts and abort it, you'll get credit for however much time _was_ spent. If you don't see it and it runs until it gets the CPU time limit error, you'll get credit for it. If you see and abort one before it starts, you've lost nothing other than a few seconds downloading it. If a few "short" ones hit your computer and error out, well, so what? Meanwhile, most people are getting mostly "good" WUs, so the work continues. Suspending Rosetta just slows the process. Any "bad" WU that you don't get will just be done by someone else, but maybe a day later. Anyone who knows my postings, on this and other boards, will know that I don't cut the projects very much slack. When they screw up, I tell them about it. If you read the "How to have the best BOINC project" thread, my points 6, 7, and 8 will tell you that I'm not terribly happy about this present situation. But this is a "young" project, and not all of the safeguards are in place yet. They're posting on these boards actively, and trying to do the right things. |
Message boards :
Number crunching :
Please abort WUs with
©2024 University of Washington
https://www.bakerlab.org