Message boards : Number crunching : Please abort WUs with
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
I just read a post that says windows ME is not supported by Rosetta. I have ME on my computer and Rosetta was doing fine with it until the 18th. Does it depend on the type of WU you receive as to working with ME? As I understand it, Rosetta officially only *aims* to have its apps working on Win2k and later versions of windows. Even so, I run one ME box on this project without any OS-specific problems. The latest app runs as well for me as the earlier ones did, once I exclude the huge numbers of bad jobs we've had recently and which would not have run on any OS. However ME is officially unsupported - so in future an app may come along which won't run under ME, and then me & thee will have no right to moan at the project programmers because it is us that are 'out of spec' not them. Have another project, or a linux dual boot, ready for the day that that happens! I've got both lined up, me... River~~ |
O&O Send message Joined: 11 Dec 05 Posts: 25 Credit: 66,900 RAC: 0 |
A) ... On 24/12/2005 02:01: Started downloading aa1dis2_09_5.400_v1_3.gz (3.08MB) Since then, I received: "rosetta@home|Temporarily failed download of aa1di2_09_05.200_v1_3.gz: error 500 X" ... about 24 times! Nevertheless, ... On 24/12/2005 03:27: Finished downloading the last peice of it which was WU: DEFAULT_1di2_205_78_5 Q1) I have "suspend" working on this WU while it is in the "Ready to run" state, should I still "abort"? Q2) 25 times of "error 500", are they "normal" communication problems between your "server" and my PC? B).... 24/12/2005 01:13:48|rosetta@home|Starting result 1hz6A_topology_sample_207_14685_8 using rosetta version 481 24/12/2005 01:13:49|rosetta@home|Starting result 1hz6A_topology_sample_207_15735_8 using rosetta version 481 24/12/2005 01:14:34|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_14685_8 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 01:14:34|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_15735_8 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 03:38:50|rosetta@home|Resuming result 1hz6A_topology_sample_207_10781_7 using rosetta version 481 24/12/2005 03:39:25|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_10781_7 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:40:11|rosetta@home|Starting result 1hz6A_topology_sample_207_11598_1 using rosetta version 481 24/12/2005 04:40:23|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_9040_6 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:40:23|rosetta@home|Computation for result 1ogw__topology_sample_207_9040_6 finished 24/12/2005 04:40:24|rosetta@home|Starting result 1hz6A_topology_sample_207_9621_7 using rosetta version 481 24/12/2005 04:40:50|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_11598_1 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:40:50|rosetta@home|Computation for result 1hz6A_topology_sample_207_11598_1 finished 24/12/2005 04:40:51|rosetta@home|Starting result 1ogw__topology_sample_207_12440_5 using rosetta version 481 24/12/2005 04:41:04|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_9621_7 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:41:04|rosetta@home|Computation for result 1hz6A_topology_sample_207_9621_7 finished 24/12/2005 04:41:05|rosetta@home|Starting result 1ogw__topology_sample_207_9064_3 using rosetta version 481 24/12/2005 04:41:38|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_12440_5 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:41:38|rosetta@home|Computation for result 1ogw__topology_sample_207_12440_5 finished 24/12/2005 04:41:38|rosetta@home|Starting result 1ogw__topology_sample_204_2061_5 using rosetta version 481 24/12/2005 04:41:52|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_9064_3 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:41:53|rosetta@home|Computation for result 1ogw__topology_sample_207_9064_3 finished 24/12/2005 04:41:53|rosetta@home|Starting result 1ogw__topology_sample_207_14480_7 using rosetta version 481 24/12/2005 04:42:29|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_204_2061_5 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:42:29|rosetta@home|Computation for result 1ogw__topology_sample_204_2061_5 finished 24/12/2005 04:42:29|rosetta@home|Starting result 1ogw__topology_sample_207_9063_5 using rosetta version 481 24/12/2005 04:42:41|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_14480_7 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:42:41|rosetta@home|Computation for result 1ogw__topology_sample_207_14480_7 finished 24/12/2005 04:42:41|rosetta@home|Starting result 1hz6A_topology_sample_207_5164_6 using rosetta version 481 24/12/2005 04:43:15|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_9063_5 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:43:15|rosetta@home|Computation for result 1ogw__topology_sample_207_9063_5 finished 24/12/2005 04:43:15|rosetta@home|Starting result 1hz6A_topology_sample_207_15831_9 using rosetta version 481 24/12/2005 04:43:22|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_5164_6 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:43:22|rosetta@home|Computation for result 1hz6A_topology_sample_207_5164_6 finished 24/12/2005 04:43:22|rosetta@home|Starting result 1ogw__topology_sample_207_16196_6 using rosetta version 481 24/12/2005 04:43:56|rosetta@home|Unrecoverable error for result 1hz6A_topology_sample_207_15831_9 ( - exit code -1073741819 (0xc0000005)) 24/12/2005 04:43:56|rosetta@home|Computation for result 1hz6A_topology_sample_207_15831_9 finished 24/12/2005 04:43:56|rosetta@home|Starting result 1dtj__abrelax_rand_len10_jit02_omega_sim_23401_1 using rosetta version 481 24/12/2005 04:44:10|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_16196_6 ( - exit code -1073741819 (0xc0000005)) Q3) Are they related to your announced "peoblems"? Q4) In the future, should my PC expect more of such WUs (Took more than 2 hours to download) so to have'em end with "Computational errors" in fractions of seconds? Thank you. O&O (UTC +3, Dial-up) |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
DEFAULT_1di2_205_78_5 Yes... OR... given you are on dial-up, you _could_ let this result run, and then you will get credit for the time spent on it. It is "good" in terms of structure, it is only "bad" in that it runs extremely long, and eventually trips the "maximum CPU time" error. You will get 0 credit at first, then after the holidays, they will grant the credit manually. Q2) 25 times of "error 500", are they "normal" communication problems between your "server" and my PC? That does not sound normal at all. Error "500" is the "generic fallback" error that is reported when BOINC doesn't have a real error message. The servers have not been overloaded, from what I've seen. 24/12/2005 04:44:10|rosetta@home|Unrecoverable error for result 1ogw__topology_sample_207_16196_6 ( - exit code -1073741819 (0xc0000005)) Yes, these are examples of the "short WUs" that error out quickly due to a random-number problem. They are supposed to be "almost gone" at this point, but we have been advising people on dial-up that they should Suspend Rosetta for a day or two, or at least until they can check these boards and verify that all the bad ones are gone. Given the length of time it took you to get these, I would personally run the DEFAULT_205 thing, _or_, suspend Rosetta and work on another project for a couple of days. There is no point in your spending that much time downloading, only to not only return no valid results, but also get no credit for it. |
O&O Send message Joined: 11 Dec 05 Posts: 25 Credit: 66,900 RAC: 0 |
Thank you BM for your swift response ... much appreciated. One more question if you can answer ... please ... Q5)Not mentioning the time it took me to download'em, I have in a "Reday to Run" status, 14 Default_xxxx_219_xxxx_x WUs, 9 Default_xxxx_218_xxxx_x and 1 Default_xxxx_221_xxxx_x. Should I "abort"? Edit: And this ... rather silly but I was wondering ... Q6) I have one WU with the name ... BARCODE_FRAG_30_1n0u_221_42_0 ..., is it related to the 3-dimensional shapes of proteins research to find cures for some major human diseases? Regards, O&O |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
Q5)Not mentioning the time it took me to download'em, I have in a "Reday to Run" status, 14 Default_xxxx_219_xxxx_x WUs, 9 Default_xxxx_218_xxxx_x and 1 Default_xxxx_221_xxxx_x. See the very first message in this thread: 'please ABORT any WUs whose names start with "DEFAULT_....._205_...." ' The ones you mention are not the 205 batch so don't need to be aborted. *** Join BOINC@Australia today *** |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
Q5)Not mentioning the time it took me to download'em, I have in a "Reday to Run" status, 14 Default_xxxx_219_xxxx_x WUs, 9 Default_xxxx_218_xxxx_x and 1 Default_xxxx_221_xxxx_x. The "DEFAULT_xxx_218" (and up) is, unless it happens to be a "short WU" (which is unlikely, I think the problem was fixed by batch 218) should be good. The "Barcode" part, I have no idea about. The WU names are sometimes discussed in the Science forums, but I haven't seen that one. |
divyab Send message Joined: 20 Oct 05 Posts: 6 Credit: 0 RAC: 0 |
(in the future, science questions like this will probably be more promptly addressed on one of the science threads...but since it seems like the WU's are stabalizing, i'll answer here....) Barcode refers to a particular method we use when we try and accurately predict the protein's structure, as you guessed above. basically, we use this as a way to make sure that we are not missing some particular "features" when we are searching for the correct structure. a "barcode" might be for some particular feature (lets say, a kink in the chain), and has different "flavors" (kink at the beginning, kink in the middle, kink at the end, all 3, etc.). in the runs that say "barcode", we spread our search so that all the different flavors of certain features are evaluated before making our predictions. |
AKH54 Send message Joined: 8 Dec 05 Posts: 4 Credit: 1,812,208 RAC: 0 |
Does this mean you have to abort all wu starting with DEFAULT, or just the ones with 205. I have been crunching for a few days now, and I have noticed I have 12 client errors. What does this mean? and is it normal to get so many errors. Might explan why my graph in the statistic tab is flatlined. Also the people who have the little database of all their wu for different BOINC project in their replys How can I get the same for my WU Many Thanks Alan |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 513 |
Does this mean you have to abort all wu starting with DEFAULT, or just the ones with 205. Just abort the DEFAULT 205 workunits. I thought all those were purged from the system by now so you should not see any of those. The problem with those, if I recall, was that they would keep on running and running (for 100 times longer than normal I think) and would hit a cpu time limit that each workunit has built in and then abort themselves. Even if you let it run you'd get no credit. As for others, yes, some of those will abort themselves after running for only a short while. Do not abort those. It will help clean them out. When a WU is reported as a failure as these are, the boinc servers send it back out up to 10 times before finally giving up. Letting them run and abort will help clear them out of the system. I've got a bunch in my queue that will abort when they reach the top of the queue later today. (How do I know that, you ask? A workunit name ends in _X where X is some single digit. The first time a WU is sent out X=0. If it fails for whatever reason and needs to be sent out again, then the X is changed to a 1. If it fails again and is sent out a third time, X is set to 2. See the pattern? I've got several where X is 5 or 6 or 7 or 8. I know those will abort after running for only a few seconds. But, and this is important, just because X is greater than 0 does not necessarily mean a WU will abort. For example, if a WU is sent out and is not returned by the deadline, it is sent out again with X=1. It could be a perfectly good WU. Also if someone aborts a WU maually or resets a project, the WUs in question will need to be resent and therfore the value of X for these will be greater than 0. So, jest let all these go and the system will do the right thing for you.) Oh, I should say, if you are on a broadband connection you should have no problem. If, on the other hand, you are on dial-up, it may be best to simply suspend RAH and let boinc process on some other project(s). With all the traffic involed with uploading and downloading files for WUs that only run a short time, dial-up would be very inefficient (and expensive if you pay for connection time and/or number of bytes transferred.) The admins for the project are off for the holidays but will work on these problems whe they get back. They've been very responsive so far and I have no reason to doubt them. As for the database of WU's some people have in their replies, there are several sites that collect the stats file from the various projects and make nice tables and graphs out of them. They also supply graphics for signatures. You end up adding a url that points to one of these sites and specifies your particular user id. I get mine from boincstats.com. They tell you how to set it up for their site in their FAQ at http://www.boincstats.com/page/faq.php#3. I'm sure other sites have similar instructions. The url for you signature is added to your forum preferences. Click on "your account" on the main page and then on "view or edit forum preferences". Hope all this helps. Charlie -Charlie |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
Just abort the DEFAULT 205 workunits. I thought all those were purged from the system by now so you should not see any of those. The problem with those, if I recall, was that they would keep on running and running (for 100 times longer than normal I think) and would hit a cpu time limit that each workunit has built in and then abort themselves. Even if you let it run you'd get no credit. One very minor addition to the excellent information Charlie has provided: While the DEFAULT_xxxx_205 WUs will report "error" and "0 credit", whether aborted or allowed to run, the project staff has said that when they return from the holidays and all of these have been 'flushed through' the system, they will go back and AWARD credit for any time you have spent on these before aborting or failing. |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
I just aborted one: 12/25/2005 9:14:31 PM|rosetta@home|Unrecoverable error for result DEFAULT_2reb_205_39_3 (aborted by user) 12/25/2005 9:14:31 PM||Rescheduling CPU: result op 12/25/2005 9:14:32 PM||Rescheduling CPU: process exited 12/25/2005 9:14:32 PM|rosetta@home|Computation for result DEFAULT_2reb_205_39_3 finished 12/25/2005 9:14:36 PM||Rescheduling CPU: result op after 2 hours and 0.7% finished (Was watching Monsters Inc. on tv! :-D ) [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
I just aborted one Fuzzy, can you give me a link to that one? I'm writing up a bunch of stuff for David when he returns, and I had thought all the 205's had flushed out by now... I know there are still _some_ of the "short WUs" around, because I just had two of them today. (Much lower percentage than a couple of days ago, of course.) |
AMDave Send message Joined: 16 Dec 05 Posts: 35 Credit: 12,576,896 RAC: 0 |
I just aborted one Bill: I just aborted one as well. It hasn't begun processing and remains in my queue. Is this a link you can use? https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3760755 |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
I just aborted one as well. It hasn't begun processing and remains in my queue. Is this a link you can use? https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3760755 Perfect. Bad news though. The first two people both let that one run completely, taking four days to get to you, you aborted it, that still leaves 8 more people to do it before it's "flushed". Large caches kill us on things like this. The guy who got it first has 48 results on his system. Nothing _wrong_ with that, it just sure slows down getting bad WUs flushed through quick. If everybody else takes 4 days to get to this one, we'll be into February. So yes, the staff needs to "kill" these, can't rely on them being gone by the time they get back. That's the question I was trying to answer, it's just not the answer I hoped for. :-( |
rbpeake Send message Joined: 25 Sep 05 Posts: 168 Credit: 247,828 RAC: 0 |
If everybody else takes 4 days to get to this one, we'll be into February. So yes, the staff needs to "kill" these, can't rely on them being gone by the time they get back. That's the question I was trying to answer, it's just not the answer I hoped for. :-( They are like zombies, they can't seem to be killed and they keep rising up.....(Sorry, my daughter gave me "The Zombie Survival Guide" for Christmas, so I have as bad case of zombies on the mind right now.... ;) Regards, Bob P. |
Webmaster Yoda Send message Joined: 17 Sep 05 Posts: 161 Credit: 162,253 RAC: 0 |
If everybody else takes 4 days to get to this one, we'll be into February. So yes, the staff needs to "kill" these, can't rely on them being gone by the time they get back. That's the question I was trying to answer, it's just not the answer I hoped for. :-( Yeah, the 10 errors alllowed per WU is going to keep these (and the other bad WUs) circulating for some time. If this can't be fixed in the scheduler, perhaps it can be fixed by deleting (or renaming) the directories on the project's server that these bad WUs are stored in. It would result in download errors but it would reduce bandwidth usage, stop anyone crunching them and get them flushed out of the system quickly (assuming download errors add to the error count on the WU) *** Join BOINC@Australia today *** |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Just an aside, I am, and have been doing a number of work units with the graphics beta application ... :) |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
... The first two people both let that one run completely, taking four days to get to you, you aborted it, that still leaves 8 more people to do it before it's "flushed". In hindsight, with the project staff being away and with a high replication factor, better advice would be to suspend the WU for now (not the project, the individual result), and only abort it after the project tell us they have deleted the files from the server. Suspend would delay reissue, deletign the files would then prevent it. In the event that there are still some out there, can I ask people to suspend for now, until the project people get back? People will have to make their own mind up whether to follow this or to go with Jack's request - after all as a project scientist he does rank me! Also in hindsight, the fact that people were aborting these wholesale for a few hours explains why there were so many of the things around for a few hours. That's the question I was trying to answer, it's just not the answer I hoped for. :-( If you have your answer, do you still need reports of aborts, Bill? I aborted a couple on Christmas day - one that hadn't run and one that had clocked over 24hr before I noticed it! - I also clicked the wrong option on BOINCview and aborted several good WU from the cache on the same box :-( >doh< |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
I just aborted one Yeah, sure! :-) https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3760991 With the result: https://boinc.bakerlab.org/rosetta/result.php?resultid=5090070
Very good idea! :-) [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,362 RAC: 9 |
I don't think any information is left to be gained on these... so no, _I_ certainly don't need to see anything. I don't see any reason the project would either, but I can't be 100% sure of that. I have a couple of examples now (thanks Fuzzy!) for my email to DK. River, your suggestion is great; suspending the WU until the staff returns, rather than aborting it, would at least keep it from going to someone who hasn't read the boards, etc... I think Jack's "just abort them" was to prevent us from wasting our time crunching them. Suspending it accomplishes that, AND keeps someone else from wasting the time. Now if I could only get one... :-( |
Message boards :
Number crunching :
Please abort WUs with
©2024 University of Washington
https://www.bakerlab.org