Message boards : Number crunching : (reached daily quota of 200 results)
Author | Message |
---|---|
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
I have been getting the following message for the past two days: [quote]12/31/2005 8:12:07 PM|rosetta@home|Requesting 518400 seconds of new work 12/31/2005 8:12:22 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded 12/31/2005 8:12:22 PM|rosetta@home|Message from server: No work sent 12/31/2005 8:12:22 PM|rosetta@home|Message from server: (reached daily quota of 200 results) 12/31/2005 8:12:22 PM|rosetta@home|No work from project[/] The machine is an A64 x2 4400+, 1GB memory... BTW - It's not only DEFAULT_XXXXX_205_ thats failing at a record pace and there's more failures than sucesses on my 30 machines... Join the Teddies@WCG |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
Nite Owl, the 4400 looks like it has indeed downloaded 200 results, on the 31st, at 9 AM UTC; what is the problem? Did you recently increase your cache size greatly? Did these results not actually arrive? And as at least 99 other threads here say, the DEFAULT_xxxx_205 problem is NOT the one causing rapid failure. That is a separate issue, that (unless you still have some in very large caches) should be flushed out by now. I haven't had a single failure in the last two days (with a 0.25 day cache). Unless of course you have "leave in memory" set to "no", in which case you could have random failures on _any_ results. |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
It looks like that machine has downloaded almost 200 WU's to it's queue. Does BOINC manager show that many on that system? The other client error WU's that finish very quickly is a known issue and should be corrected when the project staff returns from their holiday break. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Nite Owl, the 4400 looks like it has indeed downloaded 200 results, on the 31st, at 9 AM UTC; what is the problem? Did you recently increase your cache size greatly? Did these results not actually arrive? If you'll read my post CAREFULLY you note I just said that "DEFAULT_xxxx_205" wasn't the only job failing! I had Rosetta set to a one day cache so BOINC wouldn't have to download so often (same os the other 29 machines). I've had leave in memory set to "YES" since day one... And, I don't have ANY cache on this and several other machines which I have since shut down do to lack of viable work. Join the Teddies@WCG |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
BTW - I have not receive anything from Rosetta on this machine as you can see... 2005-12-30 17:54:22 [rosetta@home] Message from server: No work sent |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Daivid Knittle wrote: It looks like that machine has downloaded almost 200 WU's to it's queue. Does BOINC manager show that many on that system? No.... See above ^ |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
And, I don't have ANY cache on this and several other machines which I have since shut down do to lack of viable work. Sounds like you're the victim of quite a lot of ghost WU's. (The project thinks it's sent you work, but it never makes it across to your host.) I noticed that at least one of your machines had a large number of download errors. Are you having some network problems on your end that could be interfering with getting new work? If you leave the machines up, they should be able to make another attempt within 24 hours. Hopefully they will actually make it across to you this time! |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
It sounds like you have "ghost" WUs - there is no reason you should ever hit the quota, or even ask for anything _near_ the quota, even on a 4400. I suspect the run of "bad" WUs you had lowered the estimated time per result to the point where you requested a ton of results, and then some communications glitch caused them to not arrive. The fact that you are getting "500" errors makes this very likely. (That has been identified, by the way; there is a misbehaving router at some west-coast US ISP - if you happen to be routed through it, the connection will fail. No luck so far on getting the ISP to fix it.) The only solution to this quota problem, other than waiting it out, is (when you're out of work) to detach from the project and reattach. This will create a new host ID for this computer, and get you a reasonable amount of work. Once your estimated times are back to something reasonable, you can merge the two hosts together, or you can (even better...) wait for the "old" host to completely be past deadline and the results deleted, then just delete it. Edit:: Looks like David and I were typing at the same time and he's faster. |
Nite Owl Send message Joined: 2 Nov 05 Posts: 87 Credit: 3,019,449 RAC: 0 |
Bill wrote: a misbehaving router at some west-coast US ISP Bill, Do you happen to know what the IP address is for that site on the west coast? I'd like to see if it's a problem for me... Thanks David asked: Are you having some network problems on your end that could be interfering with getting new work? Nope... It's the same network that worked fine at UD-Grid, Find-a-drug, Seti, Predictor and SIMAP, so I suspect that's not the problem... Shrug |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
Bill, Do you happen to know what the IP address is for that site on the west coast? I'd like to see if it's a problem for me... Thanks Not for sure it's been identified; the general feeling points at Cogent, just because the "500" errors hit SETI the hardest, and that's their ISP. Getting decent traceroutes was hard as the problem kept coming and going. Here is an article on ways to get around the problem. Since you showed some of these errors in your log, my guess is that you're routed through it at least some of the time. You might look up the "113" error in the Wiki, I don't know what that points to. |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
...Not for sure it's been identified; the general feeling points at Cogent, just because the "500" errors hit SETI the hardest, and that's their ISP.... Since you showed some of these errors in your log, my guess is that you're routed ... OK, then what is the explanation for error 500s and associated ghost WUs that I get from LHC and Einstein? Do I also have a mis-behaving router somewhere close to me? I get ghost WU errors with frequency in descending order: Rosetta, LHC, Einstein; I'm in northeast Ohio. The most ghost Wus I ever got at one time was 14 w/Rosetta. That's one reason why I keep my queue size to 0.01. |
Tern Send message Joined: 25 Oct 05 Posts: 576 Credit: 4,695,359 RAC: 13 |
OK, then what is the explanation for error 500s and associated ghost WUs that I get from LHC and Einstein? Do I also have a mis-behaving router somewhere close to me? I get ghost WU errors with frequency in descending order: Rosetta, LHC, Einstein; I'm in northeast Ohio. The most ghost Wus I ever got at one time was 14 w/Rosetta. That's one reason why I keep my queue size to 0.01. There are really two issues; "ghost WUs" and "500s". The ghost WU problem has been around forever, and happens any time there is a communication failure between the host and the server during a transfer. V4.45 and later have a retry mechanism that reduces the number of these, but obviously doesn't eliminate it. Einstein (and possibly others) have made a change to the server code that causes a "resync" if the client's number of results doesn't match what the server says it should have, so "ghosts" are resent the next time the host connects. This can cause "too much work" to be downloaded, if the host connected asking for more work... and the load on the server to run the resync is considerably more than for a standard connection. SETI has chosen _not_ to run the resync because their servers would be overloaded. Rosetta I believe will be turning the resync on during the next major maintenance cycle. The "500s" appeared a bit over a month ago in any quantity. The BOINC code has error messages for just about everything, but anything _missed_, gets the "default 500" error. So a totally unexpected error message falls through and gives this - it could be from the "black hole router", or from any other unhandled exception. This has made it very difficult to troubleshoot... the network gurus came up with the fact that changing the MTU would make many of these errors go away. Traceroutes and pings gave the clue that packets were being "split" unexpectedly, so smaller packets would do better. (Not a good solution, but an emergency workaround.) At first it seemed to be a SETI-only problem, so the betting was on some hardware _at_ UCB. This was supported by the fact that the people receiving the errors could be anywhere in the world, but was also countered by the fact that someone right "next door" to someone with a _continuous_ problem, had no trouble at all. But then the problem started showing up at Einstein, and other sites, pretty much in the same ratio as the size of the project. The only common factor was that the routings went "through" the west coast - but that is true of probably 25% of all internet traffic. Several ISPs had fights about charging for traffic passing through their routers during this time, and connections were disabled, and traffic rerouted or completely blocked. (So much for the "automatic rerouting" of the internet that wasn't supposed to allow that to happen...) DNS servers wound up with some pointing to one 'next hop' for a connection and another pointing to a different 'next hop'. (I'm no network guru, so if I've got some terminology wrong here, I'm working off what I understood from other's conversations - no guarantees.) As this has settled down, the frequency of "500s" has dropped off, even with nothing being done on the BOINC end. At one point a month or so ago I did a traceroute, just out of curiosity as I wasn't having errors, from here to SETI - and found myself going through Memphis, TN, almost exactly in the _wrong_ physical direction, definitely not between here and California. So the concept of "close to you" when it comes to a router is pretty iffy. A couple of years ago at work I saw the network folks using some software that "mapped" a traceroute physically; literally drew the lines on a map showing your connection path. Hopefully someone is using something like this, or just making a list of common points, to track some of the routes for the people who are still consistently getting the errors, and haven't been able to apply the fixes noted in the article I linked to earlier. I really don't know the status of all this however, it's "fallen off" the message boards. I don't know what can be done even if they _do_ identify the problem specifically - how do you call Cogent or some other huge company and say "hey guys, your router, IP address xyz, is hosed"? WHO do you even call? Oh, also, the problem is not limited to WU transfer; at the same time all of that started, many of us (even those like me who never saw the "500s") started having web page 'locks' - where one site or another, seemingly at random, either times out without responding, or just sits there for a minute or two before finally connecting. Oddly enough on that, it happens with some _browsers_, and not with others. I have actually had to run MSIE on the PC for a couple of days to get to THESE pages. Blegh. I still think it's gremlins. |
Divide Overflow Send message Joined: 17 Sep 05 Posts: 82 Credit: 921,382 RAC: 0 |
Best explanation I've heard so far. (Ok your other stuff was pretty good too...) ;) |
Nothing But Idle Time Send message Joined: 28 Sep 05 Posts: 209 Credit: 139,545 RAC: 0 |
Yeah, I agree that you're explanation was good. As you can guess it doesn't give me a warm-n-fuzzy feeling knowing that I'll continue to experience ghost WUs. But hey, I don't like paying taxes either and guess what time of year it is?!?! |
Message boards :
Number crunching :
(reached daily quota of 200 results)
©2024 University of Washington
https://www.bakerlab.org