Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 35 · 36 · 37 · 38 · 39 · 40 · 41 . . . 55 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
So what is being done to "future" proof this project and prevent such overloads from happening? Is there a way to allow new accounts to be added, but not crash the system like this? What if anything can you guys do different that will prevent this from happening again? Is there some sort of automation sequence you can write or more hardware or modern hardware to install that could allow for such a big wave of new users to enter the system without crashing it? Credits to me don't matter, just that my BOINC manager was getting clogged up with Rosie's problems and I have other projects that use my system that also fill the screen, so it was turning into a mess. Plus it seemed since the tasks could not be reported or uploaded BOINC manager did not know how to allocate resources or so it seemed. Also, we have been asking for years now for someone to keep the main page up to date with info about problems or other news. Like now, the only main page post about this problem is some technical stuff by KEL. Nothing saying it has been solved. Though I guess you can get all that from here. Communication to the outside non scientific world has always been a challenge for this project and I have suggested before that hiring or getting a volunteer from the communications department student pool would be a plus. They could be the projects PR/spokesperson. Also it seems internal communication is problem. Yes one has a right to dump their phone and computer and return to the basics, but it would be nice if there was a backup person that knew about things like this charity organization and then could tell the others that a big clump of new users could be coming online. Anyway..hope you guys learned something from this and will improve things for the future. Happy Crunching... |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Greg_BE wrote: but it would be nice if there was a backup person that knew about things like this charity organization and then could tell the others that a big clump of new users could be coming online You can only have a backup person who knows about a new surge of users if the primary contact knows themselves. From David's comments I am surmising that this surge came entirely without warning. David E K wrote: We were not warned of the spike, do not know the cause yet, and are not prepared to serve the large executable and database files currently. David E K wrote: I was told by Matthew Blumberg at Gridrepublic that the new users are real crunchers and that they "started a new marketing campaign via charityengine.com." So I re-enabled the account creation for these users. Our servers may get sluggish again but hopefully things will settle down as the new user rates decrease. And hopefully optimizing the connections on our servers will help. In the future, we hope to get more servers. Greg_BE wrote: What if anything can you guys do different that will prevent this from happening again? Is there some sort of automation sequence you can write or more hardware or modern hardware to install that could allow for such a big wave of new users to enter the system without crashing it? Partially answered by David E K and krypton above... krypton wrote: We will be getting more servers, to prevent this from happening in the future. In terms of the main page update, the source of the problem was identified late Satruday/early Sunday depending on the timezone. Hopefully the promised main page update will occur during normal working hours on Monday. |
Gallstone Send message Joined: 31 May 12 Posts: 3 Credit: 443,740 RAC: 0 |
Puuuuh, my four overdue tasks have uploaded now. It dragged, it really dragged, but finally it worked. In three of the four cases I still got points because I uploaded the task before my successor. Only in one case my successor passed by me and I got no points. Interesting. OK, everybody learned something out of it, hopefully. One advice on technical staff: if possible please treat incoming data with higher priority than outgoing data, just like in normal life, give higher priority to older unfinished processes/tasks/jobs/duties before caring about newer ones. Alternatively leave a certain amount of network connections reserved to incoming (result) data. Now, I want to apologize for my hefty statement lately, I just was a little pissed yesterday. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Homepage puts R@H at 291 TFLOPS, considering this is all x86 horsepower (no GPU clients) that is incredibly impressive! It would be nice if this becomes the new norm, as I'm sure it would have a tangible impact on experiment Turn-Around-Time and scientific progress. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Unfortunately, time will reveal that the 200+ TFLOPS as a blip due to so many people being unable to send back their results for several days. So several days of results have uploaded and had credit issued here in a single day. But, if there were about 65,000 active hosts returning work last week, and now there are close to 80,000. If they all keep crunching, it would be reasonable to hope to see the project TFLOPS increase over 20%! If you factor in that many of those new 15,000 have a better than average chance of being newer machines, perhaps their average capacity to do work is a bit ahead of the previous average as well. Hopefully this temporary logjam was an investment to bring on a sustainable, larger base of crunching machines. Rosetta Moderator: Mod.Sense |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Homepage puts R@H at 291 TFLOPS, considering this is all x86 horsepower (no GPU clients) that is incredibly impressive! It would be nice if this becomes the new norm, as I'm sure it would have a tangible impact on experiment Turn-Around-Time and scientific progress. It is up to 330 TFLOPS now. Of course that will also include clearing the backlog of uploads, so the normal figure will drop to somewhere lower. Based on the graphs at BOINC stats there are almost twice as many active users at the moment as in June, when Rosetta was running at 130 TFLOPS, so 260 TFLOPS would be a reasonable estimate. There are a couple of other factors that will affect things - how many charity engine clients still need to connect to Rosetta? And how often will charity engine clients be connected? The CE site says the clients only run BOINC projects when there is no paid computation work available. I expect the next hurdle for the scientists will be having enough work units ready to issue. Edit: Looking at the Host graphs the proportion of the increase in active hosts is lower than the proportionate increase in active users. That would suggest that Charity Engine has close to a 1:1 ratio for users and hosts while many native BOINC users have multiple clients. Based on the current data on new hosts, perhaps an increase of 25%? That would put the new speed at around 162 TFLOPS. It will be interesting to see how it plays out in reality. |
shanen Send message Joined: 16 Apr 14 Posts: 195 Credit: 12,662,308 RAC: 0 |
<snip> Well, I stopped by to learn something about what went wrong and how it was fixed, but certainly didn't. Maybe I just failed to find the right comment in this thread, but we already knew there was a problem. We already knew that it was mostly fixed over the weekend, though I didn't see anything about the ongoing problem, which fortunately seems to be relatively minor. I'd describe it in detail, but it's really hard for me to to believe they {the project managers} will suddenly develop improved communication skills to explain what is still wrong. I'm basically willing to assume they will eventually get the last wrinkles worked out. I'm even willing to believe that the new server behaviors might be closer to the proper ones than before. There is one apparent change I've noticed that might represent a reasonable optimization... However, I'm not sure where I stand on my rumor of the cause being an NSA (or CIA or Mossad) intrusion, except that the random incompetence theory mitigates against it. Too much risk of a clumsy monkey stumbling over something. I'm replying to this particular comment because I strongly disagree with the second paragraph quoted above. From the perspective of a donor to the project, I strongly prefer to continue donating, and therefore I think the downloads of tasks should have priority. I don't see any problem with storing the pending work on my machine, subject to the condition that the upload delays don't kill donations for the sake of still meaningless-to-the-donors deadlines. (If it isn't an absolute deadline but just a discount time, then that's another communication failure on the part of the project managers...) Back to the communications topic: The "News" entry on the main project webpage has already been mentioned--but negatively as in not being used effectively. There are also two other existing communication channels that should be considered. One is the "Notices" tab of the BOINC Manager. If anyone attempted to use it, they certainly didn't get any message out. The second poorly used communication channel is the "Server Status Page", which was not helpful. Specifically, I think that the Server Status Page needs to be moved to an external server so that it can also report on the status of communications to the servers from an outside perspective. The obvious solution is as a reciprocal arrangement with other BOINC projects. This would only be a minor back-scratching mechanism, but hopefully it would lead to stronger back scratching. Seems pretty unlikely that Rosetta is the first BOINC project to encounter and fix this particular problem, whatever it was. By the way, an earlier reply "jumped on me" about the exact timing of the bandwidth-wasting 80-meg "Computation Error" tasks. That criticism was apparently based on my comment posted here several months after I noticed and started investigating those tasks. Also, the critic was confused about when they stopped, but mostly I just consider it as more evidence of highly amateurish project management in that the bandwidth was wasted for so many months. Not exactly a defense, but I'm not proud of the quality of all of my own work when I was in graduate school, so I feel like cutting them some slack on that point. I'm rather more concerned as to whether the sloppiness extends to the research results derived from the Rosetta calculations... Y'all have a rather large supercomputer here, and you seem to be taking it for granted, so to speak. (I might start looking for another project that appreciates my donations more, except that I've already participated in a couple of projects and discovered that none of them were perfect... Also, Some of the researchers I support are doing some collaboration with another department of your university.) |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
A quick recap of what happened, for anyone like Shanen who missed it since it was burried in this thread: - It appears that it was not a network / switch setting, nor was it any kind of hack/NSA intrusion (that rumor was started by someone as a joke) - The cause appears to be a very large spike (20k+) new users joining the project all at once and, as is necessary when first attaching to Rosetta@home, all requesting to downloading the main Rosetta database file (~250MB if memory serves, which would put a requirement of transferring ~4,880GB of data all at once from the R@H servers.). - This large swath of users looks to be attributed to the CharityEngine project which is built on the BOINC platform and attaches to a couple BOINC projects (Rosetta being one of them) to keep their workers busy when there is no CE work to do. The joining of this large pool of users was not communicated to Rosetta staff/management and they had no for-warning to take any measures to prepare for it. - Incredibly bad timing of this entire incident compounded the issue as most of the Rosetta team was out of town to attend a conference, while another key person was on a camping trip without any phone reception / internet access. - Most of this logjam is now cleared and work has resumed as normal. |
JimWOC Send message Joined: 27 Dec 05 Posts: 2 Credit: 6,179,797 RAC: 0 |
My backlog of uploads has cleared, but I am still getting a lot of Computation Errors. I have 32 shown in just a few minutes and the list is growing. |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Can you post a log? My backlog of uploads has cleared, but I am still getting a lot of Computation Errors. I have 32 shown in just a few minutes and the list is growing. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Jim apparently has a host that is throwing everything back https://boinc.bakerlab.org/rosetta/results.php?hostid=1801946 https://boinc.bakerlab.org/rosetta/result.php?resultid=678868361 <core_client_version>7.2.42</core_client_version> <![CDATA[ <message> finish file present too long </message> ...the rest of the log is rather extensive, but includes Unhandled Exception Detected... - Unhandled Exception Record - Reason: Breakpoint Encountered (0x80000003) at address 0x760D3226 Engaging BOINC Windows Runtime Debugger... And it has been doing this for days, while the reassigned task gets completed OK https://boinc.bakerlab.org/rosetta/workunit.php?wuid=614155164 Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Jim, it seems as though one of the core files used for crunching may have been corrupted on that machine. Simple way to reset things (especially when it does not appear you have any work in progress) is to go to the projects tab, select R@h and click the button to reset the project. This will abort all work in progress, remove all of the project programs and files, and start from scratch with downloading new copies of everything. One way files may get corrupted is by anti-virus software. So if the problem persists after a project reset, that would be another thing to check. Rosetta Moderator: Mod.Sense |
TJ Send message Joined: 29 Mar 09 Posts: 127 Credit: 4,799,890 RAC: 0 |
Yep, I'm currently optimizing the number of connections on all our servers. Looks like they can keep up without too much load/memory usage so far. These servers are pretty old and I'm sure we'll upgrade soon hopefully. Not only new server but also new server code. The code running at the moment is very outdated. Greetings, TJ. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Just in case no one else mentions this, although work cleared and new work came down last night, I'm now getting this when requesting new work:
|
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
Can you be more specific about which code you are referring to? Yep, I'm currently optimizing the number of connections on all our servers. Looks like they can keep up without too much load/memory usage so far. These servers are pretty old and I'm sure we'll upgrade soon hopefully. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
They are referring to the actual BOINC Server code. R@h has not done a refresh for many years. Newer versions have reformatted the webpages for hosts and tasks and other feature additions that people grow to expect, but they are not available on R@h. Rosetta Moderator: Mod.Sense |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,636,221 RAC: 6,537 |
Can you be more specific about which code you are referring to? On other boinc projects in "server status page" there are server software version (for example, in Poem@home, is 24848) but not in Rosetta. So we don't know how old is the server code. Some volunteers speculate that Rosetta's admins don't update server 'cause the deep customization of code. But no admins confirm it.... |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,722,686 RAC: 3,377 |
8/4/2014 4:50:35 PM | rosetta@home | Reporting 9 completed tasks 8/4/2014 4:50:35 PM | rosetta@home | Requesting new tasks for CPU and NVIDIA 8/4/2014 4:50:57 PM | rosetta@home | Scheduler request failed: Couldn't connect to server 8/4/2014 4:51:01 PM | | Project communication failed: attempting access to reference site 8/4/2014 4:51:03 PM | | Internet access OK - project servers may be temporarily down. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,722,686 RAC: 3,377 |
Just in case no one else mentions this, although work cleared and new work came down last night, I'm now getting this when requesting new work: Just got the same message. |
krypton Volunteer moderator Project developer Project scientist Send message Joined: 16 Nov 11 Posts: 108 Credit: 2,164,309 RAC: 0 |
I just disabled new users from charityengine until our servers can catch up with download demand. The number of downloads that happened last week nearly doubled, today alone. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org