Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 32 · 33 · 34 · 35 · 36 · 37 · 38 . . . 55 · Next

AuthorMessage
loftwyr

Send message
Joined: 23 Dec 07
Posts: 1
Credit: 775,197
RAC: 0
Message 77163 - Posted: 31 Jul 2014, 0:59:22 UTC - in response to Message 77162.  

So it turns out the servers are working... but extremely slow in response (for some reason, we are still trying to figure out why).

<snip>

It seems to have fixed it... Some of my jobs that were on hold..have uploaded!
Can someone else try this to confirm that it works (maybe I was just lucky).


Didn't work for me. I'm still failing.
ID: 77163 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 77164 - Posted: 31 Jul 2014, 1:21:38 UTC
Last modified: 31 Jul 2014, 1:28:15 UTC

suggested cc_config didn't work for me either. Forced BOINC Manager to reread config files, retry the upload. Interestingly, it seems to fail in 8 seconds, which is less than the timeout specified.

I'm on BOINC 7.2.42 on Windows.
Rosetta Moderator: Mod.Sense
ID: 77164 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77165 - Posted: 31 Jul 2014, 1:22:20 UTC - in response to Message 77162.  
Last modified: 31 Jul 2014, 1:23:24 UTC

Added the config file. No apparent effect here.
ID: 77165 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77166 - Posted: 31 Jul 2014, 1:24:57 UTC
Last modified: 31 Jul 2014, 1:25:17 UTC

=[
Thanks for checking!
ID: 77166 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 77167 - Posted: 31 Jul 2014, 5:48:36 UTC - in response to Message 77165.  
Last modified: 31 Jul 2014, 5:51:08 UTC

Finally started downloading some fresh units to work on, but the servers definitely seem to be thrashing most horribly. LOTS of retries and no pattern except for the obvious feature that smaller files are seeming to have more successes.

Uploads still seem to be blocked almost completely. It seems possible that one of my completed tasks may have been reported, but most of them are not even trying anymore.

How about a DoS attack? The quasi-pattern reminds me of socket failures, but they can be induced.
ID: 77167 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 77168 - Posted: 31 Jul 2014, 6:53:57 UTC

Uploads for me are sill not working for about 70 hours now.
This maybe a problem form the university where the servers are maintained, but the server code is still from the 90's. It is old-dated for year already that is also debt to the low credits.
Greetings,
TJ.
ID: 77168 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 77169 - Posted: 31 Jul 2014, 6:56:01 UTC - in response to Message 77161.  
Last modified: 31 Jul 2014, 6:57:23 UTC

I don't believe I am the only person who noticed all those 80-Meg "Computation Error" tasks that the system was broadcasting for many months before (and at least a couple of months after) I commented on them in these discussions.


I think you are exaggerating there just a little. Your post about that issue was 30 days ago so a "couple of months after" is a little bit wide of the mark.

If I recall correctly the problem started a couple of weeks before your post and finished a week or two afterwards. It would have been best if the scientist in charge of that experiment had responded earlier but it was definitely not as long lasting as you suggest.

How about a DoS attack? The quasi-pattern reminds me of socket failures, but they can be induced.


As there are 501k tasks currently in progress (and most probably finished by now) the recovery phase after this outage will probably share some of the characteristics of a DoS attack as every system tries to upload at once.
ID: 77169 · Rating: 0 · rate: Rate + / Rate - Report as offensive
jareeq

Send message
Joined: 28 Apr 12
Posts: 2
Credit: 4,149,828
RAC: 0
Message 77170 - Posted: 31 Jul 2014, 8:27:27 UTC - in response to Message 77168.  

Uploads for me are sill not working for about 70 hours now.
This maybe a problem form the university where the servers are maintained, but the server code is still from the 90's. It is old-dated for year already that is also debt to the low credits.


ok - now I see why all of my results are frozen in upload. Do you have any information about when it will be fixed ?
ID: 77170 · Rating: 0 · rate: Rate + / Rate - Report as offensive
JohnH

Send message
Joined: 25 Mar 13
Posts: 43
Credit: 2,319,355
RAC: 0
Message 77171 - Posted: 31 Jul 2014, 10:40:04 UTC

I just finished one of my 9 queued uploads. At about the same time I got a bunch of downloads started. They're all stuck at 0% complete.
I ran wireshark on my Win 8.1 on net 128.95.160.0/24 which includes all the Rosetta servers I think. It shows loadsa TCP exceptions like 3 way handshake timeouts, duplicate acks TCP retransmits etc. Now I'm a long way away from UW both geographically and networkly but it smells to me like some layer 3 switch or router specific to that subnet is in trouble. Dunno if that helps or not but hey...
ID: 77171 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 77172 - Posted: 31 Jul 2014, 11:08:36 UTC - in response to Message 77170.  

Uploads for me are sill not working for about 70 hours now.
This maybe a problem form the university where the servers are maintained, but the server code is still from the 90's. It is old-dated for year already that is also debt to the low credits.


ok - now I see why all of my results are frozen in upload. Do you have any information about when it will be fixed ?


Whenever KEL (IT for Rosetta) can get the UW computing guys to help him figure out what went wrong. It is typical in these situations for the project to be out of service anywhere from a few days to a week. Just depends. Find another project to keep your system busy in the meantime.
ID: 77172 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Daedalus

Send message
Joined: 1 Aug 08
Posts: 39
Credit: 10,107,163
RAC: 301
Message 77177 - Posted: 31 Jul 2014, 16:33:49 UTC

One of my results partially uploaded before getting stuck. So -as mentioned- the server is probably throttled but not dead.

I have 54 entries waiting to be uploaded...
ID: 77177 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile The_Saint_(LDS)

Send message
Joined: 12 Aug 10
Posts: 6
Credit: 10,076,132
RAC: 0
Message 77178 - Posted: 31 Jul 2014, 17:37:35 UTC - in response to Message 77177.  

One of my results partially uploaded before getting stuck. So -as mentioned- the server is probably throttled but not dead.

I have 54 entries waiting to be uploaded...


I'm glad I had a queue on most of my little farm...only one is out of work right now but all of them have almost 3 days worth to upload
(some with partial uploads)...somewhere over 300 results to upload now.

Oh well...here's hoping they get to the bottom of this soon.
ID: 77178 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,281,662
RAC: 943
Message 77182 - Posted: 31 Jul 2014, 21:45:00 UTC
Last modified: 31 Jul 2014, 21:54:57 UTC

Did you read the news message on the home page, which says than the problem is now known to be that the connection between the university's network and the rest of the internet is running much slower than usual? Nothing on why, or when it will be fixed, yet though.

That should mean that anyone on their campus should get faster than usual response from their server, due to less competition from the rest of the world.

Many of you currently participating only in Rosetta@Home might want to add World Community Grid to your list of BOINC projects, but with a 0% resource share so that workunits will only be downloaded from WCG when none are available from Rosetta@Home.

http://www.worldcommunitygrid.org/
ID: 77182 · Rating: 0 · rate: Rate + / Rate - Report as offensive
premier

Send message
Joined: 30 Dec 05
Posts: 14
Credit: 23,872,868
RAC: 0
Message 77188 - Posted: 1 Aug 2014, 4:58:05 UTC - in response to Message 77182.  

Guys, You suck. I have never saw network failure that can't be repaired within 12 hours or less (I manage large networks). It's 4'th day without ability to upload/download anything. Come on guys, I am supporting you since 2005, and I always thought about R@H as best of the best projects. But from some time I am considering leaving it because:

1. You do not wan't to share source code - how the hell could I be sure I am not part of Bitcoin botnet or other strange project?
2. There is large number of errors in WU's
3. Current project status for me is DOWN.

Guys do something or you loose lot of compute power.
ID: 77188 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77189 - Posted: 1 Aug 2014, 6:00:20 UTC - in response to Message 77188.  
Last modified: 1 Aug 2014, 6:10:56 UTC

Hi premier,

Thank you for your support these last 5 years! I'm sorry about the current events...

1) the source code is free and available online for download, all you have to do is agree that you would not use it for profit (it's really easy to get): https://c4c.uwc4c.com/express_license_technologies/rosetta

2) Working on it...

3) Most of the Rosetta Community is out of town for a conference... Won't be back at the university till this weekend. I was not able to repair it myself, and have to wait till the experts are back (or at least till they have access to the internet). =[

Guys, You suck. I have never saw network failure that can't be repaired within 12 hours or less (I manage large networks). It's 4'th day without ability to upload/download anything. Come on guys, I am supporting you since 2005, and I always thought about R@H as best of the best projects. But from some time I am considering leaving it because:

1. You do not wan't to share source code - how the hell could I be sure I am not part of Bitcoin botnet or other strange project?
2. There is large number of errors in WU's
3. Current project status for me is DOWN.

Guys do something or you loose lot of compute power.
ID: 77189 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1995
Credit: 9,643,672
RAC: 6,759
Message 77192 - Posted: 1 Aug 2014, 8:19:17 UTC

Strange. Admins say it's a university network problem. But Ralph@home runs ok (upload/download) and, i think, it is on the same network....
ID: 77192 · Rating: 0 · rate: Rate + / Rate - Report as offensive
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77194 - Posted: 1 Aug 2014, 9:58:39 UTC - in response to Message 77192.  
Last modified: 1 Aug 2014, 10:26:14 UTC

Yeah, some servers were unaffected... for example try

curl -v --connect-timeout 0 boinc.bakerlab.org
vs
curl -v --connect-timeout 0 srv2.bakerlab.org

(both should show the same content). The latter takes 2 mins to load.

The Boinc manager uses the curl library, and by default timeout(s) if the connection is idle for certain number of seconds (hence why it fails...)

Each job is assigned one of the 5 servers srv[1..5].bakerlab.org. (each has its own ip address)

Our working hypothesis is that UW imposed some kind of Bandwidth throttling on the high traffic ip addresses... Ralph@home has its own server... which has not yet been "flagged" as high traffic.

One temporary fix I was consider would be to modify the hosts file to redirect srv[1..5] to boinc.bakerlab.org ip address... but this could cause boinc.bakerlab.org to get flagged and killed. =[

Strange. Admins say it's a university network problem. But Ralph@home runs ok (upload/download) and, i think, it is on the same network....
ID: 77194 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Gallstone

Send message
Joined: 31 May 12
Posts: 3
Credit: 443,740
RAC: 0
Message 77195 - Posted: 1 Aug 2014, 10:34:34 UTC

Upload not possible for me too. Therefore question:

How about deadlines? I have 4 completed tasks ready for upload, but deadline is Aug 2, 16:09 UTC. If the problem isn't solved by then, will deadlines be extended? Or will those tasks receive scores even if uploaded beyond deadline?
ID: 77195 · Rating: 0 · rate: Rate + / Rate - Report as offensive
biodoc

Send message
Joined: 19 Feb 06
Posts: 14
Credit: 30,717,792
RAC: 0
Message 77196 - Posted: 1 Aug 2014, 10:35:58 UTC

Perhaps the Baker lab use of University Network bandwith is affecting student access to facebook, youtube and Netflix?

Sad, science no longer rules these days.
ID: 77196 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2126
Credit: 41,256,771
RAC: 8,070
Message 77197 - Posted: 1 Aug 2014, 10:44:18 UTC - in response to Message 77189.  

3) Most of the Rosetta Community is out of town for a conference... Won't be back at the university till this weekend. I was not able to repair it myself, and have to wait till the experts are back (or at least till they have access to the internet). =[

Obviously this isn't the news we wanted, but it's important you've said it because we can adjust our expectations (and processing) accordingly.

It's disappointing you've been put in this position and the IT staff haven't supported you by calling over expert help from elsewhere in the faculty. Thanks for trying.

ID: 77197 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 32 · 33 · 34 · 35 · 36 · 37 · 38 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org