Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 299 · 300 · 301 · 302 · 303 · 304 · 305 . . . 308 · Next

AuthorMessage
ArcSedna

Send message
Joined: 23 Oct 11
Posts: 16
Credit: 71,462,581
RAC: 87,530
Message 110095 - Posted: 4 Dec 2024, 0:32:56 UTC

I'm having transient HTTP errors recently.
According to my client log with http_debug flag enabled, some of the download server(s) might have SSL certificate problem.

Rosetta@home 2024/12/04 09:16 [http] HTTP_OP::init_get(): https://boinc-files.bakerlab.org/rosetta/download/294/8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
Rosetta@home 2024/12/04 09:16 Started download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Hostname in DNS cache was stale, zapped
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Host boinc-files.bakerlab.org:443 was resolved.
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: IPv6: (none)
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: IPv4: 128.95.160.135, 128.95.160.134
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Trying 128.95.160.135:443...
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: ALPN: curl offers h2,http/1.1
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.3 (OUT), TLS handshake, Client hello (1):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.3 (IN), TLS handshake, Server hello (2):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.2 (IN), TLS handshake, Certificate (11):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.2 (OUT), TLS alert, certificate expired (557):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: SSL certificate problem: certificate has expired
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: closing connection #4040
Rosetta@home 2024/12/04 09:16 [http] HTTP error: SSL peer certificate or SSH remote key was not OK
2024/12/04 09:16 Project communication failed: attempting access to reference site
2024/12/04 09:16 [http] HTTP_OP::init_get(): https://www.google.com/
Rosetta@home 2024/12/04 09:16 Temporarily failed download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags: transient HTTP error
Rosetta@home 2024/12/04 09:16 Backing off 04:10:21 on download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
2024/12/04 09:16 Internet access OK - project servers may be temporarily down.
ID: 110095 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 195
Credit: 6,613,600
RAC: 9,094
Message 110096 - Posted: 4 Dec 2024, 3:42:08 UTC - in response to Message 110093.  

Would you prefer that they cancel workunits after they start? They obviously want one copy to finish soon, and may not have information on whether the first one will ever finish.


I prefer they not send me the work unit at all if another instance of it is presumably running. Since they want only one result, they should wait until they decide the current unit has timed out (or failed) before sending me the new one. Then they save the network cost of sending me the work unit and later, the cost pf telling my Boinc client to cancel the one they sent me.
ID: 110096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2137
Credit: 41,518,559
RAC: 15,775
Message 110097 - Posted: 4 Dec 2024, 4:21:15 UTC - in response to Message 110096.  

Would you prefer that they cancel workunits after they start? They obviously want one copy to finish soon, and may not have information on whether the first one will ever finish.

I prefer they not send me the work unit at all if another instance of it is presumably running. Since they want only one result, they should wait until they decide the current unit has timed out (or failed) before sending me the new one. Then they save the network cost of sending me the work unit and later, the cost of telling my Boinc client to cancel the one they sent me.

The sole criteria is that the task has passed its deadline.
If the previous host completes the task late, but before you've started it, the server will ask your system to abort the task it sent you.
If you've already started the task, it won't abort running and you get this situation.
ID: 110097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1722
Credit: 18,356,357
RAC: 25,250
Message 110098 - Posted: 4 Dec 2024, 5:25:05 UTC
Last modified: 4 Dec 2024, 5:26:28 UTC

It is a case of poor BOINC server configuration- ideally it would be configured so that there would be a grace period after the deadline for Tasks that have missed the deadline but are still being processed to be returned before a Task is resent.
There would still be some resends that are cancelled when the original finally comes in late, but much less than there are now, which would reduce the load on the servers.
Also, a Task cancelled by the project really shouldn't be classed as an error.



And the easiest & best way to avoid having this occur? Run with no cache at all.
If you've got it and it is being processed, then it won't be cancelled by the server. The larger your cache is, then the longer it takes to start processing work you have downloaded, and the more likely it is that Tasks will be cancelled by the Project.

No cache, no cancelled Tasks, large cache, lots of cancelled Tasks.
Your choice.
Grant
Darwin NT
ID: 110098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ArcSedna

Send message
Joined: 23 Oct 11
Posts: 16
Credit: 71,462,581
RAC: 87,530
Message 110099 - Posted: 4 Dec 2024, 5:26:44 UTC - in response to Message 110095.  
Last modified: 4 Dec 2024, 5:31:38 UTC

It seems that, the DNS server of my ISP resolves boinc-files.bakerlab.org to 128.95.160.135 or 128.95.160.134 , which have expired SSL certs.
I wrote alternative IP to my local /etc/hosts manually, like 128.95.160.156 boinc-files.bakerlab.org , now every downloads working fine so far.

I'm having transient HTTP errors recently.
According to my client log with http_debug flag enabled, some of the download server(s) might have SSL certificate problem.

Rosetta@home 2024/12/04 09:16 [http] HTTP_OP::init_get(): https://boinc-files.bakerlab.org/rosetta/download/294/8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
Rosetta@home 2024/12/04 09:16 Started download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Hostname in DNS cache was stale, zapped
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Host boinc-files.bakerlab.org:443 was resolved.
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: IPv6: (none)
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: IPv4: 128.95.160.135, 128.95.160.134
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: Trying 128.95.160.135:443...
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: ALPN: curl offers h2,http/1.1
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.3 (OUT), TLS handshake, Client hello (1):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.3 (IN), TLS handshake, Server hello (2):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.2 (IN), TLS handshake, Certificate (11):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: TLSv1.2 (OUT), TLS alert, certificate expired (557):
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: SSL certificate problem: certificate has expired
Rosetta@home 2024/12/04 09:16 [http] [ID#4306] Info: closing connection #4040
Rosetta@home 2024/12/04 09:16 [http] HTTP error: SSL peer certificate or SSH remote key was not OK
2024/12/04 09:16 Project communication failed: attempting access to reference site
2024/12/04 09:16 [http] HTTP_OP::init_get(): https://www.google.com/
Rosetta@home 2024/12/04 09:16 Temporarily failed download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags: transient HTTP error
Rosetta@home 2024/12/04 09:16 Backing off 04:10:21 on download of 8a_hal_x_hal_8aa_4jp9719_d196_0001_1.flags
2024/12/04 09:16 Internet access OK - project servers may be temporarily down.
ID: 110099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1722
Credit: 18,356,357
RAC: 25,250
Message 110100 - Posted: 4 Dec 2024, 7:53:15 UTC
Last modified: 4 Dec 2024, 8:09:45 UTC

Just checked my Event log, and downloads are instantly timing out.
Tried ArcSedna's suggestion- Success! Thanks for that info.


Edit- could be other issues occurring- Getting Ghost Tasks.
One system has requested work, log says got 2 new Tasks, but no Tasks downloaded.
Grant
Darwin NT
ID: 110100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1722
Credit: 18,356,357
RAC: 25,250
Message 110101 - Posted: 4 Dec 2024, 10:31:46 UTC

And to add to the download issues, boinc-process has died again.
Grant
Darwin NT
ID: 110101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 79
Credit: 273,880
RAC: 361
Message 110102 - Posted: 4 Dec 2024, 11:55:22 UTC

Download issues on 3 tasks on 3 devices one taks has been trying to download over 12 hours now.

Error log from one device showing the SSL EXPIRED CERTIFICATE MESSAGE:
Rosetta@home 12/4/2024 05:29:27 [http] HTTP_OP::init_get(): https://boinc-files.bakerlab.org/rosetta/download/a4/flags_rb_12_04_647237_640782__t000__0_C1_robetta
Rosetta@home 12/4/2024 05:29:27 Started download of flags_rb_12_04_647237_640782__t000__0_C1_robetta
Rosetta@home 12/4/2024 05:29:27 [http] HTTP_OP::init_get(): https://boinc-files.bakerlab.org/rosetta/download/20f/input_rb_12_04_647237_640782__t000__0_C1_robetta.zip
Rosetta@home 12/4/2024 05:29:27 Started download of input_rb_12_04_647237_640782__t000__0_C1_robetta.zip
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20615] Info: Hostname in DNS cache was stale, zapped
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: Found bundle for host: 0x1931793540 [serially]
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20615] Info: Host boinc-files.bakerlab.org:443 was resolved.
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20615] Info: IPv6: (none)
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20615] Info: IPv4: 128.95.160.135, 128.95.160.134
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20615] Info: Trying 128.95.160.135:443...
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: Hostname 'boinc-files.bakerlab.org' was found in DNS cache
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: Trying 128.95.160.135:443...
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: Connected to boinc-files.bakerlab.org (128.95.160.135) port 443
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: schannel: disabled automatic use of client certificate
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: ALPN: curl offers http/1.1
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: schannel: next InitializeSecurityContext failed: SEC_E_CERT_EXPIRED (0x80090328) - The received certificate has expired.
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: Closing connection
Rosetta@home 12/4/2024 05:29:28 [http] [ID#20616] Info: schannel: shutting down SSL/TLS connection with boinc-files.bakerlab.org port 443
Rosetta@home 12/4/2024 05:29:28 [http] HTTP error: SSL connect error
Rosetta@home 12/4/2024 05:29:29 Temporarily failed download of input_rb_12_04_647237_640782__t000__0_C1_robetta.zip: transient HTTP error

Rosetta@home 12/4/2024 05:29:29 Backing off 01:03:47 on download of input_rb_12_04_647237_640782__t000__0_C1_robetta.zip
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: Connected to boinc-files.bakerlab.org (128.95.160.135) port 443
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: schannel: disabled automatic use of client certificate
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: ALPN: curl offers http/1.1
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: schannel: next InitializeSecurityContext failed: SEC_E_CERT_EXPIRED (0x80090328) - The received certificate has expired.
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: Closing connection
Rosetta@home 12/4/2024 05:29:31 [http] [ID#20615] Info: schannel: shutting down SSL/TLS connection with boinc-files.bakerlab.org port 443
Rosetta@home 12/4/2024 05:29:31 [http] HTTP error: SSL connect error
Rosetta@home 12/4/2024 05:29:32 Temporarily failed download of flags_rb_12_04_647237_640782__t000__0_C1_robetta: transient HTTP error

Rosetta@home 12/4/2024 05:29:32 Backing off 00:30:29 on download of flags_rb_12_04_647237_640782__t000__0_C1_robetta
ID: 110102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 398
Credit: 12,294,748
RAC: 9,249
Message 110103 - Posted: 4 Dec 2024, 15:21:21 UTC - in response to Message 110099.  

Many thanks, downloads fixed, tasks running :-)
ID: 110103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobbyB

Send message
Joined: 25 Apr 20
Posts: 2
Credit: 2,088,662
RAC: 27,165
Message 110104 - Posted: 4 Dec 2024, 16:04:12 UTC
Last modified: 4 Dec 2024, 16:06:19 UTC

12 tasks hung for 15 hours. Is it me or something else. I aborted them 10 minutes ago.
Running headless.

Example log for 1:
2024-12-03 19:04:54 | Rosetta@home | Started download of 8a_hal_v_hal_8aa_4jp8289_d247_0001_1.flags
2024-12-03 19:04:55 |  | Internet access OK - project servers may be temporarily down.
2024-12-03 19:04:55 | Rosetta@home | Temporarily failed download of 8a_hal_v_hal_8aa_4jp8289_d247_0001_1.flags: transient HTTP error
2024-12-03 19:04:55 | Rosetta@home | Backing off 00:02:02 on download of 8a_hal_v_hal_8aa_4jp8289_d247_0001_1.flags
...
...
2024-12-04 10:22:51 | Rosetta@home | Temporarily failed download of 8a_hal_v_hal_8aa_4jp8289_d247_0001_1.flags: transient HTTP error
2024-12-04 10:22:51 | Rosetta@home | Backing off 01:46:39 on download of 8a_hal_v_hal_8aa_4jp8289_d247_0001_1.flags
2024-12-04 10:22:52 |  | Internet access OK - project servers may be temporarily down.
ID: 110104 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 195
Credit: 6,613,600
RAC: 9,094
Message 110105 - Posted: 4 Dec 2024, 16:23:45 UTC - in response to Message 110104.  

12 tasks hung for 15 hours. Is it me or something else. I aborted them 10 minutes ago.
Running headless
.

I get similar results to yours.

Wed 04 Dec 2024 11:16:24 AM EST | Rosetta@home | Started download of 8a_hal_w_hal_8aa_4jp5150_d23_0001_1.zip
Wed 04 Dec 2024 11:16:26 AM EST |  | Project communication failed: attempting access to reference site
Wed 04 Dec 2024 11:16:26 AM EST | Rosetta@home | Temporarily failed download of 8a_hal_w_hal_8aa_4jp5150_d23_0001_1.zip: transient HTTP error
Wed 04 Dec 2024 11:16:26 AM EST | Rosetta@home | Backing off 03:28:48 on download of 8a_hal_w_hal_8aa_4jp5150_d23_0001_1.zip
Wed 04 Dec 2024 11:16:28 AM EST |  | Internet access OK - project servers may be temporarily down.


But my machine continues to run currently running rosetta tasks and starts those I still have.

I am not sure why your machine was hung. I seem to be able to upload results. They do seem to be having server problems...
ID: 110105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 110106 - Posted: 4 Dec 2024, 16:25:01 UTC

Chrome accepts connection to https://boinc-files.bakerlab.org, but curl does not.
ID: 110106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 110107 - Posted: 4 Dec 2024, 16:38:41 UTC

ID: 110107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 271
Credit: 507,897
RAC: 496
Message 110108 - Posted: 4 Dec 2024, 17:39:13 UTC

Now it works.
ID: 110108 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 79
Credit: 273,880
RAC: 361
Message 110109 - Posted: 4 Dec 2024, 18:48:58 UTC

Now have 3 tasks stuck in download with many retries, 2 of them on Android phones.
Manual retry fail instantly.
ID: 110109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 28 May 06
Posts: 79
Credit: 273,880
RAC: 361
Message 110110 - Posted: 4 Dec 2024, 19:51:46 UTC

Now 4 tasks stuck on downloads all on Android phones.
ID: 110110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 47,239
Message 110111 - Posted: 4 Dec 2024, 20:51:24 UTC

I ran out of work on multiple PCs with a 1+ day queue set due to all of the failed downloads. I haven't been able to get tasks consistently since the 1.7m tasks went up. With that amount of work I could have gotten my 25m goal either Thurs or Friday.
ID: 110111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 195
Credit: 6,613,600
RAC: 9,094
Message 110112 - Posted: 4 Dec 2024, 21:07:52 UTC
Last modified: 4 Dec 2024, 21:10:58 UTC

It is not working for me. I have one task trying to download: two files. I checked and they will time out Friday afternoon, so they better deliver enough before then for me to get them done.

Web site says download server is green, though a lot of others are red.
ID: 110112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobbyB

Send message
Joined: 25 Apr 20
Posts: 2
Credit: 2,088,662
RAC: 27,165
Message 110113 - Posted: 4 Dec 2024, 21:13:21 UTC

Still not working. Download stuck.

2024-12-04 16:01:39 | Rosetta@home | Temporarily failed download of input_rb_12_01_646399_639907__t000__0_C1_robetta.zip: transient HTTP error
2024-12-04 16:01:39 | Rosetta@home | Backing off 00:06:14 on download of input_rb_12_01_646399_639907__t000__0_C1_robetta.zip
2024-12-04 16:01:40 | | Internet access OK - project servers may be temporarily down.

This would be a really good time for Rosetta to get their stuff working properly. Around Dec 6 or 7 World Community Grid will be offline for about 1 month. All those people and their machines will be looking for work. I have 76 cores which will be hungry.
ID: 110113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 61
Credit: 25,390,629
RAC: 47,239
Message 110114 - Posted: 4 Dec 2024, 21:45:27 UTC

The host file update worked for me in Win10 and Linux. Most clients needed a restart as it thought there was a file stuck downloading (even though it showed none) so nothing else would download. The client restart cleared that up.
ID: 110114 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 299 · 300 · 301 · 302 · 303 · 304 · 305 . . . 308 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org