Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 302 · Next

AuthorMessage
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 81494 - Posted: 18 Apr 2017, 20:25:46 UTC - in response to Message 81491.  

It seems to me that the upload problem has been solved. At least, all my stuck WU's have been uploaded now.



Yes, same here. Glad whatever the problem was has been resolved.

ID: 81494 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 81495 - Posted: 18 Apr 2017, 22:42:57 UTC

All fixed! At least all the backed up stuff is now cleared out.

Atta boy Rosetta team, we knew you could do it!

Have a great week.

Cheers!

/M
ID: 81495 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J. Ritchie Morrow

Send message
Joined: 4 Nov 05
Posts: 5
Credit: 341,049
RAC: 0
Message 81519 - Posted: 3 May 2017, 14:42:56 UTC

I keep getting the message that 'Task XX exited with zero status but no finished file. If this happens repeatedly you may need to reset the project.' I have reset the project but continue to get the error. Is this an issue on my end or the project's end? Thanks!
ID: 81519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,888,320
RAC: 0
Message 81521 - Posted: 4 May 2017, 10:50:06 UTC - in response to Message 81519.  

I keep getting the message that 'Task XX exited with zero status but no finished file. If this happens repeatedly you may need to reset the project.' I have reset the project but continue to get the error. Is this an issue on my end or the project's end? Thanks!

Copied and pasted from an earlier answer:

On Rosetta this is usually solved by increasing the "use at most xxx% of CPU time" setting to 100. You may then want to reduce the "on multiprocessors, use at most xxx% of the processors" to something less than currently set. Most people find this handles the temperature regulation concerns (that the cpu throttling was designed to address) perfectly.

Another possible cause are virus scanners; most folks exclude BOINC from those scans or set it to run only when BOINC isn't active.

An explanation and more possible causes can be found here: BOINC FAQ Service

Please know that this only becomes a fatal error when it occurs 100 times to a particular task; at that point BOINC assumes the task will never be able to finish and gives up on it, ending it as a client error. If you see this message only occasionally it is safe to ignore it.


Best,
Snags
ID: 81521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Batschlach

Send message
Joined: 7 May 17
Posts: 3
Credit: 307,527
RAC: 0
Message 81529 - Posted: 13 May 2017, 14:22:09 UTC

Hey,

I've received some work units which couldn't be finished due to a compute error. Interestingly, the second person calculating the same WU also resulted in a compute error:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825566938
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825557307
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825537629 (still pending)

Is this common behaviour? What has happened there?

Best regards
ID: 81529 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81530 - Posted: 13 May 2017, 21:18:03 UTC

Moved the details from Batschlach. These are Android WUs.
Rosetta Moderator: Mod.Sense
ID: 81530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81532 - Posted: 14 May 2017, 3:13:03 UTC - in response to Message 81529.  

Hey,

I've received some work units which couldn't be finished due to a compute error. Interestingly, the second person calculating the same WU also resulted in a compute error:
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825566938
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825557307
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=825537629 (still pending)

Is this common behaviour? What has happened there?

Best regards



This was a bad batch that a researcher accidentally sent out.
ID: 81532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Batschlach

Send message
Joined: 7 May 17
Posts: 3
Credit: 307,527
RAC: 0
Message 81533 - Posted: 14 May 2017, 10:00:11 UTC - in response to Message 81532.  

This was a bad batch that a researcher accidentally sent out.

Oh, I see. Thanks for your answer. And thanks for moving my post into the right thread @Mod.Sense!
ID: 81533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Skillz

Send message
Joined: 24 May 17
Posts: 3
Credit: 5,914,356
RAC: 7,445
Message 81538 - Posted: 27 May 2017, 18:03:46 UTC

Why am I having such problems getting work units?

I have over 250 cores that can be crunching but I only have, at the time of this post, 59 slots filled. This is the only project I am running so those other cores are sitting idle.
ID: 81538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 81539 - Posted: 27 May 2017, 20:45:41 UTC - in response to Message 81538.  

Why am I having such problems getting work units?

I have over 250 cores that can be crunching but I only have, at the time of this post, 59 slots filled. This is the only project I am running so those other cores are sitting idle.


The Server Status page is a sea of red: clearly there are problems of some sort.
ID: 81539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 2 Jun 16
Posts: 59
Credit: 24,317,585
RAC: 63,126
Message 81540 - Posted: 27 May 2017, 20:52:12 UTC

The only one that needs to be up all the time is the scheduler which I've seen up and all the ones below it have been up and down. Not everything needs to be running 100% of the time for the project to function.

Set a longer queue.
ID: 81540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
xii5ku

Send message
Joined: 29 Nov 16
Posts: 22
Credit: 13,815,783
RAC: 86
Message 81541 - Posted: 28 May 2017, 10:58:11 UTC - in response to Message 81538.  

Why am I having such problems getting work units?

I have over 250 cores that can be crunching but I only have, at the time of this post, 59 slots filled. This is the only project I am running so those other cores are sitting idle.


For a dual- or quad-socket machine, a "Target CPU run time" setting below 4 hours is not sustainable, IME.
ID: 81541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 81543 - Posted: 29 May 2017, 3:20:11 UTC - in response to Message 81541.  

Why am I having such problems getting work units?

I have over 250 cores that can be crunching but I only have, at the time of this post, 59 slots filled. This is the only project I am running so those other cores are sitting idle.


For a dual- or quad-socket machine, a "Target CPU run time" setting below 4 hours is not sustainable, IME.

Correct. Runtime <cannot> be the minimum 1 hour - especially when you have 250 cores (which is great btw).

Default run-time is 8 hours, for which you'll do 8 times the work and receive 8 times the credit, but only use 18th of the bandwidth - better for you and the project. Also, likely to reduce the occasions you have unused cores, which answers your question.

BUT! You shouldn't change directly from 1hr to 8hrs, otherwise your tasks will miss deadlines. Change up to 2hrs first, until your buffer stockpile is reduced and starts asking for more tasks. Then 3hrs - same process. Then 4hrs etc until you get to a practical level you're happy with - ideally the default 8hrs.
ID: 81543 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1995
Credit: 9,633,537
RAC: 7,232
Message 81544 - Posted: 30 May 2017, 10:11:51 UTC - in response to Message 81539.  

The Server Status page is a sea of red: clearly there are problems of some sort.


Still.... :-(
ID: 81544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81545 - Posted: 30 May 2017, 18:54:42 UTC - in response to Message 81544.  

The Server Status page is a sea of red: clearly there are problems of some sort.


Still.... :-(



Sorry for the errors in the status page. I'll take a look. Everything is running as normal so you can ignore the page for now.
ID: 81545 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 81549 - Posted: 6 Jun 2017, 11:53:21 UTC

2 long-running tasks with a long time since the last checkpoint:

b21_ncst_0601.282._relax_SAVE_ALL_OUT_486708_29_0
Last checkpoint: 7:51:46
CPU Time: 11:56:21

b22_1_0603.18._relax_SAVE_ALL_OUT_486983_40_1
Last checkpoint: 2:54:51
CPU Time: 8:23:20

Both have a default 8 hour runtime and I'm anticipating the watchdog being the only thing that stops them running.

I've got 2 more b21 tasks in my queue. Should I abort them? Thinking I will.
ID: 81549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 81550 - Posted: 6 Jun 2017, 12:16:10 UTC - in response to Message 81549.  

The 1st one has just completed and given full credit (and more) for the extra runtime. Maybe I should just let them complete after all. Let me see what the other one does.
2 long-running tasks with a long time since the last checkpoint:

b21_ncst_0601.282._relax_SAVE_ALL_OUT_486708_29_0
Last checkpoint: 7:51:46
CPU Time: 11:56:21

b22_1_0603.18._relax_SAVE_ALL_OUT_486983_40_1
Last checkpoint: 2:54:51
CPU Time: 8:23:20

Both have a default 8 hour runtime and I'm anticipating the watchdog being the only thing that stops them running.

I've got 2 more b21 tasks in my queue. Should I abort them? Thinking I will.


ID: 81550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 81553 - Posted: 7 Jun 2017, 1:59:55 UTC - in response to Message 81550.  

2ns one not so generous, but both acknowledged the full runtime and validated properly
The 1st one has just completed and given full credit (and more) for the extra runtime. Maybe I should just let them complete after all. Let me see what the other one does.
2 long-running tasks with a long time since the last checkpoint:

b21_ncst_0601.282._relax_SAVE_ALL_OUT_486708_29_0
Last checkpoint: 7:51:46
CPU Time: 11:56:21

b22_1_0603.18._relax_SAVE_ALL_OUT_486983_40_1
Last checkpoint: 2:54:51
CPU Time: 8:23:20

Both have a default 8 hour runtime and I'm anticipating the watchdog being the only thing that stops them running.

I've got 2 more b21 tasks in my queue. Should I abort them? Thinking I will.


ID: 81553 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
boinc127

Send message
Joined: 23 Jan 12
Posts: 3
Credit: 281,019
RAC: 0
Message 81555 - Posted: 7 Jun 2017, 17:39:26 UTC

I've got a b22 task that is wrapping up, but it is crawling to completion at 99.399% Its slowly creeping at 0.01% a minute or so, its in fast relax on model 292 step 7205.

I may just abort that task as well...
ID: 81555 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1995
Credit: 9,633,537
RAC: 7,232
Message 81559 - Posted: 8 Jun 2017, 7:43:11 UTC

920412254

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00EECEF9 read attempt to address 0x11DE2000

Engaging BOINC Windows Runtime Debugger...

ID: 81559 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 302 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org