Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 306 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2130
Credit: 41,424,155
RAC: 16,102
Message 89781 - Posted: 26 Oct 2018, 22:12:39 UTC - in response to Message 89778.  

Where did all the WUs go? There were loads to download the last time I looked. Now none.

Cancel that (maybe). 6k+ just came back

Went up to 14k tasks, then all gone again. Something weird happening.
ID: 89781 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,324,975
RAC: 3,637
Message 89782 - Posted: 26 Oct 2018, 23:34:09 UTC - in response to Message 89614.  

If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD:

[snip]

Looks like the main limitation of CPUs with this many processors is not the number of processors, but the speed of the memory that all the processors in the same package share.

If so, some of these processors could even be beyond the point where deciding which processor to allow to make the next memory access takes up enough of the run time is high enough to cause a significant slowdown.

You might also look up the cache size inside each of these CPUs - competing for cache space could also cause a significant slowdown.
ID: 89782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sam

Send message
Joined: 9 Mar 06
Posts: 3
Credit: 3,343,043
RAC: 1,698
Message 89794 - Posted: 28 Oct 2018, 15:46:57 UTC - in response to Message 89779.  


I'm getting an error, "<message>finish file present too long</message> after a WU has completed.


Hi Franko,

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. You can just ignore it, because most of the time your workunits are fine.

Sjmielh
ID: 89794 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 89795 - Posted: 28 Oct 2018, 16:32:27 UTC - in response to Message 89794.  

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc.

That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines.
Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it.
ID: 89795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89805 - Posted: 30 Oct 2018, 17:00:35 UTC - in response to Message 89795.  

This machine uses two SSD's in RAID 0 on a Dell PERC H710 RAID card with 1 GB of RAM (which could be the source of the problem), with the write policy set to "Write Back", which is defined as, "In Write Back mode the controller sends a data transfer completion signal to the host when the controller cache has received all of the data in a transaction."

For some reason, Windows Explorer (Exploder?) hangs when this machine is NOT under load, AND I have several windows explorer windows open.

Is there a way to increase this timeout to accommodate this machine's peculiarities?

Thanks!!

Franko

I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc.

That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines.
Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it.
ID: 89805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,170,646
RAC: 10,441
Message 89812 - Posted: 1 Nov 2018, 4:11:01 UTC - in response to Message 89805.  

The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89830 - Posted: 3 Nov 2018, 23:45:08 UTC - in response to Message 89812.  

Thanks, but after looking at the affected tasks, it looks like the result was discarded & no credit granted.

That said, it's looking more & more like this was a problem with my Dell PERC H710P RAID card. The machine was sluggish as hell with the disk cache write back enabled & everything Really went south (machine became unbootable) after I tried a backup. Fiddled with it for days, finally pulled the backup battery off the card, which disabled the cache & let it sit overnight. Next morning, reinstalled the card, and back on go. Jacked my "use at most" CPU's back up to 100% & the machine is still snappy. Back to Munching & Crunching ;)

Thanks for looking this up for me, if I run into problems again, I will try increasing this timeout.

Franko


The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89834 - Posted: 4 Nov 2018, 9:12:58 UTC - in response to Message 89812.  

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

The error message is displayed by the BOINC Client.
I think it is just a BOINC Client timing issue that they have declared "fixed" several times.
I don't think it is ever a problem, just annoying.

client/app_control.cpp

// Check for finish files every 10 sec.
// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();
//
static double last_finish_check_time = 0;
if (gstate.clock_change || gstate.now - last_finish_check_time > 10) {
last_finish_check_time = gstate.now;
for (i=0; i<active_tasks.size(); i++) {
ACTIVE_TASK* atp = active_tasks[i];
if (atp->task_state() == PROCESS_UNINITIALIZED) continue;
if (atp->finish_file_time) {
// process is still there 10 sec after it wrote finish file.
// abort the job
atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140
} else if (atp->finish_file_present()) {
atp->finish_file_time = gstate.now;
}
}
}
ID: 89834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,324,975
RAC: 3,637
Message 89837 - Posted: 5 Nov 2018, 2:19:58 UTC - in response to Message 89834.  

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

[snip]

Files with the .cpp extension are usually C++ source files, which can be edited. However, doing so is not useful unless:

1. You have a copy of the file. Most BOINC downloads do not include the source files - you have to know where to find the source files and download the entire package of source files.

2. You know enough C++ to make useful edits.

3. You have all of the compilers installed to compile the entire program for your operating system.

4. You have the instructions to compile all source files needed, and then link them into a new version of the program.

5. You know how to substitute the new version of the program for the old version.
ID: 89837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89845 - Posted: 6 Nov 2018, 17:48:05 UTC - in response to Message 89837.  

Got it, thanks!!

I spent some more time with this machine running at 100% (32 Rosetta tasks + 1 SETI task on the GPU) & it DID hang occasionally, which would explain this error.

As this is also my daily driver, I backed the "Use at most CPU's" option down to 93.75% (30 of 32 threads) & I haven't seen the problem since.

Problem resolved.

Thanks!!

Franko

Dang it, I'm still getting the same error.

I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit?

Thanks!!

Franko

[snip]

Files with the .cpp extension are usually C++ source files, which can be edited. However, doing so is not useful unless:

ID: 89845 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
anklab

Send message
Joined: 1 Jun 10
Posts: 1
Credit: 9,599,886
RAC: 342
Message 89926 - Posted: 24 Nov 2018, 14:05:41 UTC

Hi!
Recently, I have noticed that WU calculations that go on for a long time are also evaluated, as WU calculations that take place for a short time.
For example, mu computers
Intel Core2Duo E8500 and Intel Core i5-2500.

E8500 get WUs with 4 hours crunching, i5-2500 with 24 hours. it is strange that different tasks with different work results are granted equally.

Core i5-2500 // 24 hours // granted 160.33
======================================================
DONE :: 1 starting structures 86255.3 cpu seconds
This process generated 174 decoys from 174 attempts
======================================================

E8500 // 4 hours // granted 152.93
======================================================
DONE :: 1 starting structures 13805.2 cpu seconds
This process generated 22 decoys from 22 attempts
======================================================


Much earlier, i5-2500 received for each completed WU approximately 800~850 credits.
What can i do?
ID: 89926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LarryMajor

Send message
Joined: 1 Apr 16
Posts: 22
Credit: 31,533,212
RAC: 0
Message 89930 - Posted: 25 Nov 2018, 19:52:03 UTC - in response to Message 89926.  

Much earlier, i5-2500 received for each completed WU approximately 800~850 credits.
What can i do?


I'd do nothing for a few days. It appears to have been the recent WUs/scoring that caused a big drop. Mine started to look more typical in the past 24 hours.
ID: 89930 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 42
Credit: 1,258,039
RAC: 0
Message 89949 - Posted: 2 Dec 2018, 12:32:51 UTC - in response to Message 89930.  

Hi

I have tasks erroring after 10 hours of calculation

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
command: rosetta_4.09_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_12_01_955_1018__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_12_01_955_1018__t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3814287
Starting watchdog...
Watchdog active.
======================================================
DONE :: 43 starting structures 28348 cpu seconds
This process generated 43 decoys from 43 attempts
======================================================
BOINC :: WS_max 5.21523e+08

BOINC :: Watchdog shutting down...
12:42:37 (98417): called boinc_finish(0)

</stderr_txt>
]]>


A few did succeed from the same lot after the same amount of calculation time

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<stderr_txt>
command: rosetta_4.09_x86_64-apple-darwin -run:protocol jd2_scripting @flags_rb_12_01_948_1013__t000__1_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_12_01_948_1013__t000__1_C1_robetta.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3810154
Starting watchdog...
Watchdog active.
======================================================
DONE :: 7 starting structures 28105.7 cpu seconds
This process generated 7 decoys from 7 attempts
======================================================
BOINC :: WS_max 9.90781e+08

BOINC :: Watchdog shutting down...
12:37:56 (98460): called boinc_finish(0)

</stderr_txt>
]]>

ID: 89949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
wshadw

Send message
Joined: 3 Dec 18
Posts: 1
Credit: 0
RAC: 0
Message 89955 - Posted: 3 Dec 2018, 21:31:24 UTC - in response to Message 80621.  

I am getting a message of "Abandoned by Project" on too many workunits. With 8 hour workunits this is unacceptable and since I compute in the Gridcoin pool I cannot change my settings.
ID: 89955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,324,975
RAC: 3,637
Message 89956 - Posted: 3 Dec 2018, 22:52:34 UTC - in response to Message 89955.  

I am getting a message of "Abandoned by Project" on too many workunits. With 8 hour workunits this is unacceptable and since I compute in the Gridcoin pool I cannot change my settings.


Could this mean that your computer is so slow that two other computers have finished the workunit before your does?

Does your computer finish workunits before their deadlines?
ID: 89956 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Arnav Sood

Send message
Joined: 20 Aug 18
Posts: 2
Credit: 11,782,086
RAC: 0
Message 89984 - Posted: 11 Dec 2018, 17:25:27 UTC

Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff."

I'm on an iMac Pro 2017 running macOS 10.14 Mojave and BOINC 7.12
ID: 89984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fcbrants
Avatar

Send message
Joined: 25 Mar 13
Posts: 13
Credit: 3,933,177
RAC: 0
Message 89985 - Posted: 11 Dec 2018, 17:55:31 UTC - in response to Message 89984.  

I just checked my logs back to 12/10 15:00 CST & it looks like I've been uploading continuously, uninterrupted. Win64 Boinc 7.12.1.

Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff."

I'm on an iMac Pro 2017 running macOS 10.14 Mojave and BOINC 7.12

ID: 89985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 90000 - Posted: 14 Dec 2018, 19:35:41 UTC
Last modified: 14 Dec 2018, 20:10:20 UTC

I was away from home (of course), and Rosetta took out my i7-4770. Everything was frozen up. I have never seen that before for Rosetta.

Apparently it was this work unit:
https://boinc.bakerlab.org/result.php?resultid=1046921926

<core_client_version>7.12.0</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu @foldit_2006238_0004_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_foldit_2006238_0004_data.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2498717
Starting watchdog...
Watchdog active.
Starting watchdog...
Watchdog active.

ERROR: Unable to open database file for dun10 rotamer library: minirosetta_database/rotamer/shapovalov/StpDwn_0-0-0/cys.bbdep.rotamers.lib
ERROR:: Exit from: src/core/pack/dunbrack/RotamerLibrary.cc line: 1085
BACKTRACE:
[0xe8ca514]
[0xca17443]
[0xca178ce]
[0xca92145]
[0xc90133c]
[0xc9ef641]
[0xd019a4b]
[0xd3e6e18]
[0xd3eb9ce]
[0xc96b2d1]
[0xc963eb2]
[0xb7fef3f]
[0xac8f844]
[0x9404246]
[0x9299a6c]
[0xc232777]
[0xc234a84]
[0xc2f46c0]
[0xc2f323b]
[0x929e531]
[0x8054670]
[0xedcf791]
[0xedcf98d]
[0x8266087]
BOINC:: Error reading and gzipping output datafile: default.out
14:21:38 (2187): called boinc_finish(1)

</stderr_txt>

Rosetta is the only project I have running on that machine (limited to six cores, with two cores free); I don't even have a GPU installed.
It probably won't happen again, but once is enough.

EDIT: I updated Ubuntu 16.04, and upon reboot, picked up this in my BOINC log. I have never seen it before, and have no idea what it means.

6	Rosetta@home	12/14/2018 2:51:39 PM	[error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu	
7	Rosetta@home	12/14/2018 2:51:39 PM	[error] State file error: duplicate app version: minirosetta x86_64-pc-linux-gnu 378 	
8	Rosetta@home	12/14/2018 2:51:39 PM	[error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu	


But everything appears to be back to normal, and Rosetta is running OK now.
ID: 90000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 0
Message 90001 - Posted: 14 Dec 2018, 20:37:32 UTC

to my surprise i see 24 Tasks in my Profile uploaded to my PC
In real i have 10 in my Boinc Manager
Whats going on there?

Anwendung Rosetta 4.07
Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5433
Status Angehalten durch Benutzer
erhalten

Anwendung Rosetta 4.07
Name foldit_2006238_0002_fold_and_dock_SAVE_ALL_OUT_707992_5434
Status Angehalten durch Benutzer
erhalten

Anwendung Rosetta 4.07
Name foldit_2006254_0004_fold_and_dock_SAVE_ALL_OUT_708044_5432
Status Angehalten durch Benutzer
erhalten
slots/2

Anwendung Rosetta 4.07
Name foldit_2006238_0003_fold_and_dock_SAVE_ALL_OUT_707994_5434
Status Angehalten durch Benutzer erhalten
slots/7

Anwendung Rosetta 4.07
Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_5431
Status Angehalten durch Benutzer erhalten
slots/5

Anwendung Rosetta 4.07
Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_4988
Status Angehalten durch Benutzer erhalten
slots/4

Anwendung Rosetta 4.07
Name foldit_2006254_0002_fold_and_dock_SAVE_ALL_OUT_708040_5432
Status Angehalten durch Benutzer erhalten
slots/3

Anwendung Rosetta 4.07
Name foldit_2006254_0003_fold_and_dock_SAVE_ALL_OUT_708042_5432
Status Aktiv erhalten
slots/6

Anwendung Rosetta 4.07
Name foldit_2006238_0004_fold_and_dock_SAVE_ALL_OUT_707996_5434
Status Aktiv erhalten
slots/11

Anwendung Rosetta 4.07
Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5434
Status Aktiv erhalten
slots/13
ID: 90001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jjch

Send message
Joined: 10 Nov 13
Posts: 14
Credit: 441,016,712
RAC: 34,620
Message 90006 - Posted: 16 Dec 2018, 18:38:59 UTC - in response to Message 90001.  
Last modified: 16 Dec 2018, 18:41:29 UTC

I think I may be experiencing a similar issue.

Recently I noted the work in progress value appeared to be approximately double the normal amount of work units I have running at a time.

In order to trouble shoot this I set Rosetta to no new tasks and let them run out. Checking Boincstats I no longer have any work left on any host.

According to Rosetta I currently have a total of 1709 tasks in progress. For example host 1770544 it is not running any Rosetta tasks but yet the In progress count is 216.

https://boinc.bakerlab.org/rosetta/results.php?hostid=1770544&offset=0&show_names=0&state=1&appid=

I did try resetting the project on that host but it didn't make any difference. My impression there is a problem on the Rosetta server side and it isn't updating the task status properly.

I think we need the Rosetta programming team look into this further.
ID: 90006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 306 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org