Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 25 · 26 · 27 · 28 · 29 · 30 · 31 . . . 34 · Next
Author | Message |
---|---|
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Same with the only DNAN task I have had so far: 1270191608 The examples here have all failed within seconds of starting, so no great loss… |
Kompakki Send message Joined: 14 Jul 14 Posts: 3 Credit: 19,316,100 RAC: 19,796 |
I want to inform software developers of R@H and also need some help with tasks which failed. During the last few days one of my hosts has failed about 500 tasks. For example tasks bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s_1013915_3_0 (https://boinc.bakerlab.org/rosetta/result.php?resultid=1271705979) cd28_1yjd_graft_v1_SAVE_ALL_OUT_IGNORE_THE_REST_0qk3jo8r_1013410_2_0 (https://boinc.bakerlab.org/rosetta/result.php?resultid=1271706673) both failed with error: Computation error. One of the stderr messages looks like: <core_client_version>7.9.3</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_at3_design_boinc_v1.xml @bmpr2_att3_flags -in:file:silent bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.zip @bmpr2_att3_9_SAVE_ALL_OUT_IGNORE_THE_REST_2fq8je7s.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1429430 Using database: database_357d5d93529_n_methyl/minirosetta_database </stderr_txt> ]]> Host details: AMD Phenom(tm) II X6 1090T, Linux Ubuntu, Ubuntu 18.04.5 LTS [5.4.0-48-generic|libc 2.27 (Ubuntu GLIBC 2.27-3ubuntu1.2)]. Other Linux hosts have successfully completed tasks with name starting like bmpr2.... or cd28.... . What's wrong with that one computer? Why does it fail tasks? |
Falconet Send message Joined: 9 Mar 09 Posts: 353 Credit: 1,227,479 RAC: 1,836 |
"process got signal 11" Could be your RAM. Maybe clean any dust, reseat the RAM sticks and run memtest. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,154,260 RAC: 4,107 |
You are using a REALLY old version of Boinc, is that by design or you just haven't updated? That works. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
What's wrong with that one computer? Why does it fail tasks?With your computer hidden it makes it difficult to even guess. Overclocked too much? Over volted too much? Not enough RAM? Faulty RAM module? Faulty power supply? Overheating? All are possible causes. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
Just had a new Task crash out. I've already completed several others of the same type without issue, so i'm waiting to see if my Wingman has the same problem with this Work Unit as well.Nice to know it wasn't me, WU crashed out in seconds for Wingman as well. Grant Darwin NT |
Bill F Send message Joined: 29 Jan 08 Posts: 44 Credit: 1,561,577 RAC: 647 |
You are using a REALLY old version of Boinc, is that by design or you just haven't updated?BOINC is just the manager (it's the project's applications that actually process work), and if something isn't broken, then don't fix it. You may want to read some of the release notes 7.6.22 forward to current again. The BOINC manager has also upgraded Library's that the applications use and GPU tables for newer GPU's as well as improvements in Task scheduling and Task time estimates. It may not be broke but it might be improvable. Bill F |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
You may want to read some of the release notes 7.6.22 forward to current again. The BOINC manager has also upgraded Library's that the applications use and GPU tables for newer GPU's as well as improvements in Task scheduling and Task time estimates. It may not be broke but it might be improvable.I'm a one project cruncher, so better Scheduling, GPUs support and time estimates aren't an issue for me here at Rosetta. But getting rid of the "Old URL" message every time the Scheduler is contacted may yet be reason enough. Grant Darwin NT |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Name: DNANX53C_DnaN_53C_refine_26_stripped_relax_-1_-1_3_45942938_4mers_0002_SAVE_ALL_OUT_1014073_330_0 Errors: Too many errors (may have bug) Too many total results.Well, wingman errored out in same approximate time with same errors. Appears this type of task requiring something many hosts are missing. I do note that at least 1 or 2 of these tasks DID validate on my system(s). Not too many of them rec'd, thank goodness. |
JLDun Send message Joined: 31 May 08 Posts: 7 Credit: 68,063 RAC: 447 |
One of my (few) errors. drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e_1016145_3_0 command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_arm-android-linux-gnu -run:protocol jd2_scripting -parser:protocol c2_design.xml @flags_drhicks1 -in:file:silent drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip drhicks1_derroids_torricks_all_1_SAVE_ALL_OUT_IGNORE_THE_REST_1fm0nm1e.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3387230 |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Name: CLPPPJS8_255_ClpP1P2_stub_justPHE_0001_SAVE_ALL_OUT_1018533_362_1 Application: Rosetta v4.20 windows_x86_64 Device: 1759960 Task: 1284630939. WU: 1150838856 Status: Error while computing Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION Errors: Too many errors (may have bug). Too many total results. Stderr output: (unknown error) - exit code -1073741819 (0xc0000005) I was wingman on this new for me type WU and we both got same error. Run time about 15 sec., so no big loss. Don't see any more like this in my current queue. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
One of my (few) errors.I had a similar error on one of those Tasks as well. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
Name: CLPPPJS8_255_ClpP1P2_stub_justPHE_0001_SAVE_ALL_OUT_1018533_362_1Same here. Grant Darwin NT |
äxl Send message Joined: 30 Dec 08 Posts: 11 Credit: 497,080 RAC: 0 |
So many "Error while computing". Should I detach this computer? https://boinc.bakerlab.org/rosetta/results.php?userid=294942 Running BOINC because: 1) I'm using 100% green energy (no certificates or other non-sense) 2) My computer runs mostly anyway (due to BT and other non-sense) 3) To help |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
So many "Error while computing". Should I detach this computer?Yep. You need to figure out what's wrong with it, then re-attach to the project. Signal 11 errors indicate a memory problem, but it can also be due to overheating CPU, PSU, motherboard, faulty RAM, PSU, motherboard, overclocked too much memory, CPU, etc, etc, etc... Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,555,377 RAC: 6,312 |
A lot of errors of "miniprotein_relax" wus 1305402903 1305403039 etc command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.zip @miniprotein_relax7_SAVE_ALL_OUT_IGNORE_THE_REST_8cs2zu5j.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3798651 |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1677 Credit: 17,755,824 RAC: 22,866 |
A lot of errors of "miniprotein_relax" wusJust had a look and all of mine so far have resulted in computation errors in 50min or less. No Valid results yet. - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00007FF614BE8316 read attempt to address 0xFFFFFFFF - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x000000000000010A Grant Darwin NT |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
Same here; several have failed with an access violation after a little over an hour. I’ve got some more that have been running for 5 hours so far; let’s see whether they manage to complete… |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
let’s see whether they manage to complete…They did. (Example.) The failed ones might just have been certain input values exposing a bug in an algorithm. |
Brian Nixon Send message Joined: 12 Apr 20 Posts: 293 Credit: 8,432,366 RAC: 0 |
And of course all the failed ones get resent… I’ve just received a couple of dozen. Debating whether to abort them all |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org