Questions and Answers : Unix/Linux : R@H works, but all COVID-19 tasks fail
Author | Message |
---|---|
sspseudoo Send message Joined: 4 Mar 20 Posts: 7 Credit: 23,843 RAC: 0 |
Hello everyone, i use rosetta@home on fedora 31 and it works. See https://boinc.bakerlab.org/rosetta/results.php?userid=2083373 My problem is that as soon as there is a COVID-19 task, the calculation will fail. This is the output in the command line: [...] 26-Mar-2020 10:09:04 [Rosetta@home] Starting task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0 [... nothing regarding COVID-19] 26-Mar-2020 10:10:28 [Rosetta@home] Computation for task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0 finished 26-Mar-2020 10:10:28 [Rosetta@home] Output file 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0_r336008625_0 for task 2uc6gr8g_jhr_design1_COVID-19_SAVE_ALL_OUT_903414_1_0 absent [...] Command line output when I start BOINC: $ boinc 26-Mar-2020 10:08:31 [---] Starting BOINC client version 7.16.1 for x86_64-pc-linux-gnu 26-Mar-2020 10:08:31 [---] log flags: file_xfer, sched_ops, task 26-Mar-2020 10:08:31 [---] Libraries: libcurl/7.66.0 OpenSSL/1.1.1d-fips zlib/1.2.11 brotli/1.0.7 libidn2/2.3.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 26-Mar-2020 10:08:31 [---] Data directory: /home/x 26-Mar-2020 10:08:32 [---] OpenCL CPU: pthread-AMD Athlon(tm) II X4 620 Processor (OpenCL driver vendor: The pocl project, driver version 1.5-pre, device version OpenCL 1.2 pocl HSTR: pthread-x86_64-unknown-linux-gnu-amdfam10) 26-Mar-2020 10:08:32 [---] No usable GPUs found 26-Mar-2020 10:08:32 [---] [libc detection] gathered: 2.30, GNU libc 26-Mar-2020 10:08:32 [---] Host name: x-2017-1.local 26-Mar-2020 10:08:32 [---] Processor: 4 AuthenticAMD AMD Athlon(tm) II X4 620 Processor [Family 16 Model 5 Stepping 2] 26-Mar-2020 10:08:32 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save 26-Mar-2020 10:08:32 [---] OS: Linux Fedora: Fedora release 31 (Thirty One) [5.5.10-200.fc31.x86_64|libc 2.30 (GNU libc)] 26-Mar-2020 10:08:32 [---] Memory: 3.59 GB physical, 3.74 GB virtual 26-Mar-2020 10:08:32 [---] Disk: 82.67 GB total, 3.70 GB free 26-Mar-2020 10:08:32 [---] Local time is UTC +1 hours 26-Mar-2020 10:08:32 [---] No general preferences found - using defaults 26-Mar-2020 10:08:32 [---] Reading preferences override file 26-Mar-2020 10:08:32 [---] Preferences: 26-Mar-2020 10:08:32 [---] max memory usage when active: 1837.91 MB 26-Mar-2020 10:08:32 [---] max memory usage when idle: 3308.23 MB 26-Mar-2020 10:08:32 [---] max disk usage: 6.90 GB 26-Mar-2020 10:08:32 [---] don't use GPU while active 26-Mar-2020 10:08:32 [---] suspend work if non-BOINC CPU load exceeds 25% 26-Mar-2020 10:08:32 [---] (to change preferences, visit a project web site or select Preferences in the Manager) 26-Mar-2020 10:08:32 [---] Setting up project and slot directories 26-Mar-2020 10:08:32 [---] Checking active tasks 26-Mar-2020 10:08:32 [Rosetta@home] URL https://boinc.bakerlab.org/rosetta/; Computer ID 3780697; resource share 100 26-Mar-2020 10:08:32 [---] Setting up GUI RPC socket 26-Mar-2020 10:08:32 [---] Checking presence of 29 project files 26-Mar-2020 10:08:32 Initialization completed 26-Mar-2020 10:08:40 [Rosetta@home] project resumed by user [...] What can I do to debug this problem? Which log flags should I enable? https://boinc.berkeley.edu/wiki/Client_configuration#Logging_flags Thanks in advance! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The report from your WU, such as this one shows: Stderr output <core_client_version>7.16.1</core_client_version> <![CDATA[ <message> process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 2uc6gr8g_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 2uc6gr8g_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1372451 Starting watchdog... Watchdog active. </stderr_txt> ]]> Are you overclocking this machine? Is the memory stable in this machine? Given that it runs for less than a minute, I wouldn't think that is enough time for the task to have filled its own memory space. Rosetta Moderator: Mod.Sense |
sspseudoo Send message Joined: 4 Mar 20 Posts: 7 Credit: 23,843 RAC: 0 |
Thank you for the fast answer! I do no overclocking and the machine is stable, usually no crashes, and most other tasks are completed without problems. But there is only 4 GB RAM in total and I'm still using Firefox with many tabs. And there is not so much disk space left in my home partition (only 4 GB left, old but good Samsung SSD), may that be a problem? Is there a possibility to find out if RAM or disc space is the limiting factor? May logging flags help? |
Alexey Vazhnov Send message Joined: 26 Mar 20 Posts: 2 Credit: 0 RAC: 0 |
Hello! I have the same problem: all tasks works fine except tasks like 3xu2pj0j_jhr_design1_COVID-19_SAVE_ALL_OUT_903165_1 , they fail with "Computation error". Xubuntu 19.10 amd64 Linux kernel 5.3.0-42 Boinc 7.16.3 installed from official Ubuntu repository CPU: AMD Phenom II X6 1055T RAM: 8GB Space for Boinc = 20GB Tasks here: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3825844 Message is the same: process got signal 11</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 4il2au3a_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 4il2au3a_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3254684 Starting watchdog... Watchdog active. |
ToKamaK Send message Joined: 27 Mar 20 Posts: 4 Credit: 5,600,147 RAC: 2,916 |
Greetings, I have witnessed similar crashes when joining the project yesterday evening. My configuration includes an AMD Phenom(tm) II X4 945 Processor, which interestingly is also of amdfam10 architecture. Also, I'm running Debian Sid, which currently is powered by the GlibC 2.30, if that matters. Here are my observations: * Affected jobs were of type COVID-19, running with the Rosetta engine version 4.08, crashing around 1% execution; * I have a COVID-19 job type currently past 20% execution, but running with Rosetta 4.07; * I saw non-COVID-19 related jobs in the task list having run successfully with Rosetta 4.08. * All other kind of jobs are running succesfully apparently. Hope this helps |
sspseudoo Send message Joined: 4 Mar 20 Posts: 7 Credit: 23,843 RAC: 0 |
I just upgraded my RAM to 12 GB and made some space free on my SSD (now 39 GB free space), and the COVID-tasks still crash. I checked my CPU with mprime torture test and my RAM with memtest86, both without problems, everything seems stable. Apart from those COVID-tasks no crashes occur when using the computer. [...] 27-Mar-2020 15:51:51 [---] [libc detection] gathered: 2.30, GNU libc 27-Mar-2020 15:51:51 [---] Host name: x-2017-1.local 27-Mar-2020 15:51:51 [---] Processor: 4 AuthenticAMD AMD Athlon(tm) II X4 620 Processor [Family 16 Model 5 Stepping 2] 27-Mar-2020 15:51:51 [---] Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save 27-Mar-2020 15:51:51 [---] OS: Linux Fedora: Fedora release 31 (Thirty One) [5.5.10-200.fc31.x86_64|libc 2.30 (GNU libc)] 27-Mar-2020 15:51:51 [---] Memory: 11.45 GB physical, 3.74 GB virtual 27-Mar-2020 15:51:51 [---] Disk: 82.67 GB total, 39.07 GB free [...] Is it possible to make a backtrace of the crash and would it be helpful? Thanks in advance! |
sspseudoo Send message Joined: 4 Mar 20 Posts: 7 Credit: 23,843 RAC: 0 |
Is this the same problem? https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13658 And is it a good idea to add ralph@home in the BOINC client now to help testing the new adjusted binaries, when they are available? https://ralph.bakerlab.org/ |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Is this the same problem? That is certainly one cause of a COVID-19 task failing. But their seem to be others, such as machines that run for longer period of time and then report out of memory errors. Yes, it is a good time to have Ralph active. The work will only be available once and a while, but just let BOINC keep trying to get work. Rosetta Moderator: Mod.Sense |
ToKamaK Send message Joined: 27 Mar 20 Posts: 4 Credit: 5,600,147 RAC: 2,916 |
Good day, I confirm SSSE3 is not recognized a least by the AMD Phenom II. Trying to assemble and execute the following code triggers an Illegal instruction error: $ cat ssse3-test.S .globl main main: pshufb %xmm1,%xmm0 ret $ gcc -o ssse3-test ssse3-test.S $ ./ssse3-test Illegal instruction I registered to ralph@home in hope this helps with further testing. Kind Regards Edited to add that a CPU supporting SSSE3 might still see this small program crash du to a Segmentation fault. But if said CPU triggers this error instead of Illegal instruction, then it means that it supports SSSE3. |
Alexey Vazhnov Send message Joined: 26 Mar 20 Posts: 2 Credit: 0 RAC: 0 |
@ ToKamaK, thank you very much for investigation! |
ToKamaK Send message Joined: 27 Mar 20 Posts: 4 Credit: 5,600,147 RAC: 2,916 |
It is ShimmerFairy who deserve thanks, the way this particular issue has been identified requires quite some patience. I only double checked the particular CPU brand with hardware I have at hand. :) |
ToKamaK Send message Joined: 27 Mar 20 Posts: 4 Credit: 5,600,147 RAC: 2,916 |
Greetings, I just wanted to confirm my first batch of Rosetta 4.12 jobs are reported to have completed and validated successfully. Kind Regards. |
Questions and Answers :
Unix/Linux :
R@H works, but all COVID-19 tasks fail
©2024 University of Washington
https://www.bakerlab.org