R@H Scientists/Coders: An analysis of the Rosetta binaries...

Author	Message
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 78874 - Posted: 1 Oct 2015, 5:44:12 UTC - in response to Message 78873. Last modified: 1 Oct 2015, 5:46:14 UTC Hi, as you may have noticed we're only talking to ourselves here. :-( The latest post of an admin on this thread was 17 Jul 2015. I hope they read forum and "taking inspiration" from other projects for app optimization. ID: 78874 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78875 - Posted: 1 Oct 2015, 15:22:47 UTC Crunch3r and Sesef seem to almost enjoy optimizing code. If the Rosetta admins could just send them the code for "academic purposes", let them play with it (with of course having correct results in mind), then send it back to double-check it's validity, and we could actually be getting somewhere. In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. ID: 78875 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0	Message 78878 - Posted: 2 Oct 2015, 23:35:22 UTC - in response to Message 78875. In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. They are amazed by it. DENIS Optimized app Note that Sesef did the Windows version, while Crunch3r added versions for Linux and OSX (and an alternative Windows version). But that is a newer project, and whether the same tricks work for the Rosetta code, which has been around for awhile, is another question. ID: 78878 · Rating: 0 · rate: / Reply Quote

Betting Slip Send message Joined: 26 Sep 05 Posts: 71 Credit: 5,702,246 RAC: 0	Message 78879 - Posted: 3 Oct 2015, 22:16:29 UTC - in response to Message 78878. In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals. They are amazed by it. DENIS Optimized app Note that Sesef did the Windows version, while Crunch3r added versions for Linux and OSX (and an alternative Windows version). But that is a newer project, and whether the same tricks work for the Rosetta code, which has been around for awhile, is another question. The last time David replied to this thread was in July that in itself speaks volumes. ID: 78879 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 78900 - Posted: 12 Oct 2015, 15:02:51 UTC - in response to Message 78879. The last time David replied to this thread was in July that in itself speaks volumes. But he replied on Ralph's forum 3 days ago: I've been too busy to look into optimizations. We do have one volunteer helping us out however. I'll keep you all posted if anything develops. ID: 78900 · Rating: 0 · rate: / Reply Quote

Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0	Message 78901 - Posted: 12 Oct 2015, 15:19:21 UTC He also just updated the Linux binary to 64-bit. IDK if I already mentioned it, but it seems that David is doing all the work... Science/BOINC/Public Relations. ID: 78901 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 78902 - Posted: 12 Oct 2015, 18:32:05 UTC - in response to Message 78901. IDK if I already mentioned it, but it seems that David is doing all the work... Science/BOINC/Public Relations. That's strange, 'cause there is a large "ecosystem" around Rosetta@home, like BakerLabs, Rosettacommons and Rosetta Design Group....but i don't know how they are involved in boinc application's development ID: 78902 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 79612 - Posted: 24 Feb 2016, 9:21:29 UTC - in response to Message 78901. He also just updated the Linux binary to 64-bit. Is really 64 bit or, like windows, is a "simple rename"? It's a pity this thread is almost dead ID: 79612 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79616 - Posted: 24 Feb 2016, 13:57:22 UTC - in response to Message 79612. He also just updated the Linux binary to 64-bit. Is really 64 bit or, like windows, is a "simple rename"? It's a pity this thread is almost dead in linux normally there is little bluff as the command 'file' is just there in about most linux distributions: # file minirosetta_graphics_3.71_x86_64-pc-linux-gnu minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped ID: 79616 · Rating: 0 · rate: / Reply Quote

Sebastian M. Bobrecki Send message Joined: 9 Oct 05 Posts: 4 Credit: 6,286,377 RAC: 0	Message 79645 - Posted: 27 Feb 2016, 18:47:05 UTC - in response to Message 79616. Last modified: 27 Feb 2016, 18:49:48 UTC ... # file minirosetta_graphics_3.71_x86_64-pc-linux-gnu minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped And uses SSE2: ... 491a1a: f2 0f 10 54 24 30 movsd 0x30(%rsp),%xmm2 491a20: f2 44 0f 10 5c 24 40 movsd 0x40(%rsp),%xmm11 491a27: f2 0f 5c e8 subsd %xmm0,%xmm5 491a2b: f2 0f 58 74 24 50 addsd 0x50(%rsp),%xmm6 491a31: f2 44 0f 59 fa mulsd %xmm2,%xmm15 491a36: f2 44 0f 59 c2 mulsd %xmm2,%xmm8 491a3b: f2 45 0f 5c dc subsd %xmm12,%xmm11 491a40: f2 0f 59 da mulsd %xmm2,%xmm3 491a44: f2 0f 59 d1 mulsd %xmm1,%xmm2 491a48: f2 0f 11 7c 24 48 movsd %xmm7,0x48(%rsp) 491a4e: f2 44 0f 11 7c 24 68 movsd %xmm15,0x68(%rsp) ... ID: 79645 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79648 - Posted: 28 Feb 2016, 5:46:31 UTC - in response to Message 79645. ... # file minirosetta_graphics_3.71_x86_64-pc-linux-gnu minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped And uses SSE2: ... 491a1a: f2 0f 10 54 24 30 movsd 0x30(%rsp),%xmm2 491a20: f2 44 0f 10 5c 24 40 movsd 0x40(%rsp),%xmm11 491a27: f2 0f 5c e8 subsd %xmm0,%xmm5 491a2b: f2 0f 58 74 24 50 addsd 0x50(%rsp),%xmm6 491a31: f2 44 0f 59 fa mulsd %xmm2,%xmm15 491a36: f2 44 0f 59 c2 mulsd %xmm2,%xmm8 491a3b: f2 45 0f 5c dc subsd %xmm12,%xmm11 491a40: f2 0f 59 da mulsd %xmm2,%xmm3 491a44: f2 0f 59 d1 mulsd %xmm1,%xmm2 491a48: f2 0f 11 7c 24 48 movsd %xmm7,0x48(%rsp) 491a4e: f2 44 0f 11 7c 24 68 movsd %xmm15,0x68(%rsp) ... It may be SSE2+ code but it is pretty ugly stuff. The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers. 4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries. ID: 79648 · Rating: 0 · rate: / Reply Quote

dcdc Send message Joined: 3 Nov 05 Posts: 1835 Credit: 124,950,919 RAC: 11,199	Message 79650 - Posted: 28 Feb 2016, 10:16:22 UTC - in response to Message 79648. It may be SSE2+ code but it is pretty ugly stuff. The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers. 4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries. Is it easy to improve that code? ID: 79650 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79655 - Posted: 28 Feb 2016, 19:05:23 UTC - in response to Message 79650. It may be SSE2+ code but it is pretty ugly stuff. The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers. 4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries. Is it easy to improve that code? Yes. A newer version of the compiler and changing the compile time options should get 30%-40% improvement without much trouble. Beyond that, it probably means digging deeper into the Rosetta algorithms. I am still trying to figure out how to tell what a "better" Rosetta run looks like. If I solve the same Rosetta "problem" with two different binaries, I get different results (not unusual for floating point code) ... so which "answer" is "better". 8-) The Rosetta algorithms seem unstable and I am seeing a 10%-20% variation in performance depending on the initial seed value that is applied. David gave me a test problem. I am building the non-boinc standalone static binary version. If I vary the -jran seed from 12345, 12346, 12347, 12348, ... the compute times vary noticeably on the same idling computer. There may be some opportunity to SIEVE problems to run quick tests to "qualify" or "eliminate" work units and spend compute cycles on the most promising candidates. I am currently using 2 compilers to generate static binaries. 1.) gcc 4.9.1 from the devtools-3 Linux Software Collections group repository and 2.) Intel icc 2016 David has been very good finding time to look at results. I would be happy to work with other developers too and share my findings. Rosetta uploaded a new source tree and I am just starting to look at it. Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options: -O3 -mtune=generic -march=core2 The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds. It is easy to "over tune" because few developers have the courage to remove "optimizations" since it is hard to verify that they are no longer needed. I always start by removing them. 8-) ID: 79655 · Rating: 0 · rate: / Reply Quote

Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,662,635 RAC: 7	Message 79660 - Posted: 29 Feb 2016, 22:24:02 UTC @ RJS5 - Was looking through the IPD's youtube channel and found this video lecture that introduces a bunch of the idiosyncrasies of the Rosetta code stack.. Alot of it is really basic but there may be some nuggets of useful info in here: Video link here: https://www.youtube.com/watch?v=Cyk6W6YtWUQ ID: 79660 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 79665 - Posted: 1 Mar 2016, 11:01:49 UTC - in response to Message 79660. Video link here: https://www.youtube.com/watch?v=Cyk6W6YtWUQ VERY interesting! ID: 79665 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 79679 - Posted: 1 Mar 2016, 21:05:34 UTC rjs5: Glad you are making some headway. The seed, and specific value used for first model computed determines how it runs. So, you would need same seed and value run each time to expect comparable results. That is part of what makes issuing credits difficult for R@h. Same protein, different run value, may take more periods of diving deeper and take longer to run. I should think the best solutions are the ones that keep thinking there may be more to milk out of the model it is working on at the time. To draw an analogy, just based on my outside understanding... If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point. Rosetta Moderator: Mod.Sense ID: 79679 · Rating: 0 · rate: / Reply Quote

[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2149 Credit: 12,838,075 RAC: 6,146	Message 79689 - Posted: 4 Mar 2016, 10:20:46 UTC - in response to Message 79679. If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point. This is the reason of Fold.it Humans are better to find solutions of this kind of problems ID: 79689 · Rating: 0 · rate: / Reply Quote

sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0	Message 79691 - Posted: 4 Mar 2016, 13:59:02 UTC - in response to Message 79655. "rjs5" wrote: Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options: -O3 -mtune=generic -march=core2 The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds. i'm thinking if it may have something to do with the on-chip L1 & L2 cache. 32 bits codes are smaller and if something fits well within the cachelines, it may be significantly faster than if everything is fetched from memory ID: 79691 · Rating: 0 · rate: / Reply Quote

rjs5 Send message Joined: 22 Nov 10 Posts: 274 Credit: 23,730,845 RAC: 0	Message 79692 - Posted: 4 Mar 2016, 15:52:44 UTC - in response to Message 79691. "rjs5" wrote: Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options: -O3 -mtune=generic -march=core2 The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds. i'm thinking if it may have something to do with the on-chip L1 & L2 cache. 32 bits codes are smaller and if something fits well within the cachelines, it may be significantly faster than if everything is fetched from memory I am still playing and I have a several compute years invested in the process. Faster performance will not matter if the quality of the result is low. Waiting on David for the good/bad news. The video link that Timo posted was encouraging. (THANKS!) Since Rosetta is well structured, they may be able to "encourage" SSE/AVX parallel operation by redefining the low level vectors. They frequently use an XYZ spacial coordinate. They might be able to modify that XYZ vector definition to a larger XYZd (d=don't care) where the processing of XYZ values individually could be changed to "pairs" (SSE/AVX) or "quad" (AVX2). I have posted code clips where the 3 scalar XYZ loads, operation, store are conducted serially. If they added a bogus "4th dimension" onto their vector, the compiler might be able to generate SSE/AVX/AVX2 loads/operations/stores that would be faster. ID: 79692 · Rating: 0 · rate: / Reply Quote

Computing for Humanity (Account) Send message Joined: 8 Jan 16 Posts: 2 Credit: 492,878,894 RAC: 25,852	Message 79714 - Posted: 8 Mar 2016, 4:19:44 UTC - in response to Message 78425. Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement. This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home. David K We might be able to help. Perhaps better to discuss via PM, some details under NDA. ID: 79714 · Rating: 0 · rate: / Reply Quote