R@H Scientists/Coders: An analysis of the Rosetta binaries...

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 78874 - Posted: 1 Oct 2015, 5:44:12 UTC - in response to Message 78873.  
Last modified: 1 Oct 2015, 5:46:14 UTC

Hi, as you may have noticed we're only talking to ourselves here.


:-(
The latest post of an admin on this thread was 17 Jul 2015.
I hope they read forum and "taking inspiration" from other projects for app optimization.
ID: 78874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 78875 - Posted: 1 Oct 2015, 15:22:47 UTC

Crunch3r and Sesef seem to almost enjoy optimizing code. If the Rosetta admins could just send them the code for "academic purposes", let them play with it (with of course having correct results in mind), then send it back to double-check it's validity, and we could actually be getting somewhere.
In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals.
ID: 78875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 78878 - Posted: 2 Oct 2015, 23:35:22 UTC - in response to Message 78875.  

In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals.

They are amazed by it.
DENIS Optimized app

Note that Sesef did the Windows version, while Crunch3r added versions for Linux and OSX (and an alternative Windows version). But that is a newer project, and whether the same tricks work for the Rosetta code, which has been around for awhile, is another question.
ID: 78878 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 78879 - Posted: 3 Oct 2015, 22:16:29 UTC - in response to Message 78878.  

In DENIS... their admins didn't do a thing and are VERY grateful for the work of these two individuals.

They are amazed by it.
DENIS Optimized app

Note that Sesef did the Windows version, while Crunch3r added versions for Linux and OSX (and an alternative Windows version). But that is a newer project, and whether the same tricks work for the Rosetta code, which has been around for awhile, is another question.


The last time David replied to this thread was in July that in itself speaks volumes.
ID: 78879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 78900 - Posted: 12 Oct 2015, 15:02:51 UTC - in response to Message 78879.  

The last time David replied to this thread was in July that in itself speaks volumes.


But he replied on Ralph's forum 3 days ago:
I've been too busy to look into optimizations. We do have one volunteer helping us out however. I'll keep you all posted if anything develops.

ID: 78900 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 78901 - Posted: 12 Oct 2015, 15:19:21 UTC

He also just updated the Linux binary to 64-bit.
IDK if I already mentioned it, but it seems that David is doing all the work... Science/BOINC/Public Relations.
ID: 78901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 78902 - Posted: 12 Oct 2015, 18:32:05 UTC - in response to Message 78901.  

IDK if I already mentioned it, but it seems that David is doing all the work... Science/BOINC/Public Relations.


That's strange, 'cause there is a large "ecosystem" around Rosetta@home, like BakerLabs, Rosettacommons and Rosetta Design Group....but i don't know how they are involved in boinc application's development
ID: 78902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 79612 - Posted: 24 Feb 2016, 9:21:29 UTC - in response to Message 78901.  

He also just updated the Linux binary to 64-bit.


Is really 64 bit or, like windows, is a "simple rename"?
It's a pity this thread is almost dead
ID: 79612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79616 - Posted: 24 Feb 2016, 13:57:22 UTC - in response to Message 79612.  

He also just updated the Linux binary to 64-bit.


Is really 64 bit or, like windows, is a "simple rename"?
It's a pity this thread is almost dead


in linux normally there is little bluff as the command 'file' is just there in about most linux distributions:

# file minirosetta_graphics_3.71_x86_64-pc-linux-gnu
minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped
ID: 79616 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Sebastian M. Bobrecki

Send message
Joined: 9 Oct 05
Posts: 4
Credit: 6,286,377
RAC: 0
Message 79645 - Posted: 27 Feb 2016, 18:47:05 UTC - in response to Message 79616.  
Last modified: 27 Feb 2016, 18:49:48 UTC

...
# file minirosetta_graphics_3.71_x86_64-pc-linux-gnu
minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped

And uses SSE2:
...
  491a1a:   f2 0f 10 54 24 30       movsd  0x30(%rsp),%xmm2
  491a20:   f2 44 0f 10 5c 24 40    movsd  0x40(%rsp),%xmm11
  491a27:   f2 0f 5c e8             subsd  %xmm0,%xmm5
  491a2b:   f2 0f 58 74 24 50       addsd  0x50(%rsp),%xmm6
  491a31:   f2 44 0f 59 fa          mulsd  %xmm2,%xmm15
  491a36:   f2 44 0f 59 c2          mulsd  %xmm2,%xmm8
  491a3b:   f2 45 0f 5c dc          subsd  %xmm12,%xmm11
  491a40:   f2 0f 59 da             mulsd  %xmm2,%xmm3
  491a44:   f2 0f 59 d1             mulsd  %xmm1,%xmm2
  491a48:   f2 0f 11 7c 24 48       movsd  %xmm7,0x48(%rsp)
  491a4e:   f2 44 0f 11 7c 24 68    movsd  %xmm15,0x68(%rsp)
...

ID: 79645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,002,200
RAC: 5,721
Message 79648 - Posted: 28 Feb 2016, 5:46:31 UTC - in response to Message 79645.  

...
# file minirosetta_graphics_3.71_x86_64-pc-linux-gnu
minirosetta_graphics_3.71_x86_64-pc-linux-gnu: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.18, BuildID[sha1]=3567ba7fc3a0583e4aefb1e58607a065a8128568, stripped

And uses SSE2:
...
  491a1a:   f2 0f 10 54 24 30       movsd  0x30(%rsp),%xmm2
  491a20:   f2 44 0f 10 5c 24 40    movsd  0x40(%rsp),%xmm11
  491a27:   f2 0f 5c e8             subsd  %xmm0,%xmm5
  491a2b:   f2 0f 58 74 24 50       addsd  0x50(%rsp),%xmm6
  491a31:   f2 44 0f 59 fa          mulsd  %xmm2,%xmm15
  491a36:   f2 44 0f 59 c2          mulsd  %xmm2,%xmm8
  491a3b:   f2 45 0f 5c dc          subsd  %xmm12,%xmm11
  491a40:   f2 0f 59 da             mulsd  %xmm2,%xmm3
  491a44:   f2 0f 59 d1             mulsd  %xmm1,%xmm2
  491a48:   f2 0f 11 7c 24 48       movsd  %xmm7,0x48(%rsp)
  491a4e:   f2 44 0f 11 7c 24 68    movsd  %xmm15,0x68(%rsp)
...


It may be SSE2+ code but it is pretty ugly stuff.

The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers.

4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries.


ID: 79648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1831
Credit: 119,536,094
RAC: 6,149
Message 79650 - Posted: 28 Feb 2016, 10:16:22 UTC - in response to Message 79648.  



It may be SSE2+ code but it is pretty ugly stuff.

The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers.

4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries.

Is it easy to improve that code?

ID: 79650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,002,200
RAC: 5,721
Message 79655 - Posted: 28 Feb 2016, 19:05:23 UTC - in response to Message 79650.  



It may be SSE2+ code but it is pretty ugly stuff.

The "sd" ending of the instructions means that those instructions are SCALAR, DOUBLE precision. They only use the lower half of the XMM registers.

4 of the 11 instructions are reading and writing (movsd (%rsp) ) temporaries from/to the stack. Each of those instructions takes longer than the actual computation. This code fragment spends more than 50% of its time reading/saving temporaries.

Is it easy to improve that code?


Yes.
A newer version of the compiler and changing the compile time options should get 30%-40% improvement without much trouble. Beyond that, it probably means digging deeper into the Rosetta algorithms.

I am still trying to figure out how to tell what a "better" Rosetta run looks like. If I solve the same Rosetta "problem" with two different binaries, I get different results (not unusual for floating point code) ... so which "answer" is "better". 8-)

The Rosetta algorithms seem unstable and I am seeing a 10%-20% variation in performance depending on the initial seed value that is applied. David gave me a test problem. I am building the non-boinc standalone static binary version. If I vary the -jran seed from 12345, 12346, 12347, 12348, ... the compute times vary noticeably on the same idling computer. There may be some opportunity to SIEVE problems to run quick tests to "qualify" or "eliminate" work units and spend compute cycles on the most promising candidates.


I am currently using 2 compilers to generate static binaries.
1.) gcc 4.9.1 from the devtools-3 Linux Software Collections group repository and
2.) Intel icc 2016



David has been very good finding time to look at results. I would be happy to work with other developers too and share my findings. Rosetta uploaded a new source tree and I am just starting to look at it.



Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options:
-O3 -mtune=generic -march=core2

The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds.

It is easy to "over tune" because few developers have the courage to remove "optimizations" since it is hard to verify that they are no longer needed. I always start by removing them. 8-)








ID: 79655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79660 - Posted: 29 Feb 2016, 22:24:02 UTC

@ RJS5 - Was looking through the IPD's youtube channel and found this video lecture that introduces a bunch of the idiosyncrasies of the Rosetta code stack.. Alot of it is really basic but there may be some nuggets of useful info in here:


Video link here: https://www.youtube.com/watch?v=Cyk6W6YtWUQ
ID: 79660 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 79665 - Posted: 1 Mar 2016, 11:01:49 UTC - in response to Message 79660.  

Video link here: https://www.youtube.com/watch?v=Cyk6W6YtWUQ


VERY interesting!
ID: 79665 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79679 - Posted: 1 Mar 2016, 21:05:34 UTC

rjs5: Glad you are making some headway. The seed, and specific value used for first model computed determines how it runs. So, you would need same seed and value run each time to expect comparable results. That is part of what makes issuing credits difficult for R@h. Same protein, different run value, may take more periods of diving deeper and take longer to run. I should think the best solutions are the ones that keep thinking there may be more to milk out of the model it is working on at the time.

To draw an analogy, just based on my outside understanding... If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point.
Rosetta Moderator: Mod.Sense
ID: 79679 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,563,789
RAC: 6,764
Message 79689 - Posted: 4 Mar 2016, 10:20:46 UTC - in response to Message 79679.  

If I'm navaigating through a maze of one way and dead end streets in a city, trying to get as close as possible to a water tower, some starting routes may progress gradually closer and closer and so merit continuing to try more variations. Other starting routes will rather quickly take you away from the tower, to the point where the decision is made to cut bait and try the next model instead of pursuing the current one further. So that second one may finish significantly sooner than the first. So you really have to compare runs of exactly the same starting point.


This is the reason of Fold.it
Humans are better to find solutions of this kind of problems
ID: 79689 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 79691 - Posted: 4 Mar 2016, 13:59:02 UTC - in response to Message 79655.  

"rjs5" wrote:

Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options:
-O3 -mtune=generic -march=core2

The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds.


i'm thinking if it may have something to do with the on-chip L1 & L2 cache.
32 bits codes are smaller and if something fits well within the cachelines, it may be significantly faster than if everything is fetched from memory
ID: 79691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,002,200
RAC: 5,721
Message 79692 - Posted: 4 Mar 2016, 15:52:44 UTC - in response to Message 79691.  

"rjs5" wrote:

Humorously, the fastest Linux binary so far seems to be a GENERIC 32-bit binary built with icc and the options:
-O3 -mtune=generic -march=core2

The current gcc binaries with the complex compile time options solve the test problem in 15,000 to 18,000 seconds. The GENERIC icc 32-bit binary (and several others) take about 12,000 seconds.


i'm thinking if it may have something to do with the on-chip L1 & L2 cache.
32 bits codes are smaller and if something fits well within the cachelines, it may be significantly faster than if everything is fetched from memory



I am still playing and I have a several compute years invested in the process.
Faster performance will not matter if the quality of the result is low. Waiting on David for the good/bad news.

The video link that Timo posted was encouraging. (THANKS!)

Since Rosetta is well structured, they may be able to "encourage" SSE/AVX parallel operation by redefining the low level vectors. They frequently use an XYZ spacial coordinate. They might be able to modify that XYZ vector definition to a larger XYZd (d=don't care) where the processing of XYZ values individually could be changed to "pairs" (SSE/AVX) or "quad" (AVX2).

I have posted code clips where the 3 scalar XYZ loads, operation, store are conducted serially.

If they added a bogus "4th dimension" onto their vector, the compiler might be able to generate SSE/AVX/AVX2 loads/operations/stores that would be faster.








ID: 79692 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Computing for Humanity (Account)

Send message
Joined: 8 Jan 16
Posts: 2
Credit: 479,911,526
RAC: 67,696
Message 79714 - Posted: 8 Mar 2016, 4:19:44 UTC - in response to Message 78425.  

Would anyone want to or would know someone who would want to help optimize the Rosetta software, at the compiler level or even at the code level? It is freely available through an Academic license but we can also provide it to individuals under the same license agreement.

This offers a great opportunity for you all to contribute directly and have direct positive impact for all researchers who use Rosetta and the optimizations would carry over to Rosetta@home.

David K


We might be able to help. Perhaps better to discuss via PM, some details under NDA.
ID: 79714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : R@H Scientists/Coders: An analysis of the Rosetta binaries...



©2024 University of Washington
https://www.bakerlab.org