SSE/SSE2, and models.

Author	Message
Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18872 - Posted: 18 Jun 2006, 2:24:50 UTC Last modified: 18 Jun 2006, 2:48:08 UTC I did a search for SSE on the message boards, and did not find anything relevant so I have decided to post. I was wondering if the rosetta application made use of SSE instructions. I figured it did not since some CPUs might not support them, and it seemed like there is only one application for each operating system. I attached to the process in a debugging session, but was unable to find any SSE instructions in the idle threads opcodes while having all threads halted. I am no expert, and have very little knowledge of SSE. However, it seems its major use is parralel processing of floating-point numbers. The problem, is if you are doing a basic fp computation, and the next fp compuation needs the result - SSE provides no help(I could be wrong.) Only if you have two basic compuations that both have unrelated results then SSE becomes very powerful. So, when a little programming background I was still curious because of the nature of the simulation which seems to be exploring the dark using random numbers to seed the exploration that mabye more than one model - mabye four could be run on the same thread(not multiple threads) and execute in parralel with each other using the SSE instruction set to gain mabye four times the speed. Alright, now thinking about that I am sure the rosetta application is not that straight-forward. I am pretty sure it contains lots of speacil conditions and evaluations to determine if this atom and that atom can be next to each other and therefore attemping to run another simulation on the same thread in parrel to take advantage of the SSE instructions could cause some choas, not the mention the mad method it would take to not introduce new bugs and not have to split the rosetta application into two completely seperate source trees just for a SSE and a NON-SSE application. So it boils down to: 1. I really don't care that much, but if anyone thinks there is a lead on my two cents here is a thread for it. =) rofl [edit] Just to give a little more insight to anyone who knows even a little C/C++. #include <xmmintrin.h> int main(void) { // built-in types __m128 a = {0.0, 1.0, 2.0, 3.0}; __m128 b = {0.0, 1.0, 2.0, 3.0}; __m128 c; // this is a intrinsic function, when compiled makes no function call. c = _mm_add_ps(a, b); // four fp single percision values are added using one instruction(above). // no SSE extensions used. float a[4] = {0.0, 1.0, 2.0, 3.0}; float b[4] = {0.0, 1.0, 2.0, 3.0}; float c[4]; c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; return 0; } ID: 18872 · Rating: 1 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18873 - Posted: 18 Jun 2006, 3:13:38 UTC Last modified: 18 Jun 2006, 3:45:23 UTC #include "stdafx.h" #include <xmmintrin.h> __m128 mtf[3]; int mtfop = 0; int mtfcnt = 0; template <int THREADID, int THREADCNT> __forceinline float mtfadd(float a, float b) { if(mtfop < 1) { mtfop = 1; mtf[0].m128_f32[THREADID] = a; mtf[1].m128_f32[THREADID] = b; mtfcnt = 1; // pause thread quickly (wait) (keep low cpu cycles). (need a atomic instruction) // when woken, it returns its own result. return mtf[3].m128_f32[THREADID]; } if(mtfop == 1) { mtf[0].m128_f32[THREADID] = a; mtf[1].m128_f32[THREADID] = b; if(mtfcnt == (THREADCNT-1)) { // execute operation using SSE instructions mtf[3] = _mm_add_ps(mtf[0], mtf[1]); // wake paused threads quickly (keep low cpu cycles).(need a atomic instruction) return mtf[3].m128_f32[THREADID]; }else{ mtfcnt++; // pause thread quickly (wait) (keep low cpu cycles). (need a atomic instruction) // when woken, it returns its own result. return mtf[3].m128_f32[THREADID]; } } // just do its operation the old style. return a+b; } template <int THREADID, int THREADCNT> class xfloat { public: float p; __forceinline void operator=(const float in) { p = in; } __forceinline xfloat& operator+(const xfloat &in) { p = mtfadd<THREADID,THREADCNT>(p, in.p); return this; } }; #define float xfloat<0,4> int _tmain(int argc, _TCHAR argv[]) { float a, b, c; a = 0.0; b = 1.0; c = 2.0; a = b + c; return 0; } Tell compiler to optimize code by enabling inline, __forceinline makes the compiler do it weather it wants to or not. Another way to not break the source tree too much, would be at critical areas of the computation. Implement a thread switching mechanism independant of the operating system. When ever the simulation for a virtual thread reached a critical point, switch to another thread, until all three threads have reached that point in the calculation then start combining operations. ID: 18873 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18879 - Posted: 18 Jun 2006, 8:35:22 UTC Last modified: 18 Jun 2006, 8:36:07 UTC I tried the below, and it did not work as expected, neverless too bad the models could not use SSE I suppose? Was extremely slow, due to all the semaphores I had to use to keep the threads from crapping on each other. ID: 18879 · Rating: 0 · rate: / Reply Quote

MikeMarsUK Send message Joined: 15 Jan 06 Posts: 121 Credit: 2,637,872 RAC: 0	Message 18880 - Posted: 18 Jun 2006, 8:52:58 UTC Last modified: 18 Jun 2006, 8:55:36 UTC I can't speak for Rosetta ('cause I haven't the foggiest!), but CPDN can't use SSE instructions because the precision is too low, it may be the same problem with Rosetta. If a project can use SSE/SSE2, then its often simply a case of flipping compiler switches rather than making code changes, and it will generate code which can run on both platforms transparently. The problem with SSE/SSE2 is that it's designed for graphics rather than high precision computation. ID: 18880 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 18882 - Posted: 18 Jun 2006, 9:12:23 UTC Recently the project staff expressed their interest in optimizing the Rosetta app for different processors but they don't have the time and expertise to do so. I read somewhere that Rosetta uses mostly single point precision so it should be ablte to take advantage of SSE. However akosf the guy who did wonder to the app on Einstein showed no interest in optimizing Rosetta. I'm sure you will receive the source code if you ask for it to try taking advantage of SSE. However the app is in constant development and updated often, so it has to be considered how to use optimizations which still allow constant code change. ID: 18882 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18883 - Posted: 18 Jun 2006, 12:55:58 UTC Last modified: 18 Jun 2006, 13:36:26 UTC MikeMarsUK, Well it will not run transparent. If a application is compiled to use SSE/SSE2 instruction on a processor that does not support them it will raise a invalid opcode exception, unless the compiler and linker used to generate the code produce seperate versions of functions or code that uses the processor features or does not. I have heard of such, and I think it was Intel's C++ compiler? I was wondering why the Rosetta@Home team has not compiled a application with SSE support, and/or another with SSE2 support. It could be that the SSE did not make a large enough improvment percentage, but that could be related to the fact that a compiler is not the GOD of optimizations for lack of a better word. I don't think you are going to beat the compiler at optimizing register usage while being worthwhile to do so, or ordering instructions, and placing data and code on page boundaries or even contracting or expanding loops. You can optimize code in ways the compiler is unable to do so, which may be more along the lines of what Rosetta@Home could have potential to do using a little SSE in a smarter way if the compiler flag did not increase performance, and at the same time their arithograms may be just too linear to put SSE to any use? Yeah, Rosetta@Home only uses single percision floats which was one reason why they had no need to produce a 64 bit binary, but because of the need for only single percision floats it does make me interested in SSE's potential. I just read a article a little while back on how lots of applications for computers now days never get optimized to use alot of these new processor features and instead run with legacy type performance(I do not know this for sure.). I did run though the Rosetta@Home application with a debugger, but who knows mabye I just missed the SSE instructions, lol, and they already use them somewhere - somehow. =?) I just stumbled onto this thread too: http://72.14.209.104/search?q=cache:f09Ekg0IOigJ:boinc.bakerlab.org/rosetta/forum_thread.php%3Fid%3D1084+SSE&hl=en&gl=us&ct=clnk&cd=1 ID: 18883 · Rating: 0 · rate: / Reply Quote

Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0	Message 18891 - Posted: 18 Jun 2006, 16:38:59 UTC - in response to Message 18872. // this is a intrinsic function, when compiled makes no function call. c = _mm_add_ps(a, b); // four fp single percision values are added using one instruction(above). // no SSE extensions used. float a[4] = {0.0, 1.0, 2.0, 3.0}; float b[4] = {0.0, 1.0, 2.0, 3.0}; float c[4]; c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3]; return 0; } You aren't counting "instructions" the way the CPU must. Instruction pipelineing does not allow you to perform 4 indentical operations in a row. What it does, is allow you to perform 4 DIFFERENT operations (i.e. one floating point, one integer, one comparison... I'm not exactly sure which labels are on the 4 pipes). And the compiler may redecorate the order of execution when it does not change the outcome of the program. So if it could pull some integer calculations from the surrounding code and intersperse them with the floats, then it would be taking advantage of the pipelineing. Here is a reference. Unfortunately it doesn't really fully discuss the causes of "bubbles" in the pipeline. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ ID: 18891 · Rating: 0 · rate: / Reply Quote

MikeMarsUK Send message Joined: 15 Jan 06 Posts: 121 Credit: 2,637,872 RAC: 0	Message 18893 - Posted: 18 Jun 2006, 16:57:44 UTC - in response to Message 18883. MikeMarsUK, Well it will not run transparent. If a application is compiled to use SSE/SSE2 instruction on a processor that does not support them it will raise a invalid opcode exception, unless the compiler and linker used to generate the code produce seperate versions of functions or code that uses the processor features or does not. I have heard of such, and I think it was Intel's C++ compiler? ... The CPDN compiler is Intel Fortran, and I think that's exactly what it does - has two versions of a block of code, one for SSE/SSE2 and the other for non-SSE. If Rosetta only needs single precision computation then there may be scope for this optimisation, it probably depends on which language + compiler etc they are using. ID: 18893 · Rating: 0 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18896 - Posted: 18 Jun 2006, 17:38:04 UTC - in response to Message 18891. Last modified: 18 Jun 2006, 18:34:11 UTC You aren't counting "instructions" the way the CPU must. Instruction pipelineing does not allow you to perform 4 indentical operations in a row. What it does, is allow you to perform 4 DIFFERENT operations (i.e. one floating point, one integer, one comparison... I'm not exactly sure which labels are on the 4 pipes). And the compiler may redecorate the order of execution when it does not change the outcome of the program. So if it could pull some integer calculations from the surrounding code and intersperse them with the floats, then it would be taking advantage of the pipelineing. Here is a reference. Unfortunately it doesn't really fully discuss the causes of "bubbles" in the pipeline. I know what you are saying, but let me compiling something quickly: float a[4], b[4], result[4]; result[0] = a[0] + b[0]; 00401736 D9 45 F0 fld dword ptr [a] 00401739 D8 45 E0 fadd dword ptr [b] 0040173C D9 5D D0 fstp dword ptr [result] result[1] = a[1] + b[1]; 0040173F D9 45 F4 fld dword ptr [ebp-0Ch] 00401742 D8 45 E4 fadd dword ptr [ebp-1Ch] 00401745 D9 5D D4 fstp dword ptr [ebp-2Ch] result[2] = a[2] + b[2]; 00401748 D9 45 F8 fld dword ptr [ebp-8] 0040174B D8 45 E8 fadd dword ptr [ebp-18h] 0040174E D9 5D D8 fstp dword ptr [ebp-28h] result[3] = a[3] + b[3]; 00401751 D9 45 FC fld dword ptr [ebp-4] 00401754 D8 45 EC fadd dword ptr [ebp-14h] 00401757 D9 5D DC fstp dword ptr [ebp-24h] Thats twelve instructions. __m128 a, b, result; result = _mm_add_ps(a,b); // 30h - 20h is 10h as in 16 bytes or 128bits or 4 32bit floats. 00401753 0F 28 45 E0 movaps xmm0,xmmword ptr [ebp-20h] 00401757 0F 28 4D D0 movaps xmm1,xmmword ptr [ebp-30h] 0040175B 0F 58 C8 addps xmm1,xmm0 0040175E 0F 29 4D C0 movaps xmmword ptr [ebp-40h],xmm1 Thats four instructions. Also, we get xmm0-xmm7 that are 128bit each to store more fp values. Instead of constantly reading/writting/swapping out of memory. SSE2 way more registers, and more flexibility- but like my procesor it only supports SSE not SSE2. SSE3 is also another one, that adds some more functionality. The processor pipeline is going to read that one instruction, it really does not do four seperate add operations. =) Its prolly doing some sorta of shifting stuff around, and making one big calculation and having a outcome of four contained in one 128bit register. That makes sense about the fortran compiler, I am just shooting in the dark and for a good explanation of how it works and why it would work good is run-time function linking by the executable. It waits to link those functions that are processor specific until it is executed, but this is hidden from the developer. The only bad side to this is, it could really bloat the code. If you had a large function the compiler would be forced to make two copies or break it down into seperate functions with call instruction which cost CPU, but I am not debunking the idea at all. It depends on how much bloat. =?) You can also do manualy what the fortran compiler does?, or the developers may do as it seems a little more likely is just to hand write two seperate functions using SSE intrinsics in one and none in the other. http://www.tc.cornell.edu/Services/Education/Topics/Optimization/SingleProcessor/2.7+DAXPY,+SAXPY.htm http://www.intel80386.com/simd/mmx2-doc.html Something common is vector math, which Rosetta@Home may use? __m128 Atom1 = {5.0,5.0,5.0,0.0}, Atom2 = {0.0,0.0,0.0,0.0}; 00401009 F3 0F 10 05 F4 20 40 00 movss xmm0,dword ptr [__real@40a00000 (4020F4h)] 00401011 D9 05 F4 20 40 00 fld dword ptr [__real@40a00000 (4020F4h)] 00401017 D9 14 24 fst dword ptr [esp] 0040101A F3 0F 11 44 24 08 movss dword ptr [esp+8],xmm0 00401020 0F 57 C0 xorps xmm0,xmm0 00401023 D9 5C 24 04 fstp dword ptr [esp+4] 00401027 F3 0F 11 44 24 10 movss dword ptr [esp+10h],xmm0 0040102D F3 0F 11 44 24 14 movss dword ptr [esp+14h],xmm0 00401033 F3 0F 11 44 24 18 movss dword ptr [esp+18h],xmm0 00401039 F3 0F 11 44 24 1C movss dword ptr [esp+1Ch],xmm0 Atom1 = _mm_sub_ps(Atom1, Atom2); 0040103F 0F 28 4C 24 10 movaps xmm1,xmmword ptr [esp+10h] 00401044 F3 0F 11 44 24 0C movss dword ptr [esp+0Ch],xmm0 0040104A 0F 28 04 24 movaps xmm0,xmmword ptr [esp] 0040104E 0F 5C C1 subps xmm0,xmm1 00401051 0F 29 04 24 movaps xmmword ptr [esp],xmm0 Atom2.m128_f32[0] = Atom2.m128_f32[1] = Atom2.m128_f32[2] = sqrt(Atom1.m128_f32[0] * Atom1.m128_f32[0] + Atom1.m128_f32[1] * Atom1.m128_f32[1] + Atom1.m128_f32[2] * Atom1.m128_f32[2]); 00401055 D9 44 24 08 fld dword ptr [esp+8] 00401059 DC C8 fmul st(0),st 0040105B D9 44 24 04 fld dword ptr [esp+4] 0040105F DC C8 fmul st(0),st 00401061 DE C1 faddp st(1),st 00401063 D9 04 24 fld dword ptr [esp] 00401066 DC C8 fmul st(0),st 00401068 DE C1 faddp st(1),st 0040106A D9 FA fsqrt 0040106C D9 54 24 18 fst dword ptr [esp+18h] 00401070 D9 54 24 14 fst dword ptr [esp+14h] 00401074 D9 5C 24 10 fstp dword ptr [esp+10h] Atom1 = _mm_div_ps(Atom1, Atom2); 00401078 0F 28 4C 24 10 movaps xmm1,xmmword ptr [esp+10h] 0040107D 0F 5E C1 divps xmm0,xmm1 00401080 0F 29 04 24 movaps xmmword ptr [esp],xmm0 00401084 F3 0F 10 44 24 0C movss xmm0,dword ptr [esp+0Ch] 0040108A F3 0F 58 44 24 08 addss xmm0,dword ptr [esp+8] thirty-three instructions No SSE __m128 Atom1 = {5.0,5.0,5.0,0.0}, Atom2 = {0.0,0.0,0.0,0.0}; 00401783 D9 05 F4 20 40 00 fld dword ptr [__real@40a00000 (4020F4h)] 00401789 D9 5D E0 fstp dword ptr [ebp-20h] 0040178C D9 05 F4 20 40 00 fld dword ptr [__real@40a00000 (4020F4h)] 00401792 D9 5D E4 fstp dword ptr [ebp-1Ch] 00401795 D9 05 F4 20 40 00 fld dword ptr [__real@40a00000 (4020F4h)] 0040179B D9 5D E8 fstp dword ptr [ebp-18h] 0040179E D9 EE fldz 004017A0 D9 5D EC fstp dword ptr [ebp-14h] 004017A3 D9 EE fldz 004017A5 D9 5D D0 fstp dword ptr [ebp-30h] 004017A8 D9 EE fldz 004017AA D9 5D D4 fstp dword ptr [ebp-2Ch] 004017AD D9 EE fldz 004017AF D9 5D D8 fstp dword ptr [ebp-28h] 004017B2 D9 EE fldz 004017B4 D9 5D DC fstp dword ptr [ebp-24h] Atom1.m128_f32[0] = Atom1.m128_f32[0] - Atom2.m128_f32[0]; 004017B7 D9 45 E0 fld dword ptr [ebp-20h] 004017BA D8 65 D0 fsub dword ptr [ebp-30h] 004017BD D9 5D E0 fstp dword ptr [ebp-20h] Atom1.m128_f32[1] = Atom1.m128_f32[1] - Atom2.m128_f32[1]; 004017C0 D9 45 E4 fld dword ptr [ebp-1Ch] 004017C3 D8 65 D4 fsub dword ptr [ebp-2Ch] 004017C6 D9 5D E4 fstp dword ptr [ebp-1Ch] Atom1.m128_f32[2] = Atom1.m128_f32[2] - Atom2.m128_f32[2]; 004017C9 D9 45 E8 fld dword ptr [ebp-18h] 004017CC D8 65 D8 fsub dword ptr [ebp-28h] 004017CF D9 5D E8 fstp dword ptr [ebp-18h] Atom2.m128_f32[0] = sqrt(Atom1.m128_f32[0] * Atom1.m128_f32[0] + Atom1.m128_f32[1] * Atom1.m128_f32[1] + Atom1.m128_f32[2] * Atom1.m128_f32[2]); 004017D2 D9 45 E0 fld dword ptr [ebp-20h] 004017D5 D8 4D E0 fmul dword ptr [ebp-20h] 004017D8 D9 45 E4 fld dword ptr [ebp-1Ch] 004017DB D8 4D E4 fmul dword ptr [ebp-1Ch] 004017DE DE C1 faddp st(1),st 004017E0 D9 45 E8 fld dword ptr [ebp-18h] 004017E3 D8 4D E8 fmul dword ptr [ebp-18h] 004017E6 DE C1 faddp st(1),st 004017E8 D9 5D CC fstp dword ptr [ebp-34h] 004017EB D9 45 CC fld dword ptr [ebp-34h] 004017EE 51 push ecx 004017EF D9 1C 24 fstp dword ptr [esp] 004017F2 E8 29 F8 FF FF call sqrt (401020h) 004017F7 83 C4 04 add esp,4 004017FA D9 5D D0 fstp dword ptr [ebp-30h] Atom1.m128_f32[0] = Atom1.m128_f32[0] / Atom2.m128_f32[0]; 004017FD D9 45 E0 fld dword ptr [ebp-20h] 00401800 D8 75 D0 fdiv dword ptr [ebp-30h] 00401803 D9 5D E0 fstp dword ptr [ebp-20h] Atom1.m128_f32[1] = Atom1.m128_f32[1] / Atom2.m128_f32[0]; 00401806 D9 45 E4 fld dword ptr [ebp-1Ch] 00401809 D8 75 D0 fdiv dword ptr [ebp-30h] 0040180C D9 5D E4 fstp dword ptr [ebp-1Ch] Atom1.m128_f32[2] = Atom1.m128_f32[2] / Atom2.m128_f32[0]; 0040180F D9 45 E8 fld dword ptr [ebp-18h] 00401812 D8 75 D0 fdiv dword ptr [ebp-30h] 00401815 D9 5D E8 fstp dword ptr [ebp-18h] fivety-three roughly minus my mistaken of not letting the compiler use fsqr instead of calling the sqr function. ID: 18896 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 18897 - Posted: 18 Jun 2006, 18:27:15 UTC Why discuss this hypothetically? If you guys understand SSE ask David Baker for the source and make some suggestions. I'm sure he is interested in possible use of SSE 1/2/3 etc. dabaker@u.washington.edu ID: 18897 · Rating: 1 · rate: / Reply Quote

Leonard Kevin Mcguire Jr. Send message Joined: 13 Jun 06 Posts: 29 Credit: 14,903 RAC: 0	Message 18898 - Posted: 18 Jun 2006, 18:52:10 UTC He is not going to release that source code to a unknown I feel. I appreciate the support, and so do Feet1st and MikeMarsUK but we are really only meger enthusiasts in hopes that Dr. Baker's development staff have already examined the use of SSE or will be refreshed by this thread of its potential. And, of course SSE may not even be of much use. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1180 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=937 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1126 lol, I just now found all of that. I'm not sending a E-MAIL. =) ID: 18898 · Rating: 0 · rate: / Reply Quote

dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0	Message 18941 - Posted: 19 Jun 2006, 17:42:02 UTC - in response to Message 18898. He is not going to release that source code to a unknown I feel. I appreciate the support, and so do Feet1st and MikeMarsUK but we are really only meger enthusiasts in hopes that Dr. Baker's development staff have already examined the use of SSE or will be refreshed by this thread of its potential. And, of course SSE may not even be of much use. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1180 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=937 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1126 lol, I just now found all of that. I'm not sending a E-MAIL. =) Indeed. It would be completely pointless to consider SSE if the Rosetta algorithm does not lend itself to the parallelization offered by SSE. The previous DC project I worked on was Find-a-Drug, and this issue was raised from time to time. The bottom line from THINK (Keith Davis) was that use of SSE just didn't help appreciably. http://www.find-a-drug.org/forums/viewtopic.php?t=5281&highlight=sse And yes, that is my rather lengthy post at the bottom, speculating on why SSE did not do anything to help. I don't know if the same logic applies to Rosetta or not. But the bottom line is that SSE is NOT a panacea that can optimize every signle floating point algorithm ever written. ID: 18941 · Rating: 1 · rate: / Reply Quote