SSE/SSE2, and models.

Message boards : Number crunching : SSE/SSE2, and models.

To post messages, you must log in.

AuthorMessage
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18872 - Posted: 18 Jun 2006, 2:24:50 UTC
Last modified: 18 Jun 2006, 2:48:08 UTC

I did a search for SSE on the message boards, and did not find anything relevant so I have decided to post.

I was wondering if the rosetta application made use of SSE instructions. I figured it did not since some CPUs might not support them, and it seemed like there is only one application for each operating system.

I attached to the process in a debugging session, but was unable to find any SSE instructions in the idle threads opcodes while having all threads halted.

I am no expert, and have very little knowledge of SSE. However, it seems its major use is parralel processing of floating-point numbers. The problem, is if you are doing a basic fp computation, and the next fp compuation needs the result - SSE provides no help(I could be wrong.) Only if you have two basic compuations that both have unrelated results then SSE becomes very powerful.

So, when a little programming background I was still curious because of the nature of the simulation which seems to be exploring the dark using random numbers to seed the exploration that mabye more than one model - mabye four could be run on the same thread(not multiple threads) and execute in parralel with each other using the SSE instruction set to gain mabye four times the speed.

Alright, now thinking about that I am sure the rosetta application is not that straight-forward. I am pretty sure it contains lots of speacil conditions and evaluations to determine if this atom and that atom can be next to each other and therefore attemping to run another simulation on the same thread in parrel to take advantage of the SSE instructions could cause some choas, not the mention the mad method it would take to not introduce new bugs and not have to split the rosetta application into two completely seperate source trees just for a SSE and a NON-SSE application.

So it boils down to:
1. I really don't care that much, but if anyone thinks there is a lead on my two cents here is a thread for it. =) rofl


[edit]
Just to give a little more insight to anyone who knows even a little C/C++.

#include <xmmintrin.h>
int main(void)
{
// built-in types
__m128 a = {0.0, 1.0, 2.0, 3.0};
__m128 b = {0.0, 1.0, 2.0, 3.0};
__m128 c;

// this is a intrinsic function, when compiled makes no function call.
c = _mm_add_ps(a, b);

// four fp single percision values are added using one instruction(above).

// no SSE extensions used.
float a[4] = {0.0, 1.0, 2.0, 3.0};
float b[4] = {0.0, 1.0, 2.0, 3.0};
float c[4];
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2];
c[3] = a[3] + b[3];
return 0;
}
ID: 18872 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18873 - Posted: 18 Jun 2006, 3:13:38 UTC
Last modified: 18 Jun 2006, 3:45:23 UTC

#include "stdafx.h"
#include <xmmintrin.h>

__m128 mtf[3];
int mtfop = 0;
int mtfcnt = 0;
template <int THREADID, int THREADCNT> __forceinline float mtfadd(float a, float b)
{
	if(mtfop < 1)
	{
		mtfop = 1;
		mtf[0].m128_f32[THREADID] = a;
		mtf[1].m128_f32[THREADID] = b;
		mtfcnt = 1;
		// pause thread quickly (wait) (keep low cpu cycles). (need a atomic instruction)
		// when woken, it returns its own result.
		return mtf[3].m128_f32[THREADID];
	}
	if(mtfop == 1)
	{
		mtf[0].m128_f32[THREADID] = a;
		mtf[1].m128_f32[THREADID] = b;
		if(mtfcnt == (THREADCNT-1))
		{
			// execute operation using SSE instructions
			mtf[3] = _mm_add_ps(mtf[0], mtf[1]);
			// wake paused threads quickly (keep low cpu cycles).(need a atomic instruction)
			return mtf[3].m128_f32[THREADID];
		}else{
			mtfcnt++;
			// pause thread quickly (wait) (keep low cpu cycles). (need a atomic instruction)
			// when woken, it returns its own result.
			return mtf[3].m128_f32[THREADID];
		}
	}
	// just do its operation the old style.
	return a+b;
}

template <int THREADID, int THREADCNT> class xfloat
{
public:
	float p;
	__forceinline void operator=(const float in)
	{
		p = in;
	}
	__forceinline xfloat& operator+(const xfloat &in)
	{
		p = mtfadd<THREADID,THREADCNT>(p, in.p);
		return *this;
	}
};
#define float xfloat<0,4>

int _tmain(int argc, _TCHAR* argv[])
{
	float a, b, c;
	a = 0.0; b = 1.0; c = 2.0;
	a = b + c;
	
	return 0;
}



Tell compiler to optimize code by enabling inline, __forceinline makes the compiler do it weather it wants to or not.


Another way to not break the source tree too much, would be at critical areas of the computation. Implement a thread switching mechanism independant of the operating system. When ever the simulation for a virtual thread reached a critical point, switch to another thread, until all three threads have reached that point in the calculation then start combining operations.
ID: 18873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18879 - Posted: 18 Jun 2006, 8:35:22 UTC
Last modified: 18 Jun 2006, 8:36:07 UTC

I tried the below, and it did not work as expected, neverless too bad the models could not use SSE I suppose?

Was extremely slow, due to all the semaphores I had to use to keep the threads from crapping on each other.
ID: 18879 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MikeMarsUK

Send message
Joined: 15 Jan 06
Posts: 121
Credit: 2,637,872
RAC: 0
Message 18880 - Posted: 18 Jun 2006, 8:52:58 UTC
Last modified: 18 Jun 2006, 8:55:36 UTC

I can't speak for Rosetta ('cause I haven't the foggiest!), but CPDN can't use SSE instructions because the precision is too low, it may be the same problem with Rosetta.

If a project can use SSE/SSE2, then its often simply a case of flipping compiler switches rather than making code changes, and it will generate code which can run on both platforms transparently. The problem with SSE/SSE2 is that it's designed for graphics rather than high precision computation.

ID: 18880 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 18882 - Posted: 18 Jun 2006, 9:12:23 UTC

Recently the project staff expressed their interest in optimizing the Rosetta app for different processors but they don't have the time and expertise to do so. I read somewhere that Rosetta uses mostly single point precision so it should be ablte to take advantage of SSE. However akosf the guy who did wonder to the app on Einstein showed no interest in optimizing Rosetta. I'm sure you will receive the source code if you ask for it to try taking advantage of SSE. However the app is in constant development and updated often, so it has to be considered how to use optimizations which still allow constant code change.
ID: 18882 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18883 - Posted: 18 Jun 2006, 12:55:58 UTC
Last modified: 18 Jun 2006, 13:36:26 UTC

MikeMarsUK, Well it will not run transparent. If a application is compiled to use SSE/SSE2 instruction on a processor that does not support them it will raise a invalid opcode exception, unless the compiler and linker used to generate the code produce seperate versions of functions or code that uses the processor features or does not. I have heard of such, and I think it was Intel's C++ compiler?

I was wondering why the Rosetta@Home team has not compiled a application with SSE support, and/or another with SSE2 support. It could be that the SSE did not make a large enough improvment percentage, but that could be related to the fact that a compiler is not the GOD of optimizations for lack of a better word. I don't think you are going to beat the compiler at optimizing register usage while being worthwhile to do so, or ordering instructions, and placing data and code on page boundaries or even contracting or expanding loops.

You can optimize code in ways the compiler is unable to do so, which may be more along the lines of what Rosetta@Home could have potential to do using a little SSE in a smarter way if the compiler flag did not increase performance, and at the same time their arithograms may be just too linear to put SSE to any use?

Yeah, Rosetta@Home only uses single percision floats which was one reason why they had no need to produce a 64 bit binary, but because of the need for only single percision floats it does make me interested in SSE's potential. I just read a article a little while back on how lots of applications for computers now days never get optimized to use alot of these new processor features and instead run with legacy type performance(I do not know this for sure.).

I did run though the Rosetta@Home application with a debugger, but who knows mabye I just missed the SSE instructions, lol, and they already use them somewhere - somehow. =?)

I just stumbled onto this thread too:
http://72.14.209.104/search?q=cache:f09Ekg0IOigJ:boinc.bakerlab.org/rosetta/forum_thread.php%3Fid%3D1084+SSE&hl=en&gl=us&ct=clnk&cd=1

ID: 18883 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 18891 - Posted: 18 Jun 2006, 16:38:59 UTC - in response to Message 18872.  


// this is a intrinsic function, when compiled makes no function call.
c = _mm_add_ps(a, b);

// four fp single percision values are added using one instruction(above).

// no SSE extensions used.
float a[4] = {0.0, 1.0, 2.0, 3.0};
float b[4] = {0.0, 1.0, 2.0, 3.0};
float c[4];
c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2];
c[3] = a[3] + b[3];
return 0;
}


You aren't counting "instructions" the way the CPU must. Instruction pipelineing does not allow you to perform 4 indentical operations in a row. What it does, is allow you to perform 4 DIFFERENT operations (i.e. one floating point, one integer, one comparison... I'm not exactly sure which labels are on the 4 pipes). And the compiler may redecorate the order of execution when it does not change the outcome of the program. So if it could pull some integer calculations from the surrounding code and intersperse them with the floats, then it would be taking advantage of the pipelineing.

Here is a reference. Unfortunately it doesn't really fully discuss the causes of "bubbles" in the pipeline.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 18891 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MikeMarsUK

Send message
Joined: 15 Jan 06
Posts: 121
Credit: 2,637,872
RAC: 0
Message 18893 - Posted: 18 Jun 2006, 16:57:44 UTC - in response to Message 18883.  

MikeMarsUK, Well it will not run transparent. If a application is compiled to use SSE/SSE2 instruction on a processor that does not support them it will raise a invalid opcode exception, unless the compiler and linker used to generate the code produce seperate versions of functions or code that uses the processor features or does not. I have heard of such, and I think it was Intel's C++ compiler?
...


The CPDN compiler is Intel Fortran, and I think that's exactly what it does - has two versions of a block of code, one for SSE/SSE2 and the other for non-SSE.

If Rosetta only needs single precision computation then there may be scope for this optimisation, it probably depends on which language + compiler etc they are using.

ID: 18893 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18896 - Posted: 18 Jun 2006, 17:38:04 UTC - in response to Message 18891.  
Last modified: 18 Jun 2006, 18:34:11 UTC


You aren't counting "instructions" the way the CPU must. Instruction pipelineing does not allow you to perform 4 indentical operations in a row. What it does, is allow you to perform 4 DIFFERENT operations (i.e. one floating point, one integer, one comparison... I'm not exactly sure which labels are on the 4 pipes). And the compiler may redecorate the order of execution when it does not change the outcome of the program. So if it could pull some integer calculations from the surrounding code and intersperse them with the floats, then it would be taking advantage of the pipelineing.

Here is a reference. Unfortunately it doesn't really fully discuss the causes of "bubbles" in the pipeline.


I know what you are saying, but let me compiling something quickly:

	float a[4], b[4], result[4];
	result[0] = a[0] + b[0];
00401736 D9 45 F0         fld         dword ptr [a] 
00401739 D8 45 E0         fadd        dword ptr [b] 
0040173C D9 5D D0         fstp        dword ptr [result] 
	result[1] = a[1] + b[1];
0040173F D9 45 F4         fld         dword ptr [ebp-0Ch] 
00401742 D8 45 E4         fadd        dword ptr [ebp-1Ch] 
00401745 D9 5D D4         fstp        dword ptr [ebp-2Ch] 
	result[2] = a[2] + b[2];
00401748 D9 45 F8         fld         dword ptr [ebp-8] 
0040174B D8 45 E8         fadd        dword ptr [ebp-18h] 
0040174E D9 5D D8         fstp        dword ptr [ebp-28h] 
	result[3] = a[3] + b[3];
00401751 D9 45 FC         fld         dword ptr [ebp-4] 
00401754 D8 45 EC         fadd        dword ptr [ebp-14h] 
00401757 D9 5D DC         fstp        dword ptr [ebp-24h] 


Thats twelve instructions.

	__m128 a, b, result;
	result = _mm_add_ps(a,b);
// 30h - 20h is 10h as in 16 bytes or 128bits or 4 32bit floats.
00401753 0F 28 45 E0      movaps      xmm0,xmmword ptr [ebp-20h] 
00401757 0F 28 4D D0      movaps      xmm1,xmmword ptr [ebp-30h] 
0040175B 0F 58 C8         addps       xmm1,xmm0 
0040175E 0F 29 4D C0      movaps      xmmword ptr [ebp-40h],xmm1 


Thats four instructions.

Also, we get xmm0-xmm7 that are 128bit each to store more fp values. Instead of constantly reading/writting/swapping out of memory.

SSE2 way more registers, and more flexibility- but like my procesor it only supports SSE not SSE2. SSE3 is also another one, that adds some more functionality.

The processor pipeline is going to read that one instruction, it really does not do four seperate add operations. =) Its prolly doing some sorta of shifting stuff around, and making one big calculation and having a outcome of four contained in one 128bit register.

That makes sense about the fortran compiler, I am just shooting in the dark and for a good explanation of how it works and why it would work good is run-time function linking by the executable. It waits to link those functions that are processor specific until it is executed, but this is hidden from the developer. The only bad side to this is, it could *really* bloat the code. If you had a large function the compiler would be forced to make two copies or break it down into seperate functions with call instruction which cost CPU, but I am not debunking the idea at all. It depends on how much bloat. =?)

You can also do manualy what the fortran compiler does?, or the developers may do as it seems a little more likely is just to hand write two seperate functions using SSE intrinsics in one and none in the other.

http://www.tc.cornell.edu/Services/Education/Topics/Optimization/SingleProcessor/2.7+DAXPY,+SAXPY.htm
http://www.intel80386.com/simd/mmx2-doc.html

Something common is vector math, which Rosetta@Home may use?

	__m128 Atom1 = {5.0,5.0,5.0,0.0}, Atom2 = {0.0,0.0,0.0,0.0};
00401009 F3 0F 10 05 F4 20 40 00 movss       xmm0,dword ptr [__real@40a00000 (4020F4h)] 
00401011 D9 05 F4 20 40 00 fld         dword ptr [__real@40a00000 (4020F4h)] 
00401017 D9 14 24         fst         dword ptr [esp] 
0040101A F3 0F 11 44 24 08 movss       dword ptr [esp+8],xmm0 
00401020 0F 57 C0         xorps       xmm0,xmm0 
00401023 D9 5C 24 04      fstp        dword ptr [esp+4] 
00401027 F3 0F 11 44 24 10 movss       dword ptr [esp+10h],xmm0 
0040102D F3 0F 11 44 24 14 movss       dword ptr [esp+14h],xmm0 
00401033 F3 0F 11 44 24 18 movss       dword ptr [esp+18h],xmm0 
00401039 F3 0F 11 44 24 1C movss       dword ptr [esp+1Ch],xmm0 
	Atom1 = _mm_sub_ps(Atom1, Atom2);
0040103F 0F 28 4C 24 10   movaps      xmm1,xmmword ptr [esp+10h] 
00401044 F3 0F 11 44 24 0C movss       dword ptr [esp+0Ch],xmm0 
0040104A 0F 28 04 24      movaps      xmm0,xmmword ptr [esp] 
0040104E 0F 5C C1         subps       xmm0,xmm1 
00401051 0F 29 04 24      movaps      xmmword ptr [esp],xmm0 
	Atom2.m128_f32[0] = Atom2.m128_f32[1] = Atom2.m128_f32[2] =
	 sqrt(Atom1.m128_f32[0] * Atom1.m128_f32[0] +
							 Atom1.m128_f32[1] * Atom1.m128_f32[1] +
							 Atom1.m128_f32[2] * Atom1.m128_f32[2]);
00401055 D9 44 24 08      fld         dword ptr [esp+8] 
00401059 DC C8            fmul        st(0),st 
0040105B D9 44 24 04      fld         dword ptr [esp+4] 
0040105F DC C8            fmul        st(0),st 
00401061 DE C1            faddp       st(1),st 
00401063 D9 04 24         fld         dword ptr [esp] 
00401066 DC C8            fmul        st(0),st 
00401068 DE C1            faddp       st(1),st 
0040106A D9 FA            fsqrt            
0040106C D9 54 24 18      fst         dword ptr [esp+18h] 
00401070 D9 54 24 14      fst         dword ptr [esp+14h] 
00401074 D9 5C 24 10      fstp        dword ptr [esp+10h] 
	Atom1 = _mm_div_ps(Atom1, Atom2);
00401078 0F 28 4C 24 10   movaps      xmm1,xmmword ptr [esp+10h] 
0040107D 0F 5E C1         divps       xmm0,xmm1 
00401080 0F 29 04 24      movaps      xmmword ptr [esp],xmm0 
00401084 F3 0F 10 44 24 0C movss       xmm0,dword ptr [esp+0Ch] 
0040108A F3 0F 58 44 24 08 addss       xmm0,dword ptr [esp+8] 


thirty-three instructions


No SSE
	__m128 Atom1 = {5.0,5.0,5.0,0.0}, Atom2 = {0.0,0.0,0.0,0.0};
00401783 D9 05 F4 20 40 00 fld         dword ptr [__real@40a00000 (4020F4h)] 
00401789 D9 5D E0         fstp        dword ptr [ebp-20h] 
0040178C D9 05 F4 20 40 00 fld         dword ptr [__real@40a00000 (4020F4h)] 
00401792 D9 5D E4         fstp        dword ptr [ebp-1Ch] 
00401795 D9 05 F4 20 40 00 fld         dword ptr [__real@40a00000 (4020F4h)] 
0040179B D9 5D E8         fstp        dword ptr [ebp-18h] 
0040179E D9 EE            fldz             
004017A0 D9 5D EC         fstp        dword ptr [ebp-14h] 
004017A3 D9 EE            fldz             
004017A5 D9 5D D0         fstp        dword ptr [ebp-30h] 
004017A8 D9 EE            fldz             
004017AA D9 5D D4         fstp        dword ptr [ebp-2Ch] 
004017AD D9 EE            fldz             
004017AF D9 5D D8         fstp        dword ptr [ebp-28h] 
004017B2 D9 EE            fldz             
004017B4 D9 5D DC         fstp        dword ptr [ebp-24h] 
	Atom1.m128_f32[0] = Atom1.m128_f32[0] - Atom2.m128_f32[0];
004017B7 D9 45 E0         fld         dword ptr [ebp-20h] 
004017BA D8 65 D0         fsub        dword ptr [ebp-30h] 
004017BD D9 5D E0         fstp        dword ptr [ebp-20h] 
	Atom1.m128_f32[1] = Atom1.m128_f32[1] - Atom2.m128_f32[1];
004017C0 D9 45 E4         fld         dword ptr [ebp-1Ch] 
004017C3 D8 65 D4         fsub        dword ptr [ebp-2Ch] 
004017C6 D9 5D E4         fstp        dword ptr [ebp-1Ch] 
	Atom1.m128_f32[2] = Atom1.m128_f32[2] - Atom2.m128_f32[2];
004017C9 D9 45 E8         fld         dword ptr [ebp-18h] 
004017CC D8 65 D8         fsub        dword ptr [ebp-28h] 
004017CF D9 5D E8         fstp        dword ptr [ebp-18h] 
	Atom2.m128_f32[0] = sqrt(Atom1.m128_f32[0] * Atom1.m128_f32[0] +
							 Atom1.m128_f32[1] * Atom1.m128_f32[1] +
							 Atom1.m128_f32[2] * Atom1.m128_f32[2]);
004017D2 D9 45 E0         fld         dword ptr [ebp-20h] 
004017D5 D8 4D E0         fmul        dword ptr [ebp-20h] 
004017D8 D9 45 E4         fld         dword ptr [ebp-1Ch] 
004017DB D8 4D E4         fmul        dword ptr [ebp-1Ch] 
004017DE DE C1            faddp       st(1),st 
004017E0 D9 45 E8         fld         dword ptr [ebp-18h] 
004017E3 D8 4D E8         fmul        dword ptr [ebp-18h] 
004017E6 DE C1            faddp       st(1),st 
004017E8 D9 5D CC         fstp        dword ptr [ebp-34h] 
004017EB D9 45 CC         fld         dword ptr [ebp-34h] 
004017EE 51               push        ecx  
004017EF D9 1C 24         fstp        dword ptr [esp] 
004017F2 E8 29 F8 FF FF   call        sqrt (401020h) 
004017F7 83 C4 04         add         esp,4 
004017FA D9 5D D0         fstp        dword ptr [ebp-30h] 
	Atom1.m128_f32[0] = Atom1.m128_f32[0] / Atom2.m128_f32[0];
004017FD D9 45 E0         fld         dword ptr [ebp-20h] 
00401800 D8 75 D0         fdiv        dword ptr [ebp-30h] 
00401803 D9 5D E0         fstp        dword ptr [ebp-20h] 
	Atom1.m128_f32[1] = Atom1.m128_f32[1] / Atom2.m128_f32[0];
00401806 D9 45 E4         fld         dword ptr [ebp-1Ch] 
00401809 D8 75 D0         fdiv        dword ptr [ebp-30h] 
0040180C D9 5D E4         fstp        dword ptr [ebp-1Ch] 
	Atom1.m128_f32[2] = Atom1.m128_f32[2] / Atom2.m128_f32[0];
0040180F D9 45 E8         fld         dword ptr [ebp-18h] 
00401812 D8 75 D0         fdiv        dword ptr [ebp-30h] 
00401815 D9 5D E8         fstp        dword ptr [ebp-18h] 


fivety-three roughly minus my mistaken of not letting the compiler use fsqr instead of calling the sqr function.
ID: 18896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 18897 - Posted: 18 Jun 2006, 18:27:15 UTC

Why discuss this hypothetically? If you guys understand SSE ask David Baker for the source and make some suggestions. I'm sure he is interested in possible use of SSE 1/2/3 etc.

dabaker@u.washington.edu


ID: 18897 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Leonard Kevin Mcguire Jr.

Send message
Joined: 13 Jun 06
Posts: 29
Credit: 14,903
RAC: 0
Message 18898 - Posted: 18 Jun 2006, 18:52:10 UTC

He is not going to release that source code to a unknown I feel. I appreciate the support, and so do Feet1st and MikeMarsUK but we are really only meger enthusiasts in hopes that Dr. Baker's development staff have already examined the use of SSE or will be refreshed by this thread of its potential.

And, of course SSE may not even be of much use.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1180
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=937
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1126

lol, I just now found all of that. I'm not sending a E-MAIL. =)
ID: 18898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 18941 - Posted: 19 Jun 2006, 17:42:02 UTC - in response to Message 18898.  

He is not going to release that source code to a unknown I feel. I appreciate the support, and so do Feet1st and MikeMarsUK but we are really only meger enthusiasts in hopes that Dr. Baker's development staff have already examined the use of SSE or will be refreshed by this thread of its potential.

And, of course SSE may not even be of much use.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1180
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=937
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=1126

lol, I just now found all of that. I'm not sending a E-MAIL. =)


Indeed. It would be completely pointless to consider SSE if the Rosetta algorithm does not lend itself to the parallelization offered by SSE.

The previous DC project I worked on was Find-a-Drug, and this issue was raised from time to time. The bottom line from THINK (Keith Davis) was that use of SSE just didn't help appreciably.

http://www.find-a-drug.org/forums/viewtopic.php?t=5281&highlight=sse

And yes, that is my rather lengthy post at the bottom, speculating on why SSE did not do anything to help. I don't know if the same logic applies to Rosetta or not. But the bottom line is that SSE is NOT a panacea that can optimize every signle floating point algorithm ever written.
ID: 18941 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : SSE/SSE2, and models.



©2024 University of Washington
https://www.bakerlab.org