Ever get the feeling that there are factions inside of Apple which aren't happy about the upcoming move to Intel?
Specifically, check out some of these highlights from the performance tips on migrating from AltiVec to SSE:
Don't bother synthesizing constants on the fly like you did for AltiVec. Most of the time, you wont have register space to keep those constants in register. You also don't have vec_splat_*, so synthesizing constants takes a lot longer.
Just like AltiVec, denormal stalls can be very expensive. Unlike AltiVec, you are much more likely to encounter them.
Reduce or eliminate your need for the permute unit. It is not as strong as on AltiVec. You could find yourself spending all your CPU time solving permute problems rather than doing actual work.
While translating code from AltiVec to SSE, pay attention to the expense of each translation. Some AltiVec instructions translate directly to a single SSE equivalent, while another potentially very similar instruction may take a dozen SSE instructions to do.
AltiVec is a rich ISA. This gives you a lot of freedom. There are frequently three ways to do anything, one of which is highly unintuitive but delivers a miracle in two instructions. SSE is smaller.
SSE involves destructive instructions most of the time. If you can phrase your algorithm in terms of destructive logic, you can probably save some unnecessary copies, and possibly some register spillage. (This will probably preclude software pipelining. However, software pipelining may not be necessary because the Intel processors are highly out-of-order.)
Heed cautions and tips in the Intel Processor Optimization Reference Manual.
Yeah, I'm really thrilled about this, too. Sounds to me like a well-tweaked AltiVec-enabled app on the G5 will still blow the doors off of a tweaked SSE2 app on a P5. Don't even get me started on their marginalization of SSE3:
SSE3 adds a small series of instructions mostly geared to making complex floating point arithmetic work better in some data layouts. However, since it is possible to get the same or better performance by repacking data as uniform vectors rather than non-uniform vectors ahead of time, it is not expected that most developers will need to rely on this feature.
Hmm. Perhaps in 2006-2007, it will be time to hoard dual-processor Xserve cluster nodes, especially for those of us who have cause to work in bioinformatics or other HPC environs?
