for i = 0 ... N-1 in steps of 2
   1. Load h[i] and h[i+1], starting from memory address &h+i, placing h[i] in the lower-half of register 1 and h[i+1] in the upper-half of register 1.
   2. Load x[i] and x[i+1], starting from memory address &x+i, placing x[i] in the lower-half of register 2 and x[i+1] in the upper-half of register 2.
   3. Multiply lower-half of register 1 by lower-half of register 2 and place 32-bit result in register 3.
   4. Multiply upper-half of register 1 by upper-half of register 2 and place 32-bit result in register 4.
   5. Add contents of register 3 to running sum stored in register 5.
   6. Add contents of register 4 to running sum stored in register 5. 

end

Example 3: Vector dot product using packed data.

Back to Article