for i = 0 ... N-1 in steps of 2
1. Load h[i] and h[i+1], starting from memory address &h+i, placing h[i] in the lower-half of register 1 and h[i+1] in the upper-half of register 1.
2. Load x[i] and x[i+1], starting from memory address &x+i, placing x[i] in the lower-half of register 2 and x[i+1] in the upper-half of register 2.
3. Multiply lower-half of register 1 by lower-half of register 2 and place 32-bit result in register 3.
4. Multiply upper-half of register 1 by upper-half of register 2 and place 32-bit result in register 4.
5. Add contents of register 3 to running sum stored in register 5.
6. Add contents of register 4 to running sum stored in register 5.
end
Example 3: Vector dot product using packed data.
Back to Article