I remember hearing somebody talk about programming hot loops in either the the PS3 or PS2 in Excel, to get a good handle on the concurrency question by having assembler in multiple columns next to each other
That would be the PS2’s VUs which had an upper and lower pipe and it was easier to write instructions for each in separate columns. Then in one SDK we received program called vcl which took a single list of instructions, doing all the pipelining for you, as well as optimizing loops and assigning registers automatically. It was a godsend.