Metal should expose shuffle instructions
Warp/lane shuffle operations are essential to doing register-level blocking of computation collaboratively across threads/work items. They can be challenging to use in general cases, but are critical to extracting peak performance on very important workloads like dense matrix-matrix multiply (and many neural net layer types), as well as local stream compaction for fine-grained dynamic collaboration between threads.
As such, all other compute APIs expose shuffle operations. Metal should as well. Without them, it is impossible to write competitive GEMM, conv layer, and other important kernels on most curent GPU architectures.
Reports posted here will not necessarily be seen by Apple.
All problems should be submitted at bugreport.apple.com before they are posted here.
Please only post information for Radars that you have filed yourself, and please do
not include Apple confidential information in your posts. Thank you!