NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Experiments with Byte Matrix Multiplication (github.com)
dkhudia 276 days ago [-]
> It's quite common in machine learning operations to multiply a matrix of unsigned byte by a matrix of signed byte. Don't ask me why, but that's the case.

Overflow is the reason. Intel's vpmaddubsw takes int8_t and uint8_t to give you results in int16_t. If both are unsigned 255 * 255 = 65025 will be out of range for int16_t (−32,768 to +32,767) so likely the instruction is designed to take int8_t and uint8_t. However, if one is signed and other is unsigned extremes -128 * 255 or 127 * 255 are always in int16_t range. The overflow (or rather saturation with this instruction) can still occur because it sums adjacent multiplications. See my comment in PyTorch. https://github.com/pytorch/pytorch/blob/a37db5ae3978010e1bb7...

atq2119 276 days ago [-]
This doesn't feel like a convincing argument. If you wanted to multiply uint8 * uint8, you'd naturally use an unsigned multiply with a uint16 result. That doesn't overflow either.

I believe a better argument is to appeal to the structure of neural networks. Activation inputs into a matrix multiply come out of a non-linear function, and ReLU is a popular function which causes activation inputs to be unsigned. Weights then need to be signed so that the matrix multiplication can have negative outputs -- without negative outputs, you would lose the non-linearity of ReLU.

dkhudia 276 days ago [-]
This is true but the instruction already existed and it doesn't support uint16_t accumulation. For the reason you mention, activations are uint8_t and weights are int8_t so it worked out well for neural networks.
gok 276 days ago [-]
Curious how this compares with, say, the implementation of gemm_s8s8s32 in Intel's MKL / OneAPI.
276 days ago [-]
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 03:16:04 GMT+0000 (Coordinated Universal Time) with Vercel.