10

I need to do some math with vars from an attitude sensor (acclerometers and gyroscopes) on an esp8266. int32_t math with those vars does not have enough range and the float math operations are abyssmally slow.

I'm hoping to use a fixed-point format with 24 bits on the left of the decimal point and 8 bits on the right side. So a number would be stored in a int32_t as 0xffffff.ff. A typical operation would be a multiply like ((i64) x * y) >> 24).

So 64-bit math is needed. Would this be significantly slower than a float multiply?

mark-hahn
  • 277
  • 6
  • 11
    do 10,000 operations of each ... time the results – jsotola Apr 10 '22 at 19:58
  • 2
    By the way, if I was downvoted because the answer seemed obvious then they should read the rest of this thread. – mark-hahn Apr 11 '22 at 04:41
  • If you need fast single-precision float operations, consider the ESP32 instead of the ESP8266. It clocks in at 240MHz instead of 160MHz, has two cores, and has hardware single-precision FPU. Especially if you want to use the WiFi or anything else that depends on interrupts you can do your DSP on the second core and not have to worry about your calculations getting sidelined by whatever else is going on in your main loop. – J... Apr 11 '22 at 19:14
  • Yes, I've used the esp32 before with this attitude chip. I just made every var a float and it breezed along. But this is for a cheap toy and the extra $2 for the esp32 is too much. – mark-hahn Apr 11 '22 at 20:05

2 Answers2

11

Calculating the relative performance of 64 bit integer versus floating point multiplication is a little more difficult than it would appear.

It's easy to time a loop that does thousands of calculations, but unless you do something with the result the compiler will happily optimize them away. It's tempting to multiply constants, but the compiler will also happily optimize that away.

To keep the code as simple as possible and not introduce anything into the loop that would take more time, you need to disable or bypass the compiler's optimizations. An easy way to do that is to use the volatile modifier on the variables. volatile tells the compiler that it can't assume the variable hasn't changed, so the compiler will disable optimizations like caching the variable in a register or noticing that the same calculation is being done over and over again and just doing it once.

Using this technique, with a couple of tests to verify that the compiler would perform optimizations that interfere with timing, I get these results:

non-volatile uint32_t microseconds 1
constants uint32_t microseconds 5
uint16_t microseconds 162513
uint32_t microseconds 137513
uint64_t microseconds 1287533
int16_t microseconds 337513
int32_t microseconds 300023
int64_t microseconds 1287513
float  microseconds 912529

64 bit unsigned or signed multiplication takes about 1.4x the time as floating point multiplication.

The first test verified that non-volatile variables are optimized out. The second test verified that multiplying constants is also optimized out.

Turning off Wifi also helps keep test results consistent.

#include <Arduino.h>

#ifdef ESP32
#include <WiFi.h>
#else
#include <ESP8266WiFi.h>
#endif

#define TIMES_TO_LOOP 1000000

volatile uint16_t ux16, uy16, uresult16;
volatile uint32_t ux32, uy32, uresult32;
volatile uint64_t ux64, uy64, uresult64;

volatile int16_t x16, y16, result16;
volatile int32_t x32, y32, result32;
volatile int64_t x64, y64, result64;

volatile float xf, yf, resultf;

uint32_t x32n, y32n, result32n;

uint16_t seed16() {
  return random(0, 0xffff);
}

uint32_t seed32() {
  return random(0, 0xffffffff);
}

uint64_t seed64() {
  return seed32();
}

float seedfloat() {
  float x, y;

  x = seed32();
  y = seed32();

  return x / y;
}

void setup() {
  uint32_t i;
  uint64_t micros_start, micros_end;

  Serial.begin(115200);
  Serial.println("hello world");

  WiFi.mode( WIFI_OFF );

  delay(1000);

  x32n = seed32();
  y32n = seed32();

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    result32n = x32n * y32n;
  micros_end = micros();

  Serial.print("non-volatile uint32_t microseconds ");
  Serial.println(micros_end - micros_start);

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    result32n = 540000 * 15;
  micros_end = micros();

  Serial.print("constants uint32_t microseconds ");
  Serial.println(micros_end - micros_start);


  ux16 = seed16();
  uy16 = seed16();
  x16 = ux16;
  y16 = uy16;

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    uresult16 = ux16 * uy16;
  micros_end = micros();

  Serial.print("uint16_t microseconds ");
  Serial.println(micros_end - micros_start);
    
  ux32 = seed32();
  uy32 = seed32();
  x32 = ux32;
  y32 = uy32;

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    uresult32 = ux32 * uy32;
  micros_end = micros();

  Serial.print("uint32_t microseconds ");
  Serial.println(micros_end - micros_start);

  ux64 = seed64();
  uy64 = seed64();
  x64 = ux64;
  y64 = uy64;

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    uresult64 = ux64 * uy64;
  micros_end = micros();

  Serial.print("uint64_t microseconds ");
  Serial.println(micros_end - micros_start);

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    result16 = x16 * y16;
  micros_end = micros();

  Serial.print("int16_t microseconds ");
  Serial.println(micros_end - micros_start);
    
  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    result32 = x32 * y32;
  micros_end = micros();

  Serial.print("int32_t microseconds ");
  Serial.println(micros_end - micros_start);

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    result64 = x64 * y64;
  micros_end = micros();

  Serial.print("int64_t microseconds ");
  Serial.println(micros_end - micros_start);

  xf = seedfloat();
  yf = seedfloat();

  micros_start = micros();
  for(i = 0; i < TIMES_TO_LOOP; i++)
    resultf = xf * yf;
  micros_end = micros();

  Serial.print("float  microseconds ");
  Serial.println(micros_end - micros_start);
}

void loop() {
}
romkey
  • 1,394
  • 6
  • 10
  • 2
    I ran your code and the results matched mine. Float *is* faster than 64-bit. int64_t microseconds 650010 float microseconds 456266 – mark-hahn Apr 11 '22 at 04:49
  • 3
    Do note that 64 x 64 -> 64 bit multiply is typically 4x slower than the 32 x 32 -> 64 bit multiply that was in the original question. – jpa Apr 11 '22 at 16:43
  • 1
    I don't doubt this, but it *is* kind of surprising. For example: why is `int16_t` so slow compared to `uint16_t`? Multiplication is the same operation for those two types. Similarly for `int32_t` and `uint32_t`. Really bizarre. Is it possible that `volatile` is a bit too pessimistic in the sense that it also prevents strength reduction optimizations, whereas in a real app those would be available? – Daniel Wagner Apr 12 '22 at 00:37
  • @jpa -- If you look closely you will see the muliplication is 64x64-bit, not 32x32. – mark-hahn Apr 13 '22 at 02:21
  • @mark-hahn Then you have a different way of implementing fixed point math than I'm used to. C compilers are smart enough to know that if you have a 32-bit variable and cast it to 64 bits just before multiplication, the upper bits are zero and don't need to be handled. – jpa Apr 13 '22 at 06:32
6

OK, I took the trouble to code fixed-point routines and test them. Here is the code ...


// fixed-point routines for 20 bit integer / 12-bit fraction
// format 0xfffff.fff, max value is +-524288.9998

#define FP_I 20     // integer bits,  e.g. 0xfffff.0
#define FP_F 12     // fraction bits, e.g. 0x0.fff
#define fpMul(_x,_y) ( (i32)( ((i64) _x * _y) >> FP_F) )
#define fpDiv(_x,_y) ( (i32)(  (((i64) _x) <<   32)      \
                             / (((i64) _y) >> FP_I) ) )
#define decToFp(_dec) ((i32) (_dec * 4096))   // e.g. decToFp(-1.2)    
#define fpToI32(_fp)  ((i32) (_fp  / 4096))

I did a 1000 loops and used the loop counter as the operation argument. So each loop was one operation plus one floating-point increment. I did this for fixed-point multiply, floating-point multiply, fixed-point divide, and floating-point divide.

The results were shocking. The fixed-point multiply took almost exactly the same time as the floating-point. The fixed-point divide took 60% longer than the floating-point. I went over my code and tried everything I could think of. I built it in debug and release mode. I tried 80MHz and 160MHz. I found floating-point speeds on the web and my floating-point tests closely matched those.

I am at a loss as to how the 64-bit math could be slower than the floating-point. Both are implemented in software. I'm sure the floating-point is highly optimized with some tricks and coded in machine language. But my fixed-point code is just shifts and 64-bit multiply/divides. I don't think there are any tricks that could be used for it.

At least this makes the decision easy. I guess I'll use floating-point as sparingly as possible and cross my fingers that I can get my app to work fast enough.

I hope this is helpful to some passing-by googlers. I couldn't find anything when I googled it.

mark-hahn
  • 277
  • 6
  • 7
    Hey, this is not a forum where we play ping pong with answers. When you write an answer it should contain a solution to the problem stated in the question. No back and forth please. Just questions and self contained answers. If you think you solved your own problem you should merge all your answers. If not, then you can add the additional info to your question. And please make sure that others can still understamd your question/answer, so they too can learn from it – chrisl Apr 11 '22 at 06:12
  • I didn't turn off wifi but I did use all the other recommendations you have here. I have played with this all day and it was hard to get right but I'm pretty confident in the results. Especially since my float results matched the results on the web from several other people who did a lot of work on this stuff. If anyone wants to insert my fixed-point code into their speed tests then let me know if your findings match mine. I've already burned up more time than I should have on this. – mark-hahn Apr 11 '22 at 04:38
  • 2
    And, welcome to Arduino Stack Exchange! Thanks for the update. – Nick Gammon Apr 11 '22 at 07:02
  • I don't understand. Every forum I've used before was all ping-pong. Where should we be having this discussion? – mark-hahn Apr 11 '22 at 20:07
  • 1
    Have you considered implementing 32-bit multiplication as multiplications of 2-digit base 2^16 numbers? Something like [this](https://gist.github.com/dmwit/c0abc5995739bc750b8e2d85c7166b4e). It may be faster than converting both 32-bit numbers to 64-bit numbers before multiplying. Signed multiplication is also possible with two extra conditionals, if that's really important (I think it's not?). – Daniel Wagner Apr 12 '22 at 01:21
  • ...and if multiplying two 30-bit numbers to get a 60-bit number is enough for your application, you can drop the 4 cheap multiplies from my previous link to just 3 using the Karatsuba multiplication algorithm. – Daniel Wagner Apr 12 '22 at 01:32
  • 2
    @mark-hahn: SE *isn't* a forum, it's a Q&A site. You'll note that the interface is structured differently, and the social conventions differ too. The [site tour](/tour) highlights some of the differences, as does our help center page on [how to write a good answer](/help/how-to-answer). Anyway, for issues that do need some back-and-forth discussion, the appropriate place for that is either here in the comments or (especially if the discussion gets very long or drifts off topic) in [chat](https://chat.stackexchange.com/?tab=site&host=arduino.stackexchange.com). – Ilmari Karonen Apr 12 '22 at 07:12
  • BTW, your `fpDiv` implementation looks a bit wonky to me: e.g. `fpDiv(decToFp(10), decToFp(2))` gives a division by zero error. Normally you'd want to do the right shift _after_ the division, I think. (Also, I don't know if that's the case on the esp8266, but at least for some platforms / compilers / optimization settings, setting `FP_F` to a multiple of the/a "native" word size like 8, 16 or 32 bits _might_ yield somewhat faster code. It could at least be worth a try.) – Ilmari Karonen Apr 12 '22 at 07:42