Occasionally, when system’s built-in precision or range is insufficient for specific tasks, numerical techniques can extend their capabilities. Extending precision is not as daunting as it might seem, given familiarity with the underlying algorithms—multiplication being a prime example, a process taught as early as grade school. This page presents a ‘perfectly’ optimized implementation (minimal operations required for the task) of the Big Multiplication algorithm, achieving peak efficiency.

Below, the source code is showcased in C++ using templates, providing a practical demonstration through the ‘pudding’ of Big Multiplication.

Traditionally, multiplication on paper involves multiplying the top number by every digit of the bottom number and summing the results. Similarly, when implemented on a computer, each digit of the bottom number multiplies with the top, and the results accumulate in a sum register. This page illustrates the ultimate optimization of this process, using only a minimal set of variables and an accumulation register, thereby optimizing both computational speed and memory usage.

When implementing this algorithm, I started with the knowledge that multiplying two numbers produces a result with as many digits as the sum of the digits of the two numbers. Each pair of digits multiplies to yield a result with at most two digits. Armed with a theory that, given the ‘shape’ of numbers, I could multiply in place to save time and memory across any radix, I enlisted the help of AI to articulate how it works (text in blue).

To explain why in-place summation works in this big multiplication algorithm, and its universal applicability across any radix—including binary systems like 2^32—we start by understanding the fundamental principles of arithmetic in various numeral systems, especially focusing on the concept of “carry”.

Understanding Carry in Different Radices:
When multiplying numbers, each digit can potentially produce a value that exceeds the base of the numeral system (or radix). This excess is what we commonly refer to as “carry”. The key insight is that the carry from any single digit multiplication in any numeral system can be accommodated by the digits directly adjacent to it.

Here’s how it can be broken down:

  1. Single Digit Multiplication: In any radix 𝑅, a single digit multiplied by another single digit yields at most (𝑅−1)×(𝑅−1).
  2. Maximum Product: This product is 𝑅^2−2𝑅+1.
  3. Handling the Product:
    • The maximum value that needs to be accommodated in a single position (or “place”) in the result is 𝑅−1 from direct multiplication, 𝑅−1 from a possible existing value in the register, and an additional potential 𝑅−1 from a carry from the previous lower position.
    • Therefore, the maximum sum before placing and carrying to the next higher position is 3𝑅−3.
    • This sum splits into a maximum carry of 𝑅−1 to the next higher position, and the maximum value 𝑅−1 that remains in the current position.

Decimal Example (Radix 10):

  • Single Digit Multiplication: 9×9=81.
  • Product Handling: 81 can be split as 80+1 where 1 is in the units place, and 80 (or 8×10) is the carry to the tens place. With the maximal value of 9 in the accumulation register, the maximum sum is 90.

Binary Example (Radix 2^32):

  • Single Digit Multiplication: The maximum product of two 32-bit numbers is
    (2^32−1)×(2^32−1) = 2^64 − 2^33 + 1.
  • Product Handling: This value requires 64 bits to represent. The lower 32 bits represent the product in the current position.
    The upper 32 bits represent the carry to the next higher position in the result array.

Universality Across Radices:

  • The maximum product of two digits is predictable and bounded by the square of the radix.
  • The carry required is bounded by the radix itself.

Thus, by using a carry that is simply the upper half of the product (after accommodating the radix of the current digit), you can always manage to sum in place without needing additional temporary storage. This is why the algorithm uses the upper bits as carry, ensuring efficient use of space and operations. This explanation highlights the mathematical robustness of your algorithm and explains why it successfully handles such operations in place for any given radix. This universality is rooted in the nature of number systems and the properties of multiplication.

Here is the full source for the proof of concept on GitHub


#include <typeinfo>
#include <cstdio>
#include <cstdint>
#include <cstdlib>

//Note, this function is not optimized and is for illustration purposes
//  see void multiply_any(int digits, LowType *a, LowType *b, LowType *c) which handles zeros more like multiplication in actual practice
void multiply_uint32_t(int digits, uint32_t *a, uint32_t *b, uint32_t *c)
{
    uint32_t carry;

    for (int place_c=0; place_c < digits*2; place_c++) c[place_c] = 0;

    //for each digit in a
    for (int place_a=0; place_a<digits; place_a++) {
        int place_b;

        carry = 0;

        //multiply with the digits in b
        for (place_b = 0; place_b < digits; place_b++) {
            uint64_t v = carry + a[place_a] * (uint64_t) b[place_b] + c[place_a + place_b];

            uint32_t low = (uint32_t) v; //store the low half
            uint32_t high = (uint32_t) v>>32; //store the high half;

            c[place_a + place_b] = low;
            carry = high;
        }

        //process the carry
        for (; carry != 0 && place_b < digits * 2; place_b++)
        {
            uint64_t v = carry + c[place_a + place_b];

            uint32_t low = (uint32_t) v; //store the low half
            uint32_t high = (uint32_t) v>>32; //store the high half;

            c[place_a + place_b] = low;
            carry = high;
        }
    }
}


template<typename LowType, typename HighType>
void multiply_any(int digits, LowType *a, LowType *b, LowType *c)
{
    LowType carry;

    for (int place_c=0; place_c < digits*2; place_c++) c[place_c] = 0;

    //for each digit in a
    for (int place_a=0; place_a<digits; place_a++) {
        int place_b;

        carry = 0;

        if (a[place_a] ==0) continue;

        //multiply with the digits in b
        for (place_b = 0; place_b < digits; place_b++) {
            HighType v;

            if (b[place_b] != 0) {
                v = carry + a[place_a] * (HighType) b[place_b] + c[place_a + place_b];;
            } else {
                v = carry + c[place_a + place_b];
            }
            carry = (LowType) (v >> (sizeof(LowType)*8));
            c[place_a + place_b] = (LowType) v;

        }

        //process the carry
        for (; carry != 0 && place_b < digits * 2; place_b++)
        {
            HighType v = carry + c[place_a + place_b];

            c[place_a + place_b] = (LowType) v;
            carry = (LowType) (v >> (sizeof(LowType)*8));

        }
    }
}

class Node
{
public:
    const std::type_info *ti;
    char name[1024];

    Node *Next = nullptr;

    Node(const std::type_info *ti, const char *name)
    {
        this->ti = ti;

        int i;
        for (i=0; i<1023; i++)
        {
            if (name[i] == 0) break;
            this->name[i] = name[i];
        }

        this->name[i] = 0;
    }

    const char *GetName(const std::type_info *t)
    {
        for (Node *N = this; N != nullptr; N = N->Next)
            if (N->ti == t) return N->name;
        return nullptr;
    }
};

Node Head(NULL, "");

template<typename LowType, typename HighType>
void multiply_uint64_t(uint64_t a, uint64_t b)
{
    const int digits = sizeof(uint64_t) / sizeof(LowType);
    LowType _a[digits];
    LowType _b[digits];
    LowType _c[digits*2];

    uint64_t a_q = a;
    uint64_t b_q = b;

    const int shift = sizeof(LowType)<<3;

    for (int digit=0; digit < digits; digit++)
    {
        _a[digit] = (LowType) a_q;
        _b[digit] = (LowType) b_q;

        a_q >>= shift;
        b_q >>= shift;
    }

    multiply_any<LowType,HighType>(digits, _a, _b, _c);

    char FormatString[1024];

    sprintf(FormatString, "%s%ldX", "%0", sizeof(LowType)*2);

    printf("%s: ", Head.GetName(&typeid(LowType)));
    fflush(stdout);

    for (int i=digits-1; i>=0; i--)
    {
        printf(FormatString, (int) _a[i]);
        fflush(stdout);
    }

    printf(" * ");
    fflush(stdout);

    for (int i=digits-1; i>=0; i--)
    {
        printf(FormatString, (int) _b[i]);
        fflush(stdout);
    }

    printf(" = ");
    fflush(stdout);

    for (int i=(digits<<1)-1; i>=0; i--)
    {
        printf(FormatString, (int) _c[i]);
        fflush(stdout);
    }

    printf("\n");
    fflush(stdout);

}

void test(uint64_t a, uint64_t b)
{
    multiply_uint64_t<uint8_t, uint16_t>(a, b);
    multiply_uint64_t<uint16_t, uint32_t>(a, b);
    multiply_uint64_t<uint32_t, uint64_t>(a, b);

}

uint64_t Getuint64_t()
{
    uint64_t R = 0;
    for (int i=0; i < 8; i++)
    {
        ((unsigned char *)&R)[i] = rand() % 0xFF;
    }

    return R;
}

int main(int argc, char **argv) {
    printf("%X\n", 0xFF * 0xFF);

    uint64_t a = 0x1000000010000001ul;
    uint64_t b = 0x2000000000000002ul;

    Node *Cur = &Head;

    Node *q;

    q = new Node(&typeid(uint8_t), "uint8_t");
    Cur = Cur->Next = q;
    q = new Node(&typeid(uint16_t), "uint16_t");
    Cur = Cur->Next = q;
    q = new Node(&typeid(uint32_t), "uint32_t");
    Cur = Cur->Next = q;
    q = new Node(&typeid(uint64_t), "uint64_t");
    Cur = Cur->Next = q;


    test(a,b);

    srand(0);

    printf("run carry\n");
    test(0xFFFFFFFFFFFFFFFFul, 0xFFFFFFFFFFFFFFFFul);

    for (int i=0; i<4; i++) {
        printf("run %d\n", i);
        a = Getuint64_t();
        b = Getuint64_t();

        test(a, b);
    }

    return 0;
}
		

Hear is the output

Hear the output shows perfect consistency at different widths with various edge cases.


FE01
uint8_t: 1000000010000001 * 2000000000000002 = 02000000020000004000000020000002
uint16_t: 1000000010000001 * 2000000000000002 = 02000000020000004000000020000002
uint32_t: 1000000010000001 * 2000000000000002 = 02000000020000004000000020000002
run carry
uint8_t: FFFFFFFFFFFFFFFF * FFFFFFFFFFFFFFFF = FFFFFFFFFFFFFFFE0000000000000001
uint16_t: FFFFFFFFFFFFFFFF * FFFFFFFFFFFFFFFF = FFFFFFFFFFFFFFFE0000000000000001
uint32_t: FFFFFFFFFFFFFFFF * FFFFFFFFFFFFFFFF = FFFFFFFFFFFFFFFE0000000000000001
run 0
uint8_t: FCF1BE5355A297A3 * E2E91314526B79F9 = E033B4E5D0E5293D33AA53CBBBF2888B
uint16_t: FCF1BE5355A297A3 * E2E91314526B79F9 = E033B4E5D0E5293D33AA53CBBBF2888B
uint32_t: FCF1BE5355A297A3 * E2E91314526B79F9 = E033B4E5D0E5293D33AA53CBBBF2888B
run 1
uint8_t: 275708561F8E512D * 75778252D0D405A7 = 120D244B28A86994EB9FBD0626B0D55B
uint16_t: 275708561F8E512D * 75778252D0D405A7 = 120D244B28A86994EB9FBD0626B0D55B
uint32_t: 275708561F8E512D * 75778252D0D405A7 = 120D244B28A86994EB9FBD0626B0D55B
run 2
uint8_t: 526A3D58ED4A991B * 95AD8E684A24D536 = 302FB558B8439796120709E874EBC2B2
uint16_t: 526A3D58ED4A991B * 95AD8E684A24D536 = 302FB558B8439796120709E874EBC2B2
uint32_t: 526A3D58ED4A991B * 95AD8E684A24D536 = 302FB558B8439796120709E874EBC2B2
run 3
uint8_t: 6CDD8CC4B5353C5F * CF59B4B4E23D3211 = 582D4AB1B1DFF16658F5303BAEF6904F
uint16_t: 6CDD8CC4B5353C5F * CF59B4B4E23D3211 = 582D4AB1B1DFF16658F5303BAEF6904F
uint32_t: 6CDD8CC4B5353C5F * CF59B4B4E23D3211 = 582D4AB1B1DFF16658F5303BAEF6904F

Process finished with exit code 0
            

Reference to self, as html added to GitHub