I have a code in which i am computing x % 25. x always takes a positive value but its dynamic range is large.

I found out that this particular code piece of computing a x % 25 is taking large cycles. I need to optimize it.

Pre-computed lookup table is ruled out due to the possible large memory size of the table.

As second approach i coded a fragment below(C code) -

```
mod(a, b)
{
int r = a;
while(r >= b)
{
r = r - b;
}
return r;
}
```

1.) How can i optimize this code further for cycles(squeeze it to max)?

2.) Is there any entirely different optimized way to achieve x % 25( i know its not a common operation, but still, looking for clever inputs people might have used in their experience which might nelp me.).

Thank you.

-AD

EDIT:

I think using a native modulo operator % in C , internally uses a division operation (/) which is costly on the processor i am using.(No div instruction). hence trying to see if custom implemetation can beat the inherent computation using % operator.

-AD

Why can't you just use the operator `%`

? If this is C code, and the numbers are ordinary "native" `int`

:s, then that should be the fastest way, by far.

If you don't like `%`

operator:

```
int mod(int a, int b) {
int integral = a / b;
return a - (b*integral);
}
```

Is there a reason why you cant use C's built in modulus operator?

```
int a = x % 25;
```

Following your edit;

If your rpocessor does not have built in modulo support then I would still use the % operator for the simple reason that your compiler will know that the processor in question doesnt have a native % function, and will likely produce asm code to optimally emulate it.

Put it this way - I'd be fascinated if you can come up with a genarl algorithm that outperforms whatevr the compiler produces from using the built in operator, notwithsatanding specific cases (such as simply taking the 2 lowest digits for modulo 100 etc)

I find it pretty strange that the operation `x % 25`

takes such long time (if you are using the built-in `%`

operator, that is). Most modern processors should do this in a single instruction. I'd look for other reasons that this code takes so long.

EDIT: Here's an algorithm that might at least give some ideas:

256 = 6 (mod 25)

This means that if we write a number `x`

as bytes `x3 x2 x1 x0`

we have that `x = 6^3*x3 + 6^2*x2 + 6*x1 + x0`

(mod 25)

This gives an algorithm for reducing the size of `x`

:

```
int x0 = x & 0xFF, x1 = (x>>8) & 0xFF, x2 = (x>>16) & 0xFF, x3 = (x>>24) & 0xFF;
int y = x4;
y = (y << 2) + (y << 1) + x3;
y = (y << 2) + (y << 1) + x2;
y = (y << 2) + (y << 1) + x1;
y = (y << 2) + (y << 1) + x0;
```

(here `(y << 2) + (y << 1) = 4*y + 2*y = 6*y`

)

After this `y`

will have the same remainder as `x`

mod 25. Iterating this 1, 2 or 3 times will make `y`

a 17, 11, or 9 bit number, respectively. One of these sizes might be small enough to make a lookup table of.

I SERIOUSLY doubt that this would be faster than the builtin `%`

operator, though.

If you are only considering the number 25 you can use the fact that 25 divies an integer if and only if the last two digits of the integer are 00, 25, 50 or 75. So to get the modulo you consider the last two digits and then subtract the nearest of 00, 25, 50 or 75.

If you know that `b`

will be a power of 2, you could use bitwise `AND`

instead of the modulo operator. However, the wikipedia page for modulo seems to indicate that any C compiler would notice this and optimize out the modulo anyway.

If your C compiler is targeting a CPU with no divide instruction, you can modify your code as follows:

```
mod(a, b) {
int s = b + b + b + b;
int r = a;
while(r >= s) {
r -= s;
}
while(r >= b) {
r -= b;
}
return r;
}
```

This works by subtracting the values in chunks of four rather than one, right up until the last one then it switches to subtracting chunks of one.

This should make your code run about four times as fast (assuming `4*b`

isn't outside the range of your integers). You could even insert more loops (say an `8*b`

one) before the `4*b`

one for even more speed.

Other than that, hand-coding assembler may help but I think you'll find quite a boost from the above code without it.

If you know more detail on the way you'll be using the mod call, you can optimize it for your particular cases. For example, if you only want to know modulo 25 of a 16-bit integer, the following code will be much faster than a simplistic loop with variable denominator.

```
int mod25 (int a) { // a has maximum value of 2^15-1 = 32767
while (a >= 15625) a-= 15625; // at most 2 times.
while (a >= 625) a-= 625; // at most 24 times.
while (a >= 25) a-= 25; // at most 24 times.
return a;
}
```

Running a test, I find that you have to do 10 million iterations before a noticeable difference appears between that modulo code and the use of the `%`

operator (2 seconds vs. 0 seconds). Up until that point, they were both 0 seconds, although that was run on a fast machine (better for `mod25`

) and *with* a `div`

instruction (better for `%`

operator) so you'd need to benchmark it on your own hardware.

This is about as fast as you're likely to get without making your code unreadable (although even that shouldn't stop you if you're willing to add lots of comments explaining how it works).

A more general solution for any denominator is to first double the denominator (with bit shifts for speed) as far as possible so that the ensuing subtractions are minimized. Then, as the numerator reduces below the increased denominator, halve the denominator and keep going (until the denominator is back at the start).

```
int mod (int n, int d) {
/* dx is the adjusted denom, don't let it overflow though. */
int dx = d;
while (((dx << 1) >>1) == dx)
dx <<= 1;
/* This loop processes the dx values until they get too small. */
while (dx >= d) {
/* This loop subtracts the large dx value. */
while (n >= dx)
n -= dx;
dx >>= 1;
}
return n;
}
```

This actually performs on par with the optimized version of `mod25`

above while providing a more general solution.

The problem with your loop is that it's O(n) - it'll be very slow for large values of r. I'd suggest something like this:

```
for (int s = MAX_SHIFT; s>=0; s--)
if (r > (b<<s)) r -= (b<<s);
```

But I doubt that your compiler is doing anything much more expensive than that.

Since you want the modulus by a constant, you can probably beat it by using reciprocal multiplication. This paper shows how you can divide by a constant in such a manner, and towards the end, how to get the remainder from it.

How about:

```
int y = 0, x = (x & 0x7f);
while (x > 25) { x -= 25; y++; }
```

Update: it's pretty wrong :) But the idea is there.

Oh my <deity of choice>. I can't believe some of these answers.

First thing, repeated subtraction, even Pax's version, will never, ever be optimal. Consider, the following:

```
20 % 25
```

that's easy and fast using repeated subtraction, but:

```
65535 % 25
```

will be horribly slow, 600+ iterations. That's an average of 300 iterations for 16 bit numbers. As for 32 bit number, well, just don't even go there.

The fastest way to do this is to use long division. See Niki's answer.

But, this is what the compiler will be generating anyway, at least, one would hope it is what the compiler is generating. It's always best to check if you're using a compiler for a niche processor.

The best way to speed this up is to not do the modulus in the first place. Why do you need to get the modulus and can you re-factor the code / algorithm to avoid the modulus, or at least, make the modulus trivial.

I was inspired by Pax's answer and made a more general purpose algorithm.

```
int mod(int a, int b) {
int s = b;
while (s <= a) {
s <<= 1;
}
int r = a;
while (r >= b) {
s >>= 1;
if (s <= r) {
r -= s;
}
}
return r;
}
```

This subtracts power of two multiples of `b`

from `a`

until the result is found.

EDIT: added the `if`

condition to make it work properly.

As an example, if this is doing 100 % 7, it first works out that 7 * 2 * 2 * 2 * 2 = 112. Then it divides 112 (`s`

) by 2 and subtracts that from 100 (`r`

) (when `s <= r`

) and continually does this until the modulo is found. Therefore,

```
s = 112 / 2 = 56, r = 100 - 56 = 44
s = 56 / 2 = 28, r = 44 - 28 = 16
s = 28 / 2 = 14, r = 16 - 14 = 2
```

therefore, 100 % 7 = 2

Possibly not the fastest but reasonably efficient. I haven't got time to test, but use a look up table of (powers of 2) * 25 up to the maximum range/2. Then do a loop. E.g. range up to 3199 needs 7 iterations.

```
static int pow[] = {25, 50, 100, 200, 400, 800, 1600};
int mod25(int x)
{
int i = sizeof pow /sizeof pow[0];
while (i--)
{
if (x >= pow[i])
x -= pow[i];
}
return x;
}
```

If you have a very large range but low values are more common then it might be worthwhile usng a binary chop to find the starting point.

On many processors, integer multiplication is faster than integer division. This blog post shows how to replace a constant integer division with a constant integer multiplication. By rearranging the maths a bit you can get the remainder instead of the quotient. Note, however, that if you are using a moderately sophisticated compiler, then this is already done for you. You just write `x % 25`

and the compiler works out the rest. You should check the generated assembly code for your code, verifying that the compiler has not done this already, before doing this optimisation in C. Also, you should measure (profile) the performance before and after to ensure that you really are making things faster.

Looping will be far slower than doing the division using the native instruction for reasonably large operands.

Edit: see also this paper.

I suggest reading Hacker's Delight. It describes very fast remainder algorithms for constant divisors. They would almost certainly beat a general algorithm.

Update: Here is some example code... It can probably be reworked to avoid the temporary long long.

```
unsigned mod25(unsigned n)
{
unsigned reciprocal = 1374389535; // 2^35 / 25
unsigned div25 = ((unsigned long long)n * reciprocal) >> 35;
return n - div25 * 25;
}
```

Here's the best I could come up with:

```
int mod25(int x)
{
while((x = (x & 31) + 7 * (x >> 5)) >= 25)
x -= 25;
return x;
}
```

It approximates `x % 25`

with `x % 32 + 7 * (x/32)`

. The value will overshoot by a multiple of `25`

, which allows for recursion.

Performance seems to be adequate: A value of `x = 2147483647`

(aka `INT_MAX`

) needs 11 iterations.

If you kept your numbers in BCD or a byte array of digits, this would be pretty easy. Unfortunately, I have no idea what else you're doing in your program with these numbers. Sometimes it pays to look at how you represent your data rather than just bang away at algorithms.

```
int mod25(int x) {
static int divisors[] = {2147483625, 244140625, 9765625, 390625, 15625, 625, 25};
int i;
for (i = 0; i < sizeof(divisors)/sizeof(int); i++) {
int divisor = divisors[i];
while (x >= divisor) {
x -= divisor;
}
}
return x;
}
```

How it works: We want to decrement `x`

by large multiples of 25 to reduce the value as fast as possible. When the divisor is too big we switch to a smaller multiple of 25. If the divisor is already down to 25 then we're done.

You could try experimenting with different divisors. You just want to make sure that:

- they're descending
- they're all multiples of 25
- the last value is 25

In the code above I used the largest signed-32-bit multiple of 25 plus the powers of 25, which seems reasonable, though I have to admit that I'm not sure that it's optimal.

(BTW: if your compiler doesn't do constant folding -- which would be *very* surprising -- then you might want to replace the upper-limit of `i`

with a hard-coded constant.)

Here's another solution I came up with:

```
int mod25(int x){
/* 25 * (all powers of 2 <= INT_MAX), descending */
if (x >= 1677721600) x -= 1677721600;
if (x >= 838860800) x -= 838860800;
if (x >= 419430400) x -= 419430400;
if (x >= 209715200) x -= 209715200;
if (x >= 104857600) x -= 104857600;
if (x >= 52428800) x -= 52428800;
if (x >= 26214400) x -= 26214400;
if (x >= 13107200) x -= 13107200;
if (x >= 6553600) x -= 6553600;
if (x >= 3276800) x -= 3276800;
if (x >= 1638400) x -= 1638400;
if (x >= 819200) x -= 819200;
if (x >= 409600) x -= 409600;
if (x >= 204800) x -= 204800;
if (x >= 102400) x -= 102400;
if (x >= 51200) x -= 51200;
if (x >= 25600) x -= 25600;
if (x >= 12800) x -= 12800;
if (x >= 6400) x -= 6400;
if (x >= 3200) x -= 3200;
if (x >= 1600) x -= 1600;
if (x >= 800) x -= 800;
if (x >= 400) x -= 400;
if (x >= 200) x -= 200;
if (x >= 100) x -= 100;
if (x >= 50) x -= 50;
if (x >= 25) x -= 25;
return x;
}
```

This doesn't use divides or multiplys, just 27 comparisons and a maximum of 27 subtractions.

It's a little hard to convince yourself that this works, but it does (at least for non-negative values of x).

The above code is really an unrolled version of this:

```
int mod25(int x){
int divisor;
for(int divisor = 1677721600; divisor >= 25; divisor >>= 1) {
if (x >= divisor) x -= divisor;
}
return x;
}
```

By unrolling it we avoid doing the loop comparison and also the shifts at the expense of larger code. You could even partially unroll it using Duff's device if you felt so inclined, but with only 27 iterations total, and such a tiny bit of code per-iteration, I'd be inclined to just unroll it all the way.

Here's how it works: Every non-negative integer x can be expressed as (n * 25) + k where n is a non-negative integer and k is an integer from 0 to 24. k also happens to be the result we want, so if we could compute x - (n * 25) we'd get our answer. We want to be able to do this without knowing n up-front, though.

Think about n in binary. If we could turn off each of the 1 bits we'd get 0. One way to do this is to start at large powers of 2 and work our way down, subtracting each power of 2 only if the current value of n is greater than or equal to that power of 2.

Since we're dealing with (n * 25) we actually need descending powers of 2 times 25. Since k is strictly less than 25, and the smallest divisor we ever consider is 25, this works even when we're dealing with (n * 25) + k.

So each comparison + subtraction is zeroing out one bit of n, and at the end we're left with k, the remainder.

Heres an Idea

```
static int table0[256];
static int table1[256];
static int table2[256];
static int table3[256];
// ran just once to initialize the tables
void initialMod25Tables() {
for (int i = 0; i < 256; ++i) {
table0[i] = i % 25;
}
for (int i = 0; i < 256; ++i) {
table1[i] = (i << 8) % 25;
}
for (int i = 0; i < 256; ++i) {
table2[i] = (i << 16) % 25;
}
for (int i = 0; i < 256; ++i) {
table3[i] = (i << 24) % 25;
}
}
int mod25(int x) {
int y = table0[x & 0xFF];
x >>= 8;
y += table1[x & 0xFF];
x >>= 8;
y += table2[x & 0xFF];
x >>= 8;
y += table3[x & 0xFF];
y = table0[y];
return y;
}
```

please engage some common sense.

If you could write C code that calculated x % 25 faster than the compiler can, then the compiler would use that faster method.

The original poster made this fantastic assumption that the compiler would use a division. No compiler that I've used in the last ten years would be doing that. It's a multiplication by a constant close to (2^32 / 25) plus some bit twiddling that you won't be able to improve by hand.

There is a remote possibility that you can produce faster code than the compiler to find out whether x % 25 == 0, because you don't actually need code that will calculate x % 25 correctly, only code that calculates x % 25 correctly if it is 0 and doesn't produce a 0 if x % 25 != 0. Savings will probably be sub-nanosecond.

"How do I calculate x % c optimally for various constants c" is a nice puzzle. Compiler writers like nice puzzles. And they are better at solving nice puzzles like this than you are. Especially since they only need a solution that works for *one* machine where you would have to produce a general solution.

Similar Questions

I need to produce the intersection between some sorted arrays of integers in C. I know how to find the intersection between two sorted arrays, but I need to do this for more than two arrays, efficient

I'm looking for an efficient algorithm for matching patterns / sequences in a [large] list of data. Given some type: class Data implements Event { int valueA; int valueB; int valueC; long timestamp; .

I need to iterate over n pairs of integers: (0, 1), (1, 2), (2, 3) ... (n-2, n-1), (n-1, 0) What's the best way to do it? Using modulo operation: for (int i = 0; i < n; i++){ int a = i; int b = (i

i want to implement kruskal's algorithm in python how can i go about representing the tree/graph and what approach should i follow to detect the cycles ?

I was going through some interview question and stumbled upon this question. p(x) = a0 + a1x + a2x^2 + ... + anx^n. What algorithm could you use to to compute the value of p(x) in O(N^2) ? I'm totally

Say you need to track the number of times a method is called and print something when it has been called n times. What would be the most efficient: Use a long variable _counter and increase it each t

What is the cost of malloc(), in terms of CPU cycles? (Vista/OS, latest version of gcc, highest optimization level,...) Basically, I'm implementing a complex DAG structure (similar to a linked list) c

I'm trying to determine the cycles in a directed graph using Tarjan's algorithm, presented in his research paper Enumeration of the elementary circuits of a directed graph from Septermber 1972. I'm

I haven't found any specific links on the net that explain what exactly modulo 64 is. I'm no programmer but came across this while studying a 3GPP2 CDMA2000 standard.

How to calculate binomial coefficient modulo 142857 for large n and r. Is there anything special about the 142857? If the question is modulo p where p is prime then we can use Lucas theorem but what s

How to convert a Java object graph to JSON, where the graph has circular dependencies / cycles?

Possible Duplicate: Floating point inaccuracy examples When using the modulo operator in java I'm not getting the number I expect to get. Heres what I'm doing: double input = 5.59; System.out.print

i found out, that you can do modulo using this : x % m == (x + x / m) & m but i cannot understand why its working... like for 8 % 7 == (8 + 8 / 7) & 7, this is x = 8 = 0001 0000 x / 7 = 1 = 1

is there a fast algorithm, similar to power of 2, which can be used with 3, i.e. n%3. Perhaps something that uses the fact that if sum of digits is divisible by three, then the number is also divisibl

For a game I am developing I need an algorithm that can calculate intersections. I have solved the problem, but the way I have done it is really nasty and I am hoping someone here might have a more el

What's the simplest, but efficient compression algorithm? Deflate, lzma, etc. aren't valid options. I need something that compiles really small, like: RLE, LZX, Huffman, etc.. Note: The data is 95% AS

I would like to make a graph algorithm that updates/computes the value of a node f(n) as a function of each of the f(n) values of neighboring nodes. The graph is directed. Each node has as initial f(

There is a problem asked in contest. I already solved this problem with dynamic programming and its complexity O(n^2). But i am looking for solution with less efficient way. What will be the complexit

I am looking for the efficient way to implement a moving time window rate limiting algorithm for a web application. By that I'm looking for a scalable algorithm. So far, I'm thinking to use a sharding

Say you have a server that constantly gets HTTP requests. Your boss needs some stats, and asks you to compute the number of hits within the last minute at any given time. What algorithm and data-struc

My problem is that i have a dataset which consists of around seven thousand 512-bit strings and i am looking for the most efficient method to compare them with each other and identify repeated sequenc

I have a graph structure in a spatial domain (say a dense community like structure) and a query point. I want to devise efficient algorithms + Data Structures to calculate distance between this group

When working in Python, let's say I have a long bytes object. I want to get it modulo some number. For example, say I have the bytes object b'hi' and I want it modulo 3, then the result is 26729 % 3 =

Practicing for software developer interviews and got stuck on an algorithm question. Given two sets of unsorted integers with array of length m and other of length n and where m < n find an efficie

I have two different arrays like this: i=[98,99,100] j=[25,69,30] I would like to rotate through this value with a double do cycles but using the same component index during each iteration. So for ex

Can some explain the retain cycles problem with a sample program?

Yesterday I was just playing Jigshaw Puzzle and somehow wondered what would be algorithm for solving it. As human, steps which I followed where: Separate all pieces in 3 parts, single flat edge, dou

I tried to get the result of -1 modulo 1000000007 using the % operator of C++ and fmod function. The output is -1, but -1 modulo 1000000007==1000000006. What have I done wrong?

I have two numbers, x1 and x2. For a number y, I want to calculate the common divisor of x1 and x2 as close as possible to y. Is there an efficient algorithm for this? I believe it's time to rephrase

I'm trying to create an unusual associative array implementation that is very space-efficient, and I need a sorting algorithm that meets all of the following: Stable (Does not change the relative ord

I have a list of pairs of objects. Objects can appear in the pair in either order. What is the most efficient algorithm (and implementation?) to find all bags (ie sets with duplicates permitted) of pa

One of my favorite interview questions is In O(n) time and O(1) space, determine whether a linked list contains a cycle. This can be done using Floyd's cycle-finding algorithm. My question is whethe

I'm working on implementing this algorithm: Which should return Pi. (3.14159265358997...) However, it is returning: 3465.083806164093990270538663167216844483674020103009669083093329738829995996594112

Any efficient algorithm to achieve the following:- Raw data:- var arr = [ [one, two, three, four, five] , [one, two, three, six, seven] , [one, two, eight, four, ten] /*

I have a sine wave whose parameters I can determine (they are user-input). It's of the form y=a*sin(m*x + t) I'd like to know whether anyone knows an efficient algorithm to figure out the range of y f

This is an interview question: How to build a distributed algorithm to compute the balance of the parentheses ? Usually he balance algorithm scans a string form left to right and uses a stack to mak

Both in modulo function and in timespec normalization the kernel code computes modulo by a loop, and prevents the compiler from optimizing the loop to a modulo operator. Why is that needed? I expect t

In an undirected random graph of 8 vertices, the probability of an edge being present between a pair of vertices in 1/2. What is the expected number of unordered cycles of length 3? Here's how I thoug

I have a matrix and I would like to using CUDA and in the fastest possible way compute the mean column-wise (boils down to simply the sum) i.e. return a row vector containing the mean of every column

I've searched Stackoverflow already and unfortunately nothing came up. I am working with the FIX protocol and I need to generate a Modulo 256 checksum as described at http://fixwiki.fixprotocol.org/fi

Recently I've been confused about the modulo operator, %. It's known that a % b == a-a/b*b when we have integers a and b where a > b, and we can do this calculation by hand if a and b are small eno

I have an array of 6 elements. Is it more efficient (time-wise) to set those elements to null, or to create a new array? I'm going to be using this array hundreds of times.

I'm trying to build a data tree like the one shown below, and I need an efficient matching algorithm that can do the following. You can think of this tree as a list of prerequisites for taking course

I'm looking for a better than O(n) algorithm to determine if a date in the future will have daylight savings time applied (and how much). Given a year, month, day, hour, minute and time zone (and a co

I'm looking for an efficient algorithm to compute a factorial-base representation (aka Cantor expansion) of a given a n-permutation. By efficient I mean one with better than O(n2) running time. (B

I tried to use Buchberger's Algorithm (see also: http://en.wikipedia.org/wiki/Buchberger%27s_Algorithm and http://www.geocities.com/famancin/buchberger.html) to compute a Groebner basis for an ideal i

Is there any efficient and idiomatic way to perform the following operation? std::vector<int> a = { 1, 2, 3, 4 }; std::vector<int> b = { 5, 6, 7, 8 }; for (std::size_t i = 0 ; i < a.siz

I am trying to raise and arbitrarily larger number to an arbitrarily large power. As far as I can see, GMP has a function that does this, but applies modulo to the result, and a function that lets me

Is there an efficient algorithm for merging 2 max-heaps that are stored as arrays?

I'm working on graph analysis. I want to compute an N by N similarity matrix that contains the Adamic Adar similarity between every two vertices. To give an overview of Adamic Adar let me start with t