Using MPP CAS Multi-Threading to boost prime number computing speed using the brute force method

2 Likes

Prime numbers (numbers which are only divisible by one or themselves) are used in encryption, mechanical engineering, random number generation, and a variety of other applications. The way to generate a list of prime numbers is simple in concept, but gets progressively more difficult for computers as you travel up along the number line. For this reason, it’s useful as an experiment to see how Multi-Threading can be used in CAS. More importantly, it shows us how we can optimize the use of our threads to get the most speed.

Foundational Material

This blog builds upon the ideas laid out in Simon William’s blog: Base SAS + SAS/CONNECT – A simple method to generate load on any number of licensed cores

To modify Simon’s Base SAS prime number generation code in order to load it in CAS and keep track of individual threads, I’ve used portions of code from Mike Henderson’s blog: Easy Multi-Threading with SAS – STATMIKE

Both blogs are also worth a read if you’d like to better understand aspects of SAS multi-threading and/or parallel processing.

Methods for computing primes

For this experiment we’ll be looking to compute every prime number between 1 and a given integer (1000, 10000, 100000, 500000). There are a variety of algorithms for doing so: and the big three are Fermat’s “little theorem”, the Sieve of Eratosthenes, and the brute force method.

Fermat’s little theorem does not get more complex as numbers get larger, which seemingly makes it ideal for this blog. However, it essentially involves running a series of “prime tests” on a given number. The issue with these tests is that they involve exponents of incredibly large numbers, which any given coding platform will need to be specially equipped to handle beyond a certain integer size.

For example: 3^318 is an extraordinarily large integer. To illustrate what I mean, if you open up your calculator app on your machine it will tell you your input is invalid.

Furthermore, Fermat’s little theorem is more like a statistical guess instead of a definitive answer as to whether a number is prime. The guess is usually correct and the accuracy can be tuned at the cost of more computations per checked number, but there are also special numbers called Carmichael numbers that will pass the test despite not being prime.

The Sieve of Eratosthenes involves eliminating all multiples of a prime number on the number line to eventually leave yourself with just the primes.

For example: to find the prime numbers in the span of numbers between 1 to 100 you would mark all the numbers divisible by 2, then all the numbers divisible by 3, then 5, then 7, and so on.

While the Sieve of Eratosthenes and other similar prime number sieve algorithms start out very fast, they use a lot of memory to maintain a record of primes and non-primes. Due to this, they are bottlenecked by the speed at which your machine can read/write to and from memory… and so will be less useful to us at large scales when compared to the final and most simplistic method: brute force.

The brute force method is exactly as the name suggests. It is easy to implement and is a direct check of each number to see if it is divisible by any of the numbers below it.

For example: to check if the number 7 is prime, your program would check if it was divisible by 2, then 3, then 4, then 5, then 6. Because 7 is not divisible by any of the numbers before it, the number is marked as prime. To check if the number 9 is prime, your program would check if it was divisible by 2, then 3. Because 9 is divisible by 3 it is marked as not prime.

For this experiment, brute force is the method of choice. Originally I would have liked to try the Sieve of Eratosthenes algorithm, but converting it to a multithreaded approach would be a little bit more complicated because of how threads would have to share pieces of information (perhaps it will be a topic for another blog). The complexity of brute force is O(n) – which is linear – and is easier to follow for experiments. Essentially, as you search more numbers the time to compute the primes should go up relatively consistently.

Code & Approach

Code to compute prime numbers was ran on a clean deployment and both the single thread and multi-threaded approach were carried out on the same machine to ensure accuracy when comparing them.

Code for computing prime numbers on a single thread:

%LET searchspace=1000;

/* initialize an ordered table of numbers 1 to searchspace */
data work.initiallist;

do until(i=&searchspace);
i+1;
output;
end;

run;

data randomprime;
set initiallist;

if _N_ <= &searchspace then do;
L=0;
do j=1 to i;
k=0;

if (mod(i , j)= 0) then do;
k+1;
a=mod(i , j);
end;

if K=1 then L+1;
end;

If L=2 then do;
output;
end;
end;
run;

This code creates a list of numbers, in the above example it would be 1 to 1000. Then, the program will then use the brute force algorithm to determine if each number in the list is prime, and output all prime numbers to a new table.

Code for computing prime numbers on multiple threads:

Initialization/Randomization step:

/* setup a cas session */
cas mysess;
libname mycas cas sessref=mysess;

/* declare number of threads and max number searched for primes */
%LET threads = 24;
%LET searchspace=500000;

/* initialize an ordered table of numbers 1 to searchspace */
data initiallist;

do until(i=&searchspace);
i+1;
output;
end;
run;

/* randomize the order of the numbers from initiallist using a uniform distribution */
PROC sql;
create table MYCAS.randomizedlist as
select * from work.initiallist
order by rand(‘uniform’);
QUIT;

Computation step:

/* use parallel processing in CAS to identify prime numbers using the brute force method */
proc cas;
dscode=”

data randomprimeMPP;
set randomizedlist;
part_size = int(&searchspace/(&threads)); drop part_size;
thread=_threadid_;

uniqueRowID = _n_ + (_threadid_ * 1000);

p = (thread – 1) * part_size;
s = p + part_size; drop s;
if thread = &threads then s = &searchspace;

row_id = _N_;

if uniqueRowID <= (s * 1000) then do;
L=0;
do j=1 to i;
k=0;

if (mod(i , j)= 0) then do;
k+1;
a=mod(i , j);
end;

if K=1 then L+1;
end;

If L=2 then do;
output;
end;
end;
run;”;

datastep.runcode / code=dscode single=’no’;

This code creates a list of numbers of a desired length and randomizes their placement in that list. Then, the program runs through different equally-sized portions of the list simultaneously using parallel processing. In this environment there are 3 CAS worker nodes with 8 threads each, for a total of 24 individual threads working on computing prime numbers independently.

In a single threaded approach, the program checks each number one at a time and it will take significantly more time as the list of numbers grows. Larger numbers are harder to calculate as they have more checks needed to ensure they are or aren’t prime. Without randomization of number placement for the multi-threaded approach, thread #1 would finish significantly faster than thread #24. The randomization essentially equalizes the time each thread should take to finish, ensuring the computational load is spread evenly to avoid later threads bottlenecking the program’s speed.

Findings

Speeds for single threaded approach:

Primes from 1 to 1000 - 0.01 seconds

1 to 10,000 – 0.91 seconds

1 to 100,000 – 1 minute 37 seconds

1 to 500,000 – 40 minutes 47 seconds

Times jump for the single threaded approach drastically as the number line gets longer. If we compare to CAS multi-threaded approach with our randomization we see a substantial speed increase:

Primes from 1 to 1000 – 0.42 seconds

1 to 10,000 – 0.49 seconds

1 to 100,000 – 16.25 seconds

1 to 500,000 – 2 minutes 7 seconds

The times shown on both series of tests are for each program’s respective prime number search procedure, and their initialization/randomization steps are not taken into consideration because the differences are negligible. There seems to be an initial setup step for CAS that raises the initial times for 1000 and 10,000 prime calculations when compared to the single threaded approach, but multi-threaded prime calculating is more than worth it as the volume of processed numbers increases.

Taking a look at the load on each of the 24 threads, we see across two separate experiments that the randomization leads to relatively even distribution of primes.

Test #1 with a search space of 500,000:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Test #2 with a search space of 500,000:

There are 41,538 prime numbers between 1 and 500,000. The program accounts for all of them and spreads the work consistently. However, several repeated experiments confirm that threads 8, 16, and 24 generally contain slightly less primes than average. This is likely related to them being the last thread in their respective worker nodes, though the underlying reason may be due to my own code rather than a quirk in CAS.

Potential Modifications

It’s possible to further modify the code to eliminate all even values from the randomized list, which predictably cuts the computation time in half (we can do this because any even number is divisible by 2 and must not be prime). The following is an approach using the randomization step’s SQL procedure:

/* randomize the order of the numbers from initiallist using a uniform distribution */
PROC sql;
create table MYCAS.randomizedlist as
select * from work.initiallist
WHERE MOD (i, 2) = 1
order by rand(‘uniform’);
QUIT;

When searching this way for primes in the numbers 1 to 500,000, the program takes around 66 seconds, or 1 minute 6 seconds. This approach is essentially a fusion between the brute force and the Sieve of Eratosthenes algorithms and could be an interesting exercise to explore in the future. It would also be possible to remove some more numbers before prime calculation: every 3^rd, every 5^th, etc. This modification has diminishing returns for computation speed once we start removing multiples of larger numbers, but removing multiples of smaller numbers would give significant benefits.

Tweaking the type of distribution for the randomization step in the MPP CAS version of the code could allow for more even spreading of prime numbers across each thread. The distribution model the PROC SQL step currently follows is a uniform distribution, but the nature of primes being more common early in the number line would mean that a weighted distribution method would suit the data better. It's unlikely that this would lead to any significant gains in computation speed, but it’s worth noting nonetheless.

It's also possible to modify the code to only check if a given number is divisible by numbers up to its square root, not up to itself. For example, when checking the number 195 we can take the square root and get 13.9, which rounds up to 14. any number combinations that multiply to 195 that contain a number beyond 14 will have already been “checked”, as numbers greater than 14 multiplied by another number greater than 14 necessarily will yield numbers greater than 195. I didn’t have the time to test this in my environment to see the exact benefits to computation speed, but it is worth noting for future experiments. In this specific scenario I was more concerned with simplicity for testing and chose to use the most basic brute force algorithm with the least number of modifications, as improvements made by CAS are the primary focus and any other prime number-specific modifications to the code could affect the results of the comparison between single thread and multi-threaded processing.

Takeaway

Computing prime numbers, in this case, is a simple exercise that could stand in for any business application in SAS that can be spread across multiple threads. Much like the randomization step in this exercise, it’s worth looking at your own data and understanding what sorts of modifications could be made to get the most out of parallel processing in CAS.