programming – Parv Maurya

I have been working on building a chess engine and a key part of evaluating a chess position involves assigning values to individual chess pieces.

For reference: this is the architecture of my chess engine (along with most conventional chess engines). The engine needs to evaluate a chess position and assign it a score, for that we need to pass the values of different chess pieces and other parameters used in your chess engine.

One of the first rules taught to beginners learning chess is that chess pieces follow the following ratio:

Queen ≈ 9 Pawns 
Rook ≈ 5 Pawns
Bishop ≈ 3 Pawns
Knight ≈ 3 Pawns

But, But, But, how do we know that this ratio is correct? Or if these are the most optimal ratios, then how can we prove that!

Also, my chess engine uses many other parameters like:
1. Piece mobility of all pieces
2. Doubled pawn, isolated pawns and passed pawns
3. King Safety & Center control

Our task ahead is cut out for us and I want to find the optimal value of all these parameters.

Genetic Algorithms

The core idea of evolution is that organisms with favorable traits for that environment are more likely to survive and pass those traits to future generations.

A single chromosome can be defined as a combination of values for all the chess engine parameters.

We will initialize a population with N chromosomes. Now, we will make each Chromosome play with each other. So, C1 chromosome will play (N-1) games with other chromosomes and C2 chromosome will play (N-2) games with other chromosomes and so on.

So, in total there will be N*(N-1)/2 games in total.

For each match between 2 chromosomes, we will assign a score of 1 for the winner and -1 for the loser and 0 if the game was a draw. After all combinations of matches have been played, we will have a ranking table as follows:

Now, we will keep top S chromosomes and discard the rest N-S chromosomes for the next iteration. We have done a “selection” of S chromosomes that are relatively stronger and will do a crossover operation among them to generate a set of N-S chromosomes so that we have a set of N chromosomes for the next iteration.

How do we do mutation of chromosomes to generate a new chromosome? – the simplest way is to select two chromosomes C1 and C2 and do a selection as follows:

C_mix (Bishop) = Random(0,1) > 0.5 ? C₁ (Bishop) : C₂(Bishop)

To ensure that our population maintains diversity and doesn’t overfit, we will follow a process of mutation.

C_mix (Bishop) = Random(0,1) < MUTATION_RATIO ? Random(0,1000) : C_mix(Bishop)

Results

I ran this experiment for N=10 population, S=4 (top 4 chromosomes preserved in each iteration), MUTATION_RATIO=25. For all games, I gave the engine 2 seconds to think of the best move.

This experiment was run on a 8-core virtual server so that 8 games can be played simultaneously and each iteration would take ~1.5 hours for 45 games played. After 2-3 days of running this simulation, I got the following results

Iteration 1:

ITERATION 41:

Piece	Iteration 0	Iteration 20	Iteration 40
Pawn	50	50	50
Rook	40	170	580
Knight	140	360	290
Bishop	360	440	390
Queen	640	440	930
Bishop Mobility	8	7	7
Rook Mobility	3	6	0
Knight Mobility	6	6	6
Queen Mobility	0	6	6
Doubled Pawn Penalty	2	9	6
Isolated Pawn Penalty	1	6	1
Passed Pawn Bonus	2	9	6
King Safety	2	9	6
Center Control	7	9	2

We can clearly observe that by iteration 40, our piece values converged to their conventional values. Interesting observation is that bishops are given more value than knights.

The key observation is that we produced these results without any prior knowledge or biases related to chess as part of our parameter tuning mechanism by just using Genetic Algorithms!

Lets get started

The purpose of this article is to use multithreading in Java to compute the largest prime number less than 1e7. We will observe some interesting things as we go along.

How do we know if a number is prime?

We will use an intuitive and a good enough primality check algorithm.

This algorithm has a O(√n) time complexity where n is the number to be tested. There are faster algorithms to test primality, we can explore them in future but are beyond the scope of this article.

static boolean isPrime(long number) {
        for(int i=2; i<Math.ceil(Math.sqrt(number)); ++i) {
            if(number%i == 0) {
                return false;
            }
        }
        return true;
    }

How do we find the largest prime number less than 1e8.

public static void sequentialPrimeCalculation(){
    long maxPrime = -1;
    long startTime = System.currentTimeMillis();
    for(long i=2; i<1e7; ++i){
        if (isPrime(i)) {
            maxPrime = Math.max(i, maxPrime);
        }
    }
    System.out.println("Largest prime is: " + maxPrime);
    System.out.println("Computation took: " + (System.currentTimeMillis()- startTime)/1000.0 + " ms");
}

Output:

Largest prime is: 9999991

Computation took: 15.036 ms

We now have a baseline for a sequential/single threaded code.

Lets start with multithreading now

We can see that each individual primality test in the sequential code is independent. This makes it a suitable task to be parallelized.

Each individual task will do the primality test and store it in a AtomicLong variable which is thread-safe and hence can be accessed safely within each task.

 static AtomicLong largestPrime = new AtomicLong(2);

static class PrimeTask implements Runnable {
        long number;

        PrimeTask(long n) {
            this.number = n;
        }


        @Override
        public void run() {
            boolean isPrime = Main.isPrime(this.number);
            if(isPrime) {
                largestPrime.updateAndGet(current -> Math.max(current, this.number));
            }
        }
    }

We will use a thread pool, where each threadpool internally maintains a queue of tasks to be submitted are pushed.

If the threadpool finds an idle/free core, it will pop a task from the queue and utilize that core. Once the task is finished, the core is available for other tasks in the threadpool.

public static void parallelPrimeCalculation(int numberOfThreads){
        final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
        long startTime = System.currentTimeMillis();
        for(long i=2; i<1e7; ++i){
            exec.submit(new PrimeTask(i));
        }
        try {
            exec.shutdown();
            exec.awaitTermination(100, TimeUnit.SECONDS);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
        System.out.println("Number of threads: " + numberOfThreads);
        System.out.println("Largest prime is: " + largestPrime.get());
        System.out.println("Computation took: " + (System.currentTimeMillis()- startTime)/1000.0 + " ms");
    }

Output:

Number of threads: 2

Largest prime is: 9999991

Computation took: 11.282 ms

The execution time is better than before, although it is not twice as fast as the sequential code.

Lets plot the time taken for prime computation with thread count:

The above plot is a classical case of diminishing returns where the computation time plateaued after 6 cores.

To understand why this happened, we need to revisit our previous assumption that each of our tasks are independent of each other. Because if all our tasks are completely independent, we should observe a linear reduction in computation time as we increase the number of cores and not a plateau.

We can divide a single instance of PrimeTask in two parts: Block 1 and Block 2

We can see that Block 2 of the PrimeTask is accessing a global AtomicLong variable with a blocking operation. As visible below, Block 2 of both the threads try to access the largestPrime number, but it can be accessed only one at a time, which means the other thread will have to wait, which makes our program a bit sequential in such cases.

We can improve our Algorithm by eliminating the Block 2 of our PrimeTask.

We will divide the search space [2, 1e7] between all the threads and each thread will store the largest prime that thread has encountered till then. Once all the threads have completed their execution, we will combine the results of all individual threads and get the maximum prime number.

We have redesigned the PrimeTask as follows:

static class PrimePoolTask implements Callable<Long> {
        long start;
        long offset;
        long end;

        PrimePoolTask(long start ,long offset, long end) {
            this.start = start;
            this.offset = offset;
            this.end = end;
        }


        @Override
        public Long call() {
            long maxNum = -1;
            for(long i=start; i<end; i+=offset){
                boolean isPrime = Main.isPrime(i);
                if(isPrime) {
                    maxNum = Math.max(i, maxNum);
                }
            }
            return maxNum;
        }
    }

We will now divide the search space and call the PrimePoolTask to find the largest prime number:

public static void parallelPrimeImprovedCalculation(int numberOfThreads){
        final ExecutorService exec = Executors.newFixedThreadPool(numberOfThreads);
        long startTime = System.currentTimeMillis();
        List<Future> maxPrimePerThread = new ArrayList<>();
        for(int i=1; i<=numberOfThreads; ++i){
            PrimePoolTask task = new PrimePoolTask(i, numberOfThreads, (long) 1e7);
            Future<Long> future = exec.submit(task);
            maxPrimePerThread.add(future);
        }

        long maximumPrime = -1;
        for(Future<Long> future: maxPrimePerThread){
            try {
                maximumPrime = Math.max(maximumPrime, future.get());
            } catch (InterruptedException | ExecutionException e) {
                throw new RuntimeException(e);
            }
        }

        System.out.println("Number of threads: " + numberOfThreads);
        System.out.println("Largest prime is: " + maximumPrime);
        System.out.println("Computation took: " + (System.currentTimeMillis()- startTime)/1000.0 + " ms");
        compTime.add((System.currentTimeMillis()- startTime)/1000.0);
    }

Lets look at the results of this modified algorithm and how it compares with the old algorithm.

We can see that it is much faster, although it also plateaus. In a perfect world if a single core program takes ~16 seconds, then a 16 core program should take ~1 second. But we converged at ~2 seconds.

There are several overheads when we do multi-threading which can lead to performance degradation.

For now we can close this discussion and one day we can deep dive into further optimizing this algorithm.

Hope this was informative.

Note: Of course the fastest way to do that is to not do a primality check from 1-> 1e8 but 1e8->1, but the purpose of the article to demonstrate multi-threading in java and not on the semantics of the problem. 🙂

Tag: programming

Foresight: Using Genetic Algorithms for parameter tuning a chess engine

Genetic Algorithms

Results

Iteration 1:

ITERATION 41:

Efficiently Finding Primes: Java’s Multithreading Approach