How to Bypass the GIL for Parallel Processing

In this tutorial, we will learn about the parallel processing in Python by passing the GIL. GIL is an important concept in Python which prevents multiple threads from executing Python bytecodes in parallel within the same process. This means that even on multi-core processors, Python threads are not able to fully utilize all available CPU cores for CPU-bound tasks due to the GIL. However it is also comes with the some limitations. We will also discuss how to run Python threads in parallel on multiple CPU cores, how we can avoid the data serialization overhead of multiprocessing. Before diving deep into this topic, let's recall the concept of parallel processing.

Introduction to Parallel Processing

Parallel processing in Python refers to the simultaneous execution of multiple tasks or processes in order to improve performance and take advantage of multi-core processors. It allows you to perform multiple computations concurrently, which can significantly reduce the time it takes to complete CPU-intensive or time-consuming tasks.

Various factors can constrain the rate at which concurrent tasks make progress. It's crucial to identify these limitations before determining whether parallel processing is suitable for your requirements and how to leverage it effectively.

The speed at which a task executes heavy computations is primarily determined by your CPU's clock rate. This clock rate directly relates to the number of machine code instructions completed per unit of time. In simpler terms, the faster your processor can operate, the more work it can accomplish within the same timeframe. When a task's performance is restricted by your processor's capabilities, we describe that task as being CPU-bound.

When dealing exclusively with CPU-bound tasks, you can enhance performance by executing them concurrently on several processor cores. Nevertheless, this approach has its limits. Beyond a certain point, your tasks will vie for the limited resources available, and the overhead linked to context switching will become problematic. To prevent performance degradation, it's advisable not to exceed the number of CPU-bound tasks being run concurrently beyond the count of available CPU cores.

Imagine an I/O-bound task as like playing chess against a single opponent. You need to make your move and then allow the other player to do the same. During their thinking time, you have the choice to either wait patiently or use that time productively. For instance, you can resume playing another game with a different player or attend to an urgent phone call.

Hence, there's no need to execute I/O-bound tasks concurrently to run them in parallel. This characteristic eliminates the constraint on the maximum number of simultaneous tasks. Unlike CPU-bound tasks, which are restricted by the count of physical CPU cores, I/O-bound tasks are not subject to such limitations. Your application can accommodate as many I/O-bound tasks as your available memory permits. It's not unusual to encounter hundreds or even thousands of such tasks.

Leveraging the Potential of Multiple CPU Cores

There are several methods reveal the parallel capabilities of modern CPUs, each with its own set of trade-offs. As an example, you have the option to execute piece of your code within distinct system processes. This approach delivers robust resource isolation and data consistency, albeit with the drawback of requiring costly data serialization.

Processes are relatively less complicated because they typically require minimal coordination or synchronization. However, their creation and interprocess communication (IPC) come at a relatively high cost, so you should limit the number you create to avoid diminishing returns. It's advisable to refrain from transferring substantial data between processes due to the serialization overhead, which can outweigh the advantages of parallelization in such cases.

To execute a large volume of concurrent tasks efficiently, you can use coroutines. These are even more lightweight execution units than threads. Unlike threads and processes, they operate using cooperative multitasking, voluntarily pausing their execution at specific points instead of relying on a preemptive task scheduler. This approach has its own set of advantages and disadvantages.

Compare Multithreading in Python and Other Languages

Multithreading typically involves dividing tasks, sharing the workload across the available CPUs, managing the individual workers, ensuring they access shared resources safely, and combining their partial results. However, for this illustration, we won't go through any of these steps. To highlight the issue with threads in Python, we'll simply call the same function on all the CPU cores at the same time, without paying attention to the return values.

Solve I/O Problem using the Java Threads

Java threads can be used to address both CPU-bound and I/O-bound problems. Let's illustrate this with examples for each:

Solving CPU-Bound Problems with Java Threads

CPU-bound problems involve tasks that are computationally intensive. Java threads can help by parallelizing these tasks across multiple CPU cores. Here's a simple example of calculating the sum of a large array of numbers using Java threads:

Example -

public class SumCalculator extends Thread {
    private int[] numbers;
    private int start;
    private int end;
    private long partialSum;

    public SumCalculator(int[] numbers, int start, int end) {
        this.numbers = numbers;
        this.start = start;
        this.end = end;
    }

    public long getPartialSum() {
        return partialSum;
    }

    @Override
    public void run() {
        for (int i = start; i < end; i++) {
            partialSum += numbers[i];
        }
    }

    public static void main(String[] args) throws InterruptedException {
        int[] numbers = new int[1000000];
        for (int i = 0; i < numbers.length; i++) {
            numbers[i] = i + 1;
        }

        int numThreads = 4; // Number of threads
        int chunkSize = numbers.length / numThreads;

        SumCalculator[] calculators = new SumCalculator[numThreads];
        for (int i = 0; i < numThreads; i++) {
            int start = i * chunkSize;
            int end = (i == numThreads - 1) ? numbers.length : (i + 1) * chunkSize;
            calculators[i] = new SumCalculator(numbers, start, end);
            calculators[i].start();
        }

        long totalSum = 0;
        for (int i = 0; i < numThreads; i++) {
            calculators[i].join();
            totalSum += calculators[i].getPartialSum();
        }

        System.out.println("Total Sum: " + totalSum);
    }
}

In this example, we divide the array into chunks and assign each chunk to a separate thread. Each thread calculates the partial sum of its assigned chunk, and then the results are combined to get the total sum.

Solving I/O-Bound Problems with Java Threads

For I/O-bound problems, such as making multiple network requests simultaneously, Java threads can also be used to improve efficiency. Here's a simple example of making parallel HTTP requests using Java threads.

Example -

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;

public class HttpRequester extends Thread {
    private String url;

    public HttpRequester(String url) {
        this.url = url;
    }

    @Override
    public void run() {
        try {
            HttpURLConnection connection = (HttpURLConnection) new URL(url).openConnection();
            connection.setRequestMethod("GET");

            BufferedReader reader = new BufferedReader(new InputStreamReader(connection.getInputStream()));
            String line;
            StringBuilder response = new StringBuilder();

            while ((line = reader.readLine()) != null) {
                response.append(line);
            }

            reader.close();
            System.out.println("Response from " + url + ": " + response.toString());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        String[] urls = {
            "https://example.com",
            "https://www.google.com",
            "https://www.openai.com"
        };

        for (String url : urls) {
            HttpRequester requester = new HttpRequester(url);
            requester.start();
        }
    }
}

In this example, we create separate threads for making HTTP requests to multiple URLs simultaneously, which can significantly reduce the time it takes to fetch data from various sources.

These examples shows how Java threads can be employed to tackle both CPU-bound and I/O-bound problems by parallelizing tasks and leveraging the capabilities of multi-core processors.

Python Threads Only Solve I/O Problem

In Python, due to the Global Interpreter Lock (GIL), threads are often more suitable for solving I/O-bound problems rather than CPU-bound problems. Here's an example illustrating how Python threads can be used to solve an I/O-bound problem, specifically making parallel HTTP requests:

Example -

import threading
import requests

# Define a function to make an HTTP request
def make_request(url):
    response = requests.get(url)
    print(f"Response from {url}: Status Code {response.status_code}")

# List of URLs to make requests to
urls = [
    "https://example.com",
    "https://www.google.com",
    "https://www.openai.com",
]

# Create and start a thread for each URL
threads = []
for url in urls:
    thread = threading.Thread(target=make_request, args=(url,))
    threads.append(thread)
    thread.start()

# Wait for all threads to finish
for thread in threads:
    thread.join()
print("All requests completed.")

In this example, we create a separate thread for each URL, and each thread makes an HTTP request. Since making HTTP requests involves waiting for external responses (I/O-bound), using threads allows us to perform these operations concurrently, improving efficiency.

While Python threads can help with I/O-bound tasks like this one, for CPU-bound tasks, you may want to consider using the `multiprocessing` module to fully utilize multiple CPU cores because of the limitations imposed by the Global Interpreter Lock (GIL) in Python.

The Global Interpreter Lock (GIL) in Python Restricts Concurrent Execution of Threads

Python threads come with unique characteristics. They are proper threads managed by the operating system, but they also utilize cooperative multitasking, which is somewhat uncommon. In contrast, most modern systems typically employ time-sharing multitasking with a preemptive scheduler. This approach ensures equitable distribution of CPU time among threads, preventing greedy or poorly designed threads from depriving others of resources.

Internally, the Python interpreter relies on the operating system's threads, which are made accessible through libraries like POSIX threads. However, there's a key restriction: only the currently active thread holding the Global Interpreter Lock (GIL) is allowed to execute. This necessitates that threads voluntarily yield the GIL periodically. As mentioned earlier, whenever there's an I/O operation, the thread automatically releases the GIL. Threads that are not involved in such operations will still release the GIL after a certain predefined time interval.

In Python versions before 3.2, the interpreter had a mechanism where it would release the Global Interpreter Lock (GIL) after executing a set number of bytecode instructions. This was done to allow other threads to have an opportunity to run, especially when there were no pending I/O operations. Since the scheduling of threads is managed by the operating system, it often resulted in the same thread, which had just released the GIL, regaining it immediately.

This approach resulted in inefficient and inequitable context switching. Furthermore, it was unpredictable because a single bytecode instruction in Python could correspond to a variable number of machine-code instructions, each with different associated execution costs. For instance, invoking a C function could take significantly more time than a simple task like printing a newline, despite both being represented as a single instruction.

Since then, Python threads have adopted a different approach. Instead of counting bytecodes, they release the Global Interpreter Lock (GIL) after a designated switching interval, which is set to a default value of five milliseconds. It's worth noting that this timing is not precise, and the release of the GIL occurs only when other threads signal their intent to acquire the GIL for their own execution.

Utilize Process-Based Parallelism in Place of Multithreading

The conventional method for achieving parallelism in Python has been to execute the interpreter in multiple instances by using distinct system processes. This method is relatively straightforward and serves as a means to bypass the Global Interpreter Lock (GIL). However, it does come with certain limitations that might render it unsuitable for specific scenarios. We will now explore two modules from the standard library that can assist in implementing this form of parallelism.

multiprocessing: Low-Level Control Over Processes

The multiprocessing module was purposely designed to closely resemble the threading module, accepting its familiar building blocks and interface. This intentional similarity makes it remarkably convenient to transition your code from a thread-based approach to a process-based one, and vice versa. In certain situations, these modules can seamlessly substitute one another.

futures: High-Level Interface for Running Concurrent Tasks

In Python 3.2, a new module was introduced in the standard library, inspired by a Java API. This module, known as concurrent.futures, takes its cues from early versions of Java, particularly the java.util.concurrent.Future interface and the Executor framework. Its new addition provides a unified and user-friendly way to handle pools of threads or processes, simplifying the execution of asynchronous tasks in the background.

When compared to the multiprocessing module, the elements within concurrent.futures offer a more straightforward, though somewhat limited, interface. It abstracts the complexities of managing and coordinating individual workers. It's worth noting that this package is built on top of `multiprocessing` but separates the submission of concurrent work from the collection of results. These results are represented by "future" objects. With `concurrent.futures`, there's no longer a need to manually exchange data using queues or pipes.

Enabling Parallel Execution of Python Threads

In this section, we will dive into different methods for circumventing Python's Global Interpreter Lock (GIL). You will gain insights into utilizing alternative runtime environments, employing GIL-resistant libraries like NumPy, creating and utilizing C extension modules, use the power of Cython, and making calls to external functions. At the conclusion of each subsection, we will provide the advantages and disadvantages of each approach, aiding you in making well-informed choices tailored to your particular requirements.

The Python interpreter serves as the underlying mechanism responsible for interpreting and executing Python code, essentially powering the execution of your programs. Complementing this interpreter, the standard library encompasses a repository of modules, functions, and objects, typically integrated into the interpreter. Within this library, you'll discover built-in functions like print(), as well as modules facilitating interaction with the host operating system, such as the os module.

The CPython is the default and widely used popular Python interpreter, which is implemented in the C programming language. It includes the Global Interpreter Lock (GIL) and reference counting for automated memory management. Consequently, its internal memory structure is made visible to extension modules via the Python/C API, ensuring they are cognizant of the presence of the GIL.

Fortunately, Python code isn't limited to just CPython as the interpreter. There are alternative implementations that leverage external runtime environments like the Java Virtual Machine (JVM) or the Common Language Runtime (CLR) for .NET applications. These implementations enable you to utilize Python to interact with the respective standard libraries, work with their native data types, and adhere to the runtime execution guidelines. However, it's worth noting that they might lack certain Python features.

Choosing for an alternative Python interpreter in place of CPython is often the simplest approach to circumventing the GIL, typically necessitating no alterations to your existing codebase whatsoever.

Let's write the C extension module that release the GIL.

Creating a C extension module for Python that releases the Global Interpreter Lock (GIL) requires writing C code that explicitly manages the GIL. Releasing the GIL allows multiple threads to execute Python code concurrently. Below is an example of a simple C extension module that demonstrates how to release and reacquire the GIL:

Create a C source file (e.g., `my_extension.c`):

#include <Python.h>
#include <thread>

// A simple function that releases and reacquires the GIL.
static PyObject* my_extension_run_concurrently(PyObject* self, PyObject* args) {
    // Release the GIL to allow other Python threads to run concurrently.
    Py_BEGIN_ALLOW_THREADS;

    // Simulate some CPU-bound work in a separate thread.
    std::thread([]() {
        // Perform some CPU-bound work here.

        // Reacquire the GIL before returning to Python code.
        PyGILState_STATE gstate;
        gstate = PyGILState_Ensure();

        // You can safely call Python API functions here.

        // Release the GIL when done with Python API calls.
        PyGILState_Release(gstate);
    }).join();

    // Reacquire the GIL before returning to Python code.
    Py_END_ALLOW_THREADS;

    Py_RETURN_NONE;
}

// Method definition table
static PyMethodDef MyExtensionMethods[] = {
    {"run_concurrently", my_extension_run_concurrently, METH_NOARGS, "Run concurrently and release the GIL."},
    {NULL, NULL, 0, NULL} // Sentinel
};

// Module initialization function
static struct PyModuleDef my_extension_module = {
    PyModuleDef_HEAD_INIT,
    "my_extension", // Module name
    NULL,           // Module documentation
    -1,
    MyExtensionMethods
};

// Module initialization function for Python 3
PyMODINIT_FUNC PyInit_my_extension(void) {
    return PyModule_Create(&my_extension_module);
}

Compile the C extension module:

You can compile the C code into a shared library using a C compiler. For example, if you have `my_extension.c`, you can compile it on Unix-like systems using:

You must make sure to adjust the include path and output filename according to your Python version and platform.

Using the C extension module in Python:

Now, you can use your C extension module in Python code like this:

import my_extension
# Call the C function that releases and reacquires the GIL.
my_extension.run_concurrently()

In this example, the Py_BEGIN_ALLOW_THREADS and `Py_END_ALLOW_THREADS` macros are used to release and reacquire the GIL, allowing the C++ thread to perform CPU-bound work without blocking other Python threads. Remember to be cautious when using this technique, as it can introduce thread-safety concerns. It's essential to use appropriate synchronization mechanisms when sharing data between threads in such scenarios.

Conclusion

This tutorial explored the concept of parallel processing in Python and the challenges posed by the Global Interpreter Lock (GIL). The GIL restricts the concurrent execution of Python threads within the same process, limiting the utilization of multi-core processors for CPU-bound tasks. We discussed the differences between CPU-bound and I/O-bound tasks, highlighting the benefits of parallelization for CPU-bound tasks.

We also examined various methods to enable parallelism in Python, including process-based parallelism using the multiprocessing module and high-level concurrent processing with concurrent.futures. Additionally, we learned about alternative Python interpreters like Jython and IronPython, which can bypass the GIL and provide access to external runtime environments.

We also concluded with a practical example of creating a C extension module that releases and reacquires the GIL, allowing for concurrent execution of CPU-bound tasks. While this technique can enhance performance, it should be approached with caution to ensure thread safety.

In summary, understanding the GIL and using the appropriate parallel processing techniques can significantly improve the performance of Python applications, especially when dealing with CPU-intensive workloads.

Next TopicHow to Catch Multiple Exceptions in Python

← prev next →