Data-Oriented Programming in Python

In this tutorial, we will learn about data-oriented programming in Python (DOP) as an alternative to good old object-oriented programming (OOP). As we can understand by its name, we practice a programming approach where data is put on first and foremost.

We can achieve this by following the four principles. These principles are language-agnostic. They can be represented in OOP languages (Java, C++, etc.), functional programming (FP) languages (Clojure, etc.) or general-purpose languages (Python, JavaScript). Let's understand the following principle.

Principle - 1: Separate code from Data

In the first principle, we need to write the functions in a way that their behaviour doesn't depend on data that is encapsulate in the function's context. A natural way to use this principle in Python is use the top-level function and data classes that only have fields for data. Let's understand the following example.

Example -

from dataclasses import dataclass
@dataclass
class AuthorData:
    """Class for keeping track of an author in the system"""
    first_name: str
    last_name: str
    n_books: int

def calculate_name(first_name: str, last_name: str):
    return f"{first_name} {last_name}"
def is_profile(n_books: int):
    return n_books > 100
author_data = AuthorData("Graham", "Mathew", 200)
calculate_name(author_data.first_name, author_data.last_name)

The calculate_name() function can be utilized not only for authors but also for users, librarians, or any individuals with a first name and a last name. By separating the code responsible for calculating the full name from the code related to author data creation, the calculate_name() function can be easily reused across different entities. This design allows flexibility and promotes code reusability, enabling the function to be employed wherever a full name needs to be generated based on a first name and a last name.

The benefit of this approach is that code can be reused in different contexts.

@dataclass
class UserData:
    """Class for keeping track of a user in the system"""
    first_name: str
    last_name: str
    email: str
user_data = UserData("John", "Doe", "john.doe@gmail.com")
calculate_name(user_data.first_name, user_data.last_name)

Let's see the below example without following principle 1.

Example -

class Contact:
    def __init__(self, first_name: str, last_name: str, email: str, phone: str):
        self.first_name = first_name
        self.last_name = last_name
        self.email = email
        self.phone = phone

    def send_email(self, message: str):
        # Code for sending an email to the contact's email address
        pass


class Customer:
    def __init__(self, first_name: str, last_name: str, account_balance: float,
                 contact: Contact):
        self.first_name = first_name
        self.last_name = last_name
        self.account_balance = account_balance
        self.contact = contact

    @property
    def full_name(self):
        return f"{self.first_name} {self.last_name}"

    @property
    def is_loyal(self):
        return self.account_balance > 10000
contact_info = Contact("John", "Doe", "johndoe@example.com", "555-1234")
customer = Customer("Alice", "Smith", 50000, contact_info)
assert customer.full_name == "Alice Smith"

Explanation -

In the above code, the full_name() method resides in the Customer class, we need to instantiate the Contact object, which requires to assign the value for every attributes including the email and phone. It is an unnecessarily complex and tedious setup just to test a single method.

On the other hand, In DOP version, to test calculate_name() code, we can create data to be passed into the function in isolation.

Principle -2: Represent data with generic data structure

According to this principle, data is represented with the generic data structures, such as maps (or dictionaries) and arrays (or lists). In this article we use dataclass that is closer to OOP than DOP. Python's dataclass is a versatile construct that bridges the gap between object-oriented programming (OOP) and data-oriented programming (DOP). Unlike dictionaries and tuples, dataclass offers several advantages, including reduced susceptibility to typos, improved readability with type hinting, simplified representation of nested complex structures, and more. Additionally, dataclass provides the flexibility to convert instances into dictionaries or tuples effortlessly when needed.

By utilizing dataclass, we can leverage the benefits of OOP, such as encapsulation and code organization, while taking advantage of the concise and descriptive nature of data structures. The type hinting support in dataclass helps ensure better code quality and facilitates easier debugging and maintenance. Furthermore, the inherent ability to convert dataclass instances into dictionaries or tuples allows for seamless interoperability with other parts of the codebase or external systems that expect these data structures.

Let's understand the following example.

Example -

from dataclasses import dataclass, asdict

@dataclass
class AuthorData:
    """Class for keeping track of an author in the system"""

    first_name: str
    last_name: str
    n_books: int

author_data = AuthorData("Robert", "Downey", 500)
asdict(author_data)

Output:

'first_name': 'Robert', 'last_name': 'Downey', 'n_books': 500}

The above generic structures help to manipulate data using a rich set of built-in Python functions available on dict, list, tuple, etc.

author = {"first_name": "Issac", "last_name": "Asimov", "n_books": 500}
# Access dict values
author.get("first_name")
# Add new field to dict
author["alive"] = False
# Update existing field
author["n_books"] = 703

Python's dataclass shields us from the intricacies of individual class methods, enabling us to focus on the core functionality. It ensures compatibility with different library versions and minimizes the impact of language-level changes. This abstraction layer fosters code stability, promotes reusability, and simplifies the development and maintenance process.

When we define the data in a generic form which means we can modify when needed and it is an advantage of a generic data.

In the following example, all dictionaries may have the different keys. The extra keys can exist in the second dictionary.

names = []
names.append({"first_name": "Isaac", "last_name": "Asimov"})
names.append({"first_name": "Jane", "last_name": "Doe", 
              "suffix": "III", "age": 70})

In Python, the performance difference between retrieving the value of a class member and accessing a value associated with a key in a dictionary is minimal. Unlike languages such as Java, Python does not have a compilation step that enables compiler optimizations specifically for class member access. Consequently, the performance characteristics of these operations are generally comparable.

On the other hand, dictionary access in Python is highly efficient due to its underlying implementation as a hash table. Retrieving a value from a dictionary based on a key involves a fast lookup process that is optimized for performance.

Sets and dictionaries in Python offer more efficient lookup times compared to lists and tuples. The use of hash functions allows for direct access to data, resulting in constant-time lookup complexity. Conversely, lists and tuples require a linear search, leading to a linear-time complexity for lookup operations.

When data is created as instances of a class, the class definition contains information about the data's structure. This means that the expected data shape can be easily identified by examining the class. The presence of a data schema at the class level facilitates the discovery and understanding of the data's expected structure.

On the other hand, when data is represented using generic data structures such as dictionaries or lists, the data schema is not inherently included as part of the data representation. The structure and expected shape of the data must be inferred or documented separately, making it potentially less apparent or discoverable.

For example - we can easily define the data shape of the FullName which is an instantiated as class object. Let's see the following example.

Example -

class FullName:
    def __init__(self, first_name, last_name, suffix):
        self.first_name = first_name
        self.last_name = last_name
        self.suffix = suffix

However, the above class doesn't confirm the data shape it expects. Suppose, we mistype the field that stores the first name as the first name. We would get an error TypeError: __init__() got an unexpected keyword argument 'first_name'.

class FullName:
    def __init__(self, first_name, last_name, suffix):
        self.first_name = first_name
        self.last_name = last_name
        self.suffix = suffix

FullName(fist_name="Chris", last_name="Prat", suffix="II")

On the other hand, if we use the generic data structure and miss type the field name that might not through an error or exception. Instead of error, first name omitted from the result.

We will get the following output -

None Prat

Principle - 3: Data is Immutable

According to this principle, the data should never change or mutable data. To apply this principle, we make our data frozen.

@dataclass(frozen=True)
class StudentData:
    """Class for keeping track of an author in the system"""

    first_name: str
    last_name: str
    roll_nu: int

The immutable data types in built-in Python are int, float, decimal, bool, string, tuple and range. Note that dict, list and set are mutable.

The benefit of mutable is that we can access to all with confidence. When dealing with mutable data, it is important to exercise caution when passing it as an argument to a function, as it has the potential to be modified or copied.

In the given example, an empty list is initially passed as a default argument to the function. However, due to the mutable nature of lists, each time the function is called, the list undergoes modifications, resulting in a different default value being utilized in subsequent calls.

Let's understand the following example -

Example -

def append_to_list(el, list1=[]):
    list1.append(el)
    return ls
append_to_list(1)
append_to_list(2)
append_to_list(3)

Output:

[1]
[1, 2]
[1, 2, 3]

We can use the following code.

Example -

def append_to_list(el, list1=None):
    if list1 is None:
        list1 = []    
    list1.append(el)
    return list1
append_to_list(1)
append_to_list(2)
append_to_list(3)

Output:

[1]
[2]
[3]

We get the expected result because None is immutable. When we use the immutable data then it can be passed to any function without hesitation because data never change.

It helps to get predictable code behaviour. Let's understand the following example.

Example -

from datetime import date
dict1= {"age": 30}
if date.today().day % 2 == 0:
    dict1["age"] = 40

The age value of dict1 is not predictable. It depends on whether we run the code on an even or odd day.

However, with immutable data, it is guaranteed that data never changes. Let's see the following example.

Example -

student_data = StudentData("Rodric", "Asma", 500)
if date.today().day % 2 == 0:
    student_data.roll_nu = 100

When we run the above code we would get error dataclasses.FrozenInstanceError: cannot assign to field "roll_nu". The frozen data class won't allow to change the student_data.roll_nu, no matter it's an even or odd day.

Another benefit is fast equality checks. Python provides two operators that are used to determine if two objects are equal: "is" and "==". The "is" operator compares the identity of the objects by checking if they reside at the same memory address, while the "==" operator compares the equality of their values by examining the actual content stored within the objects.

Example -

# String is immutable
x = "javatpoint"
# Note that the identity of `x` and `abc` is the same
print(id(x))
# 139669244330992
print(id("abc"))
# 139669244330992
print(x == "javatpoint")
# True
print(x is "javatpoint")
# True

# List is mutable
y = [1, 2, 3]

# Note that the identity of `y` and `[1, 2, 3]` is different
print(id(y))
# 140110790605632
print(id([1, 2, 3])
# 140110790605632

print(y == [1, 2, 3])
# True
print(y is [1, 2, 3])
# Fasle

As demonstrated in the example provided, the "is" and "==" operators exhibit similar behavior when comparing an immutable data type like a string (where the value cannot be changed). However, their behavior differs when applied to a mutable data type like a list.

When comparing immutable data objects, the "is" operator tends to behave more consistently. This is because it checks the objects' memory addresses, providing a reliable way to determine if they are the same object in memory. On the other hand, the "==" operator examines the actual content stored within the objects to assess their equality.

In terms of performance, the "is" operator is generally faster than the "==" operator. This is because comparing object addresses is a faster operation than comparing all the individual fields within the objects. Immutable data allows for efficient equality checks by comparing data by reference rather than content.

In a multi-threaded environment, when data is mutable, it can lead to potential race condition failures. A race condition occurs when two or more threads attempt to access and modify the same data concurrently, resulting in unpredictable outcomes.

For instance, let's consider a scenario where two threads are simultaneously trying to access and modify the value of a variable called "x" by adding or subtracting 10 to/from it. In such a situation, due to the non-deterministic interleaving of thread execution, race conditions can arise. The threads might read the value of "x" at the same time, perform their respective operations, and then overwrite each other's changes, leading to incorrect results or unexpected behavior.

Principle -4: Separate Data Schema from Data Representation

Below is a simple JSON schema, which is essentially a dictionary, that specifies the structure of the data represented as another dictionary. The schema outlines the required fields and their respective data types. On the other hand, the data is represented using a generic data structure.

Example -

schema = {
    "required": ["first_name", "last_name"],
    "properties": {
        "first_name": {"type": str},
        "last_name": {"type": str},
        "books": {"type": int},
    }
}

data = {
    "valid": {
        "first_name": "Isaac",
        "last_name": "Asimov",
        "books": 500
    },
    "invalid1": {
        "fist_name": "Isaac",
        "last_name": "Asimov",
    },
    "invalid2": {
        "first_name": "Isaac",
        "last_name": "Asimov",
        "books": "five hundred"
    }
}

Data validation functions (or libraries) can be used to check whether a piece of data conforms to a data schema.

Example -

def validate(data):
    assert set(schema["required"]).issubset(set(data.keys())), \
        f"Data must have following fields: {schema['required']}"

    for k in data:
        if k in schema["properties"].keys():
            assert type(data[k]) == schema["properties"][k]["type"], \
                f"Field {k} must be of type {str(schema['properties'][k]['type'])}"

The validate() function returns error when data is invalid with details otherwise pass through.

validate(data["valid"]))
# No error
validate(data["invalid1"])
# AssertionError: Data must have following fields: ['first_name', 'last_name']
validate(data["invalid2"])
# AssertionError: Field books must be of type <class 'int'>

In Python, we can allow the class member to be an optional. This benefit is therefore not strong in the context of Python. For example, we can set the default argument of roll_number to None to indicate the field is optional.

Example -

class Student:
    def __init__(self, first_name: str, last_name: str, roll_number: int = None):
        self.first_name = first_name
        self.last_name = last_name
        self.roll_number = n_books
    @property
    def fullname(self):
        return f"{self.first_name} {self.last_name}"
    @property
    def is_prolific(self):
        if self.n_books:
            return self.roll_number > 100
    
student = Student("Issac", "Asimov")

This principle allows to data validation at runtime. It allows the definition of data validation conditions that go beyond the type of the field.

If we compare the above schema with the given schema, we can define the more properties for each field.

schema = {
    "required": ["first_name", "last_name"],
    "properties": {
        "first_name": {
            "type": str,
            "max_length": 100,
        },
        "last_name": {
            "type": str,
            "max_length": 100
        },
        "books": {
            "type": int,
            "min": 0,
            "max": 10000,
        },
    }
}

By adopting the principles and techniques of Data-Oriented Programming (DOP), Python developers can enhance their ability to produce code that is easier to maintain and scale, thereby unleashing the complete potential of their data.

Next TopicWhat is PyDev

← prev next →