Do Data Engineers Need to Know Object Oriented Programming?

I wanted to write this post about whether Data Engineers need to know about Object Orientated Programming for two reasons:

Object Orientated Programming (OOP) is a topic of interest to me
I like to think about and learn about how OOP can or has benefitted data engineers.

Functional Programming as it says on the tin, relies on functions whereby Object Orientated Programming relies on classes and objects.

So here’s the thing. As a Data Engineer, you can write code either using Functional Programming (FP) techniques or using Object Orientated Programming (OOP) techniques. Functional Programming as it says on the tin, relies on functions whereby Object Orientated Programming relies on classes and objects.

Functions take an input – say in this case a number. They then do something with it. Let’s say they take that number and multiply it by 20% which happens to the UK VAT rate. They then return an output. This is the “answer” or outcome of the function. You can say that functions “are designed to perform a specific task.”

If you were to write a simple function in this case, you will be guaranteed that the function will always return the same result for the same input. Later on, if that task needed altering, say the UK decides to reduce VAT on all applicable purchases, then the function can be changed in one place and the outcome will reflect that change in all places where the function is referenced in the code.

Object Orientated Programming involves the author writing classes which can be used as “blueprints” for objects in the code. Each class can have properties, also known as variables set for it as well as object methods. These are functions which belong to the objects. An example of a class may be a product. The product can have properties such as the size, type, colour and price.

Let’s just demonstrate with some simple examples of the above:

Functional Programming – Creating a Function in Python

def process_product(size, color, product_type, price):
    # Calculate the price with 20% added
    price_plus_20_percent = price + (price * 0.20)
    
    # Create a dictionary to store the inputs and the calculated value
    product_info = {
        "size": size,
        "color": color,
        "type": product_type,
        "price": price,
        "price_plus_20_percent": price_plus_20_percent
    }
    
    return product_info

# Example usage:
size = "Medium"
color = "Blue"
product_type = "Shirt"
price = 50.0

product_data = process_product(size, color, product_type, price)
print("Product Information:")
for key, value in product_data.items():
    print(f"{key}: {value}")

The above script returns the following:

Product Information:
size: Medium
color: Blue
type: Shirt
price: 50.0
price_plus_20_percent: 60.0

Object Orientated Programming – Creating a Class and Method in Python

class Product:
    def __init__(self, size, color, gender, product_type, price):
        self.size = size
        self.color = color
        self.gender = gender
        self.product_type = product_type
        self.price = price

    def add_20_percent_to_price(self):
        self.price += self.price * 0.20

    def __str__(self):
        return f"Size: {self.size}, Color: {self.color}, Gender: {self.gender}, Type: {self.product_type}, Price: {self.price}"

# Example usage:
product = Product("Medium", "Blue", "Mens", "Shirt", 50.0)
print("Original Product:")
print(product)

# Add 20% to the price
product.add_20_percent_to_price()
print("\nProduct After Adding 20% to the Price:")
print(product)

The example demonstrated returns the following:

Original Product:
Size: Medium, Color: Blue, Gender: Mens, Type: Shirt, Price: 50.0

Product After Adding 20% to the Price:
Size: Medium, Color: Blue, Gender: Mens, Type: Shirt, Price: 60.0

The advantage of the OOP approach here is that I can modify a property of an object after the object has been created. The function on the other hand just gives me an output. With the FP approach I can’t edit product_data and change the size to something else but I can do that with the object as I can simply do something like:

product.size = "Large"
print(product)

Returns:

Size: Large, Color: Blue, Gender: Male, Type: Shirt, Price: 60.0

So where am I going with this? How does this affect a Data Engineer in their work?

Data Engineer’s source, clean, transform and make available data for business stakeholders to consume. I can see use cases for OOP but FP can be more simple to implement.

I can think of many an occasion when I have used FP and not even considered OOP.

Some examples:

Open a CSV file for importing – FP
Connect to a DB – FP
Transform some data – FP
Retrieve some data from a DB – FP
Add some data to a DB – FP

However, in other situations OOP may be the preferred choice.

I once wrote a number of pipelines for an organisation who operated in the classified advertising space. In that instance, I had to retrieve new adverts or modified adverts from the database since the last time it ran. This was the essence of the extract and then for the transformation part, each advert needed to be converted into a JSON format for the consumer which happened to be an API. The JSON itself was nested in places.

A simple example of what I mean would be something like this:

{
"unique_reference" : 123,
"date_added" : "2023-07-01"
"price" : 39.99,
"advert_type: "car",  
"properties : { "doors" : 4, "colour" : "blue", "gearbox" : "manual"}
}

My ETL pipeline looked like this:

Retrieve data from MySQL database (EXTRACT)
Convert data into JSON (TRANSFORM)
Load data to destination API (LOAD)

What I was looking to do was not repeat any code for the different advert types. Some of them had shared properties, such as their unique reference, the date/time they were added, the price they were advertised for etc and where they varied was in the more specific properties.

A piece of furniture doesn’t have doors or a gearbox type for example but it is being sold and will therefore have a price, a date/time and a unique reference. I wrote classes for the shared properties and bespoke classes for the different advert types which inherited the shared class and its properties.

In the example above, whatever the advert_type was, then the relevant class was used and appropriate properties output which were then concatenated together with the shared properties.

It worked perfectly.

Every type of advert had a place in the project making it easy to read and maintain and I found myself not duplicating anything. Ok, I could have maybe done it via functional programming but I don’t think it would have been as neat and as easy to maintain.