Why Data Engineers, Data Analysts and Data Scientists Should Learn to use Pandas for Python

In this post, I talk about Pandas. Not the furry bears from China that munch on bamboo that we all know and love (sorry, couldn’t resist adding a crappy joke) but the super fantastic library available to install in Python that a Data Engineer, Data Analyst or Data Scientist can do SO MUCH with.

What is Pandas for Python?

As per the official documentation, it reads:

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Yep, that’s it in a nutshell but my word is it powerful.

How to Install the Pandas Library

Simple:

pip3 install pandas

How to Use the Pandas Library

Keeping this relatively high level as it is so vast, you can get started like this:

import pandas as pd

Then you can reference the library methods as pd

df = pd.DataFrame(data=[1,2,3,4,5], columns=[“a”])
print(df.head(5))

This would return a simple data frame object containing one column of data as passed in the list with a column label of “a”, which other than learning some syntax isn’t of great use so let’s talk about some more real world uses.

Use Cases for Pandas in Python

Where this library is so powerful is that you can load data from different sources into it. At which point the data can be cleaned as required and either then analysed or forwarded to another destination. What’s also awesome is that data from different sources can be blended together.

The blend could be likened to a JOIN in SQL where two or more tables form a data set. It is also possible to perform a UNION like operation whereby duplicated schemas can be stitched together to form a single quantity or rows.

As a data engineer, I find myself using these functions a lot in my pipelines.

For example you could load a CSV file and merge it with a table from a relational database. You can merge data from different relational databases formats – SQL Server and MySQL for example. You can load NoSQL data in. Want to analyse data from MongoDb? No problem. Want to analyse or merge a spreadsheet with another source? No problem – Microsoft Excel is a popular format.

This is just a taster but in later posts, I’ll provide some real examples.

Leave a comment