In this post, I talk about Pandas. Not the furry bears from China that munch on bamboo that we all know and love (sorry, couldn’t resist adding a crappy joke) but the super fantastic library available to install in Python that a Data Engineer, Data Analyst or Data Scientist can do SO MUCH with.
What is Pandas for Python?
As per the official documentation, it reads:
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Yep, that’s it in a nutshell but my word is it powerful.
How to Install the Pandas Library
Simple:
pip3 install pandas
How to Use the Pandas Library
Keeping this relatively high level as it is so vast, you can get started like this:
import pandas as pd
Then you can reference the library methods as pd
df = pd.DataFrame(data=[1,2,3,4,5], columns=[“a”])
print(df.head(5))
This would return a simple data frame object containing one column of data as passed in the list with a column label of “a”, which other than learning some syntax isn’t of great use so let’s talk about some more real world uses.
Use Cases for Pandas in Python
Where this library is so powerful is that you can load data from different sources into it. At which point the data can be cleaned as required and either then analysed or forwarded to another destination. What’s also awesome is that data from different sources can be blended together.
The blend could be likened to a JOIN in SQL where two or more tables form a data set. It is also possible to perform a UNION like operation whereby duplicated schemas can be stitched together to form a single quantity or rows.
As a data engineer, I find myself using these functions a lot in my pipelines.
For example you could load a CSV file and merge it with a table from a relational database. You can merge data from different relational databases formats – SQL Server and MySQL for example. You can load NoSQL data in. Want to analyse data from MongoDb? No problem. Want to analyse or merge a spreadsheet with another source? No problem – Microsoft Excel is a popular format.
This is just a taster but in later posts, I’ll provide some real examples.