What is a Data Engineer?

what is a data engineerIn today’s world, a Data Engineer will typically build “data pipelines“. I say “typically” as this would vary from organisation to organisation and may include additional responsibilities such as designing data warehouse schema’s and building reports.

At a minimum, a data pipeline will move data from source (Extract) to destination (Load). There may also be some requirement to turn (Transform) that data into something useful that the business can glean greater insight and value from.

I’ve been working with databases since 1997 and since then, I have built pipelines in various forms.

These have included simple extract to CSV file, FTP and load scripts written on the command line, SQL stored procedures which extract, transform and load from one database to another via linked servers. SSIS packages that pull in files and database data together and scripts written in Python that make use of external API’s.

In my view, the Data Engineer of today is essentially the latest iteration of the positions that have done similar tasks in the past utilising the latest methods to process data and turn it into something useful for the business in order to make money.

These positions might have also been known as Data Developers, Data Warehouse Developers, BI Developers and other similar titles.

What skills does a Data Engineer need?

I believe that the following core skills and attributes are required if you want to become a data engineer. Beyond these, it will be a case of more specific skills associated with the system they work with, ie Azure, GCP, AWS, Hadoop, Kafka etc.

  • Good communication and collaboration skills
  • SQL
  • Relational database design
  • Knowledge of JSON and XML, (especially JSON)
  • Data warehousing
  • Data Lakes
  • An understanding of ETL and ELT
  • Regular Expressions
  • Knowledge of a programming language such as Python or Java
  • Linux or Windows administration & scripting in particular the terminal/command window
  • Patience

Communication and collaboration

I placed communication and collaboration at the top of this list deliberately. If you cannot communicate well, then it will be harder for you to succeed. Depending on the size of the organisation you work for, it likely be required for you to meet with stakeholders to best understand their business function and ascertain requirements. It is important that you are a good listener and are equally good at asking questions.

Establishing good relationships with stakeholders through effective communication is a key ingredient in your success as a data engineer. Communication is a skill and if you feel that this isn’t one of your strong points then do what you can to be a better communicator. There will be a ton of resources out there, some free, some paid that can help you. Type into both Google and Youtube “How to be a better communicator” to get you started.

SQL

If a Data Engineer’s skills could be added to a Swiss Army Knife then this would be the blade that would likely be your most frequently used.

A lot of data lives in databases and requires extracting. If you’re not extracting it from a database and instead pulling it via an API for example, it is likely you will be interacting with a database at the destination.

You have to know SQL and know it well. A simple understanding of CRUD commands won’t cut it it most cases. You will need to know aggregations and GROUP BY, Window Functions, CTE’s, CASE and functions that enable to you to transform data.

Learn SQL, learn it well and keep learning it!

Some resources to get you started

Codecademy – https://www.codecademy.com/learn/learn-sql

W3schools – https://www.w3schools.com/sql/

Relational Database Design

An understanding of the concepts that RDBMS are built around is important to help you discover and ultimately extract data from a source.

Learn data modelling, normalization, primary and foreign keys. Practice modelling and building your own relational databases in a development environment of some kind.

JSON, XML and CSV

A lot of data is now available via API’s. A call to an API will likely return a response that is going to be either in JSON (most common) or XML. Your code will then need to transform that response and load it somewhere.

Learn how these formats differ, practice writing your own formats. Paste them into online validators to know whether you have written them correctly.

CSV files – you’ll likely process these at some stage in your pipelines. Practice creating, viewing and updating CSV files – learn about delimiters and how to escape delimiting characters which would otherwise cause your calling programs to see more columns than what it is supposed to see.

Data Warehousing

A Data Engineer helps a business turn their data into insight that can generate revenue from or improve operational costs. It is highly likely that you will be expected to load data into a data warehouse and it is possible that you would be expected to design and build the schema for it.

Learn about data modelling, facts and dimensions, the different variations of slowly changing dimensions (SCD), star and snowflake schemas, Inmon v Kimball, denormalisation and data marts and when you might use these instead of or to complement a data warehouse.

Data Lakes

There are differences between data warehouses and data lakes. The warehouse will contain an amount of processed data but a lake would contain data in a raw form. Learn when you might use one over the other as part of a data strategy.

EL, ETL and ELT

“Extract and Load”, “Extract, Transform, Load” and more recently “Extract, Load and Transform”. These are the high level stages involved in processing data from source to insight. You will be expected to know the differences and be skilled at each stage.

Regular Expressions

The skill to be able to write code to extract pattern values from text data. Regular expressions are not easy initially but like anything, practice makes you better and they are a very powerful tool.

Knowledge of a programming language

I referred to Python as an option – this is very popular and widely used in data engineering. I personally love it and use it all the time. It’s easy to learn and very versatile. There are lots of resources and free tools available that can help you learn this quickly, including this one 🙂

Linux or Windows

You’ll be needing some knowledge around how to do things within either of these operating systems so knowledge of the command line/terminal in particular including scripting.

Patience

You need to be patient. It’s not an easy role. It can be frustrating. You can spend lots of time trying to make sense of bad data, encountering issue after issue that requires some tweak in your code to handle it. You will be rewarded with your patience however and those who depend on your skills to deliver the data, such as the analytics engineers, data analysts and data scientists will be extremely grateful for your efforts as it will make their jobs a whole lot easier.

How to Become a Data Engineer?

Definitely invest your time in learning and polishing up on those skills I mentioned but what else could you do?

Courses and certificates

There is lots of training material out there. All of the cloud providers provide introductory courses and more advanced material including certification. Just go out there and search for AWS courses, Google cloud courses, Azure courses, data engineer courses and there will be plenty of material coming up.

Create an online portfolio

An important one this – for prospective employers checking you out, get your name online and open a GitHub repository to showcase projects you have worked on.

What projects can you work on for your online portfolio?

Just search for “data engineering project ideas” and a whole host of articles will appear to get your started. Don’t be shy to take inspiration from someone else’s lightbulb moment but just make sure that the code is your work and is unique.

Read blogs

Find well established data engineers who are actively writing, tweeting or making videos about the subject matter and follow them.

Start a blog

If you like writing then it’s a great way to present your knowledge to others for their learning as well as your own. Pick topics to study and then write about them. You can set up your own site or contribute to one of the larger platforms such as medium.com which has lots of traffic. You could monetise your own blog using Adsense or something like that to help pay for the hosting. Prioritise the content above all else however. If you do it well, then the rest will take of itself.

Start a social media channel

YouTube is a nice one – how to videos are always popular. Again, it could be monetised later if you grow an audience. Like I said in the last paragraph, focus on the content and making the experience worthwhile for the viewer.

Credits

Photo by Christian Bass on Unsplash

Leave a comment