Skip to content

About

Who is Data Engineer?

A data engineer transforms data into a useful format for analysis.

A data engineer needs to be good at:

  • Architecting distributed systems
  • Creating reliable pipelines
  • Combining data sources
  • Architecting data stores
  • Collaborating with data science teams and building the right solutions for them

Roughly, the operations in a data pipeline consist of the following phases:

  • Ingestion : this involves gathering in the needed data.
  • Processing : this involves processing the data to get the end results you want.
  • Storage : this involves storing the end results for fast retrieval.
  • Access : you’ll need to enable a tool or user to access the end results of the pipeline.

Data Engineering Roles

Although data engineers need to have the skills listed above, the day to day of a data engineer will vary depending on the type of company they work for.

Broadly, you can classify data engineers into a few categories:

  • Generalist
  • Pipeline-centric
  • Database-centric

Generalist

A generalist data engineer typically works on a small team. When a data engineer is the only data-focused person at a company, they usually end up having to do more end-to-end work. For example, a generalist data engineer may have to do everything from ingesting the data to processing it to doing the final analysis. This requires more data science skill than most data engineers have. However, it also requires less systems architecture knowledge — small teams and companies don’t have a ton of users, so engineering for scale isn’t as important.

Pipeline-centric

Pipeline-centric data engineers tend to be necessary in mid-sized companies that have complex data science needs. A pipeline-centric data engineer will work with teams of data scientists to transform data into a useful format for analysis. This entails in-depth knowledge of distributed systems and computer science.

Database-centric

A database-centric data engineer is focused on setting up and populating analytics databases. This involves some work with pipelines, but more work with tuning databases for fast analysis and creating table schemas. This involves ETL work to get data into warehouses. This type of data engineer is usually found at larger companies with many data analysts that have their data distributed across databases.