Data Engineering Fundamentals: Building Pipelines That Power African Businesses

Every dashboard you admire, every machine learning model that predicts customer churn, every report that lands on a CEO's desk — none of it exists without a data engineer working behind the scenes. Data engineers are the builders of the data world. They design and maintain the systems that collect, move, transform, and store data so that analysts and scientists can actually use it.

In Africa, where businesses are digitising rapidly and data volumes are growing faster than the teams managing them, data engineering skills are in enormous demand. From fintech startups in Lagos to logistics companies in Nairobi and telecom giants in Johannesburg, every data-driven organisation needs people who can build reliable data infrastructure.

This guide introduces you to the core concepts of data engineering, the tools used in real pipelines, and how to start building the skills that will make you one of the most valuable people in any data team.

What Does a Data Engineer Actually Do?

A data engineer's primary job is to make sure the right data gets to the right place in the right format at the right time. That sounds simple, but in practice it involves:

Ingesting data from dozens of different sources — APIs, databases, flat files, streaming platforms
Transforming raw messy data into clean, structured formats ready for analysis
Loading that data into storage systems like data warehouses or data lakes
Orchestrating all of these steps to run automatically on a schedule
Monitoring pipelines to catch failures before they affect downstream teams

This process is commonly referred to as ETL — Extract, Transform, Load — and it sits at the heart of everything a data engineer builds.

How Data Engineering Differs From Data Analysis

It is worth being clear about the distinction, because many people starting out conflate the two roles:

Responsibility	Data Analyst	Data Engineer
Primary focus	Insights from data	Moving and storing data
Core tools	SQL, Python, Power BI	Python, Spark, Airflow
Output	Reports, dashboards	Pipelines, warehouses
Works with	Clean, ready data	Raw, unstructured data
Coding depth	Moderate	Heavy

In small organisations, one person often does both. But as a company grows, these roles separate quickly.

The ETL Process in Detail

Extract — Getting Data From the Source

Data lives everywhere. A typical African e-commerce business might have:

Order data in a MySQL database
Customer interactions logged in Firebase
Payment records from Paystack or Flutterwave APIs
Inventory data in a spreadsheet shared on Google Drive
Social media engagement pulled from Meta's Graph API

Extracting data means connecting to all of these sources and pulling the records you need. Here is a simple example of extracting data from a REST API using Python:

import requests
import pandas as pd

def extract_from_api(endpoint, api_key):
    headers = {"Authorization": f"Bearer {api_key}"}
    response = requests.get(endpoint, headers=headers)

    if response.status_code == 200:
        data = response.json()
        return pd.DataFrame(data["results"])
    else:
        raise Exception(f"API call failed with status {response.status_code}")

df = extract_from_api(
    "https://api.example.com/transactions",
    api_key="your_api_key_here"
)

Transform — Cleaning and Reshaping

This is where the real engineering work happens. Raw data from production systems is almost never analysis-ready. Transformation steps might include:

Removing duplicate records created by retry logic
Standardising date formats across sources
Converting currencies to a single base (NGN, KES, GHS all to USD)
Joining data from multiple sources on a common key like a customer ID
Aggregating transaction-level data to daily or monthly summaries

Here is a transformation function that handles several of these at once:

def transform_transactions(df):
    # Remove duplicates
    df = df.drop_duplicates(subset=["transaction_id"])

    # Standardise date format
    df["transaction_date"] = pd.to_datetime(df["transaction_date"])

    # Convert all amounts to USD
    exchange_rates = {"NGN": 0.00065, "KES": 0.0077, "GHS": 0.083}
    df["amount_usd"] = df.apply(
        lambda row: row["amount"] * exchange_rates.get(row["currency"], 1),
        axis=1
    )

    # Add derived columns
    df["month"] = df["transaction_date"].dt.to_period("M").astype(str)
    df["is_high_value"] = df["amount_usd"] > 500

    return df[["transaction_id", "customer_id", "transaction_date",
               "amount_usd", "month", "is_high_value"]]

Load — Writing to the Destination

Once transformed, data is loaded into a storage system. The most common destinations are:

Data warehouses like BigQuery, Snowflake, or Amazon Redshift — optimised for fast analytical queries
Data lakes like AWS S3 or Google Cloud Storage — for storing raw files at massive scale cheaply
Relational databases like PostgreSQL — for operational data that applications read from directly

from sqlalchemy import create_engine

def load_to_postgres(df, table_name, connection_string):
    engine = create_engine(connection_string)
    df.to_sql(
        name=table_name,
        con=engine,
        if_exists="append",
        index=False
    )
    print(f"Loaded {len(df)} rows into {table_name}")

load_to_postgres(
    df=transformed_df,
    table_name="transactions_clean",
    connection_string="postgresql://user:password@localhost:5432/analytics_db"
)

Orchestrating Pipelines With Apache Airflow

Running your ETL script once manually is fine for testing. But in production, you need pipelines to run automatically — daily, hourly, or even every few minutes — and you need to know immediately when something breaks.

Apache Airflow is the industry-standard tool for this. It lets you define pipelines as code using Python and schedule them to run on any cadence. Here is what a simple Airflow DAG looks like:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta

default_args = {
    "owner": "datafrik",
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "email": ["data-team@datafrik.co"]
}

with DAG(
    dag_id="daily_transactions_pipeline",
    default_args=default_args,
    schedule_interval="0 6 * * *",
    start_date=datetime(2024, 1, 1),
    catchup=False
) as dag:

    extract_task = PythonOperator(
        task_id="extract",
        python_callable=extract_from_api,
        op_kwargs={"endpoint": "...", "api_key": "..."}
    )

    transform_task = PythonOperator(
        task_id="transform",
        python_callable=transform_transactions
    )

    load_task = PythonOperator(
        task_id="load",
        python_callable=load_to_postgres
    )

    extract_task >> transform_task >> load_task

The >> operator defines the order — extract runs first, then transform, then load. If any step fails, Airflow retries it and sends an alert to your team.

Key Tools Every African Data Engineer Should Know

The data engineering ecosystem is large, but you do not need to learn everything at once. Focus on these in order:

SQL — the foundation of everything. You cannot be a data engineer without strong SQL skills
Python — for writing pipeline logic, transformations, and automation scripts
Apache Airflow — for scheduling and orchestrating pipelines
dbt (data build tool) — for writing transformations inside your data warehouse using SQL
Apache Spark — for processing datasets too large to fit in memory on a single machine
Docker — for packaging your pipelines so they run consistently across environments

A Realistic Learning Path

Master SQL — joins, window functions, CTEs, aggregations
Get comfortable with Python and Pandas
Learn how to connect Python to databases and APIs
Build your first end-to-end ETL script from scratch
Deploy it on a schedule using Airflow or a simple cron job
Learn dbt for warehouse transformations
Explore cloud services — covered in our companion post on Cloud Computing

Why Data Engineering Matters for Africa

The data infrastructure gap in Africa is real. Many organisations are sitting on years of valuable data stored in disconnected systems, spreadsheets, and legacy databases — but they lack the engineering capacity to unlock it.

A well-built data pipeline can:

Give a microfinance institution in Ghana real-time visibility into loan repayment rates
Help a logistics company in Nigeria optimise delivery routes by analysing historical traffic patterns
Enable a public health authority in Kenya to track disease outbreaks as they develop rather than weeks after the fact

Data engineering is not just a technical discipline — it is infrastructure for decision-making. And Africa urgently needs better infrastructure at every level.

The opportunity for data engineers on the continent is significant. Salaries are competitive, remote work is increasingly available, and the problems worth solving are genuinely important.

Start Building Today

The best way to learn data engineering is to build something. Pick a publicly available African dataset — from the World Bank Open Data, the African Development Bank, or your national statistics bureau — and build a pipeline around it. Extract it, clean it, load it into a local PostgreSQL database, and schedule it to refresh weekly.

That one project, done properly and documented on GitHub, is worth more than any certification in a job interview.

At DatAfrik, we walk you through exactly this kind of hands-on project work in our Data Engineering bootcamp. Everything is built around African data, African use cases, and the kinds of problems you will actually face working in this industry.