Data Engineering Fundamentals: Building Pipelines That Power African Businesses

Every dashboard you admire, every machine learning model that predicts customer churn, every report that lands on a CEO's desk — none of it exists without a data engineer working behind the scenes. Data engineers are the builders of the data world. They design and maintain the systems that collect, move, transform, and store data so that analysts and scientists can actually use it.
In Africa, where businesses are digitising rapidly and data volumes are growing faster than the teams managing them, data engineering skills are in enormous demand. From fintech startups in Lagos to logistics companies in Nairobi and telecom giants in Johannesburg, every data-driven organisation needs people who can build reliable data infrastructure.
This guide introduces you to the core concepts of data engineering, the tools used in real pipelines, and how to start building the skills that will make you one of the most valuable people in any data team.
What Does a Data Engineer Actually Do?
A data engineer's primary job is to make sure the right data gets to the right place in the right format at the right time. That sounds simple, but in practice it involves:
- Ingesting data from dozens of different sources — APIs, databases, flat files, streaming platforms
- Transforming raw messy data into clean, structured formats ready for analysis
- Loading that data into storage systems like data warehouses or data lakes
- Orchestrating all of these steps to run automatically on a schedule
- Monitoring pipelines to catch failures before they affect downstream teams
This process is commonly referred to as ETL — Extract, Transform, Load — and it sits at the heart of everything a data engineer builds.
How Data Engineering Differs From Data Analysis
It is worth being clear about the distinction, because many people starting out conflate the two roles:
| Responsibility | Data Analyst | Data Engineer |
|---|---|---|
| Primary focus | Insights from data | Moving and storing data |
| Core tools | SQL, Python, Power BI | Python, Spark, Airflow |
| Output | Reports, dashboards | Pipelines, warehouses |
| Works with | Clean, ready data | Raw, unstructured data |
| Coding depth | Moderate | Heavy |
In small organisations, one person often does both. But as a company grows, these roles separate quickly.
The ETL Process in Detail
Extract — Getting Data From the Source
Data lives everywhere. A typical African e-commerce business might have:
- Order data in a MySQL database
- Customer interactions logged in Firebase
- Payment records from Paystack or Flutterwave APIs
- Inventory data in a spreadsheet shared on Google Drive
- Social media engagement pulled from Meta's Graph API
Extracting data means connecting to all of these sources and pulling the records you need. Here is a simple example of extracting data from a REST API using Python:
import requests
import pandas as pd
def extract_from_api(endpoint, api_key):
headers = {"Authorization": f"Bearer {api_key}"}
response = requests.get(endpoint, headers=headers)
if response.status_code == 200:
data = response.json()
return pd.DataFrame(data["results"])
else:
raise Exception(f"API call failed with status {response.status_code}")
df = extract_from_api(
"https://api.example.com/transactions",
api_key="your_api_key_here"
)Transform — Cleaning and Reshaping
This is where the real engineering work happens. Raw data from production systems is almost never analysis-ready. Transformation steps might include:
- Removing duplicate records created by retry logic
- Standardising date formats across sources
- Converting currencies to a single base (NGN, KES, GHS all to USD)
- Joining data from multiple sources on a common key like a customer ID
- Aggregating transaction-level data to daily or monthly summaries
Here is a transformation function that handles several of these at once:
def transform_transactions(df):
# Remove duplicates
df = df.drop_duplicates(subset=["transaction_id"])
# Standardise date format
df["transaction_date"] = pd.to_datetime(df["transaction_date"])
# Convert all amounts to USD
exchange_rates = {"NGN": 0.00065, "KES": 0.0077, "GHS": 0.083}
df["amount_usd"] = df.apply(
lambda row: row["amount"] * exchange_rates.get(row["currency"], 1),
axis=1
)
# Add derived columns
df["month"] = df["transaction_date"].dt.to_period("M").astype(str)
df["is_high_value"] = df["amount_usd"] > 500
return df[["transaction_id", "customer_id", "transaction_date",
"amount_usd", "month", "is_high_value"]]Load — Writing to the Destination
Once transformed, data is loaded into a storage system. The most common destinations are:
- Data warehouses like BigQuery, Snowflake, or Amazon Redshift — optimised for fast analytical queries
- Data lakes like AWS S3 or Google Cloud Storage — for storing raw files at massive scale cheaply
- Relational databases like PostgreSQL — for operational data that applications read from directly
from sqlalchemy import create_engine
def load_to_postgres(df, table_name, connection_string):
engine = create_engine(connection_string)
df.to_sql(
name=table_name,
con=engine,
if_exists="append",
index=False
)
print(f"Loaded {len(df)} rows into {table_name}")
load_to_postgres(
df=transformed_df,
table_name="transactions_clean",
connection_string="postgresql://user:password@localhost:5432/analytics_db"
)Orchestrating Pipelines With Apache Airflow
Running your ETL script once manually is fine for testing. But in production, you need pipelines to run automatically — daily, hourly, or even every few minutes — and you need to know immediately when something breaks.
Apache Airflow is the industry-standard tool for this. It lets you define pipelines as code using Python and schedule them to run on any cadence. Here is what a simple Airflow DAG looks like:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
"owner": "datafrik",
"retries": 2,
"retry_delay": timedelta(minutes=5),
"email_on_failure": True,
"email": ["data-team@datafrik.co"]
}
with DAG(
dag_id="daily_transactions_pipeline",
default_args=default_args,
schedule_interval="0 6 * * *",
start_date=datetime(2024, 1, 1),
catchup=False
) as dag:
extract_task = PythonOperator(
task_id="extract",
python_callable=extract_from_api,
op_kwargs={"endpoint": "...", "api_key": "..."}
)
transform_task = PythonOperator(
task_id="transform",
python_callable=transform_transactions
)
load_task = PythonOperator(
task_id="load",
python_callable=load_to_postgres
)
extract_task >> transform_task >> load_taskThe >> operator defines the order — extract runs first, then transform, then load. If any step fails, Airflow retries it and sends an alert to your team.
Key Tools Every African Data Engineer Should Know
The data engineering ecosystem is large, but you do not need to learn everything at once. Focus on these in order:
- SQL — the foundation of everything. You cannot be a data engineer without strong SQL skills
- Python — for writing pipeline logic, transformations, and automation scripts
- Apache Airflow — for scheduling and orchestrating pipelines
- dbt (data build tool) — for writing transformations inside your data warehouse using SQL
- Apache Spark — for processing datasets too large to fit in memory on a single machine
- Docker — for packaging your pipelines so they run consistently across environments
A Realistic Learning Path
- Master SQL — joins, window functions, CTEs, aggregations
- Get comfortable with Python and Pandas
- Learn how to connect Python to databases and APIs
- Build your first end-to-end ETL script from scratch
- Deploy it on a schedule using Airflow or a simple cron job
- Learn dbt for warehouse transformations
- Explore cloud services — covered in our companion post on Cloud Computing
Why Data Engineering Matters for Africa
The data infrastructure gap in Africa is real. Many organisations are sitting on years of valuable data stored in disconnected systems, spreadsheets, and legacy databases — but they lack the engineering capacity to unlock it.
A well-built data pipeline can:
- Give a microfinance institution in Ghana real-time visibility into loan repayment rates
- Help a logistics company in Nigeria optimise delivery routes by analysing historical traffic patterns
- Enable a public health authority in Kenya to track disease outbreaks as they develop rather than weeks after the fact
Data engineering is not just a technical discipline — it is infrastructure for decision-making. And Africa urgently needs better infrastructure at every level.
The opportunity for data engineers on the continent is significant. Salaries are competitive, remote work is increasingly available, and the problems worth solving are genuinely important.
Start Building Today
The best way to learn data engineering is to build something. Pick a publicly available African dataset — from the World Bank Open Data, the African Development Bank, or your national statistics bureau — and build a pipeline around it. Extract it, clean it, load it into a local PostgreSQL database, and schedule it to refresh weekly.
That one project, done properly and documented on GitHub, is worth more than any certification in a job interview.
At DatAfrik, we walk you through exactly this kind of hands-on project work in our Data Engineering bootcamp. Everything is built around African data, African use cases, and the kinds of problems you will actually face working in this industry.