fb_pixel
data analysis

Python for Data Analysis: A Practical Guide for African Analysts

Python for Data Analysis: A Practical Guide for African Analysts
views
7 min read
#data analysis

Data is being generated across Africa at an unprecedented rate — from mobile money transactions in Kenya to agricultural yield records in Nigeria and public health data in Ghana. The challenge is no longer collecting data; it's knowing what to do with it.

Python has emerged as the go-to language for data analysts worldwide, and for good reason. It is beginner-friendly, incredibly powerful, and has a rich ecosystem of libraries purpose-built for working with data. Whether you are a fresh graduate trying to break into the industry or a business analyst looking to level up your skills, this guide will walk you through the fundamentals of data analysis with Python using real African-context examples.

By the end of this guide, you will understand how to load, clean, explore, and visualise a dataset — the four core steps every data analyst follows on every project.


Setting Up Your Environment

Before writing a single line of code, you need the right tools installed. The fastest way to get started is with Anaconda, a Python distribution that bundles everything you need — Python, Jupyter Notebook, and all the major data libraries — into one installer.

Installing Anaconda

Head to anaconda.com and download the version for your operating system. Once installed, open Jupyter Notebook from the Anaconda Navigator. This is where you will write and run your code interactively, one cell at a time.

Alternatively, if you prefer working in the browser without installing anything, Google Colab is an excellent free option that runs Python notebooks in the cloud.

The Libraries You Need

For data analysis, three libraries do most of the heavy lifting:

  • Pandas — for loading, cleaning, and transforming data
  • NumPy — for numerical operations and array handling
  • Matplotlib / Seaborn — for creating charts and visualisations

Install them with a single command if they are not already available:

pip install pandas numpy matplotlib seaborn

Once installed, import them at the top of every notebook you create:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading Your Dataset

The most common data format you will encounter is the CSV file — a plain text file where each row is a record and columns are separated by commas. Pandas makes loading one trivially easy.

Reading a CSV File

Suppose you have a dataset of mobile internet subscription rates across African countries. Loading it looks like this:

df = pd.read_csv("africa_internet_subscriptions.csv")
df.head()

df.head() shows you the first five rows — always the first thing you should do when loading a new dataset. It gives you an immediate feel for the structure: what columns exist, what the values look like, and whether anything seems off at a glance.

Quick Data Overview

After loading, run these three commands to understand your dataset before touching anything:

df.shape        # (rows, columns)
df.dtypes       # data type of each column
df.describe()   # summary statistics for numeric columns

df.describe() is particularly powerful — it gives you the count, mean, min, max, and quartile values for every numeric column in one shot, letting you immediately spot things like unusually large values or suspiciously low counts that might indicate missing data.


Cleaning Messy Data

Real-world data — especially data sourced from African government portals, NGO reports, or scraped from the web — is almost never clean. You will encounter missing values, inconsistent formatting, duplicate rows, and columns with the wrong data type. Cleaning this is not glamorous work, but it is the most important step in any analysis.

Handling Missing Values

Check for missing values across your entire dataset:

df.isnull().sum()

This returns a count of missing values per column. From here you have a few options depending on the context:

  • Drop rows where a critical column is missing: df.dropna(subset=["subscription_rate"])
  • Fill with the column mean for numeric data: df["subscription_rate"].fillna(df["subscription_rate"].mean(), inplace=True)
  • Fill with a placeholder for categorical data: df["region"].fillna("Unknown", inplace=True)

There is no universal right answer — the best approach depends on how much data is missing and how important that column is to your analysis.

Fixing Data Types

A common issue is numeric columns being stored as strings because of formatting — for example, a population column that contains values like "12,500,000". Pandas reads the comma as a character, making the column a string instead of an integer.

Fix it like this:

df["population"] = df["population"].str.replace(",", "").astype(int)

Removing Duplicates

Always check for and remove duplicate rows, especially if your data was merged from multiple sources:

df.drop_duplicates(inplace=True)

Exploring and Analysing the Data

With a clean dataset, you can start asking real questions. This stage — called Exploratory Data Analysis (EDA) — is where the insights begin to emerge.

Grouping and Aggregating

One of the most useful operations in Pandas is groupby. Suppose you want to find the average internet subscription rate by region:

regional_avg = df.groupby("region")["subscription_rate"].mean().reset_index()
regional_avg.sort_values("subscription_rate", ascending=False)

This tells you at a glance which regions are leading in connectivity and which are lagging — a critical insight for anyone working in telecoms, policy, or digital education across Africa.

Filtering Data

You can filter your dataframe just like a spreadsheet. To look at only East African countries:

east_africa = df[df["region"] == "East Africa"]

Or countries with subscription rates above 50%:

high_connectivity = df[df["subscription_rate"] > 50]

Combine conditions using & (and) or | (or):

df[(df["region"] == "West Africa") & (df["subscription_rate"] > 30)]

Visualising Your Findings

Numbers in a table rarely tell a compelling story on their own. Visualisation is what turns your analysis into something stakeholders can actually understand and act on.

Bar Chart — Comparing Countries

plt.figure(figsize=(12, 6))
sns.barplot(data=df, x="country", y="subscription_rate", palette="viridis")
plt.title("Mobile Internet Subscription Rates by Country")
plt.xlabel("Country")
plt.ylabel("Subscription Rate (%)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

If your dataset includes a year column, you can plot how subscription rates have changed over time for a specific country:

nigeria = df[df["country"] == "Nigeria"]
plt.plot(nigeria["year"], nigeria["subscription_rate"], marker="o")
plt.title("Nigeria Internet Subscription Rate Over Time")
plt.xlabel("Year")
plt.ylabel("Subscription Rate (%)")
plt.grid(True)
plt.show()

Correlation Heatmap

Want to understand how your numeric variables relate to each other? A heatmap of the correlation matrix gives you that at a glance:

plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Matrix")
plt.show()

A Real-World Example: African Internet Data

Here is a summary of what a cleaned dataset on African internet penetration might look like across regions:

RegionCountries AnalysedAvg Subscription RateHighest Performer
North Africa661.4%Morocco
Southern Africa1048.2%South Africa
East Africa1334.7%Kenya
West Africa1629.3%Ghana
Central Africa914.1%Cameroon

This kind of summary table — produced in just a few lines of Pandas — is exactly the sort of output you would include in a report for a client, an NGO, or a government ministry.


What to Learn Next

Once you are comfortable with the basics, these are the natural next steps on your data analysis journey:

  • SQL — for querying data directly from databases, which is how most enterprise data is stored
  • Power BI or Tableau — for building interactive dashboards without writing code
  • Scikit-learn — for moving into machine learning and predictive modelling
  • Statistics fundamentals — understanding hypothesis testing, confidence intervals, and regression will make you a far more credible analyst

The Most Important Habit

The single best thing you can do to grow as a data analyst is to work with real data consistently. Find a dataset that interests you — football statistics, crop prices, public health records, election results — and analyse it. Share your findings on LinkedIn or GitHub. Build a portfolio.

Africa needs analysts who understand its data. The opportunity is enormous, and the barrier to entry has never been lower.

The goal is to turn data into information, and information into insight. Start with one dataset, one question, and one chart. Everything else follows.

The tools are free. The data is available. The only thing standing between you and a career in data is starting.


Keep Learning with DatAfrik

At DatAfrik, we are building the resources, courses, and community that African data professionals need to thrive. From beginner Python tutorials to advanced machine learning content, everything we create is designed with the African context in mind — our datasets, our problems, our opportunities.

Explore our learning paths, browse our cheat sheets, or join our community to connect with other analysts across the continent. The data revolution in Africa is already underway — and you belong in it.

Logo

datafrik.co

Copyright © 2024 Datafrik.co

All rights reserved

Quick Links
Career Bootcamp
Tools Bootcamp
Stay up to date

1813 Pinsky Lane, North Las Vegas, NV, 89032, USA

Email: support@datafrik.co

© 2026 datafrik.co