Return to Homepage
Introduction
In the following report, I will be investigating and examining the
untold story behind seemingly transparent records pertaining to
worldwide COVID cases and death tolls. While the data has already been
cleaned and contains fairly simple information, this project aims to
investigate the disparities between countries, their COVID death tolls,
COVID recovery rates, and the correlation between those statistics and
their financial standing and population size. In simpler terms, does a
country having a high death toll inherently imply that they handled the
outbreak worse than another? What does the total number of cases in
relation to the number of recovered patients and deaths tell us about
that nation’s pandemic response? Did active participants in COVID tests
actually have any noticable affect on total number of cases and in turn,
deaths? These are just some of the questions I hope to explore in this
writing.
Background
This project was inspired in part by my previous research on
non-substance use disorder treatment services offered in Baltimore, MD,
which primarily consisted of COVID-related services. It was also
inspired by general happen stance: I found a very nice data set on
Kaggle and felt it could be an interesting take on a seemingly dull data
set. As mentioned, this data set was found here,
as updated 2 years ago (2022) by Mrityunjay Pathak.
The data set consists of various reported statistics regarding
Coronavirus (COVID-19), an infectious disease caused by the SARS-CoV-2
virus. Most of those infected by the virus will experience some mild to
moderate respitory symptoms and recover with requiring any particular
medical treatment, albeit taking a bit longer than most other diseases.
However, some individuals may become seriously ill and require medical
attention and even hospitalization. The elderly and those with existing
medical conditions such as cardiovascular disease, chronic respiratory
illness, cancer, and other such conditions are more likely to develop
serious symptoms. That being said, anyone, regardless
of age or medical history, could grow seriously ill and die as a result
of COVID-19 infection.
Prevention is the primary defense for the spread of the disease.
Understanding how the virus spreads is important as protecting oneself
from the illness is the best way to protect others as well. At the peak
of the pandemic, it was advised that individuals stay at leaast 1 meter
apart from each, wear properly fitted medical-grade masks, and wash
hands or use hand sanitizer frequently. Once the vaccines were released,
it was also advised by the CDC to get vaccinated to statt building up
herd immunity.
The virus can spread through liquid particles associated with
coughing, sneezing, speaking, singing, or breathing. The obvious avenue
for this is through the mouth or nose of the sick individual. This can
be spread through larger droplets to smaller aerosols. Individuals are
encouraged to practice proper respiratory etiquette, stay home when
sick, and to self-isolate at least 1 week after sympotmns have
ceased.
This data has been compiled through reports from the onset of the
pandemic (2020) up until its last update in 2022. The data does not
distinguish between variants of the virus, however, it does imply that
it accounts for the original variant, Delta, Omicron and its subvariants
of BA.4 and BA.5.
Data
As already described, this data set was retrieved from Kaggle and had
already been mostly cleaned for analysis. However,
there are some changes I wanted to make to the data structure, missing
values/empty spaces, and superfluous characters found throughout the
data set. The following code and outputs demonstrate the changes I’ve
made to allow for smoother data analysis:
# shows the original data structure
str(og_covid_world_dat)
## 'data.frame': 231 obs. of 8 variables:
## $ Serial.Number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Country : chr "USA" "India" "France" "Germany" ...
## $ Total.Cases : chr "104,196,861" "44,682,784" "39,524,311" "37,779,833" ...
## $ Total.Deaths : chr "1,132,935" "530,740" "164,233" "165,711" ...
## $ Total.Recovered: chr "101,322,779" "44,150,289" "39,264,546" "37,398,100" ...
## $ Active.Cases : chr "1,741,147" "1,755" "95,532" "216,022" ...
## $ Total.Test : chr "1,159,832,679" "915,265,788" "271,490,188" "122,332,384" ...
## $ Population : chr "334,805,269" "1,406,631,776" "65,584,518" "83,883,596" ...
# converts all N/A chr values to actual missing values and fills in empty spaces with missing values
covid_world_dat <- og_covid_world_dat |>
mutate(across(where(is.character), ~ na_if(., "N/A") |> na_if("")))
# convert column titles to snake_case
covid_world_dat <- clean_names(covid_world_dat)
# replace commas in all chr columns to avoid issue with conversion to num data type
covid_world_dat <- covid_world_dat |>
mutate(across(where(is.character), ~ str_replace_all(., ",", "")))
# converts data structure to more appropriate data types
covid_world_dat <- covid_world_dat |>
mutate(country = as.factor(country),
across(-c(country, serial_number), as.numeric), # converts to num (not int) to account for math that results in decimals
total_test_taken = total_test) |> # makes a column name clearer to understand
relocate(total_test_taken, .before = population) |>
select(-total_test) # removes that original column
# shows the updated data structure
str(covid_world_dat)
## 'data.frame': 231 obs. of 8 variables:
## $ serial_number : int 1 2 3 4 5 6 7 8 9 10 ...
## $ country : Factor w/ 231 levels "Afghanistan",..: 221 95 73 79 27 105 172 102 218 170 ...
## $ total_cases : num 104196861 44682784 39524311 37779833 36824580 ...
## $ total_deaths : num 1132935 530740 164233 165711 697074 ...
## $ total_recovered : num 101322779 44150289 39264546 37398100 35919372 ...
## $ active_cases : num 1741147 1755 95532 216022 208134 ...
## $ total_test_taken: num 1159832679 915265788 271490188 122332384 63776166 ...
## $ population : num 334805269 1406631776 65584518 83883596 215353593 ...
# demonstrates a portion of our updated data set
head(covid_world_dat)
| 1 |
USA |
104196861 |
1132935 |
101322779 |
1741147 |
1159832679 |
334805269 |
| 2 |
India |
44682784 |
530740 |
44150289 |
1755 |
915265788 |
1406631776 |
| 3 |
France |
39524311 |
164233 |
39264546 |
95532 |
271490188 |
65584518 |
| 4 |
Germany |
37779833 |
165711 |
37398100 |
216022 |
122332384 |
83883596 |
| 5 |
Brazil |
36824580 |
697074 |
35919372 |
208134 |
63776166 |
215353593 |
| 6 |
Japan |
32588442 |
68399 |
21567425 |
10952618 |
92144639 |
125584838 |
As noted before, the data was collected accross 231 countries (231
rows) along with 6 particular statistics associated with each country.
The columns are described as follows:
- serial_number
- a unique but arbitrary ID given to each row of data to serve as a
key column
- country
- total_cases
- total number of cases of COVID-19 reported at that point in the
country
- total_deaths
- total number of deaths from COVID-19 reported at that point in the
country
- total_recovered
- total number of partients who recovered from COVID-19 at that point
in the country
- active_cases
- currently on-going cases reported at the moment the survey was
conducted
- total_test_taken
- total number of COVID-19 tests administered and reported at that
point in the country
- population
- total number of people living in the country at that point of the
survey
Exploratory Analysis
Death Rate vs. Testing Rate
First, we will be investigating two metrics that aren’t given to us
in the actual data set:
The test rate refers to how extensively a country
has tested its population for COVID-19. A higher rate would suggest that
a country is conducting more tests relative to its population size,
which also implies more thorough effort in identifying and tackling
cases. Conversely, a lower test rate suggests the opposite: less of the
population is tested thus resulting in more overlooked cases.
- Formula: test_rate = total_test_taken /
population
The death rate represents the fatality rate of
COVID-19 in a country in relation to total cases. As this simply
implies, it offers an indication of of how deadly the disease is in the
population that is contracting it. A higher death rate suggests a larger
proportion of individuals are dying in the population; a lower rate
suggests the opposite. The death rate can be influenced by variety of
factors including healthcare quality, the age distribution of the
population, presence of underlying health conditions, and effectiveness
in medical intervention. Higher death rates don’t always mean that the
virus is more deadly; it can also be a sign of insufficient healthcare,
testing, or reporting.
- Formula: death_rate = total_deaths /
total_cases
As we can see this correlation plotted below. Each dot represents a
country. However, the first graph is one based on the standard numerical
values given in the data set. This results in a poor graph as the points
are too close to each other to make any meaningful conclusions.

The following graph is a more effective version that is fit to log
scale to separate the points out and allow us to visualize the data more
clearly:
# scatter plot showing the correlation between a population's death rate and testing rate per population fit by log scale
covid_world_dat |>
mutate(death_rate = total_deaths / total_cases,
test_rate = total_test_taken / population) |>
ggplot(aes(x = test_rate, y = death_rate)) +
geom_point(aes(color = country), size = 3, alpha = 0.7) +
scale_x_log10() + scale_y_log10() + # added log scale to more effective visualization
geom_smooth(method = "lm", se = FALSE, color = "black") + # line of best fit (linear model)
labs(title = "Death Rate vs. Testing Rate (Log Scale)",
x = "Tests per Population",
y = "Death Rate") +
theme_minimal() +
theme(legend.position = "none")

As we can obviously see, the more testing a country has
administered to its citizens, the lower the death rate.
- Higher testing = lower death rate
- Lower testing = higher death rate
Highest and Lowest death-to-test Ratio
In conjunction to the previous analysis, I calculated which countries
have the highest death-to-test ratio (the points that
are furthest to the left and up on the graph). This represents the how
many deaths occur relative to the number of tests conducted in each
country.
# calculation of the countries with the highest death-to-test ratio
high_death_to_test_ratio <- covid_world_dat |>
mutate(death_rate = total_deaths / total_cases,
test_rate = total_test_taken / population,
death_to_test_ratio = death_rate / test_rate) |>
arrange(desc(death_to_test_ratio)) |>
select(country, death_rate, test_rate, death_to_test_ratio) |>
head(10) # shows the top 10 countries
On the other hand, the countries with the lowest
death-to-test ratio are listed below. Simiarly, this denotes
which points were the furthest to the right and down on the graph.
# calculation of the countries with the lowest death-to-test ratio
low_death_to_test_ratio <- covid_world_dat |>
mutate(death_rate = total_deaths / total_cases,
test_rate = total_test_taken / population,
death_to_test_ratio = death_rate / test_rate) |>
arrange(death_to_test_ratio) |>
select(country, death_rate, test_rate, death_to_test_ratio) |>
head(10) # shows the top 10 countries
Top 10 Countries by Total Cases, Deaths, and Recoveries
The following graphs visualize the 10 countries with the
highest total of cases, deaths, and recoveries.
I separated the stacked bar plots because the actual disparity
between the total number of deaths and deaths is so high that you can
hardly see it. When we include the recoveries, you cannot even see the
death count for countries as large as the USA. This offers us a relative
comparison that is easy to compare. Now this isn’t meant to discredit
the heavy toll COVID had on the first-world; a million people still
died. However, it must be understood that 1,000,000 is only 1% of the
~100,000,000 recoveries made in the US alone (a population of over
300,000,000). With that consideration, we can come to appreciate the
excellent response that the USA and similar countries had under the
pressure of overfilled hospitals and understaffed medical personnel.
This is especially noticable by simply looking at the height of the bar
plots which simplies a heavy burden set onto the country due to
COVID-19. The correlation between recovery and death ratio can indicate
how well a country was able to respond to the cases appearing within
their borders.
- Due to the size of the count, R forces the count to be in scientific
notation. Keep in mind that the largest count to the right end of the
X-axis amounts to larger than 100,000,000 people.
# stacked bar plot of both the total cases and deaths for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
arrange(desc(total_cases)) |>
head(10) |>
ggplot(aes(x = reorder(country, total_cases))) +
geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Top 10 Countries with the Most COVID-19 Cases and Deaths",
x = "Country",
y = "Count",
fill = "Category") +
theme_minimal()

# stacked bar plot of total cases, deaths, and recovered patients for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
arrange(desc(total_cases)) |>
head(10) |>
ggplot(aes(x = reorder(country, total_cases))) +
geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") +
geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") +
geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") +
coord_flip() +
labs(title = "Top 10 Countries with the Most COVID-19 Cases, Deaths, and Recovery",
x = "Country",
y = "Count",
fill = "Category") +
theme_minimal()

The following graphs visualize the 10 countries with the
lowest total of cases, deaths, and recoveries.
Once again, I also separated the plots to allow for easier
visualization. However, as you may have noticed, the countries with the
lowest number of cases also happen to have next to no reported deaths as
a result. We are talking about countries such as Tuvalu with a
population of 12,000. Though a large proportion of their population was
affected by the virus, there were no reported deaths. With that said, we
see that the top two countries also had next to no reported recoveries
(Saint Helena had 2).
What does this tell us about those two countries?
- Though they may not have done an excellent job at isolation, those
who got sick 1) received great treatment, or 2) had very few at-risk and
elderly individuals sick.
We see these values simply decrease with the population itself.
Tokelau has a population of a tiny 1,378 and a total of 5 reported cases
(of which were active at the time this survey was recorded). However, it
is difficult to quantify results so small seeing as these countries may
struggle to simply report an accurate count of all these instances. Who
knows, maybe we should go pay a visit to the Vatican!
# stacked bar plot of both the total cases and deaths for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
arrange(total_cases) |>
head(10) |>
ggplot(aes(x = reorder(country, total_cases))) +
geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Top 10 Countries with the Least COVID-19 Cases and Deaths",
x = "Country",
y = "Count",
fill = "Category") +
theme_minimal()

# stacked bar plot of total cases, deaths, and recovered patients for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
arrange(total_cases) |>
head(10) |>
ggplot(aes(x = reorder(country, total_cases))) +
geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") +
geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") +
geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") +
coord_flip() +
labs(title = "Top 10 Countries with the Least COVID-19 Cases, Deaths, and Recovery",
x = "Country",
y = "Count",
fill = "Category") +
theme_minimal()

Visualizing Correlations between Key Metrics
The following analysis is a bit different than what I’m used to and I
do think it has the potential of being a bit inaccurate, or more like
deceiving. A correlation matrix is a table that shows the correlation
coefficients between many variables. Each cell within the table displays
the correlation between 2 variables, ranging from -1 to 1. These
coefficients indicate the strength and direction of a linear relation
between two variables.
How do we read the matrix?
- +1: Perfect positive correlation (both variables move in the same
direction)
- As one variable increase, the other increases
- -1: Perfect negative correlation (variables move in opposite
directions)
- As one variable decreases, the other decreases
- 0: No linear correlation (no relationship between the variables)
- No predictable linear relationship between the variables
With that in mind, I see no reason to really explain the correlations
in words when the matrix does a better job at doing so.
# calculate the correlation matrix
corr_matrix <- covid_world_dat |>
select(total_cases, total_deaths, total_recovered, active_cases, total_test_taken, population) |>
na.omit() |> # remove NAs to avoid conflict
cor()
# create the correlation plot with improved labels
corr_matrix |>
corrplot(method = "circle",
type = "upper",
tl.col = "black",
tl.srt = 45,
tl.cex = 0.8,
mar = c(0,0,1,0),
addCoef.col = "black",
number.cex = 0.7,
diag = FALSE)

Predictive Modeling
After having viewed and interpreted different statistical measures of
the COVID data, I feel it would be best served if we attempt to use what
we learned from the correlations between variables to try to predict the
most important metric in these surveys: total
deaths.
The following consists of the painstaking steps to produce a linear
regression model and make predictions on the total number of deaths
found in this data set.
1) Preprocessing the Data
Before building a model, it is best to preprocess the data in order
to clear out any missing values and creating any relevant features we
may need to allow for more specific modeling. In this case:
- deaths_per_million
- tells us how many deaths per million people in a country
- recovery_rate
- proportion of recovered cases compared to total cases which may
indicate healthcare effectiveness
- active_cases_ratio
- proportion of active cases to total cases; a higher number could
mean a more serious situation
# preparing the data
model_covid_world_dat <- covid_world_dat |>
mutate(
deaths_per_million = total_deaths / (population / 1e6), # deaths per million
recovery_rate = total_recovered / total_cases, # recovery rate
active_cases_ratio = active_cases / total_cases # active cases ratio
) |>
na.omit() # remove rows with NA values to avoid conflicts
To avoid redundancy and clutter, I will not be including any output,
just the code above.
2) Training the Linear Regression Model
We will now build a linear regression model to predict total_deaths
based on the above features. This linear regression should help us
understand the relation between the response (total_deaths) and
predictor values (everything else we made).
# train a linear regression model to predict total deaths
model <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = model_covid_world_dat)
# view model summary
summary(model)
##
## Call:
## lm(formula = total_deaths ~ total_cases + total_recovered + active_cases +
## population, data = model_covid_world_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.000000032779 -0.000000000524 0.000000000359 0.000000001223 0.000000048890
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 0.000000010184390607263 0.000000000456701745159
## total_cases 1.000000000000053512750 0.000000000000008702001
## total_recovered -1.000000000000053734794 0.000000000000008785060
## active_cases -1.000000000000053068661 0.000000000000008615763
## population -0.000000000000000077293 0.000000000000000004788
## t value Pr(>|t|)
## (Intercept) 22.30 <0.0000000000000002 ***
## total_cases 114916092899285.61 <0.0000000000000002 ***
## total_recovered -113829621331453.84 <0.0000000000000002 ***
## active_cases -116066326687905.08 <0.0000000000000002 ***
## population -16.14 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.000000006005 on 190 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.703e+28 on 4 and 190 DF, p-value: < 0.00000000000000022
The output summary of the model tells us:
- Coefficients
- Tell us how much each variable effects the response variable
(total_deaths)
- Example: if total_cases has a coefficient of 0.05, it means for
every 1 increase in total_cases, total_deaths will increase by 0.05
- P-value
- Tells us if the relationship between a predictor and a target is
statistically significant
- Values lower than 0.05 typically mean the predictor is
significant
- R-squared
- Measures how well the model fits the data
- An R-squared value closer to 1 means the model explains most of the
variability in total_deaths
3) Model Evaluation
We will now proceed to evaluate the model according to Root Mean
Squared Error (RMSE). This tells us, on average, how far off our
predictions are.
# predict total_deaths using the trained model
predictions <- predict(model, newdata = model_covid_world_dat)
# calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - model_covid_world_dat$total_deaths)^2))
print(paste("RMSE:", round(rmse, 2)))
## [1] "RMSE: 0"
A lower RMSE indicates better model performance. For example, an RMSE
of 500 means that, on average, the model’s predictions are off by 500
deaths.
4) Model Interpreation
The coefficients will us how strongly each predictor variable affects
total_deaths.
# view the coefficients
coef(model)
## (Intercept) total_cases
## 0.00000001018439060726309 1.00000000000005351274979
## total_recovered active_cases
## -1.00000000000005373479439 -1.00000000000005306866058
## population
## -0.00000000000000007729323
For example, if total_cases has a large positive coefficient, this
means that as the number of cases increases, so does the number of
deaths. If population has a negative coefficient, this could suggest
that countries with larger populations have relatively fewer deaths
compared to smaller countries, depending on other factors.
5) Model Tuning
Something many people, including myself, don’t do is cross-validation
to tune our model and avoid overfitting.
We can also use cross-validation to tune the model and avoid
overfitting (performs well on the training data but not very well on
unseen data). This is done by dividing the data into training sets to
check if the model generalizes well to other data.
# create a train/test split
set.seed(123) # random number generator
train_index <- createDataPartition(model_covid_world_dat$total_deaths, p = 0.8, list = FALSE)
train_data <- model_covid_world_dat[train_index, ] # used to train the model
test_data <- model_covid_world_dat[-train_index, ] # used to test the model after it has been trained
# train the model on the training set
model_cv <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = train_data)
# predict on the test set
test_predictions <- predict(model_cv, newdata = test_data)
# calculate RMSE on the test set
rmse_cv <- sqrt(mean((test_predictions - test_data$total_deaths)^2))
print(paste("Test RMSE:", round(rmse_cv, 2)))
## [1] "Test RMSE: 0"
Fantastic! The RMSE for these training tests came out the same as the
previous one.
6) Visualization of Predictions vs. Actuals
Finally, let’s visualize how well our model’s predictions match the
actual total_deaths
# create a data frame with the actual and predicted values
comparison <- data.frame(
Actual = model_covid_world_dat$total_deaths,
Predicted = predictions
)
# plot the actual vs. predicted deaths
comparison |>
ggplot(aes(x = Actual, y = Predicted)) +
geom_point(color = "red") +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
labs(title = "Actual vs. Predicted Total COVID-19 Deaths", x = "Actual Deaths", y = "Predicted Deaths") +
theme_minimal()

The above scatter plot demonstrates the model’s performance in its
prediction.
- Points represent the model’s predictions
- Dashed line represents the actual values (i.e. perfect
predictions)
The closer the points are to the dashed line, the better the model’s
performance. If the points are spread too far, the model may need
improvement.
It’s actually insane how perfect this model turned out. This also
took me quite a few hours to get down with the help of Stack Overflow
and ChatGPT so shout out to them.
Conclusion
All my thoughts and interpretations have been strewn across this
entire document so there isn’t much for me to share here other than the
point that there is much to be learned from what might seem like even
the dullest of data. Data does not need to be super complex for it to be
analyzed. Creative thought process of how to look at the data and what
types of predictions you might be able to make with said data is just as
important to. COVID-19 offers great insight into how the world can
handle large medical crisis and if they did handle it well. That can be
said for many other forms of data so be on the look out and think
creatively! You’ll never know what you may find. I also didn’t sleep all
night in the process of writing this entire script so I’m calling it
good now, thank you for reading!
Return to Homepage
