Return to Homepage

Introduction

In the following report, I will be investigating and examining the untold story behind seemingly transparent records pertaining to worldwide COVID cases and death tolls. While the data has already been cleaned and contains fairly simple information, this project aims to investigate the disparities between countries, their COVID death tolls, COVID recovery rates, and the correlation between those statistics and their financial standing and population size. In simpler terms, does a country having a high death toll inherently imply that they handled the outbreak worse than another? What does the total number of cases in relation to the number of recovered patients and deaths tell us about that nation’s pandemic response? Did active participants in COVID tests actually have any noticable affect on total number of cases and in turn, deaths? These are just some of the questions I hope to explore in this writing.

Background

This project was inspired in part by my previous research on non-substance use disorder treatment services offered in Baltimore, MD, which primarily consisted of COVID-related services. It was also inspired by general happen stance: I found a very nice data set on Kaggle and felt it could be an interesting take on a seemingly dull data set. As mentioned, this data set was found here, as updated 2 years ago (2022) by Mrityunjay Pathak.

The data set consists of various reported statistics regarding Coronavirus (COVID-19), an infectious disease caused by the SARS-CoV-2 virus. Most of those infected by the virus will experience some mild to moderate respitory symptoms and recover with requiring any particular medical treatment, albeit taking a bit longer than most other diseases. However, some individuals may become seriously ill and require medical attention and even hospitalization. The elderly and those with existing medical conditions such as cardiovascular disease, chronic respiratory illness, cancer, and other such conditions are more likely to develop serious symptoms. That being said, anyone, regardless of age or medical history, could grow seriously ill and die as a result of COVID-19 infection.

Prevention is the primary defense for the spread of the disease. Understanding how the virus spreads is important as protecting oneself from the illness is the best way to protect others as well. At the peak of the pandemic, it was advised that individuals stay at leaast 1 meter apart from each, wear properly fitted medical-grade masks, and wash hands or use hand sanitizer frequently. Once the vaccines were released, it was also advised by the CDC to get vaccinated to statt building up herd immunity.

The virus can spread through liquid particles associated with coughing, sneezing, speaking, singing, or breathing. The obvious avenue for this is through the mouth or nose of the sick individual. This can be spread through larger droplets to smaller aerosols. Individuals are encouraged to practice proper respiratory etiquette, stay home when sick, and to self-isolate at least 1 week after sympotmns have ceased.

This data has been compiled through reports from the onset of the pandemic (2020) up until its last update in 2022. The data does not distinguish between variants of the virus, however, it does imply that it accounts for the original variant, Delta, Omicron and its subvariants of BA.4 and BA.5.

Data

As already described, this data set was retrieved from Kaggle and had already been mostly cleaned for analysis. However, there are some changes I wanted to make to the data structure, missing values/empty spaces, and superfluous characters found throughout the data set. The following code and outputs demonstrate the changes I’ve made to allow for smoother data analysis:

# shows the original data structure 
str(og_covid_world_dat)
## 'data.frame':    231 obs. of  8 variables:
##  $ Serial.Number  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Country        : chr  "USA" "India" "France" "Germany" ...
##  $ Total.Cases    : chr  "104,196,861" "44,682,784" "39,524,311" "37,779,833" ...
##  $ Total.Deaths   : chr  "1,132,935" "530,740" "164,233" "165,711" ...
##  $ Total.Recovered: chr  "101,322,779" "44,150,289" "39,264,546" "37,398,100" ...
##  $ Active.Cases   : chr  "1,741,147" "1,755" "95,532" "216,022" ...
##  $ Total.Test     : chr  "1,159,832,679" "915,265,788" "271,490,188" "122,332,384" ...
##  $ Population     : chr  "334,805,269" "1,406,631,776" "65,584,518" "83,883,596" ...
# converts all N/A chr values to actual missing values and fills in empty spaces with missing values 
covid_world_dat <- og_covid_world_dat |>
  mutate(across(where(is.character), ~ na_if(., "N/A") |> na_if("")))

# convert column titles to snake_case 
covid_world_dat <- clean_names(covid_world_dat)

# replace commas in all chr columns to avoid issue with conversion to num data type 
covid_world_dat <- covid_world_dat |>
  mutate(across(where(is.character), ~ str_replace_all(., ",", "")))

# converts data structure to more appropriate data types 
covid_world_dat <- covid_world_dat |>
  mutate(country = as.factor(country),
         across(-c(country, serial_number), as.numeric), # converts to num (not int) to account for math that results in decimals 
         total_test_taken = total_test) |> # makes a column name clearer to understand 
  relocate(total_test_taken, .before = population) |>
  select(-total_test) # removes that original column 

# shows the updated data structure
str(covid_world_dat)
## 'data.frame':    231 obs. of  8 variables:
##  $ serial_number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ country         : Factor w/ 231 levels "Afghanistan",..: 221 95 73 79 27 105 172 102 218 170 ...
##  $ total_cases     : num  104196861 44682784 39524311 37779833 36824580 ...
##  $ total_deaths    : num  1132935 530740 164233 165711 697074 ...
##  $ total_recovered : num  101322779 44150289 39264546 37398100 35919372 ...
##  $ active_cases    : num  1741147 1755 95532 216022 208134 ...
##  $ total_test_taken: num  1159832679 915265788 271490188 122332384 63776166 ...
##  $ population      : num  334805269 1406631776 65584518 83883596 215353593 ...
# demonstrates a portion of our updated data set 
head(covid_world_dat)
serial_number country total_cases total_deaths total_recovered active_cases total_test_taken population
1 USA 104196861 1132935 101322779 1741147 1159832679 334805269
2 India 44682784 530740 44150289 1755 915265788 1406631776
3 France 39524311 164233 39264546 95532 271490188 65584518
4 Germany 37779833 165711 37398100 216022 122332384 83883596
5 Brazil 36824580 697074 35919372 208134 63776166 215353593
6 Japan 32588442 68399 21567425 10952618 92144639 125584838

As noted before, the data was collected accross 231 countries (231 rows) along with 6 particular statistics associated with each country. The columns are described as follows:

  • serial_number
    • a unique but arbitrary ID given to each row of data to serve as a key column
  • country
    • the country in question
  • total_cases
    • total number of cases of COVID-19 reported at that point in the country
  • total_deaths
    • total number of deaths from COVID-19 reported at that point in the country
  • total_recovered
    • total number of partients who recovered from COVID-19 at that point in the country
  • active_cases
    • currently on-going cases reported at the moment the survey was conducted
  • total_test_taken
    • total number of COVID-19 tests administered and reported at that point in the country
  • population
    • total number of people living in the country at that point of the survey

Exploratory Analysis

Death Rate vs. Testing Rate

First, we will be investigating two metrics that aren’t given to us in the actual data set:

  • Test rate

  • Death rate

The test rate refers to how extensively a country has tested its population for COVID-19. A higher rate would suggest that a country is conducting more tests relative to its population size, which also implies more thorough effort in identifying and tackling cases. Conversely, a lower test rate suggests the opposite: less of the population is tested thus resulting in more overlooked cases.

  • Formula: test_rate = total_test_taken / population

The death rate represents the fatality rate of COVID-19 in a country in relation to total cases. As this simply implies, it offers an indication of of how deadly the disease is in the population that is contracting it. A higher death rate suggests a larger proportion of individuals are dying in the population; a lower rate suggests the opposite. The death rate can be influenced by variety of factors including healthcare quality, the age distribution of the population, presence of underlying health conditions, and effectiveness in medical intervention. Higher death rates don’t always mean that the virus is more deadly; it can also be a sign of insufficient healthcare, testing, or reporting.

  • Formula: death_rate = total_deaths / total_cases

As we can see this correlation plotted below. Each dot represents a country. However, the first graph is one based on the standard numerical values given in the data set. This results in a poor graph as the points are too close to each other to make any meaningful conclusions.

The following graph is a more effective version that is fit to log scale to separate the points out and allow us to visualize the data more clearly:

# scatter plot showing the correlation between a population's death rate and testing rate per population fit by log scale 
covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population) |>
  ggplot(aes(x = test_rate, y = death_rate)) +
  geom_point(aes(color = country), size = 3, alpha = 0.7) +
  scale_x_log10() + scale_y_log10() +  # added log scale to more effective visualization 
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # line of best fit (linear model)
  labs(title = "Death Rate vs. Testing Rate (Log Scale)",
       x = "Tests per Population", 
       y = "Death Rate") +
  theme_minimal() +
  theme(legend.position = "none")

As we can obviously see, the more testing a country has administered to its citizens, the lower the death rate.

  • Higher testing = lower death rate
  • Lower testing = higher death rate

Highest and Lowest death-to-test Ratio

In conjunction to the previous analysis, I calculated which countries have the highest death-to-test ratio (the points that are furthest to the left and up on the graph). This represents the how many deaths occur relative to the number of tests conducted in each country.

# calculation of the countries with the highest death-to-test ratio 
high_death_to_test_ratio <- covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population,
         death_to_test_ratio = death_rate / test_rate) |>
  arrange(desc(death_to_test_ratio)) |>
  select(country, death_rate, test_rate, death_to_test_ratio) |>
  head(10)  # shows the top 10 countries

On the other hand, the countries with the lowest death-to-test ratio are listed below. Simiarly, this denotes which points were the furthest to the right and down on the graph.

# calculation of the countries with the lowest death-to-test ratio 
low_death_to_test_ratio <- covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population,
         death_to_test_ratio = death_rate / test_rate) |>
  arrange(death_to_test_ratio) |>
  select(country, death_rate, test_rate, death_to_test_ratio) |>
  head(10)  # shows the top 10 countries 

Top 10 Countries by Total Cases, Deaths, and Recoveries

The following graphs visualize the 10 countries with the highest total of cases, deaths, and recoveries.

I separated the stacked bar plots because the actual disparity between the total number of deaths and deaths is so high that you can hardly see it. When we include the recoveries, you cannot even see the death count for countries as large as the USA. This offers us a relative comparison that is easy to compare. Now this isn’t meant to discredit the heavy toll COVID had on the first-world; a million people still died. However, it must be understood that 1,000,000 is only 1% of the ~100,000,000 recoveries made in the US alone (a population of over 300,000,000). With that consideration, we can come to appreciate the excellent response that the USA and similar countries had under the pressure of overfilled hospitals and understaffed medical personnel. This is especially noticable by simply looking at the height of the bar plots which simplies a heavy burden set onto the country due to COVID-19. The correlation between recovery and death ratio can indicate how well a country was able to respond to the cases appearing within their borders.

  • Due to the size of the count, R forces the count to be in scientific notation. Keep in mind that the largest count to the right end of the X-axis amounts to larger than 100,000,000 people.
# stacked bar plot of both the total cases and deaths for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
  arrange(desc(total_cases)) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) +
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Top 10 Countries with the Most COVID-19 Cases and Deaths", 
       x = "Country", 
       y = "Count", 
       fill = "Category") +
  theme_minimal()

# stacked bar plot of total cases, deaths, and recovered patients for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
  arrange(desc(total_cases)) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) + 
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") + 
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") + 
  geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") + 
  coord_flip() + 
  labs(title = "Top 10 Countries with the Most COVID-19 Cases, Deaths, and Recovery", 
       x = "Country", 
       y = "Count", 
       fill = "Category") + 
  theme_minimal()

The following graphs visualize the 10 countries with the lowest total of cases, deaths, and recoveries.

Once again, I also separated the plots to allow for easier visualization. However, as you may have noticed, the countries with the lowest number of cases also happen to have next to no reported deaths as a result. We are talking about countries such as Tuvalu with a population of 12,000. Though a large proportion of their population was affected by the virus, there were no reported deaths. With that said, we see that the top two countries also had next to no reported recoveries (Saint Helena had 2).

What does this tell us about those two countries?

  • Though they may not have done an excellent job at isolation, those who got sick 1) received great treatment, or 2) had very few at-risk and elderly individuals sick.

We see these values simply decrease with the population itself. Tokelau has a population of a tiny 1,378 and a total of 5 reported cases (of which were active at the time this survey was recorded). However, it is difficult to quantify results so small seeing as these countries may struggle to simply report an accurate count of all these instances. Who knows, maybe we should go pay a visit to the Vatican!

# stacked bar plot of both the total cases and deaths for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
  arrange(total_cases) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) +
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Top 10 Countries with the Least COVID-19 Cases and Deaths", 
       x = "Country", 
       y = "Count", 
       fill = "Category") +
  theme_minimal()

# stacked bar plot of total cases, deaths, and recovered patients for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
  arrange(total_cases) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) + 
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") + 
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") + 
  geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") + 
  coord_flip() + 
  labs(title = "Top 10 Countries with the Least COVID-19 Cases, Deaths, and Recovery", 
       x = "Country", 
       y = "Count", 
       fill = "Category") + 
  theme_minimal()

Visualizing Correlations between Key Metrics

The following analysis is a bit different than what I’m used to and I do think it has the potential of being a bit inaccurate, or more like deceiving. A correlation matrix is a table that shows the correlation coefficients between many variables. Each cell within the table displays the correlation between 2 variables, ranging from -1 to 1. These coefficients indicate the strength and direction of a linear relation between two variables.

How do we read the matrix?

  • +1: Perfect positive correlation (both variables move in the same direction)
    • As one variable increase, the other increases
  • -1: Perfect negative correlation (variables move in opposite directions)
    • As one variable decreases, the other decreases
  • 0: No linear correlation (no relationship between the variables)
    • No predictable linear relationship between the variables

With that in mind, I see no reason to really explain the correlations in words when the matrix does a better job at doing so.

# calculate the correlation matrix
corr_matrix <- covid_world_dat |>
  select(total_cases, total_deaths, total_recovered, active_cases, total_test_taken, population) |>
  na.omit() |> # remove NAs to avoid conflict 
  cor()

# create the correlation plot with improved labels
corr_matrix |>
  corrplot(method = "circle", 
           type = "upper", 
           tl.col = "black",
           tl.srt = 45,                
           tl.cex = 0.8,               
           mar = c(0,0,1,0),           
           addCoef.col = "black",      
           number.cex = 0.7,           
           diag = FALSE)

Predictive Modeling

After having viewed and interpreted different statistical measures of the COVID data, I feel it would be best served if we attempt to use what we learned from the correlations between variables to try to predict the most important metric in these surveys: total deaths.

The following consists of the painstaking steps to produce a linear regression model and make predictions on the total number of deaths found in this data set.

1) Preprocessing the Data

Before building a model, it is best to preprocess the data in order to clear out any missing values and creating any relevant features we may need to allow for more specific modeling. In this case:

  • deaths_per_million
    • tells us how many deaths per million people in a country
  • recovery_rate
    • proportion of recovered cases compared to total cases which may indicate healthcare effectiveness
  • active_cases_ratio
    • proportion of active cases to total cases; a higher number could mean a more serious situation
# preparing the data 
model_covid_world_dat <- covid_world_dat |>
  mutate(
    deaths_per_million = total_deaths / (population / 1e6), # deaths per million
    recovery_rate = total_recovered / total_cases, # recovery rate
    active_cases_ratio = active_cases / total_cases # active cases ratio
  ) |>
  na.omit() # remove rows with NA values to avoid conflicts 

To avoid redundancy and clutter, I will not be including any output, just the code above.

2) Training the Linear Regression Model

We will now build a linear regression model to predict total_deaths based on the above features. This linear regression should help us understand the relation between the response (total_deaths) and predictor values (everything else we made).

# train a linear regression model to predict total deaths
model <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = model_covid_world_dat)

# view model summary
summary(model)
## 
## Call:
## lm(formula = total_deaths ~ total_cases + total_recovered + active_cases + 
##     population, data = model_covid_world_dat)
## 
## Residuals:
##             Min              1Q          Median              3Q             Max 
## -0.000000032779 -0.000000000524  0.000000000359  0.000000001223  0.000000048890 
## 
## Coefficients:
##                                 Estimate               Std. Error
## (Intercept)      0.000000010184390607263  0.000000000456701745159
## total_cases      1.000000000000053512750  0.000000000000008702001
## total_recovered -1.000000000000053734794  0.000000000000008785060
## active_cases    -1.000000000000053068661  0.000000000000008615763
## population      -0.000000000000000077293  0.000000000000000004788
##                             t value            Pr(>|t|)    
## (Intercept)                   22.30 <0.0000000000000002 ***
## total_cases      114916092899285.61 <0.0000000000000002 ***
## total_recovered -113829621331453.84 <0.0000000000000002 ***
## active_cases    -116066326687905.08 <0.0000000000000002 ***
## population                   -16.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.000000006005 on 190 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.703e+28 on 4 and 190 DF,  p-value: < 0.00000000000000022

The output summary of the model tells us:

  • Coefficients
    • Tell us how much each variable effects the response variable (total_deaths)
      • Example: if total_cases has a coefficient of 0.05, it means for every 1 increase in total_cases, total_deaths will increase by 0.05
  • P-value
    • Tells us if the relationship between a predictor and a target is statistically significant
      • Values lower than 0.05 typically mean the predictor is significant
  • R-squared
    • Measures how well the model fits the data
      • An R-squared value closer to 1 means the model explains most of the variability in total_deaths

3) Model Evaluation

We will now proceed to evaluate the model according to Root Mean Squared Error (RMSE). This tells us, on average, how far off our predictions are.

# predict total_deaths using the trained model
predictions <- predict(model, newdata = model_covid_world_dat)

# calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - model_covid_world_dat$total_deaths)^2))
print(paste("RMSE:", round(rmse, 2)))
## [1] "RMSE: 0"

A lower RMSE indicates better model performance. For example, an RMSE of 500 means that, on average, the model’s predictions are off by 500 deaths.

4) Model Interpreation

The coefficients will us how strongly each predictor variable affects total_deaths.

# view the coefficients
coef(model)
##                (Intercept)                total_cases 
##  0.00000001018439060726309  1.00000000000005351274979 
##            total_recovered               active_cases 
## -1.00000000000005373479439 -1.00000000000005306866058 
##                 population 
## -0.00000000000000007729323

For example, if total_cases has a large positive coefficient, this means that as the number of cases increases, so does the number of deaths. If population has a negative coefficient, this could suggest that countries with larger populations have relatively fewer deaths compared to smaller countries, depending on other factors.

5) Model Tuning

Something many people, including myself, don’t do is cross-validation to tune our model and avoid overfitting.

We can also use cross-validation to tune the model and avoid overfitting (performs well on the training data but not very well on unseen data). This is done by dividing the data into training sets to check if the model generalizes well to other data.

# create a train/test split
set.seed(123) # random number generator 
train_index <- createDataPartition(model_covid_world_dat$total_deaths, p = 0.8, list = FALSE)
train_data <- model_covid_world_dat[train_index, ] # used to train the model 
test_data <- model_covid_world_dat[-train_index, ] # used to test the model after it has been trained 

# train the model on the training set
model_cv <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = train_data)

# predict on the test set
test_predictions <- predict(model_cv, newdata = test_data)

# calculate RMSE on the test set
rmse_cv <- sqrt(mean((test_predictions - test_data$total_deaths)^2))
print(paste("Test RMSE:", round(rmse_cv, 2)))
## [1] "Test RMSE: 0"

Fantastic! The RMSE for these training tests came out the same as the previous one.

6) Visualization of Predictions vs. Actuals

Finally, let’s visualize how well our model’s predictions match the actual total_deaths

# create a data frame with the actual and predicted values
comparison <- data.frame(
  Actual = model_covid_world_dat$total_deaths,
  Predicted = predictions
)

# plot the actual vs. predicted deaths
comparison |>
  ggplot(aes(x = Actual, y = Predicted)) +
  geom_point(color = "red") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Actual vs. Predicted Total COVID-19 Deaths", x = "Actual Deaths", y = "Predicted Deaths") +
  theme_minimal()

The above scatter plot demonstrates the model’s performance in its prediction.

  • Points represent the model’s predictions
  • Dashed line represents the actual values (i.e. perfect predictions)

The closer the points are to the dashed line, the better the model’s performance. If the points are spread too far, the model may need improvement.

It’s actually insane how perfect this model turned out. This also took me quite a few hours to get down with the help of Stack Overflow and ChatGPT so shout out to them.

Conclusion

All my thoughts and interpretations have been strewn across this entire document so there isn’t much for me to share here other than the point that there is much to be learned from what might seem like even the dullest of data. Data does not need to be super complex for it to be analyzed. Creative thought process of how to look at the data and what types of predictions you might be able to make with said data is just as important to. COVID-19 offers great insight into how the world can handle large medical crisis and if they did handle it well. That can be said for many other forms of data so be on the look out and think creatively! You’ll never know what you may find. I also didn’t sleep all night in the process of writing this entire script so I’m calling it good now, thank you for reading!

Return to Homepage

---
title: "An Exploratory Analysis of Worldwide COVID-19 Cases and Deaths"
date: "December 5, 2024"
output: 
    html_document:
        theme: paper
        highlight: tango
        toc: true
        toc_float:
            collapsed: true
        number_sections: false
        code_download: true
        df_print: kable
        code_folding: show
        mode: selfcontained
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE)
```

[Return to Homepage](../index.html)

# **Introduction**
In the following report, I will be investigating and examining the untold story behind seemingly transparent records pertaining to worldwide COVID cases and death tolls. While the data has already been cleaned and contains fairly simple information, this project aims to investigate the disparities between countries, their COVID death tolls, COVID recovery rates, and the correlation between those statistics and their financial standing and population size. In simpler terms, does a country having a high death toll inherently imply that they handled the outbreak worse than another? What does the total number of cases in relation to the number of recovered patients and deaths tell us about that nation's pandemic response? Did active participants in COVID tests actually have any noticable affect on total number of cases and in turn, deaths? These are just some of the questions I hope to explore in this writing. 

# **Background**
This project was inspired in part by my previous research on non-substance use disorder treatment services offered in Baltimore, MD, which primarily consisted of COVID-related services. It was also inspired by general happen stance: I found a very nice data set on Kaggle and felt it could be an interesting take on a seemingly dull data set. As mentioned, this data set was found [here](https://www.kaggle.com/datasets/themrityunjaypathak/covid-cases-and-deaths-worldwide/data), as updated 2 years ago (2022) by Mrityunjay Pathak.

The data set consists of various reported statistics regarding Coronavirus (COVID-19), an infectious disease caused by the SARS-CoV-2 virus. Most of those infected by the virus will experience some mild to moderate respitory symptoms and recover with requiring any particular medical treatment, albeit taking a bit longer than most other diseases. However, some individuals may become seriously ill and require medical attention and even hospitalization. The elderly and those with existing medical conditions such as cardiovascular disease, chronic respiratory illness, cancer, and other such conditions are more likely to develop serious symptoms. That being said, **anyone**, regardless of age or medical history, could grow seriously ill and die as a result of COVID-19 infection. 

Prevention is the primary defense for the spread of the disease. Understanding how the virus spreads is important as protecting oneself from the illness is the best way to protect others as well. At the peak of the pandemic, it was advised that individuals stay at leaast 1 meter apart from each, wear properly fitted medical-grade masks, and wash hands or use hand sanitizer frequently. Once the vaccines were released, it was also advised by the CDC to get vaccinated to statt building up herd immunity. 

The virus can spread through liquid particles associated with coughing, sneezing, speaking, singing, or breathing. The obvious avenue for this is through the mouth or nose of the sick individual. This can be spread through larger droplets to smaller aerosols. Individuals are encouraged to practice proper respiratory etiquette, stay home when sick, and to self-isolate at least 1 week after sympotmns have ceased. 

This data has been compiled through reports from the onset of the pandemic (2020) up until its last update in 2022. The data does not distinguish between variants of the virus, however, it does imply that it accounts for the original variant, Delta, Omicron and its subvariants of BA.4 and BA.5. 

# **Data**
```{r, echo = FALSE, results = 'hide'}
# load all necessary libraries 
library(tidyverse)
library(broom)
library(janitor)
library(reactable)
library(corrplot)
library(caret)

# converts all numbers to standard notation 
options(scipen = 999)

# reading in kaggle data set 
og_covid_world_dat <- read.csv("covid_worldwide.csv")

```
As already described, this data set was retrieved from Kaggle and had already been **mostly** cleaned for analysis. However, there are some changes I wanted to make to the data structure, missing values/empty spaces, and superfluous characters found throughout the data set. The following code and outputs demonstrate the changes I've made to allow for smoother data analysis: 

```{r}
# shows the original data structure 
str(og_covid_world_dat)

# converts all N/A chr values to actual missing values and fills in empty spaces with missing values 
covid_world_dat <- og_covid_world_dat |>
  mutate(across(where(is.character), ~ na_if(., "N/A") |> na_if("")))

# convert column titles to snake_case 
covid_world_dat <- clean_names(covid_world_dat)

# replace commas in all chr columns to avoid issue with conversion to num data type 
covid_world_dat <- covid_world_dat |>
  mutate(across(where(is.character), ~ str_replace_all(., ",", "")))

# converts data structure to more appropriate data types 
covid_world_dat <- covid_world_dat |>
  mutate(country = as.factor(country),
         across(-c(country, serial_number), as.numeric), # converts to num (not int) to account for math that results in decimals 
         total_test_taken = total_test) |> # makes a column name clearer to understand 
  relocate(total_test_taken, .before = population) |>
  select(-total_test) # removes that original column 

# shows the updated data structure
str(covid_world_dat)

# demonstrates a portion of our updated data set 
head(covid_world_dat)

```

As noted before, the data was collected accross 231 countries (231 rows) along with 6 particular statistics associated with each country. The columns are described as follows: 

- **serial_number**
  - a unique but arbitrary ID given to each row of data to serve as a key column 
  
- **country** 
  - the country in question 
  
- **total_cases**
  - total number of cases of COVID-19 reported at that point in the country 
  
- **total_deaths**
  - total number of deaths from COVID-19 reported at that point in the country 
  
- **total_recovered** 
  - total number of partients who recovered from COVID-19 at that point in the country 
  
- **active_cases**
  - currently on-going cases reported at the moment the survey was conducted 
  
- **total_test_taken**
  - total number of COVID-19 tests administered and reported at that point in the country 
  
- **population**
  - total number of people living in the country at that point of the survey
  
# **Exploratory Analysis**
## Death Rate vs. Testing Rate 
First, we will be investigating two metrics that aren't given to us in the actual data set: 

- Test rate 

- Death rate 

The **test rate** refers to how extensively a country has tested its population for COVID-19. A higher rate would suggest that a country is conducting more tests relative to its population size, which also implies more thorough effort in identifying and tackling cases. Conversely, a lower test rate suggests the opposite: less of the population is tested thus resulting in more overlooked cases. 

- **Formula:** test_rate = total_test_taken / population

The **death rate** represents the fatality rate of COVID-19 in a country in relation to total cases. As this simply implies, it offers an indication of of how deadly the disease is in the population that is contracting it. A higher death rate suggests a larger proportion of individuals are dying in the population; a lower rate suggests the opposite. The death rate can be influenced by variety of factors including healthcare quality, the age distribution of the population, presence of underlying health conditions, and effectiveness in medical intervention. Higher death rates don't always mean that the virus is more deadly; it can also be a sign of insufficient healthcare, testing, or reporting. 

- **Formula:** death_rate = total_deaths / total_cases

As we can see this correlation plotted below. Each dot represents a country. However, the first graph is one based on the standard numerical values given in the data set. This results in a poor graph as the points are too close to each other to make any meaningful conclusions. 

```{r, echo = FALSE}
# poorly represented scatter plot of the death rate correlated to the test rate by population (no log scale)
covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population) |>
  ggplot(aes(x = test_rate, y = death_rate)) +
  geom_point(aes(color = country), size = 3, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  labs(title = "Death Rate vs. Testing Rate",
       x = "Tests per Population", 
       y = "Death Rate") +
  theme_minimal() +
  theme(legend.position = "none")

```

The following graph is a more effective version that is fit to log scale to separate the points out and allow us to visualize the data more clearly: 

```{r}
# scatter plot showing the correlation between a population's death rate and testing rate per population fit by log scale 
covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population) |>
  ggplot(aes(x = test_rate, y = death_rate)) +
  geom_point(aes(color = country), size = 3, alpha = 0.7) +
  scale_x_log10() + scale_y_log10() +  # added log scale to more effective visualization 
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # line of best fit (linear model)
  labs(title = "Death Rate vs. Testing Rate (Log Scale)",
       x = "Tests per Population", 
       y = "Death Rate") +
  theme_minimal() +
  theme(legend.position = "none")


```

As we can obviously see, **the more testing a country has administered to its citizens, the lower the death rate.**

- Higher testing = lower death rate 
- Lower testing = higher death rate 

## Highest and Lowest death-to-test Ratio
In conjunction to the previous analysis, I calculated which countries have the **highest death-to-test ratio** (the points that are furthest to the left and up on the graph). This represents the how many deaths occur relative to the number of tests conducted in each country. 

```{r}
# calculation of the countries with the highest death-to-test ratio 
high_death_to_test_ratio <- covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population,
         death_to_test_ratio = death_rate / test_rate) |>
  arrange(desc(death_to_test_ratio)) |>
  select(country, death_rate, test_rate, death_to_test_ratio) |>
  head(10)  # shows the top 10 countries

```

```{r, echo = FALSE}
# using reactable to make the data prettier
high_death_to_test_ratio |>
  reactable(columns = list(
    country = colDef(name = "Country"),
    death_rate = colDef(name = "Death Rate", format = colFormat(digits = 2)),
    test_rate = colDef(name = "Test Rate", format = colFormat(digits = 2)),
    death_to_test_ratio = colDef(name = "Death-to-Test Ratio", format = colFormat(digits = 2))
    ),
    defaultPageSize = 10,
    highlight = TRUE, 
    striped = TRUE,
    bordered = TRUE,
    showPageSizeOptions = TRUE,
    pagination = TRUE)

```

On the other hand, the countries with the **lowest death-to-test ratio** are listed below. Simiarly, this denotes which points were the furthest to the right and down on the graph. 

```{r}
# calculation of the countries with the lowest death-to-test ratio 
low_death_to_test_ratio <- covid_world_dat |>
  mutate(death_rate = total_deaths / total_cases, 
         test_rate = total_test_taken / population,
         death_to_test_ratio = death_rate / test_rate) |>
  arrange(death_to_test_ratio) |>
  select(country, death_rate, test_rate, death_to_test_ratio) |>
  head(10)  # shows the top 10 countries 

```

```{r, echo = FALSE}
# using reactable to make the data prettier
low_death_to_test_ratio |>
  reactable(columns = list(
    country = colDef(name = "Country"),
    death_rate = colDef(name = "Death Rate", format = colFormat(digits = 2)),
    test_rate = colDef(name = "Test Rate", format = colFormat(digits = 2)),
    death_to_test_ratio = colDef(name = "Death-to-Test Ratio", format = colFormat(digits = 2))
    ),
    defaultPageSize = 10,
    highlight = TRUE,
    striped = TRUE,
    bordered = TRUE,
    showPageSizeOptions = TRUE,
    pagination = TRUE)

```

## Top 10 Countries by Total Cases, Deaths, and Recoveries
The following graphs visualize the 10 countries with the **highest total of cases, deaths, and recoveries**. 

I separated the stacked bar plots because the actual disparity between the total number of deaths and deaths is so high that you can hardly see it. When we include the recoveries, you cannot even see the death count for countries as large as the USA. This offers us a relative comparison that is easy to compare. Now this isn't meant to discredit the heavy toll COVID had on the first-world; a million people still died. However, it must be understood that 1,000,000 is only 1% of the ~100,000,000 recoveries made in the US alone (a population of over 300,000,000). With that consideration, we can come to appreciate the excellent response that the USA and similar countries had under the pressure of overfilled hospitals and understaffed medical personnel. This is especially noticable by simply looking at the height of the bar plots which simplies a heavy burden set onto the country due to COVID-19. The correlation between recovery and death ratio can indicate how well a country was able to respond to the cases appearing within their borders. 

  - Due to the size of the count, R forces the count to be in scientific notation. Keep in mind that the largest count to the right end of the X-axis amounts to larger than 100,000,000 people. 

```{r}
# stacked bar plot of both the total cases and deaths for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
  arrange(desc(total_cases)) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) +
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Top 10 Countries with the Most COVID-19 Cases and Deaths", 
       x = "Country", 
       y = "Count", 
       fill = "Category") +
  theme_minimal()

```

```{r}
# stacked bar plot of total cases, deaths, and recovered patients for the countries with the most COVID-19 cases and deaths
covid_world_dat |>
  arrange(desc(total_cases)) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) + 
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") + 
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") + 
  geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") + 
  coord_flip() + 
  labs(title = "Top 10 Countries with the Most COVID-19 Cases, Deaths, and Recovery", 
       x = "Country", 
       y = "Count", 
       fill = "Category") + 
  theme_minimal()

```

The following graphs visualize the 10 countries with the **lowest total of cases, deaths, and recoveries**. 

Once again, I also separated the plots to allow for easier visualization. However, as you may have noticed, the countries with the lowest number of cases also happen to have next to no reported deaths as a result. We are talking about countries such as Tuvalu with a population of 12,000. Though a large proportion of their population was affected by the virus, there were no reported deaths. With that said, we see that the top two countries also had next to no reported recoveries (Saint Helena had 2). 

What does this tell us about those two countries? 

- Though they may not have done an excellent job at isolation, those who got sick 1) received great treatment, or 2) had very few at-risk and elderly individuals sick.

We see these values simply decrease with the population itself. Tokelau has a population of a tiny 1,378 and a total of 5 reported cases (of which were active at the time this survey was recorded). However, it is difficult to quantify results so small seeing as these countries may struggle to simply report an accurate count of all these instances. Who knows, maybe we should go pay a visit to the Vatican!

```{r}
# stacked bar plot of both the total cases and deaths for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
  arrange(total_cases) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) +
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity", position = "dodge") +
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Top 10 Countries with the Least COVID-19 Cases and Deaths", 
       x = "Country", 
       y = "Count", 
       fill = "Category") +
  theme_minimal()

```
```{r}
# stacked bar plot of total cases, deaths, and recovered patients for the countries with the least COVID-19 cases and deaths
covid_world_dat |>
  arrange(total_cases) |>
  head(10) |>
  ggplot(aes(x = reorder(country, total_cases))) + 
  geom_bar(aes(y = total_cases, fill = "Cases"), stat = "identity") + 
  geom_bar(aes(y = total_deaths, fill = "Deaths"), stat = "identity") + 
  geom_bar(aes(y = total_recovered, fill = "Recovered"), stat = "identity") + 
  coord_flip() + 
  labs(title = "Top 10 Countries with the Least COVID-19 Cases, Deaths, and Recovery", 
       x = "Country", 
       y = "Count", 
       fill = "Category") + 
  theme_minimal()

```

## Visualizing Correlations between Key Metrics 
The following analysis is a bit different than what I'm used to and I do think it has the potential of being a bit inaccurate, or more like deceiving. A correlation matrix is a table that shows the correlation coefficients between many variables. Each cell within the table displays the correlation between 2 variables, ranging from -1 to 1. These coefficients indicate the strength and direction of a linear relation between two variables. 

How do we read the matrix? 

- +1: Perfect positive correlation (both variables move in the same direction)
  - As one variable increase, the other increases 
- -1: Perfect negative correlation (variables move in opposite directions)
  - As one variable decreases, the other decreases 
- 0: No linear correlation (no relationship between the variables)
  - No predictable linear relationship between the variables 

With that in mind, I see no reason to really explain the correlations in words when the matrix does a better job at doing so. 

```{r}
# calculate the correlation matrix
corr_matrix <- covid_world_dat |>
  select(total_cases, total_deaths, total_recovered, active_cases, total_test_taken, population) |>
  na.omit() |> # remove NAs to avoid conflict 
  cor()

# create the correlation plot with improved labels
corr_matrix |>
  corrplot(method = "circle", 
           type = "upper", 
           tl.col = "black",
           tl.srt = 45,                
           tl.cex = 0.8,               
           mar = c(0,0,1,0),           
           addCoef.col = "black",      
           number.cex = 0.7,           
           diag = FALSE)

```

# **Predictive Modeling**
After having viewed and interpreted different statistical measures of the COVID data, I feel it would be best served if we attempt to use what we learned from the correlations between variables to try to predict the most important metric in these surveys: **total deaths**.

The following consists of the painstaking steps to produce a linear regression model and make predictions on the total number of deaths found in this data set. 

## 1) Preprocessing the Data 
Before building a model, it is best to preprocess the data in order to clear out any missing values and creating any relevant features we may need to allow for more specific modeling. In this case: 

- deaths_per_million
  - tells us how many deaths per million people in a country 
- recovery_rate 
  - proportion of recovered cases compared to total cases which may indicate healthcare effectiveness 
- active_cases_ratio 
  - proportion of active cases to total cases; a higher number could mean a more serious situation 

```{r, results = 'hide'}
# preparing the data 
model_covid_world_dat <- covid_world_dat |>
  mutate(
    deaths_per_million = total_deaths / (population / 1e6), # deaths per million
    recovery_rate = total_recovered / total_cases, # recovery rate
    active_cases_ratio = active_cases / total_cases # active cases ratio
  ) |>
  na.omit() # remove rows with NA values to avoid conflicts 

```

To avoid redundancy and clutter, I will not be including any output, just the code above. 

## 2) Training the Linear Regression Model 
We will now build a linear regression model to predict total_deaths based on the above features. This linear regression should help us understand the relation between the response (total_deaths) and predictor values (everything else we made). 

```{r}
# train a linear regression model to predict total deaths
model <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = model_covid_world_dat)

# view model summary
summary(model)

```

The output summary of the model tells us: 

- Coefficients 
  - Tell us how much each variable effects the response variable (total_deaths)
    - Example: if total_cases has a coefficient of 0.05, it means for every 1 increase in total_cases, total_deaths will increase by 0.05

- P-value 
  - Tells us if the relationship between a predictor and a target is statistically significant
    - Values lower than 0.05 typically mean the predictor is significant 
    
- R-squared
  - Measures how well the model fits the data
    - An R-squared value closer to 1 means the model explains most of the variability in total_deaths 

## 3) Model Evaluation 
We will now proceed to evaluate the model according to Root Mean Squared Error (RMSE). This tells us, on average, how far off our predictions are. 

```{r}
# predict total_deaths using the trained model
predictions <- predict(model, newdata = model_covid_world_dat)

# calculate RMSE (Root Mean Squared Error)
rmse <- sqrt(mean((predictions - model_covid_world_dat$total_deaths)^2))
print(paste("RMSE:", round(rmse, 2)))

```

A lower RMSE indicates better model performance. For example, an RMSE of 500 means that, on average, the model's predictions are off by 500 deaths. 

## 4) Model Interpreation 
The coefficients will us how strongly each predictor variable affects total_deaths.

```{r}
# view the coefficients
coef(model)

```

For example, if total_cases has a large positive coefficient, this means that as the number of cases increases, so does the number of deaths. If population has a negative coefficient, this could suggest that countries with larger populations have relatively fewer deaths compared to smaller countries, depending on other factors.

## 5) Model Tuning 
Something many people, including myself, don't do is cross-validation to tune our model and avoid overfitting. 

We can also use cross-validation to tune the model and avoid overfitting (performs well on the training data but not very well on unseen data). This is done by dividing the data into training sets to check if the model generalizes well to other data.

```{r}
# create a train/test split
set.seed(123) # random number generator 
train_index <- createDataPartition(model_covid_world_dat$total_deaths, p = 0.8, list = FALSE)
train_data <- model_covid_world_dat[train_index, ] # used to train the model 
test_data <- model_covid_world_dat[-train_index, ] # used to test the model after it has been trained 

# train the model on the training set
model_cv <- lm(total_deaths ~ total_cases + total_recovered + active_cases + population, data = train_data)

# predict on the test set
test_predictions <- predict(model_cv, newdata = test_data)

# calculate RMSE on the test set
rmse_cv <- sqrt(mean((test_predictions - test_data$total_deaths)^2))
print(paste("Test RMSE:", round(rmse_cv, 2)))

```

Fantastic! The RMSE for these training tests came out the same as the previous one. 

## 6) Visualization of Predictions vs. Actuals
Finally, let's visualize how well our model's predictions match the actual total_deaths

```{r}
# create a data frame with the actual and predicted values
comparison <- data.frame(
  Actual = model_covid_world_dat$total_deaths,
  Predicted = predictions
)

# plot the actual vs. predicted deaths
comparison |>
  ggplot(aes(x = Actual, y = Predicted)) +
  geom_point(color = "red") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  labs(title = "Actual vs. Predicted Total COVID-19 Deaths", x = "Actual Deaths", y = "Predicted Deaths") +
  theme_minimal()

```

The above scatter plot demonstrates the model's performance in its prediction. 

- Points represent the model's predictions 
- Dashed line represents the actual values (i.e. perfect predictions)

The closer the points are to the dashed line, the better the model's performance. If the points are spread too far, the model may need improvement. 

It's actually insane how perfect this model turned out. This also took me quite a few hours to get down with the help of Stack Overflow and ChatGPT so shout out to them. 

# **Conclusion** 
All my thoughts and interpretations have been strewn across this entire document so there isn't much for me to share here other than the point that there is much to be learned from what might seem like even the dullest of data. Data does not need to be super complex for it to be analyzed. Creative thought process of how to look at the data and what types of predictions you might be able to make with said data is just as important to. COVID-19 offers great insight into how the world can handle large medical crisis and if they did handle it well. That can be said for many other forms of data so be on the look out and think creatively! You'll never know what you may find. I also didn't sleep all night in the process of writing this entire script so I'm calling it good now, thank you for reading! 

[Return to Homepage](../index.html)
