R Coding Exercise

#load dslabs and tidyverse package
library(dslabs)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#get an overview of data structure
str(gapminder)

'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ population      : num  1636054 11124892 5270844 54681 20619075 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...

#get a summary of data
summary(gapminder)

                country           year      infant_mortality life_expectancy
 Albania            :   57   Min.   :1960   Min.   :  1.50   Min.   :13.20  
 Algeria            :   57   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
 Angola             :   57   Median :1988   Median : 41.50   Median :67.54  
 Antigua and Barbuda:   57   Mean   :1988   Mean   : 55.31   Mean   :64.81  
 Argentina          :   57   3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
 Armenia            :   57   Max.   :2016   Max.   :276.90   Max.   :83.90  
 (Other)            :10203                  NA's   :1453                    
   fertility       population             gdp               continent   
 Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Africa  :2907  
 1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Americas:2052  
 Median :3.750   Median :5.009e+06   Median :7.794e+09   Asia    :2679  
 Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11   Europe  :2223  
 3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10   Oceania : 684  
 Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                  
 NA's   :187     NA's   :185         NA's   :2972                       
             region    
 Western Asia   :1026  
 Eastern Africa : 912  
 Western Africa : 912  
 Caribbean      : 741  
 South America  : 684  
 Southern Europe: 684  
 (Other)        :5586

#determine the type of object gapminder is
class(gapminder)

[1] "data.frame"

Processing Data

Write code that assigns only the African countries to a new object/variable called africadata. Run str and summary on the new object you created. You should now have 2907 observations, down from 10545. Depending on how you do this, you might also notice that all the different categories are still kept in the continent (and other) variables, but show 0. R does not automatically remove categories of what in R is called a factor variable (a categorical variable) even if they are empty. We don’t have to worry about that just now, but something to keep in mind, it can sometimes lead to strange behavior.

Take the africadata object and create two new objects (name them whatever you want), one that contains only infant_mortality and life_expectancy and one that contains only population and life_expectancy. You should have two new objects/variables with 2907 rows and two columns. Use the str, and summary commands to take a look at both. Make sure you add comments into your code to explain what each line of code is doing, and as needed, also add additional explanatory Markdown text to your Quarto file.

#filter only African countries
africadata <- gapminder %>%
  filter(continent == "Africa")

#check the data structure of the object `africadata`
str(africadata)

'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

#get a summary of the new data
summary(africadata)

         country          year      infant_mortality life_expectancy
 Algeria     :  57   Min.   :1960   Min.   : 11.40   Min.   :13.20  
 Angola      :  57   1st Qu.:1974   1st Qu.: 62.20   1st Qu.:48.23  
 Benin       :  57   Median :1988   Median : 93.40   Median :53.98  
 Botswana    :  57   Mean   :1988   Mean   : 95.12   Mean   :54.38  
 Burkina Faso:  57   3rd Qu.:2002   3rd Qu.:124.70   3rd Qu.:60.10  
 Burundi     :  57   Max.   :2016   Max.   :237.40   Max.   :77.60  
 (Other)     :2565                  NA's   :226                     
   fertility       population             gdp               continent   
 Min.   :1.500   Min.   :    41538   Min.   :4.659e+07   Africa  :2907  
 1st Qu.:5.160   1st Qu.:  1605232   1st Qu.:8.373e+08   Americas:   0  
 Median :6.160   Median :  5570982   Median :2.448e+09   Asia    :   0  
 Mean   :5.851   Mean   : 12235961   Mean   :9.346e+09   Europe  :   0  
 3rd Qu.:6.860   3rd Qu.: 13888152   3rd Qu.:6.552e+09   Oceania :   0  
 Max.   :8.450   Max.   :182201962   Max.   :1.935e+11                  
 NA's   :51      NA's   :51          NA's   :637                        
                       region   
 Eastern Africa           :912  
 Western Africa           :912  
 Middle Africa            :456  
 Northern Africa          :342  
 Southern Africa          :285  
 Australia and New Zealand:  0  
 (Other)                  :  0

#Create the first object containing only infant_mortality and life_expectancy
infant_mortality_life_expectancy <- africadata %>%
  select(infant_mortality, life_expectancy)

#check the data structure of the new object infant_mortality_life_expectancy
str(infant_mortality_life_expectancy)

'data.frame':   2907 obs. of  2 variables:
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...

#check the summary of the new object infant_mortality_life_expectancy
summary(infant_mortality_life_expectancy)

 infant_mortality life_expectancy
 Min.   : 11.40   Min.   :13.20  
 1st Qu.: 62.20   1st Qu.:48.23  
 Median : 93.40   Median :53.98  
 Mean   : 95.12   Mean   :54.38  
 3rd Qu.:124.70   3rd Qu.:60.10  
 Max.   :237.40   Max.   :77.60  
 NA's   :226

#Create the second object containing only population and life_expectancy
population_life_expectancy <- africadata %>% 
  select(population, life_expectancy)

#check the data structure of the new object population_life_expectancy
str(population_life_expectancy)

'data.frame':   2907 obs. of  2 variables:
 $ population     : num  11124892 5270844 2431620 524029 4829291 ...
 $ life_expectancy: num  47.5 36 38.3 50.3 35.2 ...

#check the summary of the new object population_life_expectancy
summary(population_life_expectancy)

   population        life_expectancy
 Min.   :    41538   Min.   :13.20  
 1st Qu.:  1605232   1st Qu.:48.23  
 Median :  5570982   Median :53.98  
 Mean   : 12235961   Mean   :54.38  
 3rd Qu.: 13888152   3rd Qu.:60.10  
 Max.   :182201962   Max.   :77.60  
 NA's   :51

Plotting

Using the new variables you created, plot life expectancy as a function of infant mortality and as a function of population size. Make two separate plots. Plot the data as points. For the plot with population size on the x-axis, set the x-axis to a log scale.

plot1 <- ggplot(infant_mortality_life_expectancy, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs Infant Mortality",
       x = "Infant Mortality",
       y = "Life Expectancy") +
  theme_get()

print(plot1)

Warning: Removed 226 rows containing missing values or values outside the scale range
(`geom_point()`).

plot2 <- ggplot(population_life_expectancy, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs Population Size (Log Scale)",
       x = "Population Size (Log Scale)",
       y = "Life Expectancy") +
  theme_get()

print(plot2)

Warning: Removed 51 rows containing missing values or values outside the scale range
(`geom_point()`).

Plotting cont’d

You should see a negative correlation between infant mortality and life expectancy, which makes sense. You should also see a positive correlation between population size and life expectancy. In both plots, especially the second one, you will see ‘streaks’ of data that seem to go together. Can you figure out what is going on here? Take another look at the africadata data we generated, which should give you a hint of what’s happening. Add descriptive text into your Quarto file to explain what you see and why.

head(africadata)

       country year infant_mortality life_expectancy fertility population
1      Algeria 1960            148.2           47.50      7.65   11124892
2       Angola 1960            208.0           35.98      7.32    5270844
3        Benin 1960            186.9           38.29      6.28    2431620
4     Botswana 1960            115.5           50.34      6.62     524029
5 Burkina Faso 1960            161.3           35.21      6.29    4829291
6      Burundi 1960            145.1           40.58      6.95    2786740
          gdp continent          region
1 13828152297    Africa Northern Africa
2          NA    Africa   Middle Africa
3   621797131    Africa  Western Africa
4   124460933    Africa Southern Africa
5   596612183    Africa  Western Africa
6   341126765    Africa  Eastern Africa

str(africadata)

'data.frame':   2907 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ infant_mortality: num  148 208 187 116 161 ...
 $ life_expectancy : num  47.5 36 38.3 50.3 35.2 ...
 $ fertility       : num  7.65 7.32 6.28 6.62 6.29 6.95 5.65 6.89 5.84 6.25 ...
 $ population      : num  11124892 5270844 2431620 524029 4829291 ...
 $ gdp             : num  1.38e+10 NA 6.22e+08 1.24e+08 5.97e+08 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

Upon further research into the new objects, we can surmise the reason for the ‘streaks’ in our plots is due to the grouping of country and year.

More data processing

Let’s pick only one year and see what patterns we find. We want a year for which we have the most data. You might have noticed that in africadata, there are 226 NA (i.e., missing values) for infant mortality. Write code that figures out which years have missing data for infant mortality. You should find that there is missing up to 1981 and then again for 2016. So we’ll avoid those years and go with 2000 instead. Create a new object by extracting only the data for the year 2000 from the africadata object. You should end up with 51 observations and 9 variables. Check it with str() and summary().

#Create new object to determine which years have missing data for infant mortality
missing_infant_mortality <- africadata %>% 
  filter(is.na(infant_mortality)) %>% 
  group_by(year) %>% 
  summarize(missing_count = n())

print(missing_infant_mortality)

# A tibble: 23 × 2
    year missing_count
   <int>         <int>
 1  1960            10
 2  1961            17
 3  1962            16
 4  1963            16
 5  1964            15
 6  1965            14
 7  1966            13
 8  1967            11
 9  1968            11
10  1969             7
# ℹ 13 more rows

#Create new object with data from the year 2000
year_2000 <- africadata %>% 
  filter(year=="2000")

#Verify data by using head() and str() fucntinons
head(year_2000)

       country year infant_mortality life_expectancy fertility population
1      Algeria 2000             33.9            73.3      2.51   31183658
2       Angola 2000            128.3            52.3      6.84   15058638
3        Benin 2000             89.3            57.2      5.98    6949366
4     Botswana 2000             52.4            47.6      3.41    1736579
5 Burkina Faso 2000             96.2            52.6      6.59   11607944
6      Burundi 2000             93.4            46.7      7.06    6767073
          gdp continent          region
1 54790058957    Africa Northern Africa
2  9129180361    Africa   Middle Africa
3  2254838685    Africa  Western Africa
4  5632391130    Africa Southern Africa
5  2610945549    Africa  Western Africa
6   835334807    Africa  Eastern Africa

str(year_2000)

'data.frame':   51 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 2 3 18 22 26 27 29 31 32 33 ...
 $ year            : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
 $ infant_mortality: num  33.9 128.3 89.3 52.4 96.2 ...
 $ life_expectancy : num  73.3 52.3 57.2 47.6 52.6 46.7 54.3 68.4 45.3 51.5 ...
 $ fertility       : num  2.51 6.84 5.98 3.41 6.59 7.06 5.62 3.7 5.45 7.35 ...
 $ population      : num  31183658 15058638 6949366 1736579 11607944 ...
 $ gdp             : num  5.48e+10 9.13e+09 2.25e+09 5.63e+09 2.61e+09 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 11 10 20 17 20 5 10 20 10 10 ...

More Plotting

Let’s make the same plots as above again, this time only for the year 2000. Based on those plots, there seems to still be a negative correlation between infant mortality and life expectancy, and no noticeable correlation between population size and life expectancy. Let’s apply some statistical model to this data.

#Create a plot showing life expectancy as a function of infant mortality
plot1_2000 <- ggplot(year_2000, aes(x = infant_mortality, y = life_expectancy)) +
  geom_point() +
  labs(title = "Life Expectancy vs Infant Mortality (Year 2000)",
       x = "Infant Mortality",
       y = "Life Expectancy") +
  theme_get()

# Display Plot 1
print(plot1_2000)

#Create a plot showing life expectancy as a function of population
plot2_2000 <- ggplot(year_2000, aes(x = population, y = life_expectancy)) +
  geom_point() +
  scale_x_log10() +
  labs(title = "Life Expectancy vs Population Size (Log Scale, Year 2000)",
       x = "Population (Log Scale)",
       y = "Life Expectancy") +
  theme_get()

# Display Plot 1
print(plot2_2000)

Simple model fits

Use the lm() function and fit life expectancy as the outcome, and infant mortality as the predictor. Then repeat, now with the population size as the predictor variable. (Use the data from 2000 only.) Save the result from the two fits into two objects (e.g. fit1 and fit2) and apply the summary command to both, which will print various fit results to the screen. Based on the p-values for each fit, what do you conclude? Add comments into your quarto file to explain what you did and found.

# Fit 1: Life Expectancy as the outcome and Infant Mortality as the predictor
fit1 <- lm(life_expectancy ~ infant_mortality, year_2000)

# Fit 2: Life Expectancy as the outcome and Population Size as the predictor
fit2 <- lm(life_expectancy ~ population, year_2000)

summary(fit1)


Call:
lm(formula = life_expectancy ~ infant_mortality, data = year_2000)

Residuals:
     Min       1Q   Median       3Q      Max 
-22.6651  -3.7087   0.9914   4.0408   8.6817 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      71.29331    2.42611  29.386  < 2e-16 ***
infant_mortality -0.18916    0.02869  -6.594 2.83e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.221 on 49 degrees of freedom
Multiple R-squared:  0.4701,    Adjusted R-squared:  0.4593 
F-statistic: 43.48 on 1 and 49 DF,  p-value: 2.826e-08

summary(fit2)


Call:
lm(formula = life_expectancy ~ population, data = year_2000)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.429  -4.602  -2.568   3.800  18.802 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 5.593e+01  1.468e+00  38.097   <2e-16 ***
population  2.756e-08  5.459e-08   0.505    0.616    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.524 on 49 degrees of freedom
Multiple R-squared:  0.005176,  Adjusted R-squared:  -0.01513 
F-statistic: 0.2549 on 1 and 49 DF,  p-value: 0.6159

Results

Based on the p-values for each fit:

Model 1 (Life Expectancy vs. Infant Mortality): The p-value for the infant mortality coefficient is < 0.05, indicating a statistically significant negative relationship between infant mortality and life expectancy. This means that as infant mortality increases, life expectancy decreases, which is expected.

Model 2 (Life Expectancy vs. Population Size): The p-value for the population size coefficient is > 0.05, indicating a lack of statistical significance. This suggests that there is no strong evidence of a linear relationship between population size and life expectancy based on this dataset.

These results help us understand the factors affecting life expectancy in African countries in the year 2000. While infant mortality shows a clear and significant impact, population size does not appear to have a significant linear effect.

Assignment Part 2

This section contributed by Austin Thrash - FFK221

More data exploration

Pick another dataset from dslabs, whichever one you want. Unfortunately, the dslabs package doesn't have a nice website. But you can go to its offical CRAN entryLinks to an external site. and click on Reference Manual. The pdf lists the various other datasets and for each provides a brief explanation.

Once you chose one of the datasets, write R code to go through similar steps as above. Specifically, do the following:

Explore the dataset.
Do any processing/cleaning you want to do.
Make a few exploratory figures. Optionally, also some tables.
Run some simple statistical model(s). Your choice.
Report the results from the model(s).
For each step, add plenty comments to the code and explanatory Markdown text to the Quarto file.

Explore the dataset

# Printing structure and summary of 'mice_weights' dataset

str(mice_weights)

'data.frame':   780 obs. of  7 variables:
 $ body_weight : num  27.6 23 28.7 32.6 28.6 ...
 $ bone_density: num  0.616 0.769 0.684 0.644 0.53 ...
 $ percent_fat : num  7.26 4.95 6.02 9.54 6.99 ...
 $ sex         : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
 $ diet        : Factor w/ 2 levels "chow","hf": 1 1 1 1 1 1 1 1 1 1 ...
 $ gen         : Factor w/ 5 levels "4","7","8","9",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ litter      : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(mice_weights)

  body_weight     bone_density     percent_fat     sex       diet     gen     
 Min.   :18.13   Min.   :0.2708   Min.   : 2.552   F:398   chow:394   4 : 97  
 1st Qu.:28.09   1st Qu.:0.4888   1st Qu.: 5.566   M:382   hf  :386   7 :195  
 Median :32.98   Median :0.5643   Median : 8.276                      8 :193  
 Mean   :34.08   Mean   :0.5697   Mean   : 8.594                      9 : 97  
 3rd Qu.:39.37   3rd Qu.:0.6373   3rd Qu.:10.926                      11:198  
 Max.   :65.15   Max.   :0.9980   Max.   :22.154                              
                 NA's   :4        NA's   :4                                   
 litter 
 1:442  
 2:338

Data Processing/Cleaning

# Removing NA's and checking to see if there are any missing values, there are 8. 4 in 'bone_density' and 4 in 'percent_fat' as seen above

mice <- mice_weights %>% drop_na()

sum(is.na(mice))

[1] 0

Exploratory Figures

# Looking at distribution of sex

plot_1 <- ggplot(mice, aes(x = sex)) +
  geom_histogram(stat = "count")  +
  labs(title = "Sex Distribution",
       x = "Sex",
       y = "Count") +
  theme_get()

Warning in geom_histogram(stat = "count"): Ignoring unknown parameters:
`binwidth`, `bins`, and `pad`

print(plot_1)

# Plotting bone density as a function of body weight to identify if there is any relationship between the two

plot_2 <- ggplot(mice, aes(x = body_weight, y = bone_density)) +
  geom_point()  +
  labs(title = "Body Weight vs Bone Density",
       x = "Body Weight",
       y = "Bone Density") +
  theme_get()

print(plot_2)

# Plotting percent fat as a function of body weight to identify if there is any relationship between the two

plot_3 <- ggplot(mice, aes(x = body_weight, y = percent_fat)) +
  geom_point()  +
  labs(title = "Body Weight vs Percent Fat",
       x = "Body Weight",
       y = "Percent Fat") +
  theme_get()

print(plot_3)

As we can see from the models above, it looks as if there is a positive linear relationship between percent fat/bone density and body weight (which is to be expected). Therefore we will use these features during statistical modeling.

Statistical Models

# Fit 1: Body Weight as the outcome and Bone Density as the predictor
fit_1 <- lm(body_weight ~ bone_density, mice)

# Fit 2: Body Weight as the outcome and Percent Fat as the predictor
fit_2 <- lm(body_weight ~ percent_fat, mice)

summary(fit_1)


Call:
lm(formula = body_weight ~ bone_density, data = mice)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.0564  -5.5995  -0.8832   5.1219  29.5985 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    19.159      1.401   13.67   <2e-16 ***
bone_density   26.139      2.414   10.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.538 on 774 degrees of freedom
Multiple R-squared:  0.1316,    Adjusted R-squared:  0.1305 
F-statistic: 117.3 on 1 and 774 DF,  p-value: < 2.2e-16

summary(fit_2)


Call:
lm(formula = body_weight ~ percent_fat, data = mice)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.2594  -4.7686  -0.7525   4.5298  18.9356 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.51034    0.55808   38.54   <2e-16 ***
percent_fat  1.45918    0.05976   24.42   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.079 on 774 degrees of freedom
Multiple R-squared:  0.4351,    Adjusted R-squared:  0.4344 
F-statistic: 596.1 on 1 and 774 DF,  p-value: < 2.2e-16

Results

We can see from the results above that both bone density and percent fat have a statistically significant relationship with body fat; both predictors have a p-value less than 0.05. The first results show that for every one unit increase in bone_density, we can expect to see a 26.139 increase in body_weight. The second results show that for every one unit increase in percent_fat, we expect to see a 1.45918 increase in weight.