Chapter 8 07-Wage Predictors

8.1 Intro

Today our analysis will help to determine whether demographic, age, marital status, employment, and health characteristics can be used to predict whether someone earns a (comparably) higher or lower wage. Creating a binary wage outcome based on the median wage with these different variables, conducting classical statistical tests (t-tests, ANOVA, and chi-square), and building a logistic regression model will help us draw conclusions on what factors can predict one’s income.

8.2

data("Wage")


str(Wage)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

wage_median <- median(Wage$wage, na.rm = TRUE)
wage_median

## [1] 104.9215

Wage <- Wage %>%
  mutate(
    WageCategory = case_when(
      wage > wage_median ~ "High",
      wage < wage_median ~ "Low",
      TRUE ~ NA_character_    # rows exactly equal to median
    )
  )

Wage <- Wage %>% filter(!is.na(WageCategory))

Wage$WageCategory <- factor(Wage$WageCategory, levels = c("Low", "High"))

knitr::kable(table(Wage$WageCategory),
             caption = "Number of observations classified as low versus high wage earners.")

Table 7.1: Number of observations classified as low versus high wage earners.
Var1	Freq
Low	1447
High	1483

clean_factor <- function(x) {
  x <- as.character(x)
  x <- str_replace(x, "^[0-9]+\\.\\s*", "")
  factor(x)
}

Wage <- Wage %>%
  mutate(
    race       = clean_factor(race),
    education  = clean_factor(education),
    jobclass   = clean_factor(jobclass),
    health     = clean_factor(health),
    health_ins = clean_factor(health_ins),
    maritl     = clean_factor(maritl)
  )

summary(Wage[, c("race", "education", "jobclass", "health", "health_ins", "maritl")])

##     race                education          jobclass            health    
##  Asian: 187   < HS Grad      :264   Industrial :1507   <=Good     : 841  
##  Black: 280   Advanced Degree:423   Information:1423   >=Very Good:2089  
##  Other:  36   College Grad   :661                                        
##  White:2427   HS Grad        :949                                        
##               Some College   :633                                        
##  health_ins           maritl    
##  No : 901   Divorced     : 200  
##  Yes:2029   Married      :2019  
##             Never Married: 639  
##             Separated    :  55  
##             Widowed      :  17

8.3

Wage %>%
  group_by(WageCategory) %>%
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    sd_age   = sd(age, na.rm = TRUE),
    n        = n()
  ) %>%
  knitr::kable(
    caption = "Mean age and standard deviation by wage category."
  )

Table 8.1: Mean age and standard deviation by wage category.
WageCategory	mean_age	sd_age	n
Low	40.01106	12.669373	1447
High	44.68510	9.820245	1483

age_ttest <- t.test(age ~ WageCategory, data = Wage)
age_ttest

## 
##  Welch Two Sample t-test
## 
## data:  age by WageCategory
## t = -11.143, df = 2724.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
##  -5.496555 -3.851525
## sample estimates:
##  mean in group Low mean in group High 
##           40.01106           44.68510

A Welch two-sample t-test compared the mean age of “high” vs. “low” income levels. *Results showed a significant difference, t(2724.7) = −11.14, p < .001.

Mean age (Low): 40.01 years

Mean age (High): 44.69 years

Mean difference: ≈ 4.7 years

8.4

Interpretation

We can interpret these findings as higher income earners are on average older than lower income earners. Age is positively associated with income.

wage_edu_aov <- aov(wage ~ education, data = Wage)
summary(wage_edu_aov)

##               Df  Sum Sq Mean Sq F value Pr(>F)    
## education      4 1242703  310676   228.5 <2e-16 ***
## Residuals   2925 3976086    1359                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An ANOVA tested whether income differs by education level:

F(4, 2925) = 228.5, p < .001

We can interpret these findings as a strong effect on education, meaning the highest wages made the most money, compared to lowest incomes(across the five income brackets). As education level went up, so did income/ wage level.

8.5

wage_maritl_table <- table(Wage$WageCategory, Wage$maritl)
knitr::kable(
  wage_maritl_table,
  caption = "Contingency table of wage category by marital status."
)

Table 8.2: Contingency table of wage category by marital status.
	Divorced	Married	Never Married	Separated	Widowed
Low	114	817	470	37	9
High	86	1202	169	18	8

chisq_wage_maritl <- chisq.test(wage_maritl_table)
chisq_wage_maritl

## 
##  Pearson's Chi-squared test
## 
## data:  wage_maritl_table
## X-squared = 225.33, df = 4, p-value < 2.2e-16

chisq_wage_maritl$expected

##       
##         Divorced   Married Never Married Separated  Widowed
##   Low   98.77133  997.0966      315.5744  27.16212 8.395563
##   High 101.22867 1021.9034      323.4256  27.83788 8.604437

cramer_v <- CramerV(wage_maritl_table)
cramer_v

## [1] 0.2773195

A chi-square test evaluated whether wage category is associated with marital status:

χ²(4) = 225.33, p < .001

Cramer’s V = 0.277

We can interpret the chi-square test with a moderate association between marital status and wage category. However, married participants are over-represented in the high wage group and un-married individuals are over-represented in the low wage group, which can skew the data inaccurately.

set.seed(123)
split_flag <- sample.split(Wage$WageCategory, SplitRatio = 0.7)

train_data <- subset(Wage, split_flag == TRUE)
test_data  <- subset(Wage, split_flag == FALSE)

nrow(train_data); nrow(test_data)

## [1] 2051

## [1] 879

8.6

logit_model <- glm(
  WageCategory ~ age + education + jobclass + maritl + health + health_ins,
  data = train_data,
  family = binomial
)

summary(logit_model)

## 
## Call:
## glm(formula = WageCategory ~ age + education + jobclass + maritl + 
##     health + health_ins, family = binomial, data = train_data)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -3.906039   0.410689  -9.511  < 2e-16 ***
## age                       0.016967   0.005301   3.201  0.00137 ** 
## educationAdvanced Degree  2.878189   0.272729  10.553  < 2e-16 ***
## educationCollege Grad     2.356742   0.243963   9.660  < 2e-16 ***
## educationHS Grad          0.654597   0.231391   2.829  0.00467 ** 
## educationSome College     1.399887   0.238876   5.860 4.62e-09 ***
## jobclassInformation       0.128700   0.110463   1.165  0.24398    
## maritlMarried             0.816820   0.209849   3.892 9.92e-05 ***
## maritlNever Married      -0.480261   0.245800  -1.954  0.05072 .  
## maritlSeparated          -0.213755   0.474737  -0.450  0.65252    
## maritlWidowed             0.191127   0.599294   0.319  0.74979    
## health>=Very Good         0.325054   0.121269   2.680  0.00735 ** 
## health_insYes             1.372473   0.123232  11.137  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2843.0  on 2050  degrees of freedom
## Residual deviance: 2115.4  on 2038  degrees of freedom
## AIC: 2141.4
## 
## Number of Fisher Scoring iterations: 4

odds_ratios <- exp(coef(logit_model))
odds_ratios

##              (Intercept)                      age educationAdvanced Degree 
##               0.02012005               1.01711169              17.78204218 
##    educationCollege Grad         educationHS Grad    educationSome College 
##              10.55650582               1.92436665               4.05474320 
##      jobclassInformation            maritlMarried      maritlNever Married 
##               1.13734877               2.26329134               0.61862205 
##          maritlSeparated            maritlWidowed        health>=Very Good 
##               0.80754605               1.21061266               1.38410575 
##            health_insYes 
##               3.94509346

conf_int <- exp(confint(logit_model))

## Waiting for profiling to be done...

conf_int

##                                 2.5 %      97.5 %
## (Intercept)               0.008880154  0.04447506
## age                       1.006623973  1.02776935
## educationAdvanced Degree 10.544345226 30.76750089
## educationCollege Grad     6.618263517 17.25899258
## educationHS Grad          1.236224160  3.07016179
## educationSome College     2.566408373  6.56196069
## jobclassInformation       0.915418699  1.41171317
## maritlMarried             1.503694264  3.42685266
## maritlNever Married       0.382337937  1.00301632
## maritlSeparated           0.305655850  1.99075661
## maritlWidowed             0.368417266  3.95933760
## health>=Very Good         1.091532922  1.75622946
## health_insYes             3.104800954  5.03437155

test_probs <- predict(logit_model, newdata = test_data, type = "response")

test_pred_class <- ifelse(test_probs > 0.5, "High", "Low")
test_pred_class <- factor(test_pred_class, levels = c("Low", "High"))

conf_mat <- confusionMatrix(
  data      = test_pred_class,
  reference = test_data$WageCategory,
  positive  = "High"
)
conf_mat

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Low High
##       Low  304  119
##       High 130  326
##                                           
##                Accuracy : 0.7167          
##                  95% CI : (0.6857, 0.7463)
##     No Information Rate : 0.5063          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.4332          
##                                           
##  Mcnemar's Test P-Value : 0.5263          
##                                           
##             Sensitivity : 0.7326          
##             Specificity : 0.7005          
##          Pos Pred Value : 0.7149          
##          Neg Pred Value : 0.7187          
##              Prevalence : 0.5063          
##          Detection Rate : 0.3709          
##    Detection Prevalence : 0.5188          
##       Balanced Accuracy : 0.7165          
##                                           
##        'Positive' Class : High            
##

8.7

Model classification performance on the test set

roc_obj <- roc(response = test_data$WageCategory,
               predictor = test_probs,
               levels = c("Low", "High"))  # "High" is the event

## Setting direction: controls < cases

plot(roc_obj, main = "ROC Curve for WageCategory Model")

Figure 7.2: high versus low wage earners

auc_value <- auc(roc_obj)
auc_value

## Area under the curve: 0.8063

8.8

Our analysis shows that high earners were older on average, and education had one of the strongest effects. Incomes increased with each higher level of education. Married workers seemed more likely to be high earners and never-married workers seemed more likely to fall into the low-wage group, but the representations was not equally distributed.The logistic regression also confirmed these patterns. Education was the biggest predictor, with people with advanced degrees having higher odds of being high earners. Age, being married, better health, and having health insurance also increased the likelihood of earning a higher wage, while job class didn’t add much once other factors were included.

The model performed well on new data, with an accuracy of about 71.7%, which is much better than guessing. If we were to continue this analysis, it would be interesting to adjust some variables and explore additional factors like childhood family income, parental involvement, family structure, or early life stability(such as parental divorce, death in family, etc). These kinds of variables might help explain how early experiences shape income later in adulthood. Overall, this project highlights the characteristics most connected to current wage levels and points toward meaningful directions for future research on how upbringing might influence economic outcomes.