Factor score regression vs simulations

Jun 22, 2023

Lets say we want to estimate Jerry’s IQ. We have an IQ test with a reliability of 1.00 that gave him an estimation of 113 and he got a 118 on the SAT. From an intuitive perspective, Jerry obviously has an IQ of 113 as the IQ test has a reliability of 1. However, if you do factor score regression in the order of IQ → SAT you would falsely estimate his IQ to be 117.

Given that it is clear that factor score regression can fail, it must be determined when it can be applied and when it cannot. Fortunately, simulations can be used to simulate a distribution of IQ, then vectors of traits that correlate with it (e.g. income, educational attainment), and then the average IQ of somebody with a few given traits can be estimated. Standard errors can also be estimated with this method in addition to the means.

The summary of these cases is that if you are doing factor score regression, you will generate scores identical to simulations if you use 1 indicator, the generated scores will be similar to simulations if you use 2, and somewhat different from them if you use 3.

Case 1: one SAT score. Ranges from -3 to 3 SD above the mean

Yes. Simulation generated and factor score concord at 0.99999.

simu <- rep(0, 61)
fac <- rep(0, 61)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-3, to=3, by=0.1)

for(i in 1:61) {
  g <- rnorm(10000)
  SAT <- 0.8*g + rnorm(10000)*0.6
  subby1 <- data.frame(g, SAT)
  lower = betas[i] - 0.05
  higher = betas[i] + 0.05
  
  subby2 <- subset(subby1, subby1$SAT > lower & subby1$SAT < higher)
  qwe$simu[i] = mean(subby2$g)
  qwe$fac[i] = betas[i]*0.8
  qwe$se[i] = sd(subby2$g)
  
}
epi.ccc(qwe$simu, qwe$fac)

Case 2: one SAT score and race (Black mean set at 85, Jewish one set at 110, etc.)

Yes. Simulation generated and factor scores still concord at 0.996 in a population with an average IQ of 0.7 SD above the mean. SAT ranges from -3 to 3 SD above the mean.

simu <- rep(0, 61)
fac <- rep(0, 61)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-3, to=3, by=0.1)

for(i in 1:61) {
  g <- rnorm(10000, mean=0.7)
  SAT <- 0.8*g + rnorm(10000)*0.6
  
  subby1 <- data.frame(g, SAT)
  lower = betas[i] - 0.05
  higher = betas[i] + 0.05
  
  subby2 <- subset(subby1, subby1$SAT > lower & subby1$SAT < higher)
  qwe$simu[i] = mean(subby2$g)
  qwe$fac[i] = (betas[i]-0.7)*0.8 + 0.7
  qwe$se[i] = sd(subby2$g)
  
}
epi.ccc(qwe$simu, qwe$fac)

Case 3: SAT score + income → income is regressed first, then SAT. Income is always 2 SD above the mean to make the simulation simpler. SAT ranges from -2 to 2 SD above the mean.

Somewhat acceptable concordance of 0.98, with a delta of 0.13, due to factor scores underestimating IQ

simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)

for(i in 1:41) {
  g <- rnorm(1000000, mean=0)
  SAT <- 0.8*g + rnorm(1000000)*0.6
  inc <- 0.35*g + rnorm(1000000)*0.93674
  
  subby1 <- data.frame(g, SAT)
  subby1$inc = inc
  lower = betas[i] - 0.05
  higher = betas[i] + 0.05
  
  subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$inc > 1.95 & subby1$inc < 2.05))
  qwe$simu[i] = mean(subby2$g)
  qwe$fac[i] = 0.7 + (betas[i]-0.7)*0.8
  qwe$se[i] = sd(subby2$g)
  
}
epi.ccc(qwe$simu, qwe$fac)
qwe

Case 3: SAT score + income → income regressed first, then SAT. Income is always 4SD above the mean to make the simulation simpler. SAT can range from -2 to 2 SD above the mean.

Now it’s starting to get worse. Concordance of only 0.96, with factor score method generating estimates that are too low:

simu fac se

1 NaN -1.32 NA

2 NaN -1.24 NA

3 NaN -1.16 NA

4 -0.48644140 -1.08 NA

5 -1.12832229 -1.00 NA

6 -1.36321700 -0.92 NA

7 -0.27098809 -0.84 NA

8 -0.47041321 -0.76 0.8066435

9 -0.42014081 -0.68 0.4891348

10 0.03035648 -0.60 0.1412493

11 -0.20208548 -0.52 0.7536335

12 -0.16769194 -0.44 0.5794042

13 0.13132777 -0.36 0.5669211

14 -0.02122053 -0.28 0.6106637

15 -0.15125467 -0.20 0.7449515

16 -0.03591396 -0.12 0.4325467

17 0.25970717 -0.04 0.4563195

18 0.27417868 0.04 0.4320809

19 0.25399757 0.12 0.6661855

20 0.51768787 0.20 0.5925620

21 0.60862911 0.28 0.6266576

22 0.54326775 0.36 0.6502169

23 0.67199985 0.44 0.5359795

24 0.77977855 0.52 0.6036107

25 0.99037592 0.60 0.5659904

26 1.14622026 0.68 0.5403669

27 0.87258937 0.76 0.5607741

28 1.17781746 0.84 0.5376491

29 1.07216502 0.92 0.5364575

30 1.22595049 1.00 0.6634900

31 1.38507905 1.08 0.4525702

32 1.49358776 1.16 0.5759996

33 1.44870443 1.24 0.5755129

34 1.59389938 1.32 0.5899371

35 1.71151076 1.40 0.5651431

36 1.61588737 1.48 0.7290326

37 1.85292791 1.56 0.6976653

38 1.93692609 1.64 0.5745176

39 1.73194444 1.72 0.5094257

40 1.91957242 1.80 0.5538069

41 2.18908287 1.88 0.5238415

simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)

for(i in 1:41) {
  g <- rnorm(20000000, mean=0)
  SAT <- 0.8*g + rnorm(20000000)*0.6
  inc <- 0.35*g + rnorm(20000000)*0.93674
  
  subby1 <- data.frame(g, SAT)
  subby1$inc = inc
  lower = betas[i] - 0.05
  higher = betas[i] + 0.05
  
  subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$inc > 3.8 & subby1$inc < 4.2))
  qwe$simu[i] = mean(subby2$g)
  qwe$fac[i] = 1.4 + (betas[i]-1.4)*0.8
  qwe$se[i] = sd(subby2$g)
  
}
epi.ccc(qwe$simu, qwe$fac)

Case 4: SAT score + skill in game + income → income regressed first, then skill, then SAT. Income is always 2 SD above the mean to make the simulation simpler. Skill in game is set at 3 SD above the mean in every simulation, SAT can range from 0 to 2 SD above the mean.

It comes tumbling down. Concordance of 0.88, with score regression method underestimating scores, especially at lower SAT scores.

simu fac se

1 NaN -1.253 NA

2 NaN -1.173 NA

3 -0.13084961 -1.093 NA

4 NaN -1.013 NA

5 NaN -0.933 NA

6 -0.86851849 -0.853 0.5015578

7 0.39011515 -0.773 NA

8 NaN -0.693 NA

9 -0.36473355 -0.613 0.4521529

10 0.39538262 -0.533 0.3944783

11 0.01679933 -0.453 0.6589419

12 0.23557488 -0.373 0.4834243

13 0.37234340 -0.293 0.5122159

14 0.18675995 -0.213 0.7202369

15 0.48873863 -0.133 0.6132729

16 0.25970710 -0.053 0.4374590

17 0.46679357 0.027 0.6584412

18 0.64776243 0.107 0.6150856

19 0.54753745 0.187 0.5058257

20 0.63397843 0.267 0.5435534

21 0.70424561 0.347 0.5772798

22 0.85121176 0.427 0.5412052

23 0.91588284 0.507 0.5714374

24 0.96585050 0.587 0.5696521

25 1.06461351 0.667 0.5615047

26 1.12121771 0.747 0.5548811

27 1.15041147 0.827 0.5634120

28 1.22498621 0.907 0.5870118

29 1.28158277 0.987 0.6009292

30 1.41739135 1.067 0.5826199

31 1.47757181 1.147 0.5760538

32 1.58126221 1.227 0.5699387

33 1.63804572 1.307 0.5855638

34 1.66971712 1.387 0.5148785

35 1.75406733 1.467 0.5504398

36 1.76784068 1.547 0.6016072

37 1.87247347 1.627 0.5555837

38 1.98934159 1.707 0.5948008

39 1.99170886 1.787 0.5315359

40 2.09278895 1.867 0.5772159

41 2.15102625 1.947 0.5868213

simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)

for(i in 1:41) {
  g <- rnorm(30000000, mean=0)
  SAT <- 0.8*g + rnorm(30000000)*0.6
  inc <- 0.35*g + rnorm(30000000)*0.93674
  skill <- 0.45*g + rnorm(30000000)*0.893028
  
  subby1 <- data.frame(g, SAT)
  subby1$inc = inc
  subby1$skill = skill
  
  lower = betas[i] - 0.05
  higher = betas[i] + 0.05
  
  subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$skill > 2.7 & subby1$skill < 3.3) & (subby1$inc > 1.7 & subby1$inc < 2.3))
  qwe$simu[i] = mean(subby2$g)
  qwe$fac[i] = 1.735 + (betas[i]-1.735)*0.8
  qwe$se[i] = sd(subby2$g)
  
}
epi.ccc(qwe$simu, qwe$fac)
qwe

Factor score regression vs simulations

Discussion about this post