Factor score regression vs simulations
Lets say we want to estimate Jerry’s IQ. We have an IQ test with a reliability of 1.00 that gave him an estimation of 113 and he got a 118 on the SAT. From an intuitive perspective, Jerry obviously has an IQ of 113 as the IQ test has a reliability of 1. However, if you do factor score regression in the order of IQ → SAT you would falsely estimate his IQ to be 117.
Given that it is clear that factor score regression can fail, it must be determined when it can be applied and when it cannot. Fortunately, simulations can be used to simulate a distribution of IQ, then vectors of traits that correlate with it (e.g. income, educational attainment), and then the average IQ of somebody with a few given traits can be estimated. Standard errors can also be estimated with this method in addition to the means.
The summary of these cases is that if you are doing factor score regression, you will generate scores identical to simulations if you use 1 indicator, the generated scores will be similar to simulations if you use 2, and somewhat different from them if you use 3.
Case 1: one SAT score. Ranges from -3 to 3 SD above the mean
Yes. Simulation generated and factor score concord at 0.99999.
simu <- rep(0, 61)
fac <- rep(0, 61)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-3, to=3, by=0.1)
for(i in 1:61) {
g <- rnorm(10000)
SAT <- 0.8*g + rnorm(10000)*0.6
subby1 <- data.frame(g, SAT)
lower = betas[i] - 0.05
higher = betas[i] + 0.05
subby2 <- subset(subby1, subby1$SAT > lower & subby1$SAT < higher)
qwe$simu[i] = mean(subby2$g)
qwe$fac[i] = betas[i]*0.8
qwe$se[i] = sd(subby2$g)
}
epi.ccc(qwe$simu, qwe$fac)
Case 2: one SAT score and race (Black mean set at 85, Jewish one set at 110, etc.)
Yes. Simulation generated and factor scores still concord at 0.996 in a population with an average IQ of 0.7 SD above the mean. SAT ranges from -3 to 3 SD above the mean.
simu <- rep(0, 61)
fac <- rep(0, 61)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-3, to=3, by=0.1)
for(i in 1:61) {
g <- rnorm(10000, mean=0.7)
SAT <- 0.8*g + rnorm(10000)*0.6
subby1 <- data.frame(g, SAT)
lower = betas[i] - 0.05
higher = betas[i] + 0.05
subby2 <- subset(subby1, subby1$SAT > lower & subby1$SAT < higher)
qwe$simu[i] = mean(subby2$g)
qwe$fac[i] = (betas[i]-0.7)*0.8 + 0.7
qwe$se[i] = sd(subby2$g)
}
epi.ccc(qwe$simu, qwe$fac)
Case 3: SAT score + income → income is regressed first, then SAT. Income is always 2 SD above the mean to make the simulation simpler. SAT ranges from -2 to 2 SD above the mean.
Somewhat acceptable concordance of 0.98, with a delta of 0.13, due to factor scores underestimating IQ
simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)
for(i in 1:41) {
g <- rnorm(1000000, mean=0)
SAT <- 0.8*g + rnorm(1000000)*0.6
inc <- 0.35*g + rnorm(1000000)*0.93674
subby1 <- data.frame(g, SAT)
subby1$inc = inc
lower = betas[i] - 0.05
higher = betas[i] + 0.05
subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$inc > 1.95 & subby1$inc < 2.05))
qwe$simu[i] = mean(subby2$g)
qwe$fac[i] = 0.7 + (betas[i]-0.7)*0.8
qwe$se[i] = sd(subby2$g)
}
epi.ccc(qwe$simu, qwe$fac)
qwe
Case 3: SAT score + income → income regressed first, then SAT. Income is always 4SD above the mean to make the simulation simpler. SAT can range from -2 to 2 SD above the mean.
Now it’s starting to get worse. Concordance of only 0.96, with factor score method generating estimates that are too low:
simu fac se
1 NaN -1.32 NA
2 NaN -1.24 NA
3 NaN -1.16 NA
4 -0.48644140 -1.08 NA
5 -1.12832229 -1.00 NA
6 -1.36321700 -0.92 NA
7 -0.27098809 -0.84 NA
8 -0.47041321 -0.76 0.8066435
9 -0.42014081 -0.68 0.4891348
10 0.03035648 -0.60 0.1412493
11 -0.20208548 -0.52 0.7536335
12 -0.16769194 -0.44 0.5794042
13 0.13132777 -0.36 0.5669211
14 -0.02122053 -0.28 0.6106637
15 -0.15125467 -0.20 0.7449515
16 -0.03591396 -0.12 0.4325467
17 0.25970717 -0.04 0.4563195
18 0.27417868 0.04 0.4320809
19 0.25399757 0.12 0.6661855
20 0.51768787 0.20 0.5925620
21 0.60862911 0.28 0.6266576
22 0.54326775 0.36 0.6502169
23 0.67199985 0.44 0.5359795
24 0.77977855 0.52 0.6036107
25 0.99037592 0.60 0.5659904
26 1.14622026 0.68 0.5403669
27 0.87258937 0.76 0.5607741
28 1.17781746 0.84 0.5376491
29 1.07216502 0.92 0.5364575
30 1.22595049 1.00 0.6634900
31 1.38507905 1.08 0.4525702
32 1.49358776 1.16 0.5759996
33 1.44870443 1.24 0.5755129
34 1.59389938 1.32 0.5899371
35 1.71151076 1.40 0.5651431
36 1.61588737 1.48 0.7290326
37 1.85292791 1.56 0.6976653
38 1.93692609 1.64 0.5745176
39 1.73194444 1.72 0.5094257
40 1.91957242 1.80 0.5538069
41 2.18908287 1.88 0.5238415
simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)
for(i in 1:41) {
g <- rnorm(20000000, mean=0)
SAT <- 0.8*g + rnorm(20000000)*0.6
inc <- 0.35*g + rnorm(20000000)*0.93674
subby1 <- data.frame(g, SAT)
subby1$inc = inc
lower = betas[i] - 0.05
higher = betas[i] + 0.05
subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$inc > 3.8 & subby1$inc < 4.2))
qwe$simu[i] = mean(subby2$g)
qwe$fac[i] = 1.4 + (betas[i]-1.4)*0.8
qwe$se[i] = sd(subby2$g)
}
epi.ccc(qwe$simu, qwe$fac)
Case 4: SAT score + skill in game + income → income regressed first, then skill, then SAT. Income is always 2 SD above the mean to make the simulation simpler. Skill in game is set at 3 SD above the mean in every simulation, SAT can range from 0 to 2 SD above the mean.
It comes tumbling down. Concordance of 0.88, with score regression method underestimating scores, especially at lower SAT scores.
simu fac se
1 NaN -1.253 NA
2 NaN -1.173 NA
3 -0.13084961 -1.093 NA
4 NaN -1.013 NA
5 NaN -0.933 NA
6 -0.86851849 -0.853 0.5015578
7 0.39011515 -0.773 NA
8 NaN -0.693 NA
9 -0.36473355 -0.613 0.4521529
10 0.39538262 -0.533 0.3944783
11 0.01679933 -0.453 0.6589419
12 0.23557488 -0.373 0.4834243
13 0.37234340 -0.293 0.5122159
14 0.18675995 -0.213 0.7202369
15 0.48873863 -0.133 0.6132729
16 0.25970710 -0.053 0.4374590
17 0.46679357 0.027 0.6584412
18 0.64776243 0.107 0.6150856
19 0.54753745 0.187 0.5058257
20 0.63397843 0.267 0.5435534
21 0.70424561 0.347 0.5772798
22 0.85121176 0.427 0.5412052
23 0.91588284 0.507 0.5714374
24 0.96585050 0.587 0.5696521
25 1.06461351 0.667 0.5615047
26 1.12121771 0.747 0.5548811
27 1.15041147 0.827 0.5634120
28 1.22498621 0.907 0.5870118
29 1.28158277 0.987 0.6009292
30 1.41739135 1.067 0.5826199
31 1.47757181 1.147 0.5760538
32 1.58126221 1.227 0.5699387
33 1.63804572 1.307 0.5855638
34 1.66971712 1.387 0.5148785
35 1.75406733 1.467 0.5504398
36 1.76784068 1.547 0.6016072
37 1.87247347 1.627 0.5555837
38 1.98934159 1.707 0.5948008
39 1.99170886 1.787 0.5315359
40 2.09278895 1.867 0.5772159
41 2.15102625 1.947 0.5868213
simu <- rep(0, 41)
fac <- rep(0, 41)
qwe <- data.frame(simu, fac)
qwe$se = 0
betas <- seq(from=-2, to=2, by=0.1)
for(i in 1:41) {
g <- rnorm(30000000, mean=0)
SAT <- 0.8*g + rnorm(30000000)*0.6
inc <- 0.35*g + rnorm(30000000)*0.93674
skill <- 0.45*g + rnorm(30000000)*0.893028
subby1 <- data.frame(g, SAT)
subby1$inc = inc
subby1$skill = skill
lower = betas[i] - 0.05
higher = betas[i] + 0.05
subby2 <- subset(subby1, (subby1$SAT > lower & subby1$SAT < higher) & (subby1$skill > 2.7 & subby1$skill < 3.3) & (subby1$inc > 1.7 & subby1$inc < 2.3))
qwe$simu[i] = mean(subby2$g)
qwe$fac[i] = 1.735 + (betas[i]-1.735)*0.8
qwe$se[i] = sd(subby2$g)
}
epi.ccc(qwe$simu, qwe$fac)
qwe