Bootstrap

SAS:如何比较不同组之间的回归系数

from:http://statistics.ats.ucla.edu/stat/sas/faq/compreg3.htm


Sometimes your research may predict that the size of a regression coefficient may vary across groups. For example, you might believe that the regression coefficient of height predicting weight would differ across three age groups (young, middle age, senior citizen). Below, we have a data file with 10 fictional young people, 10 fictional middle age people, and 10 fictional senior citizens, along with their height in inches and their weight in pounds. The variableage indicates the age group and is coded 1 for young people, 2 for middle aged, and 3 for senior citizens.

DATA htwt;
  INPUT id age height weight ;
CARDS;
 1 1  56 140   
 2 1  60 155   
 3 1  64 143   
 4 1  68 161   
 5 1  72 139   
 6 1  54 159   
 7 1  62 138   
 8 1  65 121   
 9 1  65 161   
10 1  70 145   
11 2  56 117   
12 2  60 125   
13 2  64 133   
14 2  68 141   
15 2  72 149   
16 2  54 109   
17 2  62 128   
18 2  65 131   
19 2  65 131   
20 2  70 145   
21 3  64 211   
22 3  68 223   
23 3  72 235   
24 3  76 247   
25 3  80 259   
26 3  62 201   
27 3  69 228   
28 3  74 245   
29 3  75 241   
30 3  82 269   
;
RUN; 

We analyze their data separately using the proc reg below.

PROC REG DATA=htwt;
   BY age ;
   MODEL weight = height ;
RUN; 

The parameter estimates (coefficients) for the young, middle age, and senior citizens are shown below. below, and the results do seem to suggest thatheight is a stronger predictor of weight for seniors (3.18) than for the middle aged (2.09). The results also seem to suggest that height does not predictweight as strongly for the young (-.37) as for the middle aged and seniors. However, we would need to perform specific significance tests to be able to make claims about the differences among these regression coefficients.

AGE=1
                 Parameter      Standard    T for H0:               
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
INTERCEP   1    170.166445   49.43018216         3.443        0.0088
HEIGHT     1     -0.376831    0.77433413        -0.487        0.6396
 
AGE=2
                 Parameter      Standard    T for H0:               
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
INTERCEP   1     -2.397470    7.05327189        -0.340        0.7427
HEIGHT     1      2.095872    0.11049098        18.969        0.0001
 
AGE=3
                 Parameter      Standard    T for H0:               
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
INTERCEP   1      5.601677    8.93019669         0.627        0.5480
HEIGHT     1      3.189727    0.12323669        25.883        0.0001
 

We can compare the regression coefficients among these three age groups to test the null hypothesis

Ho: B1 = B2 = B3

where B1 is the regression for for the young, B2 is the regression for for the middle aged, and B3 is the regression for for senior citizens. To do this analysis, we first make a dummy variable called age1 that is coded 1 if young (age=1), 0 otherwise, and age2 that is coded 1 if middle aged (age=2), 0 otherwise. We also create age1ht that is age1 times height, and age2ht that is age2 times height.

data htwt2;
  set htwt; 
 
  age1 = . ;
  age2 = . ;
  IF age = 1 then age1 = 1; ELSE age1 = 0 ;
  IF age = 2 then age2 = 1; ELSE age2 = 0 ;
 
  age1ht = age1*height ;
  age2ht = age2*height ;
 
RUN; 

We can now use age1 age2 heightage1ht and age2ht as predictors in the regression equation in proc reg below. In the proc reg we use the

 TEST age1ht=0, age2ht=0; 

statement to test the null hypothesis

Ho: B1 = B2 = B3

This test will have two degrees of freedom because it compares among three regression coefficients.

PROC REG DATA=htwt2 ;
  MODEL weight = age1 age2 height age1ht age2ht ;
  TEST age1ht=0, age2ht=0 ;
RUN;

The output below shows that the null hypothesis

Ho: B1 = B2 = B3

can be rejected (F=17.29, p = 0.0000). This means that the regression coefficients between height and weight do indeed significantly differ across the 3 age groups (young, middle age, senior citizen).

Model: MODEL1
Dependent Variable: WEIGHT

    Analysis of Variance

                         Sum of         Mean
Source          DF      Squares       Square      F Value       Prob>F

Model            5  69595.35464  13919.07093      220.261       0.0001
Error           24   1516.64536     63.19356
C Total         29  71112.00000

    Root MSE       7.94944     R-square       0.9787
    Dep Mean     171.00000     Adj R-sq       0.9742
    C.V.           4.64879

    Parameter Estimates

                 Parameter      Standard    T for H0:
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

INTERCEP   1      5.601677   29.48853690         0.190        0.8509
AGE1       1    164.564768   41.55490307         3.960        0.0006
AGE2       1     -7.999147   41.55490307        -0.192        0.8490
HEIGHT     1      3.189727    0.40694172         7.838        0.0001
AGE1HT     1     -3.566558    0.61316088        -5.817        0.0001
AGE2HT     1     -1.093855    0.61316088        -1.784        0.0871

Dependent Variable: WEIGHT
Test:          Numerator:   1092.7718  DF:    2   F value:  17.2925
               Denominator:  63.19356  DF:   24   Prob>F:    0.0001

It is also possible to run such an analysis in proc glm, using syntax as shown below. Instead of using a test statement, the contrast statement is used to test the null hypothesis

Ho: B1 = B2 = B3

The contrast statement uses the comma to join together what would have been two separate one degree of freedom tests into a single two degree of freedom test that tests the null hypothesis above.

PROC GLM DATA=htwt2 ;
  CLASS age ;
  MODEL weight = age height age*height / SOLUTION ;
  CONTRAST 'test equal slopes' age*height 1 -1  0,
                               age*height 0  1 -1 ;
RUN;

If you compare the contrast output from proc glm (labeled test equal slopes found below with the output from test from proc glm above, you will see the F values and p values are the same. This is because these two tests are equivalent.

General Linear Models Procedure
Class Level Information

Class    Levels    Values

AGE           3    1 2 3


Number of observations in data set = 30

General Linear Models Procedure

Dependent Variable: WEIGHT
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F

Model                    5     69595.354644    13919.070929    220.26     0.0001

Error                   24      1516.645356       63.193557

Corrected Total         29     71112.000000

                  R-Square             C.V.        Root MSE          WEIGHT Mean

                  0.978672         4.648794       7.9494375            171.00000

Source                  DF        Type I SS     Mean Square   F Value     Pr > F

AGE                      2     64350.600000    32175.300000    509.15     0.0001
HEIGHT                   1      3059.211075     3059.211075     48.41     0.0001
HEIGHT*AGE               2      2185.543569     1092.771784     17.29     0.0001

Source                  DF      Type III SS     Mean Square   F Value     Pr > F

AGE                      2     1395.9046778     697.9523389     11.04     0.0004
HEIGHT                   1     2597.0189017    2597.0189017     41.10     0.0001
HEIGHT*AGE               2     2185.5435689    1092.7717845     17.29     0.0001

Contrast                DF      Contrast SS     Mean Square   F Value     Pr > F

test equal slopes        2     2185.5435689    1092.7717845     17.29     0.0001

                                        T for H0:    Pr > |T|   Std Error of
Parameter                  Estimate    Parameter=0                Estimate

INTERCEPT                 5.6016771 B         0.19     0.8509    29.48853690
AGE        1            164.5647676 B         3.96     0.0006    41.55490307
           2             -7.9991472 B        -0.19     0.8490    41.55490307
           3              0.0000000 B          .        .          .
HEIGHT                    3.1897275 B         7.84     0.0001     0.40694172
HEIGHT*AGE 1             -3.5665584 B        -5.82     0.0001     0.61316088
           2             -1.0938553 B        -1.78     0.0871     0.61316088
           3              0.0000000 B          .        .          .

      NOTE: The X'X matrix has been found to be singular and a generalized inverse
      was used to solve the normal equations.   Estimates followed by the
      letter 'B' are biased, and are not unique estimators of the parameters.

You might notice that the null hypothesis that we are testing

Ho: B1 = B2 = B3

is similar to the null hypothesis that you might test using ANOVA to compare the means of the three groups,

Ho: Mu1 = Mu2 = Mu3

In ANOVA, you can get an overall F test testing the null hypothesis. In addition to that overall test, you could perform planned comparisons among the three groups. So far we have seen how to to an overall test of the equality of the three regression coefficients, and now we will test planned comparisons among the regression coefficients. Below, we show how you can perform two such tests using the contrasta statement in proc glm. The first contrastcompares the regression coefficients of the middle aged vs. senior.

Ho: B2 = B3

The second contrast compares the regression coefficients of the young vs. middle aged and seniors.

Ho: B1 = (B2 + B3)/2
PROC GLM DATA=htwt2 ;
  CLASS age ;
  MODEL weight = age height age*height ;
  CONTRAST 'Mid Age vs. Sen.  ' age*height  0  1 -1 ;
  CONTRAST 'Yng vs (Mid & Sen)' age*height -2  1  1 ;
RUN;

The output from contrast indicates that regression coefficients for middle aged and seniors do not significantly differ (F=3.18, p=0.0871) The secondcontrast was significant (F=29.96, p=0.0000) indicating that the regression coefficients for the young differ from the middle age and seniors combined.

General Linear Models Procedure
Class Level Information

Class    Levels    Values

AGE           3    1 2 3

Number of observations in data set = 30

General Linear Models Procedure

Dependent Variable: WEIGHT
                                     Sum of            Mean
Source                  DF          Squares          Square   F Value     Pr > F

Model                    5     69595.354644    13919.070929    220.26     0.0001

Error                   24      1516.645356       63.193557

Corrected Total         29     71112.000000

                  R-Square             C.V.        Root MSE          WEIGHT Mean

                  0.978672         4.648794       7.9494375            171.00000

Source                  DF        Type I SS     Mean Square   F Value     Pr > F

AGE                      2     64350.600000    32175.300000    509.15     0.0001
HEIGHT                   1      3059.211075     3059.211075     48.41     0.0001
HEIGHT*AGE               2      2185.543569     1092.771784     17.29     0.0001

Source                  DF      Type III SS     Mean Square   F Value     Pr > F

AGE                      2     1395.9046778     697.9523389     11.04     0.0004
HEIGHT                   1     2597.0189017    2597.0189017     41.10     0.0001
HEIGHT*AGE               2     2185.5435689    1092.7717845     17.29     0.0001

Contrast                DF      Contrast SS     Mean Square   F Value     Pr > F

Mid Age vs. Sen.         1      201.1146303     201.1146303      3.18     0.0871
Yng vs (Mid & Sen)       1     1893.2074903    1893.2074903     29.96     0.0001

We can do the exact same analysis in proc reg by coding age1 and age2 like the coding shown in the contrast statements above We will create age1that will be:

0 for young
1 for middle age
-1 for senior

and we will create age2 that will be:

-2 for young
1 for middle age
1 for senior

The significance tests in proc reg below for age1ht and age2ht will correspond to the contrast statements we used in proc glm above.

data htwt3;
  set htwt; 
 
  age1 = . ;
  age2 = . ;
  IF age = 1 then age1 =  0;
  IF age = 2 then age1 =  1;
  IF age = 3 then age1 = -1;
 
  IF age = 1 then age2 = -2;
  IF age = 2 then age2 =  1;
  IF age = 3 then age2 =  1;
 
  age1ht = age1*height ;
  age2ht = age2*height ;
 
RUN;
 
PROC REG DATA=htwt3 ;
  MODEL weight = age1 age2 height age1ht age2ht ;
RUN;

The results below correspond to the proc reg results above except that the proc glm results are reported as F values and the proc reg results are reported as t values. We can square the t values to make them comparable to the F values. Indeed, for the comparison of Middle age vs. Seniors, the t value of -1.784 when squared becomes 3.183, the same as the F value from proc glm. Likewise, for the comparison of Young vs. middle & Senior the t value from proc reg is 5.473 and when squared becomes 29.954, the same as the F value from proc glm.

Model: MODEL1
Dependent Variable: WEIGHT

Analysis of Variance

                         Sum of         Mean
Source          DF      Squares       Square      F Value       Prob>F

Model            5  69595.35464  13919.07093      220.261       0.0001
Error           24   1516.64536     63.19356
C Total         29  71112.00000

    Root MSE       7.94944     R-square       0.9787
    Dep Mean     171.00000     Adj R-sq       0.9742
    C.V.           4.64879

Parameter Estimates

                 Parameter      Standard    T for H0:
Variable  DF      Estimate         Error   Parameter=0    Prob > |T|

INTERCEP   1     57.790217   16.94450462         3.411        0.0023
AGE1       1     -3.999574   20.77745154        -0.192        0.8490
AGE2       1    -56.188114   11.96726393        -4.695        0.0001
HEIGHT     1      1.636256    0.25524084         6.411        0.0001
AGE1HT     1     -0.546928    0.30658044        -1.784        0.0871
AGE2HT     1      1.006544    0.18389498         5.473        0.0001


;