Tuesday, February 28, 2006

Re: RE: st: behavior of -areg-

In response to Radu's <raduban@gmail.com> query on why -areg- was not dropping collinear variables, Scott <smerryman@kc.rr.com> made an interesting demonstration on how machine precision can affect whether -_rmcoll- drops variables:

> You replicate this with the grunfeld dataset and the mvalue variable. > > sysuse grunfeld,clear > egen double t1 = total(mva), by(com) > egen double t2 = total(kstock), by(com) > > //t1 not dropped > areg invest kstock t1, ab(com) > > //t2 dropped > areg invest mvalu t2, ab(com) > > replace t1 = t1/100 > //Now, t1 is dropped > areg invest kstock t1, ab(com) > > replace t2 = t2*10000 > //Now, t2 is not dropped > areg invest mvalu t2, ab(com)

I would like to add to this demonstration where I show the dangers of having single precision variables (float) and not promoting them to double precision with caution.

Note that single precision will have only about 7 digits of precision while double has about 16 corresponding to 24 and 53 bits for the mantissia:

. di c(epsfloat) 1.192e-07

. di c(epsdouble) 2.220e-16

. di 2^(1-24) 1.192e-07

. di 2^(1-53) 2.220e-16

machine epsilon is 2^(1-p) where p is the number of bits reserved for the mantissia of floating point numbers.

Now take a look at the storage type for each of the variables in the grunfeld dataset.

. webuse grunfeld,clear

. describe

Contains data from http://www.stata-press.com/data/r9/grunfeld.dta obs: 200 vars: 6 3 Mar 2005 20:27 size: 5,600 (99.5% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- company float %9.0g year float %9.0g invest float %9.0g mvalue float %9.0g kstock float %9.0g time float %9.0g

. list in 1/20

+--------------------------------------------------+ | company year invest mvalue kstock time | |--------------------------------------------------| 1. | 1 1935 317.6 3078.5 2.8 1 | 2. | 1 1936 391.8 4661.7 52.6 2 | 3. | 1 1937 410.6 5387.1 156.9 3 | 4. | 1 1938 257.7 2792.2 209.2 4 | 5. | 1 1939 330.8 4313.2 203.4 5 | |--------------------------------------------------| 6. | 1 1940 461.2 4643.9 207.2 6 | 7. | 1 1941 512 4551.2 255.2 7 | 8. | 1 1942 448 3244.1 303.7 8 | 9. | 1 1943 499.6 4053.7 264.1 9 | 10. | 1 1944 547.5 4379.3 201.6 10 | |--------------------------------------------------| 11. | 1 1945 561.2 4840.9 265 11 | 12. | 1 1946 688.1 4900.9 402.2 12 | 13. | 1 1947 568.9 3526.5 761.5 13 | 14. | 1 1948 529.2 3254.7 922.4 14 | 15. | 1 1949 555.1 3700.2 1020.1 15 | |--------------------------------------------------| 16. | 1 1950 642.9 3755.6 1099 16 | 17. | 1 1951 755.9 4833 1207.7 17 | 18. | 1 1952 891.2 4924.9 1430.5 18 | 19. | 1 1953 1304.4 6241.7 1777.3 19 | 20. | 1 1954 1486.7 5593.6 2226.3 20 | +--------------------------------------------------+

We can extend the display format of mvalue and kstock and you will see what you are going to get beyond the 7-th digit when you promote these variables to double precision.

. format %20.15g kstock mvalue

. gen double fvalue = mvalue

. gen double fstock = kstock

. format %20.15g fvalue fstock

. list fvalue mvalue fstock kstock in 1/20

+---------------------------------------------------------------------------+ | fvalue mvalue fstock kstock | |---------------------------------------------------------------------------| 1.| 3078.5 3078.5 2.79999995231628 2.79999995231628 | 2.| 4661.7001953125 4661.7001953125 52.5999984741211 52.5999984741211 | 3.| 5387.10009765625 5387.10009765625 156.899993896484 156.899993896484 | 4.| 2792.19995117188 2792.19995117188 209.199996948242 209.199996948242 | 5.| 4313.2001953125 4313.2001953125 203.399993896484 203.399993896484 | |---------------------------------------------------------------------------| 6.| 4643.89990234375 4643.89990234375 207.199996948242 207.199996948242 | 7.| 4551.2001953125 4551.2001953125 255.199996948242 255.199996948242 | 8.| 3244.10009765625 3244.10009765625 303.700012207031 303.700012207031 | 9.| 4053.69995117188 4053.69995117188 264.100006103516 264.100006103516 | 10.| 4379.2998046875 4379.2998046875 201.600006103516 201.600006103516 | |---------------------------------------------------------------------------| 11 | 4840.89990234375 4840.89990234375 265 265 | 12.| 4900.89990234375 4900.89990234375 402.200012207031 402.200012207031 | 13.| 3526.5 3526.5 761.5 761.5 | 14.| 3254.69995117188 3254.69995117188 922.400024414063 922.400024414063 | 15.| 3700.19995117188 3700.19995117188 1020.09997558594 1020.09997558594 | |---------------------------------------------------------------------------| 16.| 3755.60009765625 3755.60009765625 1099 1099 | 17.| 4833 4833 1207.69995117188 1207.69995117188 | 18.| 4924.89990234375 4924.89990234375 1430.5 1430.5 | 19.| 6241.7001953125 6241.7001953125 1777.30004882813 1777.30004882813 | 20.| 5593.60009765625 5593.60009765625 2226.30004882813 2226.30004882813 | +---------------------------------------------------------------------------+

. egen double t1 = total(mva), by(com)

. egen double t2 = total(kstock), by(com)

. format %20.15g t1 t2

. list t1 t2 in 1/20

+------------------------------------+ | t1 t2 | |------------------------------------| 1. | 86676.900390625 12968.7000625134 | 2. | 86676.900390625 12968.7000625134 | 3. | 86676.900390625 12968.7000625134 | 4. | 86676.900390625 12968.7000625134 | 5. | 86676.900390625 12968.7000625134 | |------------------------------------| 6. | 86676.900390625 12968.7000625134 | 7. | 86676.900390625 12968.7000625134 | 8. | 86676.900390625 12968.7000625134 | 9. | 86676.900390625 12968.7000625134 | 10. | 86676.900390625 12968.7000625134 | |------------------------------------| 11. | 86676.900390625 12968.7000625134 | 12. | 86676.900390625 12968.7000625134 | 13. | 86676.900390625 12968.7000625134 | 14. | 86676.900390625 12968.7000625134 | 15. | 86676.900390625 12968.7000625134 | |------------------------------------| 16. | 86676.900390625 12968.7000625134 | 17. | 86676.900390625 12968.7000625134 | 18. | 86676.900390625 12968.7000625134 | 19. | 86676.900390625 12968.7000625134 | 20. | 86676.900390625 12968.7000625134 | +------------------------------------+

You can see here that even though we have made the variables t1 and t2 double precision, we really only have done single precision numerics (actually a little better than single precision) and the variables have only 9 digits that are any good. We gained some precsion, but we did not do as well as we would have if we started with full double precision. This garbage beyond the 9-th digit can be enough for -_rmcoll- to not detect variable collinearity.

Now if we promote mvalue and kstock to double carefully we can get full precision.

. scalar digits = round(log10(abs(c(epsfloat))),1)+1

. gen double dvalue = round(mvalue,10^(digits+ceil(log10(abs(mvalue)))))

. gen double dstock = round(kstock,10^(digits+ceil(log10(abs(kstock)))))

. format %20.15g dvalue dstock

. list dvalue mvalue dstock kstock in 1/20

+-------------------------------------------------------+ | dvalue mvalue dstock kstock | |-------------------------------------------------------| 1. | 3078.5 3078.5 2.8 2.79999995231628 | 2. | 4661.7 4661.7001953125 52.6 52.5999984741211 | 3. | 5387.1 5387.10009765625 156.9 156.899993896484 | 4. | 2792.2 2792.19995117188 209.2 209.199996948242 | 5. | 4313.2 4313.2001953125 203.4 203.399993896484 | |-------------------------------------------------------| 6. | 4643.9 4643.89990234375 207.2 207.199996948242 | 7. | 4551.2 4551.2001953125 255.2 255.199996948242 | 8. | 3244.1 3244.10009765625 303.7 303.700012207031 | 9. | 4053.7 4053.69995117188 264.1 264.100006103516 | 10. | 4379.3 4379.2998046875 201.6 201.600006103516 | |-------------------------------------------------------| 11. | 4840.9 4840.89990234375 265 265 | 12. | 4900.9 4900.89990234375 402.2 402.200012207031 | 13. | 3526.5 3526.5 761.5 761.5 | 14. | 3254.7 3254.69995117188 922.4 922.400024414063 | 15. | 3700.2 3700.19995117188 1020.1 1020.09997558594 | |-------------------------------------------------------| 16. | 3755.6 3755.60009765625 1099 1099 | 17. | 4833 4833 1207.7 1207.69995117188 | 18. | 4924.9 4924.89990234375 1430.5 1430.5 | 19. | 6241.7 6241.7001953125 1777.3 1777.30004882813 | 20. | 5593.6 5593.60009765625 2226.3 2226.30004882813 | +-------------------------------------------------------+

. egen double d1 = total(dvalue), by(com)

. egen double d2 = total(dstock), by(com)

. format %20.15g d1 d2

. list d1 t1 d2 t2 in 1/20

+--------------------------------------------------------+ | d1 t1 d2 t2 | |--------------------------------------------------------| 1. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 2. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 3. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 4. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 5. | 86676.9 86676.900390625 12968.7 12968.7000625134 | |--------------------------------------------------------| 6. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 7. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 8. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 9. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 10. | 86676.9 86676.900390625 12968.7 12968.7000625134 | |--------------------------------------------------------| 11. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 12. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 13. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 14. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 15. | 86676.9 86676.900390625 12968.7 12968.7000625134 | |--------------------------------------------------------| 16. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 17. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 18. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 19. | 86676.9 86676.900390625 12968.7 12968.7000625134 | 20. | 86676.9 86676.900390625 12968.7 12968.7000625134 | +--------------------------------------------------------+

Now we do Scott's experiment in double precision:

. gen double dinvest = round(invest,10^(digits+ceil(log10(abs(invest)))))

. recast long company

. areg dinvest dstock d1, ab(com)

Linear regression, absorbing indicators Number of obs = 200 F( 1, 189) = 366.45 Prob > F = 0.0000 R-squared = 0.9184 Adj R-squared = 0.9141 Root MSE = 63.566

------------------------------------------------------------------------------ dinvest | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dstock | .3707496 .0193676 19.14 0.000 .3325452 .4089541 d1 | (dropped) _cons | 43.62499 6.984315 6.25 0.000 29.84777 57.40222 -------------+---------------------------------------------------------------- company | F(9, 189) = 36.975 0.000 (10 categories)

. areg dinvest dvalue d2, ab(com)

Linear regression, absorbing indicators Number of obs = 200 F( 1, 189) = 111.35 Prob > F = 0.0000 R-squared = 0.8491 Adj R-squared = 0.8411 Root MSE = 86.444

------------------------------------------------------------------------------ dinvest | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- dvalue | .1898776 .0179944 10.55 0.000 .1543819 .2253733 d2 | (dropped) _cons | -59.42872 20.40144 -2.91 0.004 -99.6725 -19.18494 -------------+---------------------------------------------------------------- company | F(9, 189) = 15.973 0.000 (10 categories)

So my advice is to be very wary of single precision floating point variable.

-Rich * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/


Tag:


Links to this post:

Create a Link



<< Home

This page is powered by Blogger. Isn't yours?