Solved: Re: Creating dummy variable based on a continuous variable

Yegen · Posted 05-28-2017 06:02 PM

I was running the following (simple) code to construct an indicator variable that is equal to 1 when share_rate is weakly greater than 0.05.

data work.institutional_ownership_V4;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then BlockHolder = 1; Else BlockHolder = 0; 
run;

For unknown reasons, the code does not properly work. In particlar, when I run the following code I can count how many observations I should have:

data work.try;
	set work.institutional_ownership_V3;
	if share_rate >= 0.05;
run;

The first code yields about 1,000 observations where BlockHolder=1, but the second code yields more than 100,000 observations. I am not sure why such an error occurs.

I recall @Rick_SAS stating that "Using the DATA step is cumbersome and prone to errors, so I much prefer using the SAS procedures that can create dummy variables." I was hoping to try to use the GLMMOD procedure as Rick suggested it in his blog post, but I noticed that this procedure creates dummy variables for each categorical variable. Since I am not using a categorical variable but a continuous variable, I wasn't sure how I can make use of the GLMMOD procedure or any non-data procedures. I couldn't find any post that is refering to an alternative to the data procedure.

Any tips / help will be much appreciated.

Reeza · Posted 05-28-2017 06:08 PM

Those should be giving you identical results. So either:

1. Rounding issues

2. You're doing something wrong.

Given the scope I'm leaning towards #2.

I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?

Run a proc freq on your first results and post the output and log.

Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below.

I suspect something else is the issue here.

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;
    if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output;
run;

proc freq data=both;
table cat;
table cat*cat_round;
run;

View solution in original post

Reeza · Posted 05-28-2017 06:08 PM

Those should be giving you identical results. So either:

1. Rounding issues

2. You're doing something wrong.

Given the scope I'm leaning towards #2.

I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?

Run a proc freq on your first results and post the output and log.

Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below.

I suspect something else is the issue here.

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;
    if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output;
run;

proc freq data=both;
table cat;
table cat*cat_round;
run;

Yegen · Posted 05-28-2017 06:28 PM

Thanks for your helpful reply, @Reeza. I was referring to the comment Rick made in the blog post titled "Create dummy variables in SAS." See the comments below the post.

Please see below the log and the proc freq result:

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         data GT LT both;
 63         set work.institutional_ownership_V3;
 64         If share_rate >= 0.05 then output GT;
 65          Else output LT;
 66             if share_rate >= 0.05 then cat=1; else cat=0;    if
 66       ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
 67         output;
 68         run;
 
 NOTE: There were 66812164 observations read from the data set 
       WORK.INSTITUTIONAL_OWNERSHIP_V3.
 NOTE: The data set WORK.GT has 66813283 observations and 25 variables.
 NOTE: The data set WORK.LT has 133623209 observations and 25 variables.
 NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
 NOTE: DATA statement used (Total process time):
       real time           2:51.99
       cpu time            51.55 seconds
       
 
 69         
 70         proc freq data=both;
 71         table cat;table cat*cat_round;
 72         run;
 
 NOTE: There were 66812164 observations read from the data set WORK.BOTH.
 NOTE: PROCEDURE FREQ used (Total process time):
       real time           54.99 seconds
       cpu time            6.96 seconds
       
 
 73         
 74         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 87

PaigeMiller · Posted 05-28-2017 06:37 PM

I'm guessing (but haven't really digged into this) that the ROUND function is using different rules than share_rate>0.05 is using.

--
Paige Miller

Reeza · Posted 05-28-2017 06:41 PM

Sorry, small mistake in the test script. That last OUTPUT should be output BOTH;

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;    if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output BOTH;
run;

proc freq data=both;
table cat;table cat*cat_round;
run;

PaigeMiller · Posted 05-28-2017 06:50 PM

If you use round(share_rate,0.01)>=0.05, guess what? A value of 0.049 gets rounded and then is greater than or equal to 0.05. However, the value of 0.049 will not pass the test of share_rate>=0.05.

--
Paige Miller

Reeza · Posted 05-28-2017 06:52 PM

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.

In fact, the crosstab from the incorrect results show this, less than 1000 records are not in the diagonal, so the numbers should be much more similar than indicated. I suspect the 'correct' test will not show an issue anyways, the OP most likely has another issue that really isn't shown here.

PaigeMiller · Posted 05-28-2017 06:58 PM

@Reeza wrote:

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.

Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.

If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.

--
Paige Miller

Reeza · Posted 05-28-2017 07:01 PM

@PaigeMiller wrote:

@Reeza wrote:

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.

Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.

If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.

I agree with the testing part, but I think the PROC FREQ output above agree's with my conclusion. If there were larger amounts in the top right cell, I may agree with you. I can't guarantee it, but I'd bet on it 😉

Yegen · Posted 05-28-2017 07:37 PM

@Reeza, this is the log of your updated code and the result is right below it:

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         data GT LT both;
 63         set work.institutional_ownership_V3;
 64         If share_rate >= 0.05 then output GT;
 65          Else output LT;
 66             if share_rate >= 0.05 then cat=1; else cat=0;    if
 66       ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
 67         output BOTH;
 68         run;
 
 NOTE: There were 66812164 observations read from the data set 
       WORK.INSTITUTIONAL_OWNERSHIP_V3.
 NOTE: The data set WORK.GT has 1119 observations and 25 variables.
 NOTE: The data set WORK.LT has 66811045 observations and 25 variables.
 NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
 NOTE: DATA statement used (Total process time):
       real time           1:46.14
       cpu time            29.21 seconds
       
 
 69         
 70         proc freq data=both;
 71         table cat;table cat*cat_round;
 72         run;
 
 NOTE: There were 66812164 observations read from the data set WORK.BOTH.
 NOTE: PROCEDURE FREQ used (Total process time):
       real time           5.73 seconds
       cpu time            5.44 seconds
       
 
 73         
 74         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 87         
 User: eyegen
Messages: 1

DESCRIBE TABLE CONSTRAINTS

Following @PaigeMiller's suggestion, I took a look at the BOTH dataset. Here is an example of how observations around 0.049 were treated:

PaigeMiller · Posted 05-28-2017 07:49 PM

But that wasn't my suggestion. I suggested you look at all of the mismatches and see if the cause of the mismatch could be determined. My suggestion had nothing to do with 0.049.

--
Paige Miller

Yegen · Posted 05-28-2017 08:02 PM

@PaigeMiller, okay thanks. I see your point in the suggestion. Following that, I was able to figure out the mistake.

As @Reeza suggested it, the reason of inconsistency was due to referring to the non-standardized dataset (i.e., prior to proc stdize ... method=RANGE). I fixed the issue and got the same result.

Thanks so much to both of you, especially @Reeza for pointing out a possible mistake or the rounding effect issue. One little typo in referencing to a dataset (work.institutional_ownership_V2 rather than work.institutional_ownership_V3) caused unnecessary headache.

Registration is open

SAS Training: Just a Click Away