BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Yegen
Pyrite | Level 9

I was running the following (simple) code to construct an indicator variable that is equal to 1 when share_rate is weakly greater than 0.05.

 

 

data work.institutional_ownership_V4;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then BlockHolder = 1; Else BlockHolder = 0; 
run;

For unknown reasons, the code does not properly work. In particlar, when I run the following code I can count how many observations I should have:

data work.try;
	set work.institutional_ownership_V3;
	if share_rate >= 0.05;
run;

The first code yields about 1,000 observations where BlockHolder=1, but the second code yields more than 100,000 observations. I am not sure why such an error occurs.

I recall @Rick_SAS stating that "Using the DATA step is cumbersome and prone to errors, so I much prefer using the SAS procedures that can create dummy variables." I was hoping to try to use the GLMMOD procedure as Rick suggested it in his blog post, but I noticed that this procedure creates dummy variables for each categorical variable. Since I am not using a categorical variable but a continuous variable, I wasn't sure how I can make use of the GLMMOD procedure or any non-data procedures. I couldn't find any post that is refering to an alternative to the data procedure.

 

Any tips / help will be much appreciated. 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Those should be giving you identical results. So either:

 

1. Rounding issues

2. You're doing something wrong. 

 

Given the scope I'm leaning towards #2. 

 

I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?

 

Run a proc freq on your first results and post the output and log. 


Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below. 

I suspect something else is the issue here.

 

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;
if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0; output; run; proc freq data=both; table cat;
table cat*cat_round; run;

View solution in original post

11 REPLIES 11
Reeza
Super User

Those should be giving you identical results. So either:

 

1. Rounding issues

2. You're doing something wrong. 

 

Given the scope I'm leaning towards #2. 

 

I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?

 

Run a proc freq on your first results and post the output and log. 


Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below. 

I suspect something else is the issue here.

 

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;
if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0; output; run; proc freq data=both; table cat;
table cat*cat_round; run;
Yegen
Pyrite | Level 9

Thanks for your helpful reply, @Reeza. I was referring to the comment Rick made in the blog post titled "Create dummy variables in SAS." See the comments below the post. 

Please see below the log and the proc freq result:

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         data GT LT both;
 63         set work.institutional_ownership_V3;
 64         If share_rate >= 0.05 then output GT;
 65          Else output LT;
 66             if share_rate >= 0.05 then cat=1; else cat=0;    if
 66       ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
 67         output;
 68         run;
 
 NOTE: There were 66812164 observations read from the data set 
       WORK.INSTITUTIONAL_OWNERSHIP_V3.
 NOTE: The data set WORK.GT has 66813283 observations and 25 variables.
 NOTE: The data set WORK.LT has 133623209 observations and 25 variables.
 NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
 NOTE: DATA statement used (Total process time):
       real time           2:51.99
       cpu time            51.55 seconds
       
 
 69         
 70         proc freq data=both;
 71         table cat;table cat*cat_round;
 72         run;
 
 NOTE: There were 66812164 observations read from the data set WORK.BOTH.
 NOTE: PROCEDURE FREQ used (Total process time):
       real time           54.99 seconds
       cpu time            6.96 seconds
       
 
 73         
 74         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 87         


freq.png

 

PaigeMiller
Diamond | Level 26

I'm guessing (but haven't really digged into this) that the ROUND function is using different rules than share_rate>0.05 is using.

--
Paige Miller
Reeza
Super User

Sorry, small mistake in the test script. That last OUTPUT should be output BOTH;

 

data GT LT both;
	set work.institutional_ownership_V3;
	If share_rate >= 0.05 then output GT;
 Else output LT; 
    if share_rate >= 0.05 then cat=1; else cat=0;    if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output BOTH;
run;

proc freq data=both;
table cat;table cat*cat_round;
run;
PaigeMiller
Diamond | Level 26

If you use round(share_rate,0.01)>=0.05, guess what? A value of 0.049 gets rounded and then is greater than or equal to 0.05. However, the value of 0.049 will not pass the test of share_rate>=0.05.

--
Paige Miller
Reeza
Super User

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.

 

In fact, the crosstab from the incorrect results show this, less than 1000 records are not in the diagonal, so the numbers should be much more similar than indicated. I suspect the 'correct' test will not show an issue anyways, the OP most likely has another issue that really isn't shown here. 

 

 

PaigeMiller
Diamond | Level 26

@Reeza wrote:

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.


 

 


Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.

 

If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.

--
Paige Miller
Reeza
Super User

@PaigeMiller wrote:

@Reeza wrote:

@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.


 

 


Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.

 

If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.


I agree with the testing part, but I think the PROC FREQ output above agree's with my conclusion. If there were larger amounts in the top right cell, I may agree with you. I can't guarantee it, but I'd bet on it 😉

Yegen
Pyrite | Level 9

@Reeza, this is the log of your updated code and the result is right below it:

 

1          OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 61         
 62         data GT LT both;
 63         set work.institutional_ownership_V3;
 64         If share_rate >= 0.05 then output GT;
 65          Else output LT;
 66             if share_rate >= 0.05 then cat=1; else cat=0;    if
 66       ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
 67         output BOTH;
 68         run;
 
 NOTE: There were 66812164 observations read from the data set 
       WORK.INSTITUTIONAL_OWNERSHIP_V3.
 NOTE: The data set WORK.GT has 1119 observations and 25 variables.
 NOTE: The data set WORK.LT has 66811045 observations and 25 variables.
 NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
 NOTE: DATA statement used (Total process time):
       real time           1:46.14
       cpu time            29.21 seconds
       
 
 69         
 70         proc freq data=both;
 71         table cat;table cat*cat_round;
 72         run;
 
 NOTE: There were 66812164 observations read from the data set WORK.BOTH.
 NOTE: PROCEDURE FREQ used (Total process time):
       real time           5.73 seconds
       cpu time            5.44 seconds
       
 
 73         
 74         OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
 87         
 User: eyegen
Messages: 1

DESCRIBE TABLE CONSTRAINTS

Screen Shot 2017-05-28 at 7.31.47 PM.png

 

 

Following @PaigeMiller's suggestion, I took a look at the BOTH dataset. Here is an example of how observations around 0.049 were treated:

Screen Shot 2017-05-28 at 7.31.22 PM.png

PaigeMiller
Diamond | Level 26

But that wasn't my suggestion. I suggested you look at all of the mismatches and see if the cause of the mismatch could be determined. My suggestion had nothing to do with 0.049.

--
Paige Miller
Yegen
Pyrite | Level 9

@PaigeMiller, okay thanks. I see your point in the suggestion. Following that, I was able to figure out the mistake.

As @Reeza suggested it, the reason of inconsistency was due to referring to the non-standardized dataset (i.e., prior to proc stdize ... method=RANGE). I fixed the issue and got the same result.

Thanks so much to both of you, especially @Reeza for pointing out a possible mistake or the rounding effect issue. One little typo in referencing to a dataset (work.institutional_ownership_V2 rather than work.institutional_ownership_V3) caused unnecessary headache. 

 

sas-innovate-2024.png

Today is the last day to save with the early bird rate! Register today for just $695 - $100 off the standard rate.

 

Plus, pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 2789 views
  • 9 likes
  • 3 in conversation