I was running the following (simple) code to construct an indicator variable that is equal to 1 when share_rate is weakly greater than 0.05.
data work.institutional_ownership_V4;
set work.institutional_ownership_V3;
If share_rate >= 0.05 then BlockHolder = 1; Else BlockHolder = 0;
run;
For unknown reasons, the code does not properly work. In particlar, when I run the following code I can count how many observations I should have:
data work.try;
set work.institutional_ownership_V3;
if share_rate >= 0.05;
run;
The first code yields about 1,000 observations where BlockHolder=1, but the second code yields more than 100,000 observations. I am not sure why such an error occurs.
I recall @Rick_SAS stating that "Using the DATA step is cumbersome and prone to errors, so I much prefer using the SAS procedures that can create dummy variables." I was hoping to try to use the GLMMOD procedure as Rick suggested it in his blog post, but I noticed that this procedure creates dummy variables for each categorical variable. Since I am not using a categorical variable but a continuous variable, I wasn't sure how I can make use of the GLMMOD procedure or any non-data procedures. I couldn't find any post that is refering to an alternative to the data procedure.
Any tips / help will be much appreciated.
Those should be giving you identical results. So either:
1. Rounding issues
2. You're doing something wrong.
Given the scope I'm leaning towards #2.
I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?
Run a proc freq on your first results and post the output and log.
Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below.
I suspect something else is the issue here.
data GT LT both;
set work.institutional_ownership_V3;
If share_rate >= 0.05 then output GT;
Else output LT;
if share_rate >= 0.05 then cat=1; else cat=0;
if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output;
run;
proc freq data=both;
table cat;
table cat*cat_round;
run;
Those should be giving you identical results. So either:
1. Rounding issues
2. You're doing something wrong.
Given the scope I'm leaning towards #2.
I don't recall Rick every saying a data step is combersome and prone to errors, curious to where you got that from?
Run a proc freq on your first results and post the output and log.
Try rounding and see if that resolves it, but given the number you've stated I don't think so. Post the results and log from below.
I suspect something else is the issue here.
data GT LT both;
set work.institutional_ownership_V3;
If share_rate >= 0.05 then output GT;
Else output LT;
if share_rate >= 0.05 then cat=1; else cat=0;
if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output;
run;
proc freq data=both;
table cat;
table cat*cat_round;
run;
Thanks for your helpful reply, @Reeza. I was referring to the comment Rick made in the blog post titled "Create dummy variables in SAS." See the comments below the post.
Please see below the log and the proc freq result:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
61
62 data GT LT both;
63 set work.institutional_ownership_V3;
64 If share_rate >= 0.05 then output GT;
65 Else output LT;
66 if share_rate >= 0.05 then cat=1; else cat=0; if
66 ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
67 output;
68 run;
NOTE: There were 66812164 observations read from the data set
WORK.INSTITUTIONAL_OWNERSHIP_V3.
NOTE: The data set WORK.GT has 66813283 observations and 25 variables.
NOTE: The data set WORK.LT has 133623209 observations and 25 variables.
NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 2:51.99
cpu time 51.55 seconds
69
70 proc freq data=both;
71 table cat;table cat*cat_round;
72 run;
NOTE: There were 66812164 observations read from the data set WORK.BOTH.
NOTE: PROCEDURE FREQ used (Total process time):
real time 54.99 seconds
cpu time 6.96 seconds
73
74 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
87
I'm guessing (but haven't really digged into this) that the ROUND function is using different rules than share_rate>0.05 is using.
Sorry, small mistake in the test script. That last OUTPUT should be output BOTH;
data GT LT both;
set work.institutional_ownership_V3;
If share_rate >= 0.05 then output GT;
Else output LT;
if share_rate >= 0.05 then cat=1; else cat=0; if round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
output BOTH;
run;
proc freq data=both;
table cat;table cat*cat_round;
run;
If you use round(share_rate,0.01)>=0.05, guess what? A value of 0.049 gets rounded and then is greater than or equal to 0.05. However, the value of 0.049 will not pass the test of share_rate>=0.05.
@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.
In fact, the crosstab from the incorrect results show this, less than 1000 records are not in the diagonal, so the numbers should be much more similar than indicated. I suspect the 'correct' test will not show an issue anyways, the OP most likely has another issue that really isn't shown here.
@Reeza wrote:
@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.
Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.
If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.
@PaigeMiller wrote:
@Reeza wrote:
@PaigeMiller Agreed, however having the numbers off by several hundred thousands for a boundary seems incorrect to me. So there's something else going on.
Without the ability to see the actual numbers (which the original poster has), I can't agree with this. If there are lots of 0.049 in the data ... well ... who knows. You can't assume everything is uniformly distributed.
If I was the original poster, I would look at the dataset BOTH using viewtable (or some other data editor) and examine carefully the records that don't match, turning off any formats so you can see the raw data. That ought to clear up the matter.
I agree with the testing part, but I think the PROC FREQ output above agree's with my conclusion. If there were larger amounts in the top right cell, I may agree with you. I can't guarantee it, but I'd bet on it 😉
@Reeza, this is the log of your updated code and the result is right below it:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
61
62 data GT LT both;
63 set work.institutional_ownership_V3;
64 If share_rate >= 0.05 then output GT;
65 Else output LT;
66 if share_rate >= 0.05 then cat=1; else cat=0; if
66 ! round(share_rate, 0.01) >= 0.05 then cat_round=1; else cat_round=0;
67 output BOTH;
68 run;
NOTE: There were 66812164 observations read from the data set
WORK.INSTITUTIONAL_OWNERSHIP_V3.
NOTE: The data set WORK.GT has 1119 observations and 25 variables.
NOTE: The data set WORK.LT has 66811045 observations and 25 variables.
NOTE: The data set WORK.BOTH has 66812164 observations and 25 variables.
NOTE: DATA statement used (Total process time):
real time 1:46.14
cpu time 29.21 seconds
69
70 proc freq data=both;
71 table cat;table cat*cat_round;
72 run;
NOTE: There were 66812164 observations read from the data set WORK.BOTH.
NOTE: PROCEDURE FREQ used (Total process time):
real time 5.73 seconds
cpu time 5.44 seconds
73
74 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
87
User: eyegen
Messages: 1
DESCRIBE TABLE CONSTRAINTS
Following @PaigeMiller's suggestion, I took a look at the BOTH dataset. Here is an example of how observations around 0.049 were treated:
But that wasn't my suggestion. I suggested you look at all of the mismatches and see if the cause of the mismatch could be determined. My suggestion had nothing to do with 0.049.
@PaigeMiller, okay thanks. I see your point in the suggestion. Following that, I was able to figure out the mistake.
As @Reeza suggested it, the reason of inconsistency was due to referring to the non-standardized dataset (i.e., prior to proc stdize ... method=RANGE). I fixed the issue and got the same result.
Thanks so much to both of you, especially @Reeza for pointing out a possible mistake or the rounding effect issue. One little typo in referencing to a dataset (work.institutional_ownership_V2 rather than work.institutional_ownership_V3) caused unnecessary headache.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.