SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
NKormanik
Barite | Level 11

Terrific code from @ballardw, to create 'dummy variables':

Bottom 10% of a certain variable, code dummy variable as 1.

Otherwise, code the dummy variable as 0.

But I've encountered a problem....

 

proc summary data=&etf..&etf._combined;
var
i_50401_Z
i_50402_Z
i_50403_Z
i_50404_Z
i_50405_Z
i_50408_Z
;

output out=&etf..&etf._combined_temp (drop= _:) p10= /autoname;
run;

Proc sql;
create table &etf..&etf._combined_2 as
select a.*,
(a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
(a.i_50402_Z <= b.i_50402_Z_P10) as i_50402_Z_bottom10pct ,
(a.i_50403_Z <= b.i_50403_Z_P10) as i_50403_Z_bottom10pct ,
(a.i_50404_Z <= b.i_50404_Z_P10) as i_50404_Z_bottom10pct ,
(a.i_50405_Z <= b.i_50405_Z_P10) as i_50405_Z_bottom10pct ,
(a.i_50408_Z <= b.i_50408_Z_P10) as i_50408_Z_bottom10pct

from &etf..&etf._combined as a,  &etf..&etf._combined_temp as b;

quit;

 

The problem is, there is some missing data.  Blank cells.

 

And the above code is dummy coding these blank cells as 1.

 

So, I have to somehow tell the code above to only code for the lowest 10% of actual numerical values.  Ignore the blanks.

 

Any help greatly appreciated.

 

Nicholas Kormanik

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

How about

select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc

 

View solution in original post

6 REPLIES 6
PaigeMiller
Diamond | Level 26
(a.i_50401_Z <= b.i_50401_Z_P10) and not missing(a.i_50401_Z) as i_50401_Z_bottom10pct

or maybe even better

 

case when missing(a.i_50401_Z) then . else a.i_50401_Z <= b.i_50401_Z_P10 end as i_50401_Z_bottom10pct

 

IMHO, this whole idea of finding the bottom 10pct would be better done in PROC RANK and arrays, rather than by tediously coding SQL conditions for each variable. In addition, PROC RANK has several methods of handling ties.

--
Paige Miller
sbxkoenk
SAS Super FREQ

Agreeing with @PaigeMiller.

 

See also these 3 blogs (especially the first one) :

 

Selecting the top n% and bottom n% of observations from a data set
By Kathryn McLawhorn on SAS Users July 21, 2017
https://blogs.sas.com/content/sgf/2017/07/21/selecting-the-top-n-and-bottom-n-of-observations-from-a...

 

4 ways to find the k smallest and largest data values in SAS
By Rick Wicklin on The DO Loop January 26, 2022
https://blogs.sas.com/content/iml/2022/01/26/k-smallest-largest-data.html

 

An easy way to make a "Top 10" table and bar chart in SAS
By Rick Wicklin on The DO Loop June 4, 2018
https://blogs.sas.com/content/iml/2018/06/04/top-10-table-bar-chart.html

 

Koen

ChrisNZ
Tourmaline | Level 20

How about

select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc

 

NKormanik
Barite | Level 11

@ChrisNZ @PaigeMiller @sbxkoenk @ballardw

 

Both PaigeMiller and ChrisNZ solutions work.  Good job guys.  They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration.  They are coded 0.

 

That, though, does raise another matter, but for now will overlook it.  The matter is that blank cells are being coded 0.

 

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

 

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

 

In fairness to ballardw, his code (found via Google search) was for highest percent of variable, not lowest -- highest percent coded to 1.

 

I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.

 

 

sbxkoenk
SAS Super FREQ

@NKormanik wrote:

That, though, does raise another matter, but for now will overlook it.  The matter is that blank cells are being coded 0.

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

 

Regardless of what your target event of interest is (1 versus 0), the fact that blanks are coded as 0 does matter for your final model of course!

 

Cheers,

Koen

PaigeMiller
Diamond | Level 26

@NKormanik wrote:

@ChrisNZ @PaigeMiller @sbxkoenk @ballardw

 

Both PaigeMiller and ChrisNZ solutions work.  Good job guys.  They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration.  They are coded 0.

 

That, though, does raise another matter, but for now will overlook it.  The matter is that blank cells are being coded 0.


Incorrect. I gave you two different ways to code your SQL, the second of which produces a missing when the variable is missing.

 

I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.

 

I certainly disagree. By a long shot.

 

 

 

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

 

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

 

Are you planning to use these new variables that have a 1 when you are in the bottom 10% as the response variable in Logistic Regression? Wouldn't it better to treat the continuous variable as continuous and do linear regression instead of arbitrarily turning these into 0s and 1s and doing Logistic regression?

 

Why do you not explain the real problem, that's almost always helpful. Focusing on the mechanics of obtaining these variables loses the big picture, and it may be that what you are planning to do with these new variables is not a good idea. This appears to be another example of the XY Problem in action.

 

 

--
Paige Miller

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 987 views
  • 11 likes
  • 4 in conversation