Solved: Re: Code Bottom 10%, But Not Including Blank Cells!

NKormanik · Posted 02-28-2022 05:20 AM

Terrific code from @ballardw, to create 'dummy variables':

Bottom 10% of a certain variable, code dummy variable as 1.

Otherwise, code the dummy variable as 0.

But I've encountered a problem....

proc summary data=&etf..&etf._combined;
var
i_50401_Z
i_50402_Z
i_50403_Z
i_50404_Z
i_50405_Z
i_50408_Z
;

output out=&etf..&etf._combined_temp (drop= _:) p10= /autoname;
run;

Proc sql;
create table &etf..&etf._combined_2 as
select a.*,
(a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
(a.i_50402_Z <= b.i_50402_Z_P10) as i_50402_Z_bottom10pct ,
(a.i_50403_Z <= b.i_50403_Z_P10) as i_50403_Z_bottom10pct ,
(a.i_50404_Z <= b.i_50404_Z_P10) as i_50404_Z_bottom10pct ,
(a.i_50405_Z <= b.i_50405_Z_P10) as i_50405_Z_bottom10pct ,
(a.i_50408_Z <= b.i_50408_Z_P10) as i_50408_Z_bottom10pct

from &etf..&etf._combined as a,  &etf..&etf._combined_temp as b;

quit;

The problem is, there is some missing data. Blank cells.

And the above code is dummy coding these blank cells as 1.

So, I have to somehow tell the code above to only code for the lowest 10% of actual numerical values. Ignore the blanks.

Any help greatly appreciated.

Nicholas Kormanik

ChrisNZ · Posted 03-01-2022 01:02 AM

How about

select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc

High-Performance SAS Coding - Third Edition

View solution in original post

PaigeMiller · Posted 02-28-2022 05:54 AM

(a.i_50401_Z <= b.i_50401_Z_P10) and not missing(a.i_50401_Z) as i_50401_Z_bottom10pct

or maybe even better

case when missing(a.i_50401_Z) then . else a.i_50401_Z <= b.i_50401_Z_P10 end as i_50401_Z_bottom10pct

IMHO, this whole idea of finding the bottom 10pct would be better done in PROC RANK and arrays, rather than by tediously coding SQL conditions for each variable. In addition, PROC RANK has several methods of handling ties.

--
Paige Miller

sbxkoenk · Posted 02-28-2022 11:36 AM

Agreeing with @PaigeMiller.

See also these 3 blogs (especially the first one) :

Selecting the top n% and bottom n% of observations from a data set
By Kathryn McLawhorn on SAS Users July 21, 2017
https://blogs.sas.com/content/sgf/2017/07/21/selecting-the-top-n-and-bottom-n-of-observations-from-a...

4 ways to find the k smallest and largest data values in SAS
By Rick Wicklin on The DO Loop January 26, 2022
https://blogs.sas.com/content/iml/2022/01/26/k-smallest-largest-data.html

An easy way to make a "Top 10" table and bar chart in SAS
By Rick Wicklin on The DO Loop June 4, 2018
https://blogs.sas.com/content/iml/2018/06/04/top-10-table-bar-chart.html

Koen

ChrisNZ · Posted 03-01-2022 01:02 AM

How about

select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc

High-Performance SAS Coding - Third Edition

NKormanik · Posted 03-01-2022 01:55 AM

@ChrisNZ @PaigeMiller @sbxkoenk @ballardw

Both PaigeMiller and ChrisNZ solutions work. Good job guys. They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration. They are coded 0.

That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

In fairness to ballardw, his code (found via Google search) was for highest percent of variable, not lowest -- highest percent coded to 1.

I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.

sbxkoenk · Posted 03-01-2022 04:12 AM

@NKormanik wrote:

That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

Regardless of what your target event of interest is (1 versus 0), the fact that blanks are coded as 0 does matter for your final model of course!

Cheers,

Koen

PaigeMiller · Posted 03-01-2022 06:07 AM

@NKormanik wrote:

@ChrisNZ @PaigeMiller @sbxkoenk @ballardw

Both PaigeMiller and ChrisNZ solutions work. Good job guys. They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration. They are coded 0.

That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.

Incorrect. I gave you two different ways to code your SQL, the second of which produces a missing when the variable is missing.

I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.

I certainly disagree. By a long shot.

If Logistical Regression is looking for optimizing 1, no worries, I suppose.

If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.

Are you planning to use these new variables that have a 1 when you are in the bottom 10% as the response variable in Logistic Regression? Wouldn't it better to treat the continuous variable as continuous and do linear regression instead of arbitrarily turning these into 0s and 1s and doing Logistic regression?

Why do you not explain the real problem, that's almost always helpful. Focusing on the mechanics of obtaining these variables loses the big picture, and it may be that what you are planning to do with these new variables is not a good idea. This appears to be another example of the XY Problem in action.

--
Paige Miller

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away