- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Terrific code from @ballardw, to create 'dummy variables':
Bottom 10% of a certain variable, code dummy variable as 1.
Otherwise, code the dummy variable as 0.
But I've encountered a problem....
proc summary data=&etf..&etf._combined; var i_50401_Z i_50402_Z i_50403_Z i_50404_Z i_50405_Z i_50408_Z ; output out=&etf..&etf._combined_temp (drop= _:) p10= /autoname; run; Proc sql; create table &etf..&etf._combined_2 as select a.*, (a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct , (a.i_50402_Z <= b.i_50402_Z_P10) as i_50402_Z_bottom10pct , (a.i_50403_Z <= b.i_50403_Z_P10) as i_50403_Z_bottom10pct , (a.i_50404_Z <= b.i_50404_Z_P10) as i_50404_Z_bottom10pct , (a.i_50405_Z <= b.i_50405_Z_P10) as i_50405_Z_bottom10pct , (a.i_50408_Z <= b.i_50408_Z_P10) as i_50408_Z_bottom10pct from &etf..&etf._combined as a, &etf..&etf._combined_temp as b; quit;
The problem is, there is some missing data. Blank cells.
And the above code is dummy coding these blank cells as 1.
So, I have to somehow tell the code above to only code for the lowest 10% of actual numerical values. Ignore the blanks.
Any help greatly appreciated.
Nicholas Kormanik
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How about
select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
(a.i_50401_Z <= b.i_50401_Z_P10) and not missing(a.i_50401_Z) as i_50401_Z_bottom10pct
or maybe even better
case when missing(a.i_50401_Z) then . else a.i_50401_Z <= b.i_50401_Z_P10 end as i_50401_Z_bottom10pct
IMHO, this whole idea of finding the bottom 10pct would be better done in PROC RANK and arrays, rather than by tediously coding SQL conditions for each variable. In addition, PROC RANK has several methods of handling ties.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Agreeing with @PaigeMiller.
See also these 3 blogs (especially the first one) :
Selecting the top n% and bottom n% of observations from a data set
By Kathryn McLawhorn on SAS Users July 21, 2017
https://blogs.sas.com/content/sgf/2017/07/21/selecting-the-top-n-and-bottom-n-of-observations-from-a...
4 ways to find the k smallest and largest data values in SAS
By Rick Wicklin on The DO Loop January 26, 2022
https://blogs.sas.com/content/iml/2022/01/26/k-smallest-largest-data.html
An easy way to make a "Top 10" table and bar chart in SAS
By Rick Wicklin on The DO Loop June 4, 2018
https://blogs.sas.com/content/iml/2018/06/04/top-10-table-bar-chart.html
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How about
select a.*,
( . < a.i_50401_Z <= b.i_50401_Z_P10) as i_50401_Z_bottom10pct ,
etc
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@ChrisNZ @PaigeMiller @sbxkoenk @ballardw
Both PaigeMiller and ChrisNZ solutions work. Good job guys. They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration. They are coded 0.
That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.
If Logistical Regression is looking for optimizing 1, no worries, I suppose.
If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.
In fairness to ballardw, his code (found via Google search) was for highest percent of variable, not lowest -- highest percent coded to 1.
I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@NKormanik wrote:
That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.
If Logistical Regression is looking for optimizing 1, no worries, I suppose.
If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.
Regardless of what your target event of interest is (1 versus 0), the fact that blanks are coded as 0 does matter for your final model of course!
Cheers,
Koen
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@NKormanik wrote:
@ChrisNZ @PaigeMiller @sbxkoenk @ballardw
Both PaigeMiller and ChrisNZ solutions work. Good job guys. They work in the sense that the blank cells are no longer coded 1, as they were in the original code configuration. They are coded 0.
That, though, does raise another matter, but for now will overlook it. The matter is that blank cells are being coded 0.
Incorrect. I gave you two different ways to code your SQL, the second of which produces a missing when the variable is missing.
I've looked at using the Proc Rank approach and feel ballardw's coding is much easier to use by a long shot.
I certainly disagree. By a long shot.
If Logistical Regression is looking for optimizing 1, no worries, I suppose.
If LR, or other, is focusing on 0 values, then all the blank cells coded as 0 may cause problems.
Are you planning to use these new variables that have a 1 when you are in the bottom 10% as the response variable in Logistic Regression? Wouldn't it better to treat the continuous variable as continuous and do linear regression instead of arbitrarily turning these into 0s and 1s and doing Logistic regression?
Why do you not explain the real problem, that's almost always helpful. Focusing on the mechanics of obtaining these variables loses the big picture, and it may be that what you are planning to do with these new variables is not a good idea. This appears to be another example of the XY Problem in action.
Paige Miller