I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.
My current code spits out the top 25k results, previously sorted by Field 1:
data RESULTS.&top_set;
set top_set (obs=25000);
run;
This is repeated after sorting Field 1 in descending order:
data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;
Combined, yields the top/bottom results:
data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;
What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.
There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.
So, what I really want in the output is the top and bottom results from within each unique Countries value.
So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....
For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.
Thanks
So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.
Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.
For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.
Likely the easier part of this.
Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".
proc sort data=sashelp.class out=work.class;
by sex;
run;
proc summary data=work.class nway;
class sex ;
output out=work.sexcount(drop=_type_);
run;
data temp;
merge work.class
work.sexcount
;
by sex;
if first.sex then counter=.;
counter+1;
/* picking top/bottom 3 when _freq_ is at least 6*/
if counter le 3 or (_freq_-counter) le 2 ;
run;
If you really know that you have at least 2000 records in each "group" that last line would be
if counter le 1000 or (_freq_-counter) le 999 ;
How big are the data sets?
Can you show an example of your input and expected output?
Or can you illustrate what you want as output using sashelp.heart or sashelp.cars as your input data set.
The input and output are SAS datasets and have identical structure with just two fields. What I want is results that have been filtered for the top x and bottom x from within each Countries value. Here is an example of the data. In this limited dataset, there are only 2 values in the Countries field (19, 20). The full database has 10s of millions of rows and has Countries values that range from 5 to 100:
bluneconomic Countries
-0.695725305 19
-0.695386876 19
-0.695117345 19
-0.694779503 20
-0.693231578 19
-0.692606254 20
-0.692243075 20
-0.691876134 20
-0.691733435 19
-0.691376988 19
-0.69006756 20
-0.690016635 19
-0.689842044 20
-0.689760414 19
-0.689543252 19
-0.689053336 20
-0.689001225 19
ods select none;
ods output ExtremeValues=want;
proc univariate data=yourData nextrval=10;
class countries;
var bluneconomic;
run;
ods select all;
Thanks Reeza, but I do not want ODS output. I want a SAS datafile. Thanks
So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.
Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.
For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.
Likely the easier part of this.
Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".
proc sort data=sashelp.class out=work.class;
by sex;
run;
proc summary data=work.class nway;
class sex ;
output out=work.sexcount(drop=_type_);
run;
data temp;
merge work.class
work.sexcount
;
by sex;
if first.sex then counter=.;
counter+1;
/* picking top/bottom 3 when _freq_ is at least 6*/
if counter le 3 or (_freq_-counter) le 2 ;
run;
If you really know that you have at least 2000 records in each "group" that last line would be
if counter le 1000 or (_freq_-counter) le 999 ;
I thought I was clear. The group is the field Countries. I want the output dataset to be filtered by the field called Countries, using the other field Bluneconomic to determine to highest and lowest values.
In the supplied data set, above, Countries has two values 19 and 20.
Lets say I want to top and bottom 2 results. So the output dataset would have 8 rows: the top 2 Bluneconomic values where Countries =19, the top 2 Bluneconomic values where Countries =20,the bottom 2 Bluneconomic values where Countries =19, and the bottom 2 Bluneconomic values where Countries =20.
Your original post does not include the word "Bluneconomics"
I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.
My current code spits out the top 25k results, previously sorted by Field 1:
data RESULTS.&top_set;
set top_set (obs=25000);
run;
This is repeated after sorting Field 1 in descending order:
data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;
Combined, yields the top/bottom results:
data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;
What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.
There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.
So, what I really want in the output is the top and bottom results from within each unique Countries value.
So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....
For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.
Thanks
So you would sort by Country Bluneconomics in the first sort to get increasing values by row, or by Country descending Blunecomnics; to get decreasing values of the economics within each country value.
Thanks. Your suggested code works well, even when there are fewer results than the counter setting. If I only wanted the top results rather than both the top and bottom, how would the code be modded? Thanks
April 27 – 30 | Gaylord Texan | Grapevine, Texas
Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!
Still thinking about your presentation idea? The submission deadline has been extended to Friday, Nov. 14, at 11:59 p.m. ET.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.