BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
texasmfp
Lapis Lazuli | Level 10

I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.

 

My current code spits out the top 25k results, previously sorted by Field 1:

data RESULTS.&top_set;
set top_set (obs=25000);
run;

 

This is repeated after sorting Field 1 in descending order:

data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;

 

Combined, yields the top/bottom results:

data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;

 

What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.

There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.

So, what I really want in the output is the top and bottom results from within each unique Countries value.

So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....

For each generation of runs, the range of Countries varies.  So, it needs to be scalable rather than hard code for a specific list of Countries =values.

 

Thanks

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.

 

Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.

 

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Likely the easier part of this.

 

Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".

proc sort data=sashelp.class out=work.class;
   by sex;
run;

proc summary data=work.class nway;
   class sex ;
   output out=work.sexcount(drop=_type_);
run;

data temp;
  merge work.class
        work.sexcount
  ;
  by sex;
  if first.sex then counter=.;
   counter+1;
  /* picking top/bottom 3 when _freq_ is at least 6*/
  if counter le 3 or (_freq_-counter) le 2 ; 
run;

If you really know that you have at least 2000 records in each "group" that last line would be

if counter le 1000 or (_freq_-counter) le 999 ;

 

View solution in original post

9 REPLIES 9
Reeza
Super User

How big are the data sets?

 

Can you show an example of your input and expected output?
Or can you illustrate what you want as output using sashelp.heart or sashelp.cars as your input data set.

texasmfp
Lapis Lazuli | Level 10

The input and output are SAS datasets and have identical structure with just two fields.  What I want is results that have been filtered for the top x and bottom x from within each Countries value.  Here is an example of the data.  In this limited dataset, there are only 2 values in the Countries field (19, 20).  The full database has 10s of millions of rows and has Countries values that range from 5 to 100:

 

bluneconomic	Countries
-0.695725305	19
-0.695386876	19
-0.695117345	19
-0.694779503	20
-0.693231578	19
-0.692606254	20
-0.692243075	20
-0.691876134	20
-0.691733435	19
-0.691376988	19
-0.69006756	    20
-0.690016635	19
-0.689842044	20
-0.689760414	19
-0.689543252	19
-0.689053336	20
-0.689001225	19
Reeza
Super User
ods select none;
ods output ExtremeValues=want;
proc univariate data=yourData nextrval=10;
class countries;
   var bluneconomic;
run;
ods select all;

Note nextrval is your 'x'

Example 4.3 Identifying Extreme Observations and Extreme Values

https://documentation.sas.com/?docsetId=procstat&docsetTarget=procstat_univariate_examples03.htm&doc...

texasmfp
Lapis Lazuli | Level 10

Thanks Reeza, but I do not want ODS output.  I want a SAS datafile.  Thanks

Reeza
Super User
You're aware that ODS OUTPUT creates a SAS data set? Did you try it and it didn't work?

If so, what is a data file?
ballardw
Super User

So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.

 

Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.

 

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Likely the easier part of this.

 

Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".

proc sort data=sashelp.class out=work.class;
   by sex;
run;

proc summary data=work.class nway;
   class sex ;
   output out=work.sexcount(drop=_type_);
run;

data temp;
  merge work.class
        work.sexcount
  ;
  by sex;
  if first.sex then counter=.;
   counter+1;
  /* picking top/bottom 3 when _freq_ is at least 6*/
  if counter le 3 or (_freq_-counter) le 2 ; 
run;

If you really know that you have at least 2000 records in each "group" that last line would be

if counter le 1000 or (_freq_-counter) le 999 ;

 

texasmfp
Lapis Lazuli | Level 10

I thought I was clear.  The group is the field Countries.  I want the output dataset to be filtered by the field called Countries, using the other field Bluneconomic to determine to highest and lowest values.

 

In the supplied data set, above, Countries has two values 19 and 20.

 

Lets say I want to top and bottom 2 results.  So the output dataset would have 8 rows: the top 2 Bluneconomic values where Countries =19, the top 2 Bluneconomic values where Countries =20,the bottom 2 Bluneconomic values where Countries =19, and the bottom 2 Bluneconomic values where Countries =20.

ballardw
Super User

Your original post does not include the word "Bluneconomics"

I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.

 

My current code spits out the top 25k results, previously sorted by Field 1:

data RESULTS.&top_set;
set top_set (obs=25000);
run;

 

This is repeated after sorting Field 1 in descending order:

data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;

 

Combined, yields the top/bottom results:

data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;

 

What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.

There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.

So, what I really want in the output is the top and bottom results from within each unique Countries value.

So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

 

Thanks

So you would sort by Country Bluneconomics in the first sort to get increasing values by row, or by Country descending Blunecomnics; to get decreasing values of the economics within each country value.

 

 

texasmfp
Lapis Lazuli | Level 10

Thanks.  Your suggested code works well, even when there are fewer results than the counter setting.  If I only wanted the top results rather than both the top and bottom, how would the code be modded?  Thanks

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1951 views
  • 0 likes
  • 3 in conversation