Solved: Re: Help with grouped results

texasmfp · Posted 07-15-2020 03:57 PM

I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.

My current code spits out the top 25k results, previously sorted by Field 1:

data RESULTS.&top_set;
set top_set (obs=25000);
run;

This is repeated after sorting Field 1 in descending order:

data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;

Combined, yields the top/bottom results:

data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;

What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.

There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.

So, what I really want in the output is the top and bottom results from within each unique Countries value.

So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Thanks

ballardw · Posted 07-15-2020 04:32 PM

So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.

Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Likely the easier part of this.

Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".

proc sort data=sashelp.class out=work.class;
   by sex;
run;

proc summary data=work.class nway;
   class sex ;
   output out=work.sexcount(drop=_type_);
run;

data temp;
  merge work.class
        work.sexcount
  ;
  by sex;
  if first.sex then counter=.;
   counter+1;
  /* picking top/bottom 3 when _freq_ is at least 6*/
  if counter le 3 or (_freq_-counter) le 2 ; 
run;

If you really know that you have at least 2000 records in each "group" that last line would be

if counter le 1000 or (_freq_-counter) le 999 ;

View solution in original post

Reeza · Posted 07-15-2020 04:00 PM

How big are the data sets?

Can you show an example of your input and expected output?
Or can you illustrate what you want as output using sashelp.heart or sashelp.cars as your input data set.

texasmfp · Posted 07-15-2020 04:11 PM

The input and output are SAS datasets and have identical structure with just two fields. What I want is results that have been filtered for the top x and bottom x from within each Countries value. Here is an example of the data. In this limited dataset, there are only 2 values in the Countries field (19, 20). The full database has 10s of millions of rows and has Countries values that range from 5 to 100:

bluneconomic	Countries
-0.695725305	19
-0.695386876	19
-0.695117345	19
-0.694779503	20
-0.693231578	19
-0.692606254	20
-0.692243075	20
-0.691876134	20
-0.691733435	19
-0.691376988	19
-0.69006756	    20
-0.690016635	19
-0.689842044	20
-0.689760414	19
-0.689543252	19
-0.689053336	20
-0.689001225	19

Reeza · Posted 07-15-2020 04:17 PM

ods select none;
ods output ExtremeValues=want;
proc univariate data=yourData nextrval=10;
class countries;
   var bluneconomic;
run;
ods select all;

Note nextrval is your 'x'

Example 4.3 Identifying Extreme Observations and Extreme Values

https://documentation.sas.com/?docsetId=procstat&docsetTarget=procstat_univariate_examples03.htm&doc...

texasmfp · Posted 07-15-2020 04:22 PM

Thanks Reeza, but I do not want ODS output. I want a SAS datafile. Thanks

Reeza · Posted 07-15-2020 04:58 PM

You're aware that ODS OUTPUT creates a SAS data set? Did you try it and it didn't work?

If so, what is a data file?

ballardw · Posted 07-15-2020 04:32 PM

So, how do we know what constitutes a "group"? A variable? A combination of variables? Telepathy? Please provide a clearer definition and best is terms of actual variables.

Will you state that every group has at least 2000 records? If not you need to provide a description of what you desire for smaller groups as there will be overlaps between upper and lower. And if you ever have fewer than 1000 records in a group then you have yet another problem to describe the desired behavior.

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Likely the easier part of this.

Here is one approach but does not attempt to include any fix for fewer records involved. This assumes that each level of the variable Sex defines a "group".

proc sort data=sashelp.class out=work.class;
   by sex;
run;

proc summary data=work.class nway;
   class sex ;
   output out=work.sexcount(drop=_type_);
run;

data temp;
  merge work.class
        work.sexcount
  ;
  by sex;
  if first.sex then counter=.;
   counter+1;
  /* picking top/bottom 3 when _freq_ is at least 6*/
  if counter le 3 or (_freq_-counter) le 2 ; 
run;

If you really know that you have at least 2000 records in each "group" that last line would be

if counter le 1000 or (_freq_-counter) le 999 ;

texasmfp · Posted 07-15-2020 04:41 PM

I thought I was clear. The group is the field Countries. I want the output dataset to be filtered by the field called Countries, using the other field Bluneconomic to determine to highest and lowest values.

In the supplied data set, above, Countries has two values 19 and 20.

Lets say I want to top and bottom 2 results. So the output dataset would have 8 rows: the top 2 Bluneconomic values where Countries =19, the top 2 Bluneconomic values where Countries =20,the bottom 2 Bluneconomic values where Countries =19, and the bottom 2 Bluneconomic values where Countries =20.

ballardw · Posted 07-15-2020 08:29 PM

Your original post does not include the word "Bluneconomics"

I have a large dataset that is too big to open and filter within SAS or SAS Enterprise Guide.

My current code spits out the top 25k results, previously sorted by Field 1:

data RESULTS.&top_set;
set top_set (obs=25000);
run;

This is repeated after sorting Field 1 in descending order:

data RESULTS.&bottom_set;
set bottom_set (obs=25000);
run;

Combined, yields the top/bottom results:

data top_bottom;
set RESULTS.&top_set RESULTS.&bottom_set;
run;

What I want to generate is the top 1,000 and bottom 1,000 results, but for each group.

There is another field "Countries", which captures the # of countries used to create the result in FIELD 1.

So, what I really want in the output is the top and bottom results from within each unique Countries value.

So, the top and bottom for Countries =57, plus the the top and bottom for Countries =56, the top and bottom for Countries =55, etc....

For each generation of runs, the range of Countries varies. So, it needs to be scalable rather than hard code for a specific list of Countries =values.

Thanks

So you would sort by Country Bluneconomics in the first sort to get increasing values by row, or by Country descending Blunecomnics; to get decreasing values of the economics within each country value.

texasmfp · Posted 07-15-2020 09:35 PM

Thanks. Your suggested code works well, even when there are fewer results than the counter setting. If I only wanted the top results rather than both the top and bottom, how would the code be modded? Thanks

Note nextrval is your 'x'

Example 4.3 Identifying Extreme Observations and Extreme Values

SAS Training: Just a Click Away