proc tabulate or other procedure to sort by analysis variable

Reply
Occasional Contributor
Posts: 14

proc tabulate or other procedure to sort by analysis variable

I have a dataset containing two categorical variables (C1, C2) and one analysis variable (A). C1 has about 60 categories, and C2 has about 50 categories. I want to get count and sum of A for each combination of C1 and C2. This is an easy job for proc tabulate. However since I have so many combinations, I want to focus on top 10 of C1 and, within each C1, top 10 of C2, based on the count of A. Other C1 and C2 will be group as 'other'. For example, outside top 10 C1, all other C1 are group as 'other'. For each of top 10 C1 or 'other', top 10 C2 will be kept. other C2 will be grouped together as 'other'.

For a specific C1, if the number of C2 is less than 10, then all C2 will be kept for that group.

I guess this could be done with proc tabulate or some procedure, but I don't use these kind of procedure a lot so I need some suggestion/guidance here. If an macro can be added to control top n (n=5,10, or other number), that would be great.

SAS Super FREQ
Posts: 8,866

Re: proc tabulate or other procedure to sort by analysis variable

Hi,

  PROC TABULATE will create rows for all the data that you pass to it. If you need to select off the top 10 or the top 5 for each group by frequency count, you may have to "pre-process" the data or make a pass with PROC SQL before TABULATE to select only the rows you want. You can always create a user-defined format to make categories for the smaller groups by using a dataset from the above pre-processing or SQL query to make the formats on the fly.

  You cannot ask TABULATE to summarize and then limit the report based on the summarized values.

cynthia

Super Contributor
Posts: 644

Re: proc tabulate or other procedure to sort by analysis variable

Your requirements involve first preselecting the best values for C1, then reformatting the data and summing again, and then reformatting the data again to select the best C2 values and summing the rest.  I use Proc means because it is easier to code.  It is not clear from your requirements whether you want a sum of A values or a count of A values:  if the latter replace sum = with n =

If your C1 and C2 variables already have length > $ 8 then omit the Length statements.

Note: this is untested code.  The descending key word may be in the wrong position!

%Let SelectC1 = 10 ;

%Let SelectC2 = 10 ;

/* Preselect the best values of C1 */

Proc means

  data = have ;

  Class C1 ;

  Var A ;

  Output out = CountC1

  sum = C1Tot

  ;

Run ;

/* Descending sort, best values to the top of the table */

Proc Sort

  Data = CountC1 (Keep = C1 C1Tot) ;

  By Descending C1Tot ;

Run ;

/* Create a format control table with the best + 1 rows */

Data TopC1 ;

  Length c1 $ 8 ;

  Set CountC1 (Obs = %SUM(&SelectC1 + 1)) ;

  By Descending C1Tot ;

  Retain Fmtname 'BestC1'

  Type 'I'

  ;

  Rename C1 = Start

  C1Tot = Label

  ;

  If _N_ > &SelectC1

  then

  do ;

  C1 = 'Other' ;

  C1Tot = . ;

  end ;

Run ;

/* Create an informat which will translate values of C1 into the total for C1 */

Proc Format

  CntlIn = BestC1  ;

Quit ;

/* Update the raw data with C1Tot (missing if C1 is not in the best set) and set C1 to other if it misses the cut */

Data have_now ;

  Length c1 $ 8 ;

  Set have ;

  C1Tot = Input (C1, BestC1.) ;

  If C1Tot = .

  then C1 = 'Other' ;

Run ;

/* Summarise again over C1 and C2 */

Proc means missing

  data = have_now ;

  Class C1 C2 C1Tot ;

  Var A ;

  Output out = CountC1C2

  sum = C1C2Tot

  ;

Run ;

/* Sort by descending values to get the best ones to the top */

Proc Sort

  Data = CountC1C2 (Keep = C1 C2 C1Tot C1C2Tot) ;

  By Descending C1Tot C1C2Tot ;

Run ;

/* Finally aggregate sums for C1 or C2 out of the best sets */

Data want ;

  Length C1 C2 $ 8 ;

  Set CountC1C2 ;

  By Descending C1Tot C1C2Tot ;

  Retain check sum ;

  If first.C1

  then

  do ;

  sum = 0 ;

  check = 0 ;

  end ;

  check + 1 ;

  If C1 = 'Other'

  or check > &SelectC2

  then

  do ;

  C2 = 'Other' ;

  sum + C1C2Tot ;

  If last.C1

  then

  do ;

  C1C2Tot = sum ;

  output ;

  end ;

  end ;

  else output ;

  Drop check sum ;

Run ;

Richard

Occasional Contributor
Posts: 14

Re: proc tabulate or other procedure to sort by analysis variable

Folks, thanks a lot!

Super Contributor
Posts: 644

Re: proc tabulate or other procedure to sort by analysis variable

A much simpler way to achieve what I think you want is to do the summary, then sort descending on the analysis variable.  You can then, in a following datastep, limit the output to the first 100 rows; or until you reach 90% of the total.

Ask a Question
Discussion stats
  • 4 replies
  • 648 views
  • 6 likes
  • 3 in conversation