Solved: need help creating a summary

Elliott · Posted 07-15-2022 07:31 PM

I have data that I need to convert to a summary with counts for rule (the number or records in each rule by store) and also a % of match by the bucket value.

I am do not know how to convert to summary like the example below. I have tried proc tabulate and proc transpose but have not been able to get the desired results.

any assistance will be greatly appreciated.

Thanks, Elliott

I need it to look like this with counts for rule level and number mismatch and total

the on the right the percent of bucket aging match/mismatch

Want:

	Count by matching Rule Level								% of match by aging bucket
Store#	1	2	3	4	5	6	Mis-match	Total	0-5	6-15	16-30	31-60	>60	Mismatch
00070
00144
00145
00163
00175
00251
00301

Data I have looks like this:

Store Rule bktdays match bucket
00004 NOMATCH 0-5
00004 2 3 MATCH 0-5
00011 4 -75 MATCH 0-5
00014 4 -10 MATCH 0-5
00014 1 8 MATCH 6-15
00014 4 -42 MATCH 0-5
00014 4 -45 MATCH 0-5
00014 4 -47 MATCH 0-5
00058 3 0 MATCH 0-5
00070 5 -1 MATCH 0-5
00070 3 0 MATCH 0-5
00070 4 -2 MATCH 0-5
00070 3 0 MATCH 0-5
00070 1 0 MATCH 0-5
00070 4 -9 MATCH 0-5
00070 1 5 MATCH 0-5
00070 4 -14 MATCH 0-5
00070 2 2 MATCH 0-5
00070 4 -18 MATCH 0-5
00070 3 0 MATCH 0-5
00070 4 -21 MATCH 0-5
00070 1 0 MATCH 0-5
00070 4 -25 MATCH 0-5
00070 4 -25 MATCH 0-5
00070 3 3 MATCH 0-5
00070 1 1 MATCH 0-5
00070 4 -28 MATCH 0-5
00070 4 -30 MATCH 0-5
00070 3 2 MATCH 0-5
00070 4 0 MATCH 0-5
00070 4 -42 MATCH 0-5
00070 4 -48 MATCH 0-5
00070 1 0 MATCH 0-5

data have;
infile datalines dlm=',' dsd;
length Store $5 Rule 1. bktdays8. match $10 bucket $5;
input store match bucket $ rule 1. bktdays 8.;
datalines;
00004,,,,NOMATCH,0-5
00004,,2,3,MATCH,0-5
00011,,4,-75,MATCH,0-5
00014,,4,-10,MATCH,0-5
00014,,1,8,MATCH,6-15
00014,,4,82,MATCH,61-90
00014,,4,45,MATCH,31-60
00014,4,47,MATCH,31-60
00058,3,0,MATCH,0-5
00070,5,-1,MATCH,0-5
00070,3,0,MATCH,0-5
00070,4,-2,MATCH,0-5
00070,3,0,MATCH,0-5
00070,1,0,MATCH,0-5
00070,4,-9,MATCH,0-5
00070,1,5,MATCH,0-5
00070,,-14,NOMATCH,0-5
00070,2,2,MATCH,0-5
00070,4,-18,MATCH,0-5
00072,3,0,MATCH,0-5
00072,4,21,MATCH,16-30
00073,1,0,MATCH,0-5
00073,4,25,MATCH,16-30
00073,4,25,MATCH,16-30
00074,3,3,MATCH,0-5
00074,1,1,MATCH,0-5
00074,4,58,MATCH,31-60
00075,4,30,MATCH,16-30
00075,3,2,MATCH,0-5
00075,4,0,MATCH,0-5
00076,4,92,MATCH,>90
00076,4,48,MATCH,31-60
00076,1,0,MATCH,0-5
00077,4,52,MATCH,31-60
;
run;

mkeintz · Posted 07-17-2022 09:28 PM

The below does what I think you want, except for the rightmost column.

Assuming that there is a 1:1 equivalence between RULE being missing and a Mismatch status, then just make rule=99 for mis-match cases. To make rule 99's print out as "mismatch" in your table, just assign a user-define format for RULE.

The match percentage by age bucket can be seen as the average of a variable MATCH_PCT which takes on only two values: 1 (for matches) and 0 (mis-matches). Ask tabulate to report the mean of those 1's and 0's, and apply then PERCENT. format to that mean.

What is the rightmost column? It's listed as part of a column group titled "% of Match by aging Bucket", yet that column is not an age group. And it's not even a Match %, since it is headed "mismatch".

data need / view=need;
  set work.summary_data ;

  /*Put a leading space in front of '0-5' and '6-15' so that
    lexicographic order replicates the intended age group order*/
  if bucket='0-5' then bucket=' 0-5';
  else if bucket='6-15' then bucket=' 6-15';

  if rule=. then do;
    rule=99 ; /*Use for mismatch*/
    match_pct=0;
  end;
  else match_pct=1;

  /*Right justify STORE variable to get expected sort order*/
  store=right(store); 
run ;


proc format ;
  value rulefmt   99='MisMatch'  other=[1.] ;
run;

proc tabulate data =need;
  class rule store bucket ;
  var match_pct;
  table   store
        , rule*n=' '*f=comma6.0
          all='Total'*n=' '*f=comma6.0
          match_pct='% of match by aging bucket'*bucket=' '*mean=' '*f=percent6.4 ;
  format rule rulefmt. ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 07-15-2022 08:00 PM

Please revise your DATA HAVE step into a working piece of code. Among other issues, your datalines do not all have the same number of csv value (i.e. they don't have the same number of commas). There are other errors too that make it a bit ambiguous as to what advice to give you.

Please fix so that it runs.

And then post it via the "running man" icon.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Elliott · Posted 07-16-2022 06:50 AM

here is the corrected data lines:

data have;

infile datalines dlm=',' dsd;

length Store $5 Rule $1 bktdays 8. match $10 bucket $5;

input store rule $ bktdays match $ bucket $ ;

datalines;

00004,,0,NOMATCH,0-5

00004,2,3,MATCH,0-5

00011,4,-75,MATCH,0-5

00014,4,-10,MATCH,0-5

00014,1,8,MATCH,6-15

00014,4,82,MATCH,61-90

00014,4,45,MATCH,31-60

00014,4,47,MATCH,31-60

00058,3,0,MATCH,0-5

00070,5,-1,MATCH,0-5

00070,3,0,MATCH,0-5

00070,4,-2,MATCH,0-5

00070,3,0,MATCH,0-5

00070,1,0,MATCH,0-5

00070,4,-9,MATCH,0-5

00070,1,5,MATCH,0-5

00070,,-14,NOMATCH,0-5

00070,2,2,MATCH,0-5

00070,4,-18,MATCH,0-5

00072,3,0,MATCH,0-5

00072,4,21,MATCH,16-30

00073,1,0,MATCH,0-5

00073,4,25,MATCH,16-30

00073,4,25,MATCH,16-30

00074,3,3,MATCH,0-5

00074,1,1,MATCH,0-5

00074,4,58,MATCH,31-60

00075,4,30,MATCH,16-30

00075,3,2,MATCH,0-5

00075,4,0,MATCH,0-5

00076,4,92,MATCH,>90

00076,4,48,MATCH,31-60

00076,1,0,MATCH,0-5

00077,4,52,MATCH,31-60

;

run;

PaigeMiller · Posted 07-16-2022 06:57 AM

I just tried your code, it does not work. Please test the code and confirm that it works.

--
Paige Miller

Elliott · Posted 07-16-2022 08:02 AM

update to code has already been posted, sorry for the inconvenience

mkeintz · Posted 07-16-2022 07:35 AM

What does "mismatch" mean?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Elliott · Posted 07-16-2022 08:01 AM

the initial data before it was reduced to this data was to match record from 2 different data sets, the match/nomatch indicates if a match was found.

ballardw · Posted 07-16-2022 11:21 AM

You need to provide a bit of information as to exactly what "TOTAL" and "% of match by the bucket value" means. I might guess that Total is a count of records for the store but in case it is something else. But the requested percentage needs a clear definition of what each numerator and denominator used should be especially since you seem to imply that MISMATCH is a category under Bucket when it is a different variable.

I suspect that your data is in for some significant reformatting and/or additional variables to get a report like that.

Elliott · Posted 07-16-2022 06:46 PM

here are some more details that may help, this is an image of the sample the report requestor provided:

Column H: is the number of mismatch or nomatch records for that store number.

Column J: %mismatch =H18/I18

Column I: Total of (both match and mismatched) is the number of records for 1-6 + #of mismatch

Starting at column K:

% of matched by aging bucket? - 7 matched out of 35 (sum of 1-6=7)/total? 7/35=.2 or 20%

In the original sample I provided the format was a little different. but I think they just moved the mismatch% to the end of the buckets.

Thank you all for your help with this

Elliott · Posted 07-17-2022 06:40 PM

I have worked on this all day and am making some progress but still need some help.

data work.summary_data;
  infile datalines dlm=',' dsd truncover;
  length PendingNo $9 CrUnit $5 Store $5 match $10 bucket $5 Rulechar $1 Rule 3 ;
  input PendingNo CrUnit Store match bucket Rulechar $ Rule 1.  ;
  datalines;
253864481,4730,4,MATCH,0-5,2,2
212777089,4730,4,NOMATCH,0-5,,
301172842,4730,11,MATCH,0-5,4,4
281739414,4730,14,MATCH,6-15,1,1
261300392,4730,14,MATCH,0-5,4,4
288475892,4730,14,MATCH,0-5,4,4
313076925,4730,14,MATCH,0-5,4,4
23131926,4730,14,MATCH,0-5,4,4
273107812,4730,58,MATCH,0-5,3,3
226270010,4730,70,MATCH,0-5,1,1
276177164,4730,70,MATCH,0-5,1,1
92644582,4730,70,MATCH,0-5,1,1
137788903,4730,70,MATCH,0-5,1,1
63797089,4730,70,MATCH,0-5,1,1
114514092,4730,70,MATCH,6-15,1,1
159810023,4730,70,MATCH,0-5,1,1
239028597,4730,70,MATCH,6-15,1,1
291638142,4730,70,MATCH,0-5,1,1
306576926,4730,70,MATCH,0-5,2,2
184156646,4730,70,MATCH,0-5,3,3
214132768,4730,70,MATCH,0-5,3,3
51683419,4730,70,MATCH,0-5,3,3
137563134,4730,70,MATCH,0-5,3,3
217461698,4730,70,MATCH,0-5,3,3
109113011,4730,70,MATCH,0-5,3,3
149173652,4730,70,MATCH,0-5,3,3
154079135,4730,70,MATCH,0-5,3,3
189377914,4730,70,MATCH,0-5,3,3
204766847,4730,70,MATCH,0-5,3,3
189066497,4730,70,MATCH,0-5,4,4
251583641,4730,70,MATCH,0-5,4,4
306428966,4730,70,MATCH,0-5,4,4
46074769,4730,70,MATCH,0-5,4,4
72116021,4730,70,MATCH,0-5,4,4
112755683,4730,70,MATCH,0-5,4,4
112963673,4730,70,MATCH,0-5,4,4
147893714,4730,70,MATCH,0-5,4,4
162142576,4730,70,MATCH,0-5,4,4
233991350,4730,70,MATCH,0-5,4,4
283368603,4730,70,MATCH,0-5,4,4
38519097,4730,70,MATCH,0-5,4,4
78059359,4730,70,MATCH,0-5,4,4
104160371,4730,130,MATCH,0-5,4,4
119392631,4730,130,MATCH,0-5,4,4
184920636,4730,130,MATCH,0-5,4,4
81757207,4730,130,MATCH,0-5,4,4
96275766,4730,130,MATCH,0-5,4,4
96633756,4730,130,MATCH,0-5,4,4
214350928,4730,139,MATCH,0-5,1,1
13148566,4730,139,MATCH,0-5,3,3
164847505,4730,139,MATCH,0-5,4,4
184356706,4730,139,MATCH,0-5,4,4
67873620,4730,139,MATCH,0-5,4,4
72534031,4730,139,MATCH,0-5,4,4
102649143,4730,139,MATCH,0-5,4,4
182921667,4730,139,MATCH,0-5,4,4
207836158,4730,139,MATCH,0-5,4,4
238375000,4730,139,MATCH,0-5,4,4
278454232,4730,139,MATCH,0-5,4,4
298747613,4730,139,MATCH,0-5,4,4
251074531,4730,139,NOMATCH,0-5,,
293309254,4730,139,NOMATCH,0-5,,
204006727,4730,139,NOMATCH,0-5,,
106813447,4730,139,NOMATCH,0-5,,
43030128,4730,140,MATCH,6-15,1,1
291264232,4730,140,MATCH,0-5,1,1
296210900,4730,140,MATCH,0-5,2,2
99259380,4730,140,MATCH,16-30,3,3
51328329,4730,143,MATCH,16-30,1,1
;
run;

Proc tabulate data=summary_data;
title  "Rule Analysis";
   class store  /missing;
   class PendingNo rulechar bucket /missing;
   var rule;
   table store, rulechar * bucket
       (all='Total Records' rule=''*(n="# of Matches"))
       /misstext=' '
   ;
run;

This produces part of what I need, I am still needing to know how to create the % of mismatches and also I would like Rules 1 - 6 on the left then total then buckets then %mismatch.

This is what the current output looks like now, but I would like to have it look like what is below
This is what the query creates, but.
This is what I am trying to get to

Elliott · Posted 07-17-2022 06:44 PM

here are images of what I have and below is what I am trying to get to

.current query produceswhere I am trying to get

mkeintz · Posted 07-17-2022 09:28 PM

The below does what I think you want, except for the rightmost column.

Assuming that there is a 1:1 equivalence between RULE being missing and a Mismatch status, then just make rule=99 for mis-match cases. To make rule 99's print out as "mismatch" in your table, just assign a user-define format for RULE.

The match percentage by age bucket can be seen as the average of a variable MATCH_PCT which takes on only two values: 1 (for matches) and 0 (mis-matches). Ask tabulate to report the mean of those 1's and 0's, and apply then PERCENT. format to that mean.

What is the rightmost column? It's listed as part of a column group titled "% of Match by aging Bucket", yet that column is not an age group. And it's not even a Match %, since it is headed "mismatch".

data need / view=need;
  set work.summary_data ;

  /*Put a leading space in front of '0-5' and '6-15' so that
    lexicographic order replicates the intended age group order*/
  if bucket='0-5' then bucket=' 0-5';
  else if bucket='6-15' then bucket=' 6-15';

  if rule=. then do;
    rule=99 ; /*Use for mismatch*/
    match_pct=0;
  end;
  else match_pct=1;

  /*Right justify STORE variable to get expected sort order*/
  store=right(store); 
run ;


proc format ;
  value rulefmt   99='MisMatch'  other=[1.] ;
run;

proc tabulate data =need;
  class rule store bucket ;
  var match_pct;
  table   store
        , rule*n=' '*f=comma6.0
          all='Total'*n=' '*f=comma6.0
          match_pct='% of match by aging bucket'*bucket=' '*mean=' '*f=percent6.4 ;
  format rule rulefmt. ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Elliott · Posted 07-18-2022 07:15 AM

Thank you this is wonderful, as for the last column it is supposed to be the % of mismatches. I need to add that.

Also, how would the code change if they actually wanted a count on buckets instead of %match?

mkeintz · Posted 07-18-2022 10:48 AM

@Elliott wrote:

Thank you this is wonderful, as for the last column it is supposed to be the % of mismatches. I need to add that.

Then why is it included as a subheading under the group heading titled "% of match by aging bucket"?

If that is just a layout error, then in the DATA NEED step, make a new dummy variable "mismatch_dummy", with a one for mismatch, and zero otherwise. In proc tabulate, add the mismatch_dummy as an analysis variable, and ask for its mean with a PERCENT display.

Also, how would the code change if they actually wanted a count on buckets instead of %match?

Since the %match values are nothing more than the mean of a bunch of 1's (match) and 0's for the analysis variable (not class variable) match_dummy (which I called match_pct in my earlier note), then to get a "count" of matches instead, replace the code requesting a mean to a request for a sum. The sum of all those 1's will be the same as a count of those 1's. Don't forget to change the associated format request.

This is a common "trick" to use with proc tabulate. If match_dummy had been declared a class variable - which might have been the intuitive instinct, then proc tabulate would generate two columns (or rows) for it - one for the 0's and one for the 1's. But you want only one column. Using match_dummy as an analysis variable, and coding it only as 1's and 0's allows you to replicate a count (which is the normal use of a class variable) with a sum.

Of course, the same was done with the mismatch_dummy variable, except you want a mean instead of a sum.

To make the code clearer, take away all the quoted elements, and maybe the formats - then you'll see the operational code.

data work.summary_data;
  infile datalines dlm=',' dsd truncover;
  length PendingNo $9 CrUnit $5 Store $5 match $10 bucket $5 Rulechar $1 Rule 3 ;
  input PendingNo CrUnit Store match bucket Rulechar $ Rule 1.  ;
datalines;
253864481,4730,4,MATCH,0-5,2,2
212777089,4730,4,NOMATCH,0-5,,
301172842,4730,11,MATCH,0-5,4,4
281739414,4730,14,MATCH,6-15,1,1
261300392,4730,14,MATCH,0-5,4,4
288475892,4730,14,MATCH,0-5,4,4
313076925,4730,14,MATCH,0-5,4,4
23131926,4730,14,MATCH,0-5,4,4
273107812,4730,58,MATCH,0-5,3,3
226270010,4730,70,MATCH,0-5,1,1
276177164,4730,70,MATCH,0-5,1,1
92644582,4730,70,MATCH,0-5,1,1
137788903,4730,70,MATCH,0-5,1,1
63797089,4730,70,MATCH,0-5,1,1
114514092,4730,70,MATCH,6-15,1,1
159810023,4730,70,MATCH,0-5,1,1
239028597,4730,70,MATCH,6-15,1,1
291638142,4730,70,MATCH,0-5,1,1
306576926,4730,70,MATCH,0-5,2,2
184156646,4730,70,MATCH,0-5,3,3
214132768,4730,70,MATCH,0-5,3,3
51683419,4730,70,MATCH,0-5,3,3
137563134,4730,70,MATCH,0-5,3,3
217461698,4730,70,MATCH,0-5,3,3
109113011,4730,70,MATCH,0-5,3,3
149173652,4730,70,MATCH,0-5,3,3
154079135,4730,70,MATCH,0-5,3,3
189377914,4730,70,MATCH,0-5,3,3
204766847,4730,70,MATCH,0-5,3,3
189066497,4730,70,MATCH,0-5,4,4
251583641,4730,70,MATCH,0-5,4,4
306428966,4730,70,MATCH,0-5,4,4
46074769,4730,70,MATCH,0-5,4,4
72116021,4730,70,MATCH,0-5,4,4
112755683,4730,70,MATCH,0-5,4,4
112963673,4730,70,MATCH,0-5,4,4
147893714,4730,70,MATCH,0-5,4,4
162142576,4730,70,MATCH,0-5,4,4
233991350,4730,70,MATCH,0-5,4,4
283368603,4730,70,MATCH,0-5,4,4
38519097,4730,70,MATCH,0-5,4,4
78059359,4730,70,MATCH,0-5,4,4
104160371,4730,130,MATCH,0-5,4,4
119392631,4730,130,MATCH,0-5,4,4
184920636,4730,130,MATCH,0-5,4,4
81757207,4730,130,MATCH,0-5,4,4
96275766,4730,130,MATCH,0-5,4,4
96633756,4730,130,MATCH,0-5,4,4
214350928,4730,139,MATCH,0-5,1,1
13148566,4730,139,MATCH,0-5,3,3
164847505,4730,139,MATCH,0-5,4,4
184356706,4730,139,MATCH,0-5,4,4
67873620,4730,139,MATCH,0-5,4,4
72534031,4730,139,MATCH,0-5,4,4
102649143,4730,139,MATCH,0-5,4,4
182921667,4730,139,MATCH,0-5,4,4
207836158,4730,139,MATCH,0-5,4,4
238375000,4730,139,MATCH,0-5,4,4
278454232,4730,139,MATCH,0-5,4,4
298747613,4730,139,MATCH,0-5,4,4
251074531,4730,139,NOMATCH,0-5,,
293309254,4730,139,NOMATCH,0-5,,
204006727,4730,139,NOMATCH,0-5,,
106813447,4730,139,NOMATCH,0-5,,
43030128,4730,140,MATCH,6-15,1,1
291264232,4730,140,MATCH,0-5,1,1
296210900,4730,140,MATCH,0-5,2,2
99259380,4730,140,MATCH,16-30,3,3
51328329,4730,143,MATCH,16-30,1,1
run;


data need / view=need;
  set work.summary_data ;

  /*Put a leading space in front of '0-5' and '6-15' so that
    lexicographic order replicates the intended age group order*/
  if bucket='0-5' then bucket=' 0-5';
  else if bucket='6-15' then bucket=' 6-15';

  if rule=. then do;
    rule=99 ; /*Use for mismatch*/
    match_dummy=0;
  end;
  else match_dummy=1;

  mismatch_dummy=1-match_dummy;

  /*Right justify STORE variable to get expected sort order*/
  store=right(store); 
run ;


proc format ;
  value rulefmt   99='MisMatch'  other=[1.] ;
run;

proc tabulate data =need;
  class rule store bucket ;
  var match_dummy mismatch_dummy;
  table   store
        , rule*n=' '*f=comma6.0
          all='Total'*N=''*f=comma6.0
          match_dummy='% of Match by Aging Bucket'*bucket=' '*sum=' '*f=comma6.0
          mismatch_dummy='MisMatch %'*mean=' '*f=percent5.;
  format rule rulefmt. ;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

need help creating a summary

Re: need help creating a summary - Updated information

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

Re: need help creating a summary

need help creating a summary - Updated information

Re: need help creating a summary - Updated information

Re: need help creating a summary - Updated information

Re: need help creating a summary - Updated information

Re: need help creating a summary - Updated information

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!