How to count/add occurrences of variable

deleted_user · Posted 07-06-2010 09:22 PM

Does anyone have an idea how I would add up the number of occurrences of a variable and put them into a group? Here's a sample of the dataset I'm working with:

DATE CASENUMBER CODE STATUS MEM NAM TIM

100629 CASE1665 CPL AVER ABC AAA 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.3
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1661 RLT AVER GHI CCC 0.2
100629 CASE1661 RLT AVER GHI AAA 0.2
100629 CASE1661 RLT AVER GHI DDD 0.2
100629 CASE1666 RLT ASOL JKL AAA 2.0
100629 CASE1666 RLT ASOL JKL EEE 1.0
100629 CASE1666 RLT ASOL JKL EEE 5.0
100630 CASE1667 RLT AANS MNL AAA 0.5
100630 CASE1667 RLT AANS MNL AAA 0.2
100630 CASE1667 RLT AANS MNL CCC 0.2
100630 CASE1668 RLT ASOL DE1 BBB 1.0
100630 CASE1668 RLT ASOL DE1 AAA 0.3
100630 CASE1668 RLT ASOL DE1 AAA 0.2
100630 CASE1669 RLT ADOC XEL AAA 0.5

What I want is a sum of unique occurrence of the code field based on CASENUMBER. I've tried PROC FREQ as well as counters and using nested if statements and LAST.CODE but I always end up with an addition of all occurrences. For example using counters or PROC FREQ i end up with RLT = 13 but what i'm looking for is RLT = 4 because there are only 4 unique cases. Any suggestions?

Cynthia_sas · Posted 07-06-2010 09:58 PM

Hi:
Just looking at your data, I wonder why RLT is not equal to 5. I see that RLT occurs in the following cases: 1661, 1666, 1667, 1668 and 1669. I can see how PROC FREQ would come up with RLT=13, because the code, RLT, occurs a total of 13 times in the entire file. Is there another criteria other than CASENUMBER and CODE that determines what consititutes a "unique" case??

cynthia

Patrick · Posted 07-07-2010 02:48 AM

Having made a lot of assumptions below may be what you want:

proc sql;
create view V_DCS as
select Distinct CODE, STATUS
from have
;
quit;

options missing=' ';
proc tabulate data=V_DCS noseps format=8.;
class Code Status;
keylabel n=' ' all='Total';
table Code ,Status all;
run;

HTH
Patrick

Ksharp · Posted 07-07-2010 08:36 AM

Hi.Patrick
there is only one sql statement can do this.
Assuming the last obs is 1668.
So code:

[pre]
data temp;
input (DATE CASENUMBER CODE STATUS MEM NAM TIM) ($);

datalines;
100629 CASE1665 CPL AVER ABC AAA 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.3
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1661 RLT AVER GHI CCC 0.2
100629 CASE1661 RLT AVER GHI AAA 0.2
100629 CASE1661 RLT AVER GHI DDD 0.2
100629 CASE1666 RLT ASOL JKL AAA 2.0
100629 CASE1666 RLT ASOL JKL EEE 1.0
100629 CASE1666 RLT ASOL JKL EEE 5.0
100630 CASE1667 RLT AANS MNL AAA 0.5
100630 CASE1667 RLT AANS MNL AAA 0.2
100630 CASE1667 RLT AANS MNL CCC 0.2
100630 CASE1668 RLT ASOL DE1 BBB 1.0
100630 CASE1668 RLT ASOL DE1 AAA 0.3
100630 CASE1668 RLT ASOL DE1 AAA 0.2
100630 CASE1668 RLT ADOC XEL AAA 0.5
;
run;
proc sql feedback;
select code,count(distinct CASENUMBER) label='Freq'
from temp
group by CODE
;
quit;
[/pre]

Ksharp

Patrick · Posted 07-08-2010 10:19 PM

Hi Ksharp
Thanks!
Cheers
Patrick

deleted_user · Posted 07-07-2010 11:16 AM

Hi Cynthia - I'm sorry yes you're right RLT should be 5, DYS should be 1 and CPL should 1. The only unique identifier is the CASENUMBER. I was hoping to be able to use some kind of counter and FIRST.PRODUCT and LAST.PRODUCT with some IF statements but I'm not sure if that's do-able.

Hi Ksharp - That's great that's actually the results that I'm looking for. I've never used proc sql before though and was wondering if there's a way to group together codes into 2 'buckets'? For example CPL and RLT into bucket 1 and DYS into bucket 2 so I would end up with bucket 1 = 6 and bucket 2 = 1

Also is there a reason that you said 'Assuming the last obs is 1668.' ? Would the last obs "100630 CASE1669 RLT ADOC XEL AAA 0.5" cause an issue?

Thanks everyone,
japena

Cynthia_sas · Posted 07-07-2010 05:37 PM

Hi:
It is do-able with FIRST. processing...but not exactly the way you think. Conceptually, it would go like this:
1) create a dataset with the unique combos of CaseNumber and Code (use either DATA step program or PROC SQL -- essentially get just one obs for each CaseNumber/Code combo)
2) pass this newly created dataset to PROC FREQ to get your counts (or if you ONLY want counts, then use a PROC SQL step to just select distnct)

To answer your question about how to set buckets, I would do this in a user-defined format (although you could do this in the DATA step or PROC SQL, too).

I show these techniques in the program below.

cynthia
[pre]

** read the data;
data cases;
infile datalines;
input DATE :yymmdd6. CASENUMBER $ CODE $ STATUS $ MEM $ NAM $ TIM;
return;
datalines;
100629 CASE1665 CPL AVER ABC AAA 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.3
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1663 DYS ADOC DEF BBB 0.2
100629 CASE1661 RLT AVER GHI CCC 0.2
100629 CASE1661 RLT AVER GHI AAA 0.2
100629 CASE1661 RLT AVER GHI DDD 0.2
100629 CASE1666 RLT ASOL JKL AAA 2.0
100629 CASE1666 RLT ASOL JKL EEE 1.0
100629 CASE1666 RLT ASOL JKL EEE 5.0
100630 CASE1667 RLT AANS MNL AAA 0.5
100630 CASE1667 RLT AANS MNL AAA 0.2
100630 CASE1667 RLT AANS MNL CCC 0.2
100630 CASE1668 RLT ASOL DE1 BBB 1.0
100630 CASE1668 RLT ASOL DE1 AAA 0.3
100630 CASE1668 RLT ASOL DE1 AAA 0.2
100630 CASE1669 RLT ADOC XEL AAA 0.5
;
run;

proc sort data=cases;
by casenumber code;
run;

** create output dataset with unique occurrences of casenumber and code;
data countcode(keep=casenumber code);
set cases;
by casenumber code;
if first.code then output;
run;

/*
** Make dataset with PROC SQL instead of FIRST. logic;
ods listing;
proc sql;
create table countcode2 as
select distinct casenumber, code
from cases
order by CODE ;
quit;

proc print data=countcode2;
title 'This is the same data as in CountCode dataset';
ods listing close;
*/

ods listing close;
ods html file='c:\temp\countcode.html' style=sasweb;
** show what is in dataset created above;
proc print data=countcode;
title 'What is in Dataset CountCode -- Unique combos of CaseNumber and Code';
run;

** now use proc freq on dataset created above;
proc freq data=countcode;
title 'PROC FREQ on unique values of CaseNumber and Code';
tables code;
run;

** Create Buckets Label using PROC FORMAT;
proc format;
value $bckt 'CPL', 'RLT' = 'Bucket 1'
'DYS' = 'Bucket 2'
other = 'Bucket Unknown';
run;

** Use $bckt format with PROC FREQ;
proc freq data=countcode;
title 'PROC FREQ with $BCKT format';
tables code;
format code $bckt.;
run;
ods html close;

[/pre]

deleted_user · Posted 07-07-2010 09:49 PM

Cynthia,

Thanks a lot exactly what I was looking for. I'll have to tweak it and play around it with it a little bit but this should get me a good start. I'm assuming that I can obtain the same output with listing instead of html?

On a sort of related question regarding counters and LAST. I'm trying to get a variable with the total sum of TIM, I used a counter with an IF LAST THEN variable equal to tmp but it's not working so well for me. Something like this:

DATA TIME;
SET TTIME;
RETAIN TMP 0;
TOTALTIME = 0;
IF NOT MISSING(TIM) THEN TMP = TMP + TIM;
IF LAST.CASENUMBER THEN TOTALTIME=TMP;

PROC PRINT NOOBS;
TITLE 'TOTAL TIME';
VAR TOTALTIME;

All I get is a bunch of "." vertically which represent each of the observations, if i print out the counter which is TMP in this case I can see the time been added up correctly except I'm only interested in the final value and printing the final value. Is there a more straight forward way to do this?

Cynthia_sas · Posted 07-07-2010 10:28 PM

Hi:
Well, without a BY statement, your logic will never work correctly. But if you ONLY want the final value (or grand total of TIM) to show, then why not use one of the SAS reporting procedures instead of using a DATA step program. There's no need to add everything up yourself if all you want is a grand total or even a breakdown by CASENUMBER and CODE.

However, my tendency would be to use either PROC MEANS, PROC REPORT or PROC TABULATE to get the total of the TIM variable. By default, missing values are ignored in the calculation of statistics, so the test for NOT MISSING is already taken care of when you use SAS procedures to generate statistics, such as summing.

In the code below, I'm showing the PROC REPORT and PROC TABULATE and PROC MEANS methods. All 3 procedures will create and output dataset. PROC MEANS uses the OUTPUT statement, while PROC REPORT and PROC TABULATE both use the the OUT= option on the PROC statement. The reason I prefer PROC REPORT and PROC TABULATE over PROC MEANS is that I can get subtotals and grand totals in the same report with either REPORT and TABULATE and with PROC MEANS, a bit more processing would be required to get the CASENUMBER or CASENUMBER/CODE breakdown along with the GRAND TOTAL.

There is a lot of good documentation on all 3 of these procedures. I hope that the examples below will get you started.

cynthia

[pre]
** Uses CASES data, as shown in my previous posting;
ods listing;

proc report data=cases nowd;
title '1a) PROC REPORT get grand total only';
column tim;
define tim / sum "Time";
run;

proc report data=cases nowd;
title '1b) PROC REPORT get case breakdown';
column casenumber tim;
define casenumber / group "Case";
define tim / sum "Time";
rbreak after / summarize;
run;

proc report data=cases nowd;
title '1c) PROC REPORT get case and code';
column casenumber code tim;
define casenumber / group "Case";
define code / group "Code";
define tim / sum "Time";
rbreak after / summarize;
run;

proc report data=cases nowd;
title '1d) PROC REPORT get code breakdown only';
column code tim;
define code / group "Code";
define tim / sum "Time";
rbreak after / summarize;
run;

proc tabulate data=cases;
title '2a) PROC TABULATE get grand total only';
var tim;
table tim*sum;
label tim="Time";
run;

proc tabulate data=cases;
title '2b) PROC TABULATE get casenumber breakdown';
class casenumber;
var tim;
table casenumber all,
tim*sum;
label tim="Time"
casenumber="Case";
run;

proc tabulate data=cases;
title '2c) PROC TABULATE get casenumber and code breakdown';
class casenumber code;
var tim;
table casenumber*code all,
tim*sum;
label tim="Time"
casenumber="Case"
code="Code";
run;

proc tabulate data=cases;
title '2d) PROC TABULATE get code breakdown only';
class code;
var tim;
table code all,
tim*sum;
label tim="Time"
code="Code";
run;

proc means data=cases sum;
title '3a) Grand Total of TIM only';
var tim;
run;

proc means data=cases sum;
title '3b) PROC MEANS get casenumber breakdown';
class casenumber;
var tim;
run;

proc means data=cases sum;
title '3c) PROC MEANS get casenumber and code breakdown';
class casenumber code;
var tim;
run;

proc means data=cases sum;
title '3d) PROC MEANS get code breakdown';
class code;
var tim;
run;

[/pre]

deleted_user · Posted 07-09-2010 04:16 PM

Cynthia,

This should get me going thank you so much!

How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Re: How to count/add occurrences of variable

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!