Solved: Re: Condensing observations by variables and number of events

mcalde33 · Posted 10-22-2022 12:56 PM

I need to get my data to turn out like table B (below) , my code so far used is this:

OPTION PS= 58 LS= 72 NODATE NONUMBER;

 DATA Events;
      INPUT EmpID 1-4  @6 EV1 $ @10 EV2 $ @14 EV3 $ EV4 $ 18-20 EV5 $22-24 EV6 $;
      
DATALINES;
1001 AB	 AT	 BTR S   XYZ AT
1001 AT	 MY	 LOV .   LOV LOV
1001 .   MY	 .   S   MY	 S
1001 AB	 BTR LOV BTR AT	 AT
1001 AT	 LOV KUL MY	 .   S
1002 MY	 LOV .   AT	 AT	 BTR
1002 MY	 AT	 BTR .   AT	 MY
1003 BTR MY	 .   XYZ MY	 MY
1003 .   MY	 MY	 MY	 MY	 BTR
1003 BTR AB	 MY	 .   BTR .
1004 S   .   MY	 MY	 .   XYZ
1004 XYZ AT	 KUL AB	 KUL S
1004 .   .   BTR LOV MY	 MY
1004 MY	 KUL AT	 KUL XYZ AB
;
RUN;




PROC PRINT data=events NOOBS;
     Title 'TableB';
RUN;

Table A

Emp Id	Ev1	Ev2	Ev3	Ev4	Ev5	Ev6
1001	AB	AT	BTR	S	XYZ	AT
1001	AT	MY	LOV	.	LOV	LOV
1001	.	MY	.	S	MY	S
1001	AB	BTR	LOV	BTR	AT	AT
1001	AT	LOV	KUL	MY	.	S
1002	MY	LOV	.	AT	AT	BTR
1002	MY	AT	BTR	.	AT	MY
1003	BTR	MY	.	XYZ	MY	MY
1003	.	MY	MY	MY	MY	BTR
1003	BTR	AB	MY	.	BTR	.
1004	S	.	MY	MY	.	XYZ
1004	XYZ	AT	KUL	AB	KUL	S
1004	.	.	BTR	LOV	MY	MY
1004	MY	KUL	AT	KUL	XYZ	AB

Table B

Ev	N
AB	3
AT	3
BTR	4
S	2
XYZ	3
MY	4
LOV	3
KUL	2

Tom · Posted 10-22-2022 04:25 PM

The code you posted to re-create your dataset probably will not work. That is because it seems to have tab characters randomly inserted into it. (Yet another reason why you should not be putting tab characters into program files.).

Also in your dataset there are multiple observations for the same EMPID without any variable to distinguish them.

That will make it more difficult to work with the file (and makes one wonder what the multiple observations are supposed to represent).

That is easy enough to fix by adding a counter. Either one overall counter or on that resets to one when you start a new EMPID.

DATA Events;
  length EmpID 8 EV1-EV6 $3 ;
  input EmpID EV1-EV6 ;
  row+1;
DATALINES;
1001 AB  AT  BTR S   XYZ AT
1001 AT  MY  LOV .   LOV LOV
1001 .   MY  .   S   MY  S
1001 AB  BTR LOV BTR AT  AT
1001 AT  LOV KUL MY  .   S
1002 MY  LOV .   AT  AT  BTR
1002 MY  AT  BTR .   AT  MY
1003 BTR MY  .   XYZ MY  MY
1003 .   MY  MY  MY  MY  BTR
1003 BTR AB  MY  .   BTR .
1004 S   .   MY  MY  .   XYZ
1004 XYZ AT  KUL AB  KUL S
1004 .   .   BTR LOV MY  MY
1004 MY  KUL AT  KUL XYZ AB
;

Now you can use PROC TRANSPOSE to convert to a more useful structure.

proc transpose data=events out=tall(rename=(col1=event));
  by row empid;
  var ev1-ev6 ;
run;

Where it is easy enough to count how many times a particular event occurs for each EMPID.

proc freq data=tall ;
  tables event*empid / noprint out=counts ;
run;

Which can then be used to get a count of how many EMPID values there are per event.

proc freq data=counts ;
  tables event ;
run;

Result

View solution in original post

PeterClemmensen · Posted 10-22-2022 01:40 PM

I think this is what you want to do

data temp(drop = EV1 - EV6);
   set Events;
   array e EV:;
   do over e;
      ev = e;
      ouptut;
   end;
run;

proc sql;
   create table want as
   select EV
        , count(distinct EmpID) as N
   from temp
   where ev ne ''
   group by EV
   ;
quit;

The DATA to DATA Step Macro
Blog: SASnrd

PeterClemmensen · Posted 10-22-2022 01:57 PM

Also, you can do this in a single pass with the hash object.

data _null_;
   dcl hash h (ordered: "A");
   h.definekey ("ev");
   h.definedata ("ev", "N");
   h.definedone ();
 
   dcl hash u ();
   u.definekey ("EmpID", "ev");
   u.definedone ();
 
   do until (z);
      set Events end = z;
      array e EV:;

      do over e;
         ev = e;

         if h.find() ne 0 then call missing (N);
         if u.check() ne 0 then do;
            N = sum (N, 1);
            u.add();
         end;
         h.replace();
      end;
   end;
 
   h.output (dataset: "count_distinct");
   stop;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Reeza · Posted 10-22-2022 04:06 PM

OPTION PS= 58 LS= 72;

Those options are for the listing destination. If you're using SAS EG, Viya, Studio or Foundation with HTML they're likely not doing what you expect.

What are you expecting from those options?

mcalde33 · Posted 10-23-2022 04:50 PM

Not sure, I guess it was a habit when I was using it before. I thought it was standard, If theres no use for it in SAS studio then i'll remove it.

Tom · Posted 10-24-2022 08:42 AM

It will still impact how the SAS log is formatted.

Tom · Posted 10-22-2022 04:25 PM

The code you posted to re-create your dataset probably will not work. That is because it seems to have tab characters randomly inserted into it. (Yet another reason why you should not be putting tab characters into program files.).

Also in your dataset there are multiple observations for the same EMPID without any variable to distinguish them.

That will make it more difficult to work with the file (and makes one wonder what the multiple observations are supposed to represent).

That is easy enough to fix by adding a counter. Either one overall counter or on that resets to one when you start a new EMPID.

DATA Events;
  length EmpID 8 EV1-EV6 $3 ;
  input EmpID EV1-EV6 ;
  row+1;
DATALINES;
1001 AB  AT  BTR S   XYZ AT
1001 AT  MY  LOV .   LOV LOV
1001 .   MY  .   S   MY  S
1001 AB  BTR LOV BTR AT  AT
1001 AT  LOV KUL MY  .   S
1002 MY  LOV .   AT  AT  BTR
1002 MY  AT  BTR .   AT  MY
1003 BTR MY  .   XYZ MY  MY
1003 .   MY  MY  MY  MY  BTR
1003 BTR AB  MY  .   BTR .
1004 S   .   MY  MY  .   XYZ
1004 XYZ AT  KUL AB  KUL S
1004 .   .   BTR LOV MY  MY
1004 MY  KUL AT  KUL XYZ AB
;

Now you can use PROC TRANSPOSE to convert to a more useful structure.

proc transpose data=events out=tall(rename=(col1=event));
  by row empid;
  var ev1-ev6 ;
run;

Where it is easy enough to count how many times a particular event occurs for each EMPID.

proc freq data=tall ;
  tables event*empid / noprint out=counts ;
run;

Which can then be used to get a count of how many EMPID values there are per event.

proc freq data=counts ;
  tables event ;
run;

Result

mcalde33 · Posted 10-23-2022 04:39 PM

I adjusted the code to prevent the Cumulative column and the percentage column from printing but somehow the rows AB, AT, MY get repeated?

I also want to prevent the row at the bottom from being printed "Frequency Missing=4" as well as the "The Freq procedure" title from being printed.

see the code used and see the output I am getting:



 DATA Events;
      length EmpID 8 EV1-EV6 $3 ;
   	INPUT EmpID EV1-EV6 ;
  row+1;
     
      
DATALINES;
1001 AB	 AT	 BTR S   XYZ AT
1001 AT	 MY	 LOV .   LOV LOV
1001 .   MY	 .   S   MY	 S
1001 AB	 BTR LOV BTR AT	 AT
1001 AT	 LOV KUL MY	 .   S
1002 MY	 LOV .   AT	 AT	 BTR
1002 MY	 AT	 BTR .   AT	 MY
1003 BTR MY	 .   XYZ MY	 MY
1003 .   MY	 MY	 MY	 MY	 BTR
1003 BTR AB	 MY	 .   BTR .
1004 S   .   MY	 MY	 .   XYZ
1004 XYZ AT	 KUL AB	 KUL S
1004 .   .   BTR LOV MY	 MY
1004 MY	 KUL AT	 KUL XYZ AB
;
RUN;

/*Using PROC TRANSPOSE to convert to a more useful structure*/
proc transpose data=events out=tall(rename=(col1=Event));
  by row empid;
  var ev1-ev6 ;
run;

/*Formating to count how many times a particular event occurs for each EMPID*/
proc freq data=tall;
  tables Event*empid / noprint out=counts;
run;

/*Formating to get a count of how many EMPID values there are per event*/
proc freq data=counts ;
  tables event /NOCUM NOPERCENT;
  Title 'Table B';
run;

Table B

The FREQ Procedure

Event	Frequency

AB	3
AB	1
AT	3
AT	1
BTR	4
KUL	2
LOV	3
MY	4
MY	3
S	2
XYZ	3

Frequency Missing = 4

Tom · Posted 10-24-2022 08:49 AM

You could eliminate the missing values at many places in that code.

For example by adding a WHERE statement in the last step:

proc freq data=counts ;
  where not missing(event);
  tables event /NOCUM NOPERCENT;
  Title 'Table B';
run;

mcalde33 · Posted 10-24-2022 04:25 PM

how do I rename the "frequency" column to "N". I have tried the rename option but it gave me an error.

Tom · Posted 10-24-2022 05:09 PM

@mcalde33 wrote:
how do I rename the "frequency" column to "N". I have tried the rename option but it gave me an error.

Do you mean the COUNT variable?

tables event / out=event_count(rename=(count=N));

Or do you want to just make your own print out, in which case you can use any column header you want.

proc freq data=counts;
  tables event / noprint out=event_count;
run;

proc print data=event_count label ;
  var event count;
  label count='N' ;
run;

mcalde33 · Posted 10-24-2022 07:38 PM

I got it to work thanks!
Now I noticed the observations dont match how the desired outcome should be:

what im getting:
Table B

Ev N
AB 3
AB 1
AT 3
AT 1
BTR 4
KUL 2
LOV 3
MY 4
MY 3
S 2
XYZ 3

What i should be getting:
Table B

Ev N
AB 3
AT 3
BTR 4
S 2
XYZ 3
MY 4
LOV 3
KUL 2

Here is the code I am using:

DATA Events;
length EmpID 8 EV1-EV6 $3 ;
INPUT EmpID EV1-EV6 ;
row+1;

DATALINES;
1001 AB AT BTR S XYZ AT
1001 AT MY LOV . LOV LOV
1001 . MY . S MY S
1001 AB BTR LOV BTR AT AT
1001 AT LOV KUL MY . S
1002 MY LOV . AT AT BTR
1002 MY AT BTR . AT MY
1003 BTR MY . XYZ MY MY
1003 . MY MY MY MY BTR
1003 BTR AB MY . BTR .
1004 S . MY MY . XYZ
1004 XYZ AT KUL AB KUL S
1004 . . BTR LOV MY MY
1004 MY KUL AT KUL XYZ AB
;
RUN;

/*Using PROC TRANSPOSE to convert to a more useful structure*/
proc transpose data=events out=tall(rename=(col1=Ev));
by row empid;
var ev1-ev6 ;
run;

/*Formating to count how many times a particular event occurs for each EMPID*/
proc freq data=tall;
tables Ev*empid / noprint out=counts;
RUN;

/*Formating to get a count of how many EMPID values there are per event*/
ods noproctitle;
proc freq data=counts;
where not missing(ev);
tables ev /noprint out=event_count NOCUM NOPERCENT;
Title 'Table B';
run;

proc print data=event_count label NOOBS ;
var ev count;
label count='N' ;
run;

Reeza · Posted 10-25-2022 12:10 PM

DATA Events;
length EmpID 8 EV1-EV6 $3 ;
INPUT EmpID EV1-EV6 ;
row+1;
DATALINES;
1001 AB AT BTR S XYZ AT
1001 AT MY LOV . LOV LOV
1001 . MY . S MY S
1001 AB BTR LOV BTR AT AT
1001 AT LOV KUL MY . S
1002 MY LOV . AT AT BTR
1002 MY AT BTR . AT MY
1003 BTR MY . XYZ MY MY
1003 . MY MY MY MY BTR
1003 BTR AB MY . BTR .
1004 S . MY MY . XYZ
1004 XYZ AT KUL AB KUL S
1004 . . BTR LOV MY MY
1004 MY KUL AT KUL XYZ AB
;
RUN;



proc transpose data=events out=tall(rename=(col1=Ev));
by row empid;
var ev1-ev6 ;
run;

data tall;
set tall;
ev=trim(compress(ev, , 's')); *remove any extra spaces;
run;

/*Formating to count how many times a particular event occurs for each EMPID*/
proc freq data=tall;
tables Ev/  out=counts;
RUN;

Tom · Posted 10-25-2022 12:27 PM

You have two different "AB" values.

Probably related the TAB characters that messed up your original posting.

If you are using SAS to edit your program there should be options in the editor to replace tabs with spaces. I really recommend turning that option on so that it is much harder to accidentally get tab characters inserted into your file. For example in the SAS/Studio version available in SAS ODA here is what the preferences setting looks like:

But it might be other invisible characters.

Could just be leading spaces.

But could be other things like non-breaking spaces 'A0'x or Linefeed '0A'x or CarriageReturn '0D'x or really anthing that is not a normal letter or digit or punctuation symbol.

NOTE: The ODS output system HIDES leading spaces. You can make them more visible by using the $QUOTE. format with the variable. But even then it is hard to see spaces when looking at text displayed with a proportional font. If you can produce plain old text output (ODS LISTING is new terminology) then the leading spaces are much easier to see.

ballardw · Posted 10-25-2022 01:33 AM

@mcalde33 wrote:

I adjusted the code to prevent the Cumulative column and the percentage column from printing but somehow the rows AB, AT, MY get repeated?

I also want to prevent the row at the bottom from being printed "Frequency Missing=4" as well as the "The Freq procedure" title from being printed.

see the code used and see the output I am getting:
 DATA Events;
      length EmpID 8 EV1-EV6 $3 ;
   	INPUT EmpID EV1-EV6 ;
  row+1;
     
      
DATALINES;
1001 AB	 AT	 BTR S   XYZ AT
1001 AT	 MY	 LOV .   LOV LOV
1001 .   MY	 .   S   MY	 S
1001 AB	 BTR LOV BTR AT	 AT
1001 AT	 LOV KUL MY	 .   S
1002 MY	 LOV .   AT	 AT	 BTR
1002 MY	 AT	 BTR .   AT	 MY
1003 BTR MY	 .   XYZ MY	 MY
1003 .   MY	 MY	 MY	 MY	 BTR
1003 BTR AB	 MY	 .   BTR .
1004 S   .   MY	 MY	 .   XYZ
1004 XYZ AT	 KUL AB	 KUL S
1004 .   .   BTR LOV MY	 MY
1004 MY	 KUL AT	 KUL XYZ AB
;
RUN;

/*Using PROC TRANSPOSE to convert to a more useful structure*/
proc transpose data=events out=tall(rename=(col1=Event));
  by row empid;
  var ev1-ev6 ;
run;

/*Formating to count how many times a particular event occurs for each EMPID*/
proc freq data=tall;
  tables Event*empid / noprint out=counts;
run;

/*Formating to get a count of how many EMPID values there are per event*/
proc freq data=counts ;
  tables event /NOCUM NOPERCENT;
  Title 'Table B';
run;
Table B

The FREQ Procedure

Event

Frequency

AB

3

AB

1

AT

3

AT

1

BTR

4

KUL

2

LOV

3

MY

4

MY

3

S

2

XYZ

3

Frequency Missing = 4

This is a symptom that you have either a leading space or more than one with some of the values. Proc Freq and most of the table generated output will left justify and drop the leading spaces.

See if this code shows the leading space(s);

proc tabulate data=counts ;
  class event ;
  classlev event / style=[asis=on];
  Title 'Table B';
  table event, n;
run;

The CLASSLEV statement provides options on how row and column header variable values display. The ASIS=on preserves such things as leading spaces when present.

Registration is open

SAS Training: Just a Click Away