Re: Crosstabulate distinct frequencies and totals of binary variables

SarahDew · Posted 12-21-2018 04:40 AM

Sample data:

data companies;
 length ID $ 3;
 input ID $ score1 attribute1 attribute2 attribute3;
 datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;

I would like to produce a table giving me the total of companies with a certain score that have attribute 1, 2 or 3. And the total of companies with a certain attribute. It should look something like what I get with this query:

PROC SQL;
 CREATE TABLE Summary AS 
 SELECT DISTINCT score1, 
 (COUNT(attribute1)) AS attribute1, 
 (COUNT(attribute2)) AS attribute2, 
 (COUNT(attribute3)) AS attribute3, 
 (COUNT(DISTINCT(ID))) AS Total
 FROM COMPANIES
 GROUP BY score1;
QUIT;

But then with only counting the 1-values for attributes, and only counting this for distinct companies, and adding a total row at the bottom. Below the result I want (from the whole dataset), made by hand.

Score	Attr1	Attr2	Attr3	Attr4	Total
0.3	3	182	32	0	217
0.4	0	26	0	26	52
0.6	5	3	8	0	16
0.9	2	2	2	0	6
1	1	1	1	1	4
Total	11	214	43	27	295

mansour_ib_sas · Posted 12-21-2018 04:58 AM

Hello,

I propose this solution:

data companies;
 length ID $ 3;
 input ID $ score1 attribute1 attribute2 attribute3;
 datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;
run;


PROC SQL;
 CREATE TABLE Summary AS 
 SELECT DISTINCT put (score1,5.) as score1, 
 (COUNT(attribute1)) AS attribute1, 
 (COUNT(attribute2)) AS attribute2, 
 (COUNT(attribute3)) AS attribute3, 
 (COUNT(DISTINCT(ID))) AS Total
 FROM COMPANIES
 GROUP BY score1;
QUIT;

proc report data =summary out=summary1(drop=_BREAK_) noprint  ;
columns  score1 attribute1 attribute2 attribute3 total;
define score1/DISPLAY;
define attribute1 /analysis sum;
define attribute2 /analysis sum;
define attribute3 /analysis sum;
define total /analysis sum;

 rbreak after / summarize ;
 compute after;
 score1='TOTAL';
 endcomp;
 run;

ballardw · Posted 12-21-2018 11:28 AM

@SarahDew wrote:

Sample data:
data companies;
 length ID $ 3;
 input ID $ score1 attribute1 attribute2 attribute3;
 datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;
I would like to produce a table giving me the total of companies with a certain score that have attribute 1, 2 or 3. And the total of companies with a certain attribute. It should look something like what I get with this query:
PROC SQL;
 CREATE TABLE Summary AS 
 SELECT DISTINCT score1, 
 (COUNT(attribute1)) AS attribute1, 
 (COUNT(attribute2)) AS attribute2, 
 (COUNT(attribute3)) AS attribute3, 
 (COUNT(DISTINCT(ID))) AS Total
 FROM COMPANIES
 GROUP BY score1;
QUIT;
But then with only counting the 1-values for attributes, and only counting this for distinct companies, and adding a total row at the bottom. Below the result I want (from the whole dataset), made by hand.

Score Attr1 Attr2 Attr3 Attr4 Total

0.3 3 182 32 0 217

0.4 0 26 0 26 52

0.6 5 3 8 0 16

0.9 2 2 2 0 6

1 1 1 1 1 4

Total 11 214 43 27 295

Can you show what that result table would look like for the given example data? Since the example only has 5 rows of data it is really hard to see where the result could come from.

And what actual role does the ID variable have? Is that a company indicator or something else?

SarahDew · Posted 12-24-2018 07:25 AM

Thanks,

The ID is indeed the company indicator. For the given example the produced table should look like this:

Score	Attr1	Attr2	Attr3	Total
0.3	2	1	2	2
0.5	0	1	1	1
Total	2	2	3	3

mkeintz · Posted 12-24-2018 07:00 PM

For each SCORE1, you want the variable TOTAL to be the number of unique patterns of attribute1-attribute3. And you also want the number of patterns in which each attribute is present.

So the problem is that you don't want to count the duplicate patterns that may appear within an ID/SCORE1 pair. Now it appears that instances of duplicate patterns appear in consecutive records, within an ID.

data companies;
 length ID $ 3;
 input ID $ score1 attribute1 attribute2 attribute3;
 datalines;
001 0.3 1 0 1
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
003 0.3 1 1 1
run;

/* Remove duplicates */
data vneed / view=vneed;
  set companies;
  by id score1 attribute1-attribute3  notsorted;
  if first.attribute3;
run;

/* PROC SUMMARY: Sum ATTRIBUTE values for each SCORE1, and also  */
/* use automatic variable _FREQ_ for getting total N of patterns */
proc summary data=vneed;
  class score1;
  vars attribute1-attribute3 ;
  output out=want (drop=_type_ rename=(_freq_=total)) sum= / autoname;
run;

The data set VNEED is a data set VIEW, not a data set FILE on disk. This means the data is not written to disk. In fact, not data is processed until the view VNEED is accessed later in the proc summary. At that point the data is read from companies and streamed directly to the proc summary. This saves disk input/output.
The BY statement ends with NOTSORTED, allowing the data to be grouped by id/score/attribute1-attribute3 pattern even if the patterns are not in sorted order. The "if first.attribute3" dummy tells you when the record-in-hand is the first one for a given id/score1/attribute1/attribute2/attribute3 pattern. Because there is no "THEN" clause, it is a subsetting IF statement, i.e. a filter.
The proc summary (see also its twin proc means)
1. The class statement says to summarize for each score1 value.
2. The VARS statement identifies analysys variables.
3. The output statement says to get the SUM of each analysis variable.

The is a good case for weaning yourself off of proc sql. PROC SUMMARY was made for this type of task, and will almost certainly be much faster for a large data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SarahDew · Posted 01-03-2019 07:00 AM

Thanks, with the provided code I am almost there. What was probably important to mention is that the companies might appear more than once in the list but their patterns for the attributes will be the same. They differ on other variables that are not mentioned in my example and which are not needed for this analysis. I have now applied the following, and am trying to figure out how to get the dot for total to appear as 'Total', and on the bottom row.

data companies;
 length ID $ 3;
 input ID $ score1 attribute1 attribute2 attribute3;
 datalines;
001 0.3 1 0 1
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
003 0.3 1 1 1
run;

/* Remove duplicates */
data vneed / view=vneed;
 set companies;
 by ID notsorted;
 if first.ID;
run;

/* PROC SUMMARY: Sum ATTRIBUTE values for each SCORE1, and also */
/* use automatic variable _FREQ_ for getting total N of patterns */
proc summary data=vneed;
 class score1;
 vars attribute1 attribute2 attribute3;
 output out=want (drop=_type_ rename=(_freq_=Total)) sum=attribute1 attribute2 attribute3;
run;

data want2;
retain score1 attribute1 attribute2 attribute3 Total;
set want;
run;

mkeintz · Posted 01-03-2019 09:57 PM

If a company (i.e. an ID value) appears more than once, will the duplicates always be consecutive? If not this code will keep duplicate ID's if they are separated in the data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SarahDew · Posted 01-04-2019 04:12 AM

I checked and the duplicates are consecutive in the dataset so it should work correctly I presume?

mkeintz · Posted 01-04-2019 11:19 AM

You want the total row at the bottom, but it's current at the top. What this means is that you want to concatenate two data sets:

All the observations from want1 starting with the second obs. (firstobs=2)
The single first obs from want1 (obs=1)

The set statement does concatenation:

data want2;
  set want1 (firstobs=2)   want1 (obs=1);
run;

And "dot for total"? I don't see how you can get that. You can get a dot for the class variable (score1). But it can never have the value "TOTAL" because score1 is a numeric variable, and "TOTAL" is a character value.   You can make a new character variable from SCORE1, such that when score1=dot, the new var gets "TOTAL". And all other values of score1 are stored in the new var as character values.  I.e.
    if score1=. then newvar="Total";
    else newvar=put(score1,5.1);

You can look up the PUT function to see how it works

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away