Sample data:
data companies;
length ID $ 3;
input ID $ score1 attribute1 attribute2 attribute3;
datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;
I would like to produce a table giving me the total of companies with a certain score that have attribute 1, 2 or 3. And the total of companies with a certain attribute. It should look something like what I get with this query:
PROC SQL;
CREATE TABLE Summary AS
SELECT DISTINCT score1,
(COUNT(attribute1)) AS attribute1,
(COUNT(attribute2)) AS attribute2,
(COUNT(attribute3)) AS attribute3,
(COUNT(DISTINCT(ID))) AS Total
FROM COMPANIES
GROUP BY score1;
QUIT;
But then with only counting the 1-values for attributes, and only counting this for distinct companies, and adding a total row at the bottom. Below the result I want (from the whole dataset), made by hand.
Score | Attr1 | Attr2 | Attr3 | Attr4 | Total |
0.3 | 3 | 182 | 32 | 0 | 217 |
0.4 | 0 | 26 | 0 | 26 | 52 |
0.6 | 5 | 3 | 8 | 0 | 16 |
0.9 | 2 | 2 | 2 | 0 | 6 |
1 | 1 | 1 | 1 | 1 | 4 |
Total | 11 | 214 | 43 | 27 | 295 |
Hello,
I propose this solution:
data companies;
length ID $ 3;
input ID $ score1 attribute1 attribute2 attribute3;
datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;
run;
PROC SQL;
CREATE TABLE Summary AS
SELECT DISTINCT put (score1,5.) as score1,
(COUNT(attribute1)) AS attribute1,
(COUNT(attribute2)) AS attribute2,
(COUNT(attribute3)) AS attribute3,
(COUNT(DISTINCT(ID))) AS Total
FROM COMPANIES
GROUP BY score1;
QUIT;
proc report data =summary out=summary1(drop=_BREAK_) noprint ;
columns score1 attribute1 attribute2 attribute3 total;
define score1/DISPLAY;
define attribute1 /analysis sum;
define attribute2 /analysis sum;
define attribute3 /analysis sum;
define total /analysis sum;
rbreak after / summarize ;
compute after;
score1='TOTAL';
endcomp;
run;
@SarahDew wrote:
Sample data:
data companies;
length ID $ 3;
input ID $ score1 attribute1 attribute2 attribute3;
datalines;
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
001 0.3 1 0 1
003 0.3 1 1 1
;I would like to produce a table giving me the total of companies with a certain score that have attribute 1, 2 or 3. And the total of companies with a certain attribute. It should look something like what I get with this query:
PROC SQL;
CREATE TABLE Summary AS
SELECT DISTINCT score1,
(COUNT(attribute1)) AS attribute1,
(COUNT(attribute2)) AS attribute2,
(COUNT(attribute3)) AS attribute3,
(COUNT(DISTINCT(ID))) AS Total
FROM COMPANIES
GROUP BY score1;
QUIT;But then with only counting the 1-values for attributes, and only counting this for distinct companies, and adding a total row at the bottom. Below the result I want (from the whole dataset), made by hand.
Score Attr1 Attr2 Attr3 Attr4 Total 0.3 3 182 32 0 217 0.4 0 26 0 26 52 0.6 5 3 8 0 16 0.9 2 2 2 0 6 1 1 1 1 1 4 Total 11 214 43 27 295
Can you show what that result table would look like for the given example data? Since the example only has 5 rows of data it is really hard to see where the result could come from.
And what actual role does the ID variable have? Is that a company indicator or something else?
Thanks,
The ID is indeed the company indicator. For the given example the produced table should look like this:
Score | Attr1 | Attr2 | Attr3 | Total |
0.3 | 2 | 1 | 2 | 2 |
0.5 | 0 | 1 | 1 | 1 |
Total | 2 | 2 | 3 | 3 |
For each SCORE1, you want the variable TOTAL to be the number of unique patterns of attribute1-attribute3. And you also want the number of patterns in which each attribute is present.
So the problem is that you don't want to count the duplicate patterns that may appear within an ID/SCORE1 pair. Now it appears that instances of duplicate patterns appear in consecutive records, within an ID.
data companies;
length ID $ 3;
input ID $ score1 attribute1 attribute2 attribute3;
datalines;
001 0.3 1 0 1
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
003 0.3 1 1 1
run;
/* Remove duplicates */
data vneed / view=vneed;
set companies;
by id score1 attribute1-attribute3 notsorted;
if first.attribute3;
run;
/* PROC SUMMARY: Sum ATTRIBUTE values for each SCORE1, and also */
/* use automatic variable _FREQ_ for getting total N of patterns */
proc summary data=vneed;
class score1;
vars attribute1-attribute3 ;
output out=want (drop=_type_ rename=(_freq_=total)) sum= / autoname;
run;
The is a good case for weaning yourself off of proc sql. PROC SUMMARY was made for this type of task, and will almost certainly be much faster for a large data set.
Thanks, with the provided code I am almost there. What was probably important to mention is that the companies might appear more than once in the list but their patterns for the attributes will be the same. They differ on other variables that are not mentioned in my example and which are not needed for this analysis. I have now applied the following, and am trying to figure out how to get the dot for total to appear as 'Total', and on the bottom row.
data companies;
length ID $ 3;
input ID $ score1 attribute1 attribute2 attribute3;
datalines;
001 0.3 1 0 1
001 0.3 1 0 1
002 0.5 0 1 1
003 0.3 1 1 1
003 0.3 1 1 1
run;
/* Remove duplicates */
data vneed / view=vneed;
set companies;
by ID notsorted;
if first.ID;
run;
/* PROC SUMMARY: Sum ATTRIBUTE values for each SCORE1, and also */
/* use automatic variable _FREQ_ for getting total N of patterns */
proc summary data=vneed;
class score1;
vars attribute1 attribute2 attribute3;
output out=want (drop=_type_ rename=(_freq_=Total)) sum=attribute1 attribute2 attribute3;
run;
data want2;
retain score1 attribute1 attribute2 attribute3 Total;
set want;
run;
If a company (i.e. an ID value) appears more than once, will the duplicates always be consecutive? If not this code will keep duplicate ID's if they are separated in the data set.
You want the total row at the bottom, but it's current at the top. What this means is that you want to concatenate two data sets:
The set statement does concatenation:
data want2;
set want1 (firstobs=2) want1 (obs=1);
run;
And "dot for total"? I don't see how you can get that. You can get a dot for the class variable (score1). But it can never have the value "TOTAL" because score1 is a numeric variable, and "TOTAL" is a character value. You can make a new character variable from SCORE1, such that when score1=dot, the new var gets "TOTAL". And all other values of score1 are stored in the new var as character values. I.e.
if score1=. then newvar="Total";
else newvar=put(score1,5.1);
You can look up the PUT function to see how it works
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.