Hi,
I have a dataset with more than 20 columns and I want to get the sum of each column without writing sum for over 20 times.
I made a sample dataset as follows
(the actual dataset will have much more columns, and there's not a particular order of column names):
********************************************
data sample;
infile datalines delimiter=',';
input Email $ code1 code3 codeA;
datalines;
E1,1,0,1
E2,1,1,0
E3,0,1,1
E4,0,0,1
E5,0,0,0
E6,1,1,1
;
run;
********************************************
what I want to get,
code1 | code3 | codeA |
3 | 3 | 4 |
********************************************
I know I can do:
proc sql;
select sum(code1) as Code1, sum(code3) as Code3, sum(codeA) as CodeA
from sample;
quit;
But when there are so many columns, can I do some looping in sas in order to make it more efficient?
I also tried:
data OUT;
set sample;
array code code1 -- codeA;
do over code;
code = sum(code);
end;
run;
But it still displays every email, and give me the same dataset as sample. Don' know how I can modify the code to make it work.
data sample;
infile datalines delimiter=',';
input Email $ code1 code3 codeA;
datalines;
E1,1,0,1
E2,1,1,0
E3,0,1,1
E4,0,0,1
E5,0,0,0
E6,1,1,1
;
run;
proc summary data=sample ;
var code:;
output out=want(keep=code:) sum=;
run;
The SAS recommendation is that, when the requirement is to summarize, the 1st place to look is procedures summary/freq etc
data sample;
infile datalines delimiter=',';
input Email $ code1 code3 codeA;
datalines;
E1,1,0,1
E2,1,1,0
E3,0,1,1
E4,0,0,1
E5,0,0,0
E6,1,1,1
;
run;
proc summary data=sample ;
var code:;
output out=want(keep=code:) sum=;
run;
The SAS recommendation is that, when the requirement is to summarize, the 1st place to look is procedures summary/freq etc
Hi @novinosrin ,
Thanks for the reply!!!
My follow up question is, what if there's no common part of the columns? They don't start with 'code'. For example, column names are: AA, BB,SJ, EWK.
Hmm, that's a pain. If you do know the start to end sequence present in the dataset, you could list like
var aa--ewk;
But the above assumes the list is in continuous sequence from left to right in terms of position of the variable.
If there is no such a thing and all of these are completely random names and in any random order/position, i'm afraid there isn't much one can do about it. 😞
You could use _NUMERIC_, assuming you want all numeric variables to be summed.
If it is a huge data set, and you don't really need ALL of the numeric variables to be summed, for example there's no point in summing zip codes or dates, etc. then you are trading off execution time against programming/typing time.
Guru @data_null__ Honestly did think of it. To be honest, I once made a terrible mistake at work by using _numeric_ without knowing there were some variables(numeric) of no interest, though rectified it later but that's the problem. Of course I agree, should you have all vars of interest and only those vars in your dataset, _numeric_ is very comfy.
Then you can modify the code as follows:
proc summary data=sample ;
var gorilla giraffe octopus cobra elephant oryx lion jackal;
output out=want sum=;
run;
where you type the variable names in the VAR statement.
If they are consecutive columns in the data set named SAMPLE, then it gets even easier
proc summary data=sample ;
var gorilla--jackal;
output out=want sum=;
run;
Here is a reference that illustrates how to refer to variables and datasets in a short cut list:
https://blogs.sas.com/content/iml/2018/05/29/6-easy-ways-to-specify-a-list-of-variables-in-sas.html
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.