Hi,
suppsoe I have the following data:
var1 | var2 | var3 | var4 | var5 |
abc | d45g | abc | 1 | 2 |
a | c | rrrrrr | g9 | g9 |
var1 - var3 is one range, and var4 - var5 is another.
For each row, I would like to count the number of distinct values for each range of variables. So for the above case, here is the count:
var1-var3 | var4-var5 |
2 | 2 |
3 | 1 |
Thanks!
Here is a brute force approach:
data want; set have; array v1(*) $ var1-var3; array v2(*) $ var4-var5; call sortc(of v1(*)); do i=1 to dim(v1); if not missing(v1(i)) then do; if i lt dim(v1) then do; if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1); end; else var1_var3=sum(var1_var3,1); end; end; call sortc(of v2(*)); do i=1 to dim(v2); if not missing(v2(i)) then do; if i lt dim(v2) then do; if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1); end; else var4_var5=sum(var4_var5,1); end; end; run;
HTH,
Art, CEO, AnalystFinder.com
Here is a brute force approach:
data want; set have; array v1(*) $ var1-var3; array v2(*) $ var4-var5; call sortc(of v1(*)); do i=1 to dim(v1); if not missing(v1(i)) then do; if i lt dim(v1) then do; if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1); end; else var1_var3=sum(var1_var3,1); end; end; call sortc(of v2(*)); do i=1 to dim(v2); if not missing(v2(i)) then do; if i lt dim(v2) then do; if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1); end; else var4_var5=sum(var4_var5,1); end; end; run;
HTH,
Art, CEO, AnalystFinder.com
@art297 wrote:
Here is a brute force approach:
data want; set have; array v1(*) $ var1-var3; array v2(*) $ var4-var5; call sortc(of v1(*)); do i=1 to dim(v1); if not missing(v1(i)) then do; if i lt dim(v1) then do; if v1(i)ne v1(i+1) then var1_var3=sum(var1_var3,1); end; else var1_var3=sum(var1_var3,1); end; end; call sortc(of v2(*)); do i=1 to dim(v2); if not missing(v2(i)) then do; if i lt dim(v2) then do; if v2(i)ne v2(i+1) then var4_var5=sum(var4_var5,1); end; else var4_var5=sum(var4_var5,1); end; end; run;HTH,
Art, CEO, AnalystFinder.com
@art297 I thought you might enjoy this Rube Goldberg machine with implicit and explicit arrays, arrays of implicit arrays and temporary arrays.
data t;
array a $8 var1-var3;
array b $8 var4-var5;
array d[2] _temporary_;
d[1]=dim(a); d[2]=dim(b);
array c(j) a b;
input (var:)(:);
array t[10] $8 _temporary_;
array count[2];
do over c;
call missing(k,of t[*]);
do _i_ = 1 to d[j];
if c in t then continue;
k + 1;
t[k]=c;
end;
count[j] = k;
end;
drop j k;
cards;
abc d45g abc 1 2
a c rrrrrr g9 g9
;;;;
run;
proc print;
run;
Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.
If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!
@LaurieF wrote:
Can you please provide more data? My initial generic response (using proc transpose) may or may not be viable, depending on the number of observations.
If the source data are small enough, transpose would be quite good. If it's over 32k observations, it won't!
@LaurieF this is the second or third time @ilikesas has asked a similar question. If @ilikesas would normalized their data they could easily deal with this type of problem and more, without resorting to ARRAYs and other clunky data step code.
<nods enthusiastically>
Hi data_null_,
could you please elaborate more about what you mean by "normalizing" the data? It is indeed true that arrays and do loops tend to be messy and complicated (at least for me)...
Thanks!
The long answer is: Wikipedia Data Normalisation
The short answer: SAS works inherently better when it's processing very long, very narrow tables. One of the primary functions of proc transpose, for example, is to rotate data so that columns become rows and vice versa. That way you can do queries over columns, which proc sql naturally supports.
You can, of course, use data steps to walk along arrays, and that can work very well. But if you can manage to pre-process your data so that you can avoid that, the long term solution will tend to be more effective.
Without a little more feel for your data its hard to say what normalization you need to do.
If you had survey data, for example with each survey question a variable (column) and each survey responder an observation (row) like:
ques1 | ques2 | ques3 | |
responder1 | |||
responser2 | |||
responder3 |
that would be a pretty typical format in my experience. For me (and I think for many) this sort of layout lends itself to differences between responders over a particular question easier than to differences in questions for a particular responder, although both are possible.
I think the suggestion is that perhaps a look at the data/database design/scheme and what you want to do with it could save you some work in the long run.
Hi HB,
thanks for the insight. In my case my data is code names (such as abc, gf4). Could this be fine?
Hi LaurieF,
here is some data that I prepared:
data have;
input var1 $4. var2 $4. var3 $4. var4 $4. var5 $4.;
datalines;
abc abc def g55 g56
abc def ttt hhh hhh
ddd ddd ddd ddd ddd
111 111 112 111 111
a a a a a
ab ab ac ac ac
;
run;
I guess that in order to transpose more datalines are needed than variables. Here I also made al variables to be of character type and length 4, but I guess that this specification is not very important when data is imported from an external source such as Excel.
Thanks!
Again, that's all very well, but it still doesn't define the problem very well: it's all a bit theoretical. Do you have some real-world data to play with?
actually I don't have real data, but I am rather thinking of building a data which will consist of names or name codes (such as abc, gf5 etc.)
I'd recommend, then, constructing your data so that you can avoid using arrays, and store the values in observations instead. You're making a rod for your own back otherwise.
It's a bit hard to give you advice without any context and without knowing how you plan to use this data.
On a generic level: You could go for a key/value pair construct with one column holding the name of your variables and a second column holding the value. With such a structure it's then quite simple to implement distinct counts. You could even use such data to construct SAS Formats (via Proc Format, cntlin).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.