Hi, I've stumbled on a problem that I'm having trouble figuring out how to approach. At the title suggests, I have a variable "XCOMBINED" that contains a string of values that were concatenated from previously 15 distinct variables (X1-X15) with each numeric value being delineated by a space. Therefore, in the "XCOMBINE" variable there could be anywhere from 1 to 15 numeric values that contain up to 5 characters each. Is there a way to count or graph which of these values occurs the most frequently in the observations?
Alternatively, I also have all 15 variables separate. Is there a way to determine the frequency of values in 15 different variables simultaneously? Obviously I can plot X1 or X2 frequency individually, but I can't figure out a way to count the frequency of multiple variables in one graph.
Thanks for your help!
Need to have some grouping variable/s<id> try this:
proc transpose data=x out=y;
by <id>;
var x1-x15;
run;
proc freq data=y;
tables <id>;
by _NAME_;
where nmiss(col1)=0;
run;
Thanks for replying, but I'm a bit confused what <id> is meant to represent. Can you explain? The log is telling me it's causing a syntax error.
<id> is any variable/s you need to create counts by (grouping )
FYI if you post sample data you're more likely to get code that works for you.
The key here is to transpose your data from wide to long format. Then you can run a proc freq on all the results in one pass. I think proc freq even generates some standard graphs these days.
Reeza has the best answer here, normalise the data then aggregate it. I would, just for fun, show an example of using arrays on the string you have:
data have;
length xcombined $2000;
infile datalines dlm="|";
input xcombined $;
datalines;
1234 567 1324 8723 56 56
673674 1235 1235 1235 2277 4564
;
run;
data want;
set have;
array elements{7,2} $20;
curr_element=1;
do while (scan(xcombined,1," ") ne "");
elements{curr_element,1}=scan(xcombined,1," ");
elements{curr_element,2}=strip(put(count(xcombined,scan(xcombined,1," ")),best.));
xcombined=strip(tranwrd(xcombined,scan(xcombined,1," "),""));
curr_element=curr_element+1;
end;
run;
This keeps taking the first word from the string, counting all occurences, and putting the output in a 2 dimensional array, i.e. value, count. Then that word is removed from the string and we keep going until there is nothing left.
First of all, thank you so much. This is a surprisingly helpful community, especially for someone with zero experience with coding. I posted a part of the example data I spoke about before in case I did a poor job explaining it. This is obviously "dxcombine" instead of "xcombine" but I'm trying to accomplish the same thing.
I tried the example you posted, but I'm getting the following error:
581 data have;
582
583 length dxcombine $2000;
584
585 infile datalines dlm="|";
586
587 input dxcombine $;
588
589 datalines;
NOTE: The data set WORK.HAVE has 5 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
595 ;
596
597 run;
598
599 data want;
600
601 set have;
602
603 array elements{7,2} $20;
604
605 curr_element=1;
606
607 do while (scan(dxcombine,1," ") ne "");
608
609 elements{curr_element,1}=scan(dxcombine,1," ");
610
611 elements{curr_element,2}=strip(put(count(dxcombine,scan(dxcombine,1," ")),best.));
612
613 xcombined=strip(tranwrd(dxcombine,scan(dxcombine,1," "),""));
614
615 curr_element=curr_element+1;
616
617 end;
618
619 run;
ERROR: Array subscript out of range at line 609 column 5.
dxcombine=1234 567 1324 8723 56 56 elements1=1234 elements2=1 elements3=1234 elements4=1
elements5=1234 elements6=1 elements7=1234 elements8=1 elements9=1234 elements10=1
elements11=1234 elements12=1 elements13=1234 elements14=1 curr_element=8
xcombined=567 1324 8723 56 56 _ERROR_=1 _N_=2
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 1
observations and 17 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.03 seconds
This is down to the value I hardcoded in this row:
603 array elements{7,2} $20;
Basically I am assuming there will be no more than 7 distinct elements in the string, this is just the test data I have at the time. Set it to elements{20,2} and that should cover it. I does however illustrate neatly why Reeza's answer is the better of the two. If you normalise your data:
ROW DX
1 5849
1 42822
...
2 41071
...
Then you can very easily run a freq/means procedure, or some simple code to get your result:
proc sq;
create table WANT as
select ROW,
DX,
count(DX) as RESULT
from HAVE
group by ROW,
DX;
quit;
Now to get the normalised data you can just do:
data inter (keep=row dx);
set have;
i=1;
do until (scan(dxcombine,i,' ') = '');
row=_n_;
dx=scan(dxcombine,i,' ');
output;
i=i+1;
end;
run;
This outputs one row per scanned word.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.