Re: Counting the frequency of a values in a string

out_of_my_wheelhouse · Posted 08-18-2015 10:17 AM

Hi, I've stumbled on a problem that I'm having trouble figuring out how to approach. At the title suggests, I have a variable "XCOMBINED" that contains a string of values that were concatenated from previously 15 distinct variables (X1-X15) with each numeric value being delineated by a space. Therefore, in the "XCOMBINE" variable there could be anywhere from 1 to 15 numeric values that contain up to 5 characters each. Is there a way to count or graph which of these values occurs the most frequently in the observations?

Alternatively, I also have all 15 variables separate. Is there a way to determine the frequency of values in 15 different variables simultaneously? Obviously I can plot X1 or X2 frequency individually, but I can't figure out a way to count the frequency of multiple variables in one graph.

Thanks for your help!

ndp · Posted 08-18-2015 11:02 AM

Need to have some grouping variable/s<id> try this:

proc transpose data=x out=y;
by <id>;
var x1-x15;
run;

proc freq data=y;
tables <id>;
by _NAME_;
where nmiss(col1)=0;
run;

out_of_my_wheelhouse · Posted 08-18-2015 12:26 PM

Thanks for replying, but I'm a bit confused what <id> is meant to represent. Can you explain? The log is telling me it's causing a syntax error.

ndp · Posted 08-18-2015 07:10 PM

<id> is any variable/s you need to create counts by (grouping )

Reeza · Posted 08-18-2015 09:48 PM

FYI if you post sample data you're more likely to get code that works for you.

The key here is to transpose your data from wide to long format. Then you can run a proc freq on all the results in one pass. I think proc freq even generates some standard graphs these days.

RW9 · Posted 08-19-2015 05:24 AM

Reeza has the best answer here, normalise the data then aggregate it. I would, just for fun, show an example of using arrays on the string you have:

data have;

length xcombined $2000;

infile datalines dlm="|";

input xcombined $;

datalines;

1234 567 1324 8723 56 56

673674 1235 1235 1235 2277 4564

;

run;

data want;

set have;

array elements{7,2} $20;

curr_element=1;

do while (scan(xcombined,1," ") ne "");

elements{curr_element,1}=scan(xcombined,1," ");

elements{curr_element,2}=strip(put(count(xcombined,scan(xcombined,1," ")),best.));

xcombined=strip(tranwrd(xcombined,scan(xcombined,1," "),""));

curr_element=curr_element+1;

end;

run;

This keeps taking the first word from the string, counting all occurences, and putting the output in a 2 dimensional array, i.e. value, count. Then that word is removed from the string and we keep going until there is nothing left.

out_of_my_wheelhouse · Posted 08-19-2015 07:59 AM

First of all, thank you so much. This is a surprisingly helpful community, especially for someone with zero experience with coding. I posted a part of the example data I spoke about before in case I did a poor job explaining it. This is obviously "dxcombine" instead of "xcombine" but I'm trying to accomplish the same thing.

I tried the example you posted, but I'm getting the following error:

581 data have;

582

583 length dxcombine $2000;

584

585 infile datalines dlm="|";

586

587 input dxcombine $;

588

589 datalines;

NOTE: The data set WORK.HAVE has 5 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.00 seconds

595 ;

596

597 run;

598

599 data want;

600

601 set have;

602

603 array elements{7,2} $20;

604

605 curr_element=1;

606

607 do while (scan(dxcombine,1," ") ne "");

608

609 elements{curr_element,1}=scan(dxcombine,1," ");

610

611 elements{curr_element,2}=strip(put(count(dxcombine,scan(dxcombine,1," ")),best.));

612

613 xcombined=strip(tranwrd(dxcombine,scan(dxcombine,1," "),""));

614

615 curr_element=curr_element+1;

616

617 end;

618

619 run;

ERROR: Array subscript out of range at line 609 column 5.

dxcombine=1234 567 1324 8723 56 56 elements1=1234 elements2=1 elements3=1234 elements4=1

elements5=1234 elements6=1 elements7=1234 elements8=1 elements9=1234 elements10=1

elements11=1234 elements12=1 elements13=1234 elements14=1 curr_element=8

xcombined=567 1324 8723 56 56 _ERROR_=1 _N_=2

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 2 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 1

observations and 17 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 0.02 seconds

cpu time 0.03 seconds

RW9 · Posted 08-19-2015 08:53 AM

This is down to the value I hardcoded in this row:

603 array elements{7,2} $20;

Basically I am assuming there will be no more than 7 distinct elements in the string, this is just the test data I have at the time. Set it to elements{20,2} and that should cover it. I does however illustrate neatly why Reeza's answer is the better of the two. If you normalise your data:

ROW DX

1 5849

1 42822

...

2 41071

...

Then you can very easily run a freq/means procedure, or some simple code to get your result:

proc sq;

create table WANT as

select ROW,

DX,

count(DX) as RESULT

from HAVE

group by ROW,

DX;

quit;

Now to get the normalised data you can just do:

data inter (keep=row dx);

set have;

i=1;

do until (scan(dxcombine,i,' ') = '');

row=_n_;

dx=scan(dxcombine,i,' ');

output;

i=i+1;

end;

run;

This outputs one row per scanned word.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away