DATA Step, Macro, Functions and more

Counting the frequency of a values in a string

Reply
New Contributor
Posts: 4

Counting the frequency of a values in a string

Hi, I've stumbled on a problem that I'm having trouble figuring out how to approach.  At the title suggests, I have a variable "XCOMBINED" that contains a string of values that were concatenated from previously 15 distinct variables (X1-X15) with each numeric value being delineated by a space.  Therefore, in the "XCOMBINE" variable there could be anywhere from 1 to 15 numeric values that contain up to 5 characters each.  Is there a way to count or graph which of these values occurs the most frequently in the observations?

Alternatively, I also have all 15 variables separate.  Is there a way to determine the frequency of values in 15 different variables simultaneously?  Obviously I can plot X1 or X2 frequency individually, but I can't figure out a way to count the frequency of multiple variables in one graph.

Thanks for your help!

Contributor ndp
Contributor
Posts: 61

Re: Counting the frequency of a values in a string

Need to have some grouping variable/s<id> try this:

proc transpose data=x out=y;
by <id>;
var x1-x15;
run;

proc freq data=y;
tables <id>;
by _NAME_;
where nmiss(col1)=0;
run;

New Contributor
Posts: 4

Re: Counting the frequency of a values in a string

Thanks for replying, but I'm a bit confused what <id> is meant to represent.  Can you explain?   The log is telling me it's causing a syntax error. 

Contributor ndp
Contributor
Posts: 61

Re: Counting the frequency of a values in a string

<id> is any variable/s you need to create counts by (grouping )

Super User
Posts: 17,865

Re: Counting the frequency of a values in a string

FYI if you post sample data you're more likely to get code that works for you.

The key here is to transpose your data from wide to long format. Then you can run a proc freq on all the results in one pass. I think proc freq even generates some standard graphs these days.

Super User
Super User
Posts: 7,407

Re: Counting the frequency of a values in a string

Reeza has the best answer here, normalise the data then aggregate it.  I would, just for fun, show an example of using arrays on the string you have:

data have; 

  length xcombined $2000;

  infile datalines dlm="|";

  input xcombined $;

datalines;

1234 567 1324 8723 56 56

673674 1235 1235 1235 2277 4564

;

run;

data want;

  set have;

  array elements{7,2} $20;

  curr_element=1;

  do while (scan(xcombined,1," ") ne "");

    elements{curr_element,1}=scan(xcombined,1," ");

    elements{curr_element,2}=strip(put(count(xcombined,scan(xcombined,1," ")),best.));

    xcombined=strip(tranwrd(xcombined,scan(xcombined,1," "),""));

    curr_element=curr_element+1;

  end;

run;

This keeps taking the first word from the string, counting all occurences, and putting the output in a 2 dimensional array, i.e. value, count.  Then that word is removed from the string and we keep going until there is nothing left.

New Contributor
Posts: 4

Re: Counting the frequency of a values in a string

First of all, thank you so much.  This is a surprisingly helpful community, especially for someone with zero experience with coding.  I posted a part of the example data I spoke about before in case I did a poor job explaining it.  This is obviously "dxcombine" instead of "xcombine" but I'm trying to accomplish the same thing.

https://lh3.googleusercontent.com/HlmCrfIMzdijFduxMWCRMUHAdsAbh2ZtI1ePQcl_sNM=w204-h413-no

I tried the example you posted, but I'm getting the following error:

581  data have;

582

583    length dxcombine $2000;

584

585    infile datalines dlm="|";

586

587    input dxcombine $;

588

589  datalines;

NOTE: The data set WORK.HAVE has 5 observations and 1 variables.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.00 seconds

595  ;

596

597  run;

598

599  data want;

600

601    set have;

602

603    array elements{7,2} $20;

604

605    curr_element=1;

606

607    do while (scan(dxcombine,1," ") ne "");

608

609      elements{curr_element,1}=scan(dxcombine,1," ");

610

611      elements{curr_element,2}=strip(put(count(dxcombine,scan(dxcombine,1," ")),best.));

612

613      xcombined=strip(tranwrd(dxcombine,scan(dxcombine,1," "),""));

614

615      curr_element=curr_element+1;

616

617    end;

618

619  run;

ERROR: Array subscript out of range at line 609 column 5.

dxcombine=1234 567 1324 8723 56 56 elements1=1234 elements2=1 elements3=1234 elements4=1

elements5=1234 elements6=1 elements7=1234 elements8=1 elements9=1234 elements10=1

elements11=1234 elements12=1 elements13=1234 elements14=1 curr_element=8

xcombined=567 1324 8723 56 56 _ERROR_=1 _N_=2

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 2 observations read from the data set WORK.HAVE.

WARNING: The data set WORK.WANT may be incomplete.  When this step was stopped there were 1

         observations and 17 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           0.02 seconds

      cpu time            0.03 seconds

Super User
Super User
Posts: 7,407

Re: Counting the frequency of a values in a string

This is down to the value I hardcoded in this row:

603    array elements{7,2} $20;

Basically I am assuming there will be no more than 7 distinct elements in the string, this is just the test data I have at the time.  Set it to elements{20,2} and that should cover it.  I does however illustrate neatly why Reeza's answer is the better of the two.  If you normalise your data:

ROW          DX

1                 5849

1                 42822

...

2                41071

...

Then you can very easily run a freq/means procedure, or some simple code to get your result:

proc sq;

     create table WANT as

     select     ROW,

                   DX,

                   count(DX) as RESULT

     from       HAVE

     group by ROW,

                   DX;

quit;

Now to get the normalised data you can just do:

data inter (keep=row dx);

     set have;

     i=1;

     do until (scan(dxcombine,i,' ') = '');

         row=_n_;

         dx=scan(dxcombine,i,' ');

          output;

          i=i+1;

     end;

run;

This outputs one row per scanned word.

Ask a Question
Discussion stats
  • 7 replies
  • 324 views
  • 0 likes
  • 4 in conversation