Re: Counting occurences across variables

AnnaKM · Posted 11-30-2015 02:33 PM

Good evening

I have a quite large dataset where I do need some very specific counts. To illustrate what my problem is I have written this small piece of code.

data have;

infile datalines missover;

input IDnumber $ Week1-Week16;

datalines;

1 11 11 11 . . . . . 11 11 11 . . . . .

2 11 . . . . . . . . . . 11 . . . .

3 . . . . . 11 . . . . . 11 11 11 11 11

4 11 11 . . . . . . . . . . . 11 11 11

5 . . . . . . 11. . . . . . . . . .

6 . . . . 11 11 . . . . . . 11 13 13 13

7 11 . . . . . 11 11 . . . . . 11 11 14

8 14 . . . 11 14 . . . 11 11 . . . . .

9 11 11 14 14 14 13 11 11 11. . . . 11 11 . .

run;

proc print data=have;

run;

The thing is that I would like to count all the observations saying 11 and not the observations with another value or missing. In fact I would like to start a new count when missing values or another values occur. So for the first dataline it would be c1 3 c2 3 c3 .

Or maybe more descripive:

data want;

infile datalines missover;

input IDnumber $ Week1-Week16 c1 c2 c3;

datalines;

1 11 11 11 . . . . . 11 11 11 . . . . . 3 3 .

2 11 . . . . . . . . . . 11 . . . . 1 1 .

3 . . . . . 11 . . . . . 11 11 11 11 11 1 5 .

4 11 11 . . . . . . . . . . . 11 11 11 2 3 .

5 . . . . . . 11. . . . . . . . . . 1 . .

6 . . . . 11 11 . . . . . . 11 13 13 13 2 1 .

7 11 . . . . . 11 11 . . . . . 11 11 14 1 2 2

8 14 . . . 11 14 . . . 11 11 . . . . . 1 2 .

9 11 11 14 14 14 13 11 11 11. . . . 11 11 . . 2 3 2

;

run;

proc print data=want;

run;

Can anyone help me with that?

Thank you in advance!

FreelanceReinh · Posted 11-30-2015 04:45 PM

Good evening

Try this:

data want;
set have;
array Week[16];
array c[8];
length _s $16;
do i=1 to 16;
  substr(_s,i)=ifc(Week[i]=11, 'X', ' ');
end;
do i=1 by 1;
  _c=lengthn(scan(_s,i));
  if ~_c then return;
  c[i]=_c;
end;
drop i _s _c;
run;

Astounding · Posted 11-30-2015 10:46 PM

You can't INPUT fields that don't actually exist within the incoming data. You will have to add statements to your DATA step to compute c1, c2, and c3. So you'll be using your first INPUT statement:

input IDnumber $ week1-week16;

Here are some statements you might add at that point.

array week {16};

array c {3};

c_subscript=0;

count_11=0;

do J=1 to 16;

if week{J}=11 then count_11 + 1;

else do;

if count_11 > 0 then do;

c_subscript + 1;

if (1 <= c_subscript <= 3) then c{c_subscript} = count_11;

end;

count_11=0;

end;

c_subscript + 1;

if count_11 > 0 and (1 <= c_subscript <= 3) then c{c_subscript} = count_11;

You might want to drop some of these counters, but this gives you the approach you will need.

FreelanceReinh · Posted 12-01-2015 04:02 AM

@Astounding: I think the purpose of her "input ... c1 c2 c3" was just to create a sample dataset to exemplify the specifications. I found this helpful for checking my solution.

@AnnaKM: I included 8 counters c1 - c8 in my solution, because without further restrictions there could be up to 8 "blocks" of consecutive 11s in Week1 - Week16 (example: 11,22,11,22,11,22,11,22,11,22,11,22,11,22,11,11). If you know for sure in advance that there can't be more than 3 (even taking the possibility of data errors into account), please feel free to change "array c[8];" to "array c[3];".

AnnaKM · Posted 12-06-2015 02:06 PM

It's been a week and I just realised i forgot to write back to you.

I would like to say thank you to both of you with coming up with a solution for my problem. Both of the solutions worked really well when I tried it on my test dataset but unfortunately when I tried it on my real dataset FreelanceReinhards solution did not work. SAS simply stopped processing after 28 observations. But luckily Astoundings solution worked on the whole dataset so I kept working with that solution.

Again, I want to thank both of you.

FreelanceReinh · Posted 12-07-2015 12:18 PM

Dear @AnnaKM,

Many thanks for your valuable feedback. I’m glad to read that you were able to accomplish your specific counting task, but sad to learn that my code did not work for your real data. I am very interested in the root cause of this issue. Could you please provide a little more information?

What kind of error message did you get, if any?
Did you use the code in a different way than
data want;
set have;
...
Was there anything special about the 28th observation that you mentioned?

Meanwhile I’ve successfully tested my algorithm on several simulated datasets comprising a total of more than 220 million observations (incl. all possible combinations of 16 values in {., 10, 11}). In each case I compared the results to those obtained by @Astounding’s algorithm (generalized from 3 to 8 counters, as required for certain input data).

The only difference I observed was that @Astounding’s algorithm is about 2.3 times faster than mine (apparently because my algorithm uses character functions). The results, however, were the same in all cases. Full details can be found below and in the attachment.

Best regards,

Reinhard

Here is the code I used for the tests:

/* Macros to create test data */

%macro create(out=have, values=, seed=);
  %local i;

  /* All n**16 possible combinations of Week1, ..., Week16 values in the n-element set {&VALUES}
     in lexicographic order */

  data &out;
  call streaminit(&seed);
  length IDnumber $10;
  %do i=1 %to 16;
    do Week&i = &values;
  %end;
  seqno+1;
  r=rand('UNIFORM');
  IDnumber=put(seqno,10.);
  output;
  %do i=1 %to 16;
    end;
  %end;
  drop seqno;
  run;
%mend create;

%macro sort(data=have, out=have, order=);
  %local i;

  /* reverse lexicographic order */
  %if %upcase(&order)=REVLEX %then %do;
    proc sort data=&data out=&out;
    by 
    %do i=1 %to 16;
      descending Week&i
    %end;;
    run;
  %end;

  /* random order */
  %else %if %upcase(&order)=RANDOM %then %do;
    proc sort data=&data out=&out;
    by r;
    run;
  %end;
%mend sort;


/* Macro to apply algorithms and compare results */

%macro comp(data=have);

  /* Algorithm proposed by FreelanceReinhard, unchanged */

  data wantR;
  set &data;
  array Week[16];
  array c[8];
  length _s $16;
  do i=1 to 16;
    substr(_s,i)=ifc(Week[i]=11, 'X', ' ');
  end;
  do i=1 by 1;
    _c=lengthn(scan(_s,i));
    if ~_c then return;
    c[i]=_c;
  end;
  drop i _s _c;
  run;

  /* Reference algorithm proposed by Astounding, with two minor modifications:
     1) Increased dimension of array C from 3 to 8 in order to allow for all possible
        input data vectors, adapted inequalities for C_SUBSCRIPT accordingly (3 --> 8).
     2) Inserted DROP statement for temporary variables.
  */

  data wantA;
  set &data;
  array week {16};
  array c {8};
  c_subscript=0;
  count_11=0;
  do J=1 to 16;
     if week{J}=11 then count_11 + 1;
     else do;
        if count_11 > 0 then do;
          c_subscript + 1;
          if (1 <= c_subscript <= 8) then c{c_subscript} = count_11;
        end;
        count_11=0;
     end;
  end;
  c_subscript + 1;
  if count_11 > 0 and (1 <= c_subscript <= 8) then c{c_subscript} = count_11;
  drop c_subscript count_11 j;
  run;

  /* Compare result datasets */

  proc compare data=wantR c=wantA;
  run;
%mend comp;

options nodate nonumber;

title 'All 2**16=65536 possible combinations of Week1, ..., Week16 values in {., 11}';
title2 'in lexicographic order';
%create(values=%str(., 11), seed=271828)
%comp;

title 'All 2**16 combinations of {., 11} in reverse lexicographic order';
%sort(order=revlex)
%comp;

title 'All 2**16 combinations of {., 11} in random order';
%sort(order=random)
%comp;

title 'All 3**16=43,046,721 possible combinations of Week1, ..., Week16 values in {., 10, 11}';
title2 'in lexicographic order';
%create(values=%str(., 10, 11), seed=314159) /* CAUTION: Dataset HAVE is 6.10 GB large. */
%comp; /* Datasets wantR and wantA are 8.67 GB each! */

title 'All 3**16 combinations of {., 10, 11} in reverse lexicographic order';
%sort(order=revlex)
%comp;

title 'All 3**16 combinations of {., 10, 11} in random order';
%sort(order=random) 
%comp;

title '100,000,000 random vectors (Week1, ..., Week16) with components in {., 11, 12}';
data have;
call streaminit(577219);
length IDnumber $10;
array v[3] _temporary_ (. 11 12);
array Week[16];
do i=1 to 1e8;
  IDnumber=put(i,10.);
  do j=1 to 16;
    Week[j]=v[rand('TABLE',1/3,1/3,1/3)];
  end;
  output;
end;
drop i j;
run; /* Dataset HAVE is 13.4 GB large! */

/* Check for consecutive duplicate records */
data dups;
set have;
by week1-week16 notsorted;
if ~(first.week16 & last.week16);
run; /* 8 obs. (4 pairs) */

%comp; /* Datasets wantR and wantA are 19.4 GB each! */

title;

The attached .zip file contains the SAS log (Comparisons.log, 13 KB) and the output (Comparisons.lst, 7 KB) created by the above program.

Counting occurences across variables