turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Data Management
- /
- Forum
- /
- Counting occurences across variables

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

11-30-2015 02:33 PM

Good evening

I have a quite large dataset where I do need some very specific counts. To illustrate what my problem is I have written this small piece of code.

**data** have;

infile datalines missover;

input IDnumber $ Week1-Week16;

datalines;

1 11 11 11 . . . . . 11 11 11 . . . . .

2 11 . . . . . . . . . . 11 . . . .

3 . . . . . 11 . . . . . 11 11 11 11 11

4 11 11 . . . . . . . . . . . 11 11 11

5 . . . . . . 11. . . . . . . . . .

6 . . . . 11 11 . . . . . . 11 13 13 13

7 11 . . . . . 11 11 . . . . . 11 11 14

8 14 . . . 11 14 . . . 11 11 . . . . .

9 11 11 14 14 14 13 11 11 11. . . . 11 11 . .

**run**;

**proc** **print** data=have;

**run**;

The thing is that I would like to count all the observations saying 11 and not the observations with another value or missing. In fact I would like to start a new count when missing values or another values occur. So for the first dataline it would be c1 3 c2 3 c3 .

Or maybe more descripive:

**data** want;

infile datalines missover;

input IDnumber $ Week1-Week16 c1 c2 c3;

datalines;

1 11 11 11 . . . . . 11 11 11 . . . . . 3 3 .

2 11 . . . . . . . . . . 11 . . . . 1 1 .

3 . . . . . 11 . . . . . 11 11 11 11 11 1 5 .

4 11 11 . . . . . . . . . . . 11 11 11 2 3 .

5 . . . . . . 11. . . . . . . . . . 1 . .

6 . . . . 11 11 . . . . . . 11 13 13 13 2 1 .

7 11 . . . . . 11 11 . . . . . 11 11 14 1 2 2

8 14 . . . 11 14 . . . 11 11 . . . . . 1 2 .

9 11 11 14 14 14 13 11 11 11. . . . 11 11 . . 2 3 2

;

**run**;

**proc** **print** data=want;

**run**;

Can anyone help me with that?

Thank you in advance!

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to AnnaKM

11-30-2015 04:45 PM

Good evening

Try this:

```
data want;
set have;
array Week[16];
array c[8];
length _s $16;
do i=1 to 16;
substr(_s,i)=ifc(Week[i]=11, 'X', ' ');
end;
do i=1 by 1;
_c=lengthn(scan(_s,i));
if ~_c then return;
c[i]=_c;
end;
drop i _s _c;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to AnnaKM

11-30-2015 10:46 PM

You can't INPUT fields that don't actually exist within the incoming data. You will have to add statements to your DATA step to compute c1, c2, and c3. So you'll be using your first INPUT statement:

input IDnumber $ week1-week16;

Here are some statements you might add at that point.

array week {16};

array c {3};

c_subscript=0;

count_11=0;

do J=1 to 16;

if week{J}=11 then count_11 + 1;

else do;

if count_11 > 0 then do;

c_subscript + 1;

if (1 <= c_subscript <= 3) then c{c_subscript} = count_11;

end;

count_11=0;

end;

end;

c_subscript + 1;

if count_11 > 0 and (1 <= c_subscript <= 3) then c{c_subscript} = count_11;

You might want to drop some of these counters, but this gives you the approach you will need.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

12-01-2015 04:02 AM

@Astounding: I think the purpose of her "input ... c1 c2 c3" was just to create a sample dataset to exemplify the specifications. I found this helpful for checking my solution.

@AnnaKM: I included 8 counters c1 - c8 in my solution, because without further restrictions there could be up to 8 "blocks" of consecutive 11s in Week1 - Week16 (example: 11,22,11,22,11,22,11,22,11,22,11,22,11,22,11,11). If you know for sure in advance that there can't be more than 3 (even taking the possibility of data errors into account), please feel free to change "array c[8];" to "array c[3];".

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to AnnaKM

12-06-2015 02:06 PM

It's been a week and I just realised i forgot to write back to you.

I would like to say thank you to both of you with coming up with a solution for my problem. Both of the solutions worked really well when I tried it on my test dataset but unfortunately when I tried it on my real dataset FreelanceReinhards solution did not work. SAS simply stopped processing after 28 observations. But luckily Astoundings solution worked on the whole dataset so I kept working with that solution.

Again, I want to thank both of you.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Posted in reply to AnnaKM

12-07-2015 12:18 PM

Dear @AnnaKM,

Many thanks for your valuable feedback. I’m glad to read that you were able to accomplish your specific counting task, but sad to learn that my code did not work for your real data. I am *very* interested in the root cause of this issue. Could you please provide a little more information?

- What kind of error message did you get, if any?
- Did you use the code in a different way than

data want;

set have;

... - Was there anything special about the 28th observation that you mentioned?

Meanwhile I’ve successfully tested my algorithm on several simulated datasets comprising a total of more than **220 million observations** (incl. *all* *possible* combinations of 16 values in {., 10, 11}). In each case I compared the results to those obtained by @Astounding’s algorithm (generalized from 3 to 8 counters, as required for certain input data).

The only difference I observed was that **@Astounding’s algorithm is about 2.3 times faster** than mine (apparently because my algorithm uses character functions). **The results, however, were the same in all cases.** Full details can be found below and in the attachment.

Best regards,

Reinhard

Here is the code I used for the tests:

```
/* Macros to create test data */
%macro create(out=have, values=, seed=);
%local i;
/* All n**16 possible combinations of Week1, ..., Week16 values in the n-element set {&VALUES}
in lexicographic order */
data &out;
call streaminit(&seed);
length IDnumber $10;
%do i=1 %to 16;
do Week&i = &values;
%end;
seqno+1;
r=rand('UNIFORM');
IDnumber=put(seqno,10.);
output;
%do i=1 %to 16;
end;
%end;
drop seqno;
run;
%mend create;
%macro sort(data=have, out=have, order=);
%local i;
/* reverse lexicographic order */
%if %upcase(&order)=REVLEX %then %do;
proc sort data=&data out=&out;
by
%do i=1 %to 16;
descending Week&i
%end;;
run;
%end;
/* random order */
%else %if %upcase(&order)=RANDOM %then %do;
proc sort data=&data out=&out;
by r;
run;
%end;
%mend sort;
/* Macro to apply algorithms and compare results */
%macro comp(data=have);
/* Algorithm proposed by FreelanceReinhard, unchanged */
data wantR;
set &data;
array Week[16];
array c[8];
length _s $16;
do i=1 to 16;
substr(_s,i)=ifc(Week[i]=11, 'X', ' ');
end;
do i=1 by 1;
_c=lengthn(scan(_s,i));
if ~_c then return;
c[i]=_c;
end;
drop i _s _c;
run;
/* Reference algorithm proposed by Astounding, with two minor modifications:
1) Increased dimension of array C from 3 to 8 in order to allow for all possible
input data vectors, adapted inequalities for C_SUBSCRIPT accordingly (3 --> 8).
2) Inserted DROP statement for temporary variables.
*/
data wantA;
set &data;
array week {16};
array c {8};
c_subscript=0;
count_11=0;
do J=1 to 16;
if week{J}=11 then count_11 + 1;
else do;
if count_11 > 0 then do;
c_subscript + 1;
if (1 <= c_subscript <= 8) then c{c_subscript} = count_11;
end;
count_11=0;
end;
end;
c_subscript + 1;
if count_11 > 0 and (1 <= c_subscript <= 8) then c{c_subscript} = count_11;
drop c_subscript count_11 j;
run;
/* Compare result datasets */
proc compare data=wantR c=wantA;
run;
%mend comp;
options nodate nonumber;
title 'All 2**16=65536 possible combinations of Week1, ..., Week16 values in {., 11}';
title2 'in lexicographic order';
%create(values=%str(., 11), seed=271828)
%comp;
title 'All 2**16 combinations of {., 11} in reverse lexicographic order';
%sort(order=revlex)
%comp;
title 'All 2**16 combinations of {., 11} in random order';
%sort(order=random)
%comp;
title 'All 3**16=43,046,721 possible combinations of Week1, ..., Week16 values in {., 10, 11}';
title2 'in lexicographic order';
%create(values=%str(., 10, 11), seed=314159) /* CAUTION: Dataset HAVE is 6.10 GB large. */
%comp; /* Datasets wantR and wantA are 8.67 GB each! */
title 'All 3**16 combinations of {., 10, 11} in reverse lexicographic order';
%sort(order=revlex)
%comp;
title 'All 3**16 combinations of {., 10, 11} in random order';
%sort(order=random)
%comp;
title '100,000,000 random vectors (Week1, ..., Week16) with components in {., 11, 12}';
data have;
call streaminit(577219);
length IDnumber $10;
array v[3] _temporary_ (. 11 12);
array Week[16];
do i=1 to 1e8;
IDnumber=put(i,10.);
do j=1 to 16;
Week[j]=v[rand('TABLE',1/3,1/3,1/3)];
end;
output;
end;
drop i j;
run; /* Dataset HAVE is 13.4 GB large! */
/* Check for consecutive duplicate records */
data dups;
set have;
by week1-week16 notsorted;
if ~(first.week16 & last.week16);
run; /* 8 obs. (4 pairs) */
%comp; /* Datasets wantR and wantA are 19.4 GB each! */
title;
```

The attached .zip file contains the SAS log (Comparisons.log, 13 KB) and the output (Comparisons.lst, 7 KB) created by the above program.