Re: Using an ARRAY to count number of records with certain conditions?

JonathanWarrick · Posted 05-31-2011 05:04 PM

Hi all,

I have ten columns and thousands of rows of data that contain a number between 1 and 71 OR a letter between A and J, each representing a different characteristic of an observation. Each observation can have multiple characteristics, so I've already separated these out into "characteristic" columns to separate each number/letter (example: if an observation originally had characteristics 1, 4, 7, and J, it would have been shown as "N: 1; 4; 7; J" I have separated this out into four columns (CHAR_1, CHAR_2, CHAR_3, and CHAR_4) to separate each characteristic out, so now there are four columns, each of which would contain one of the four number/letter codes).

My goal is to use an ARRAY (or some other function if more appropriate) to search across every single observation and column to get a count of every single observation that has each number/letter and output it into a table that lists every single number and letter with a count of each (for instance, if code "11" popped up 6 times in CHAR_1, 4 times in CHAR_2, 83 times in CHAR_3, and 0 times in CHAR_4, the table would have two columns: CHAR would be 11 and COUNT would be 93. This would be performed for each code, 1 - 71 and A - J).

Bonus points if someone has any advice on how to encorporate an INDEX or SCAN function into this to eliminate the need for separating each characteristic out into its own column before performing the array.
Thanks in advance!

ballardw · Posted 05-31-2011 06:51 PM

My approach to this, assuming that the other or orignal variables aren't of much concern would be instead of parsing one column out to 4 variables as in your example but to create only one variable with the CHAR value but output it 4 times.

Something like:
data charset (keep=char);
set datasetname;
array vars var1-var10;/* your columns you want to scan*/
do i= 1 to dim (vars);
j=1;
Char = scan(var,j);
do while char ne '';
output;
j=j+1;
char=scan(var,j);
end;
end;
run;

proc freq data=charset;
table char;
run;

Reeza · Posted 05-31-2011 07:16 PM

BallardW answer is perfectly correct.

Another data struture that might be useful is
var1 var2 --- vara varb
1 0 ---1 0

Basically a column for each possible answer and a 1 or 0 if that occurs.

You can do this from ballardw dataset by adding an observation 1 and transposing the data using proc transpose.

Patrick · Posted 06-01-2011 05:41 AM

Hi Jonathan

In case ballardw's proposition doesn't suit the layout of your source data I would suggest that you post a data step creating some sample data and then give us an example how the result should look like.

Cheers
Patrick

JonathanWarrick · Posted 06-01-2011 10:32 AM

Thanks for the help so far. BallardW's solution is not exactly what I'm looking for because there are situations in which I don't want the CHAR to be counted.

Here is some sample, simplified code that can be used to show exactly what I'm looking to do:

data test;
input num CHAR_1 $2. CHAR_2 $2. CHAR_3 $2. CHAR_4 $2. CHAR_5 $2. CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

The result would look like this:

CHAR COUNT
1 3
2 0
3 0
...
18 1
19 1
20 1
...
33 4
...
44 1
...
52 1
...
a 3
b 2
c 1
...
f 2
...
j 1

In my original post, I referred to giving bonus points to someone who could take out a few steps for me (parsing the CHAR var into separate variables). I didn't show it in my sample data set, but the way the original, un-parsed CHAR variable looks (for row 1) is: N:f; 1; 33; 52; a; b. If the CHAR variable does NOT pass the test, it spits out an "N:" first and then lists the corresponding codes it violated. If it passes all tests, like in rows 2 and 6, it will simply spit out a "Y" and nothing else. I do not care about the "Y" responses and I need to always drop the first two characters of the "N" responses (the "N:" would always be dropped).

Thinking about things a little further, I'd like to do a "Y" or "N" flag approach (or 1/0) for each possible CHAR code (1 - 71, a - j) so I can merge this back into the larger dataset and do analysis on certain combinations.

Hopefully this extra information helps! Thanks again!

Peter_C · Posted 06-01-2011 09:31 AM

using ARRAY for this is difficult because the number of counters is practically unknown. Beginning in SAS9 we have dynamic(defined during runtime) storage like array, but called hash.
With the 10 columns of the original data, index a hash table incrementing a (data item in the hash) counter for each of the thousands of rows of original data. Once filled, save the hash to a data set. That provides the counters for all combinations present in your data.
almost an array (at one time it was referred to as an associative array, but hash was deemed a suitable alternate name and sticks to the concept).
peterC

JonathanWarrick · Posted 06-01-2011 10:33 AM

Peter - can you please explain a little further or provide me with a link to some documentation where I can read about this to see if it would be appropriate?

polingjw · Posted 06-01-2011 12:49 PM

Maybe I don’t understand your problem exactly, but it seems to me that the array/hash solution is a bit more complicated that necessary. Will this solution work? [pre]
data test;
input @1 num @4 CHAR_1 $2. @7 CHAR_2 $2. @10 CHAR_3 $2. @13 CHAR_4 $2. @16 CHAR_5 $2. @19 CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

data all_chars(keep=char);
length char $ 2;
do i=1 to 71;
char=put(i,2.);
output;
end;
do j= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
char=j;
output;
end;
run;

data test1(keep=val);
set test;
array char{*} char_:;
do i=1 to dim(char);
if char{i} ne '' then do;
val=char{i};
output;
end;
end;
run;

proc sql;
create table counts as
select char, coalesce(count, 0) as count
from all_chars left join
(select distinct val as val, count(*) as count
from test1
group by val) as counts
on trim(left(char))=trim(left(val))
order by char;
quit;

[/pre]

Peter_C · Posted 06-01-2011 06:34 PM

here is a link to the doc for the statement that writes a hash table to a data set
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002588845.htm
it comes with very useful examples

Ksharp · Posted 06-01-2011 10:21 PM

[pre]
data temp;
infile datalines length=len;
input char_var $varying200. len;
datalines4;
N:f; j; 33; 52; a; b
Y
N:g; a; 33; 22; a; b
N:f; b; 45; c; a; f
N:c; f; 33; 52; a; b
N:e; j; 33; 52; a; b
N:f; a; e; 52; a; b
;;;;
run;
data op;
set temp;
length pass char $ 8;
pass=scan(strip(char_var),1,' ;:');
i=2;
char=scan(strip(char_var),i,' ;:');
do while(not missing(char));
output;
i+1;
char=scan(strip(char_var),i,' ;:');
end;
drop i char_var;
run;
proc freq data=op noprint;
tables char / nocum nopercent out=_want(drop=percent);
run;
data all_chars(keep=char);
length char $ 8;
do i=1 to 71;
char=strip(i);
output;
end;
do j= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
char=j;
output;
end;
run;

proc sort data=all_chars;
by char;
run;
options missing=0;
data want;
merge all_chars _want;
by char;
run;

[/pre]

Ksharp Message was edited by: Ksharp

polingjw · Posted 06-02-2011 08:24 AM

If you would rather use a hash solution as Peter suggested, I came up with the following code. While it’s a little harder to digest, it’s probably more efficient than my original code. [pre]
data test;
input @1 num @4 CHAR_1 $2. @7 CHAR_2 $2. @10 CHAR_3 $2. @13 CHAR_4 $2. @16 CHAR_5 $2. @19 CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

data counts(keep=char counts);
length char $ 2;
declare hash all_chars(suminc: "count");
all_chars.definekey('char');
all_chars.definedone();
do i=1 to 71;
char=strip(i);
all_chars.add();
end;
do char= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
all_chars.add();
end;
count=1;
do while (not eof);
set test end=eof;
array char_{*} char_:;
do i=1 to dim(char_) while (char_{i} ne '');
char=char_{i};
rc=all_chars.find();
end;
end;
do i=1 to 71;
char=strip(i);
rc=all_chars.sum(sum: counts);
output;
end;
do char= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
rc=all_chars.sum(sum: counts);
output;
end;
stop;
run;

[/pre]

Peter_C · Posted 06-03-2011 10:03 AM

jw

thank you for doing the hard work - I should pay more respect and appreciation
I had this simpler implementation in mind for counts of incidence of any characteristic;
(assumes a table with ten columns c1 - c10 )
data view/view=view ;
set orig_data( keep= c1-c10) ;
array c(10) ;
do i=1 to 10 ;
characteristic = c(i) ;
output ;
end ;
keep characteristic ;
run ;
proc summary data= view nway ;
class characteristic ;
output out= counts( drop = _type_ ) ;
run ;
********************************************************
More interesting : I originally thought of multi combinations without rationalising that these will occur in any order and A with X is enough the same as X with A that they sould be counted together.
So, assuming 10 (string-type) columns in the data named c1-c10 here is my attempt to count all the combinations that occur;
data v2/view=v2 ;
set orig_data ;
call sortc( of c1-c10 ) ;
* that puts the values of the 10 columns into some consistent order ;
keep c1-c10 ;
run ;
proc summary nway data= v2 ;
class c1-c10 ;
output out= combinations_counted( drop= _type_ );
run ;
* that provides a table of each combination (of up to 10 characteristics) that occurs in the original data, with _freq_ counting the occurrence;
Although some of the combinations wil feature just two characteristics (when c3-c10 will be empty) a full report of two-way combinations probably should count also the occurrence of those two columns within 3-way to 10-way combinations. I find that a bigger challenge to analyse;

regards
peterC

chang_y_chung_hotmail_com · Posted 06-03-2011 02:48 PM

If you end up having to use arrays a lot, then this may be a telltail of the data structured less than excellently. Here is one way that does not involve any arrays at all.

   /* test data */
   data one;
      id=1; result="N:f;1; 3; 52; a; b"; output;
      id=2; result="Y"; output;
      id=3; result="N:33"; output;
      id=4; result="N:1"; output;
      id=5; result="N:a;b;c"; output;
   run;

   /* the 81 tests */
   data tests;
      do n = 1 to 81;
         if n <= 71 then test = put(n, 2.-l);
         else test = substr("abcdefghij", n-71, 1);
         output;
      end;
      keep test;
   run;

   /* make a long-shaped data */
   data long;
      set one;
      result = compress(lowcase(result),"0123456789abcdefghij;","k");
      passedAll = ifn(missing(result), 1, 0);
      do point = 1 to 81;
         set tests point=point;
         failedThis = (indexw(trimn(result), trimn(test), ";")>0);
         passed = (passedAll or not failedThis);
         output;
      end;
      keep id test passed;
   run;

   /* usage example: count how many people passed each test */
   proc freq data=long;
      table test/list missing;
      where passed;
   run;

   /* reshape to wide -- it takes just one proc */
   proc transpose data=long out=wide(drop=_:) prefix=passed;
     by id;
     var passed;
   run;

   /* check */
   proc print data=wide;
     var passed1-passed5;
   run;
   /* on lst
   Obs    passed1    passed2    passed3    passed4    passed5
    1        0          1          0          1          1
    2        1          1          1          1          1
    3        1          1          1          1          1
    4        0          1          1          1          1
    5        1          1          1          1          1
   */

Registration is open

SAS Training: Just a Click Away