BookmarkSubscribeRSS Feed
JonathanWarrick
Calcite | Level 5
Hi all,

I have ten columns and thousands of rows of data that contain a number between 1 and 71 OR a letter between A and J, each representing a different characteristic of an observation. Each observation can have multiple characteristics, so I've already separated these out into "characteristic" columns to separate each number/letter (example: if an observation originally had characteristics 1, 4, 7, and J, it would have been shown as "N: 1; 4; 7; J" I have separated this out into four columns (CHAR_1, CHAR_2, CHAR_3, and CHAR_4) to separate each characteristic out, so now there are four columns, each of which would contain one of the four number/letter codes).

My goal is to use an ARRAY (or some other function if more appropriate) to search across every single observation and column to get a count of every single observation that has each number/letter and output it into a table that lists every single number and letter with a count of each (for instance, if code "11" popped up 6 times in CHAR_1, 4 times in CHAR_2, 83 times in CHAR_3, and 0 times in CHAR_4, the table would have two columns: CHAR would be 11 and COUNT would be 93. This would be performed for each code, 1 - 71 and A - J).

Bonus points if someone has any advice on how to encorporate an INDEX or SCAN function into this to eliminate the need for separating each characteristic out into its own column before performing the array.
Thanks in advance!
12 REPLIES 12
ballardw
Super User
My approach to this, assuming that the other or orignal variables aren't of much concern would be instead of parsing one column out to 4 variables as in your example but to create only one variable with the CHAR value but output it 4 times.

Something like:
data charset (keep=char);
set datasetname;
array vars var1-var10;/* your columns you want to scan*/
do i= 1 to dim (vars);
j=1;
Char = scan(var,j);
do while char ne '';
output;
j=j+1;
char=scan(var,j);
end;
end;
run;

proc freq data=charset;
table char;
run;
Reeza
Super User
BallardW answer is perfectly correct.

Another data struture that might be useful is
var1 var2 --- vara varb
1 0 ---1 0

Basically a column for each possible answer and a 1 or 0 if that occurs.

You can do this from ballardw dataset by adding an observation 1 and transposing the data using proc transpose.
Patrick
Opal | Level 21
Hi Jonathan

In case ballardw's proposition doesn't suit the layout of your source data I would suggest that you post a data step creating some sample data and then give us an example how the result should look like.

Cheers
Patrick
JonathanWarrick
Calcite | Level 5
Thanks for the help so far. BallardW's solution is not exactly what I'm looking for because there are situations in which I don't want the CHAR to be counted.

Here is some sample, simplified code that can be used to show exactly what I'm looking to do:

data test;
input num CHAR_1 $2. CHAR_2 $2. CHAR_3 $2. CHAR_4 $2. CHAR_5 $2. CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

The result would look like this:

CHAR COUNT
1 3
2 0
3 0
...
18 1
19 1
20 1
...
33 4
...
44 1
...
52 1
...
a 3
b 2
c 1
...
f 2
...
j 1


In my original post, I referred to giving bonus points to someone who could take out a few steps for me (parsing the CHAR var into separate variables). I didn't show it in my sample data set, but the way the original, un-parsed CHAR variable looks (for row 1) is: N:f; 1; 33; 52; a; b. If the CHAR variable does NOT pass the test, it spits out an "N:" first and then lists the corresponding codes it violated. If it passes all tests, like in rows 2 and 6, it will simply spit out a "Y" and nothing else. I do not care about the "Y" responses and I need to always drop the first two characters of the "N" responses (the "N:" would always be dropped).

Thinking about things a little further, I'd like to do a "Y" or "N" flag approach (or 1/0) for each possible CHAR code (1 - 71, a - j) so I can merge this back into the larger dataset and do analysis on certain combinations.

Hopefully this extra information helps! Thanks again!
Peter_C
Rhodochrosite | Level 12
using ARRAY for this is difficult because the number of counters is practically unknown. Beginning in SAS9 we have dynamic(defined during runtime) storage like array, but called hash.
With the 10 columns of the original data, index a hash table incrementing a (data item in the hash) counter for each of the thousands of rows of original data. Once filled, save the hash to a data set. That provides the counters for all combinations present in your data.
almost an array (at one time it was referred to as an associative array, but hash was deemed a suitable alternate name and sticks to the concept).
peterC
JonathanWarrick
Calcite | Level 5
Peter - can you please explain a little further or provide me with a link to some documentation where I can read about this to see if it would be appropriate?
polingjw
Quartz | Level 8
Maybe I don’t understand your problem exactly, but it seems to me that the array/hash solution is a bit more complicated that necessary. Will this solution work? [pre]
data test;
input @1 num @4 CHAR_1 $2. @7 CHAR_2 $2. @10 CHAR_3 $2. @13 CHAR_4 $2. @16 CHAR_5 $2. @19 CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

data all_chars(keep=char);
length char $ 2;
do i=1 to 71;
char=put(i,2.);
output;
end;
do j= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
char=j;
output;
end;
run;

data test1(keep=val);
set test;
array char{*} char_:;
do i=1 to dim(char);
if char{i} ne '' then do;
val=char{i};
output;
end;
end;
run;

proc sql;
create table counts as
select char, coalesce(count, 0) as count
from all_chars left join
(select distinct val as val, count(*) as count
from test1
group by val) as counts
on trim(left(char))=trim(left(val))
order by char;
quit;

[/pre]
Peter_C
Rhodochrosite | Level 12
here is a link to the doc for the statement that writes a hash table to a data set
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002588845.htm
it comes with very useful examples
Ksharp
Super User
[pre]
data temp;
infile datalines length=len;
input char_var $varying200. len;
datalines4;
N:f; j; 33; 52; a; b
Y
N:g; a; 33; 22; a; b
N:f; b; 45; c; a; f
N:c; f; 33; 52; a; b
N:e; j; 33; 52; a; b
N:f; a; e; 52; a; b
;;;;
run;
data op;
set temp;
length pass char $ 8;
pass=scan(strip(char_var),1,' ;:');
i=2;
char=scan(strip(char_var),i,' ;:');
do while(not missing(char));
output;
i+1;
char=scan(strip(char_var),i,' ;:');
end;
drop i char_var;
run;
proc freq data=op noprint;
tables char / nocum nopercent out=_want(drop=percent);
run;
data all_chars(keep=char);
length char $ 8;
do i=1 to 71;
char=strip(i);
output;
end;
do j= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
char=j;
output;
end;
run;

proc sort data=all_chars;
by char;
run;
options missing=0;
data want;
merge all_chars _want;
by char;
run;




[/pre]


Ksharp Message was edited by: Ksharp
polingjw
Quartz | Level 8
If you would rather use a hash solution as Peter suggested, I came up with the following code. While it’s a little harder to digest, it’s probably more efficient than my original code. [pre]
data test;
input @1 num @4 CHAR_1 $2. @7 CHAR_2 $2. @10 CHAR_3 $2. @13 CHAR_4 $2. @16 CHAR_5 $2. @19 CHAR_6 $2.;
DATALINES;
1 f 1 33 52 a b
2 Y
3 33
4 1
5 33 a b c
6 Y
7 18 19 20
8 f 33
9 a j
10 1 44
;
run;

data counts(keep=char counts);
length char $ 2;
declare hash all_chars(suminc: "count");
all_chars.definekey('char');
all_chars.definedone();
do i=1 to 71;
char=strip(i);
all_chars.add();
end;
do char= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
all_chars.add();
end;
count=1;
do while (not eof);
set test end=eof;
array char_{*} char_:;
do i=1 to dim(char_) while (char_{i} ne '');
char=char_{i};
rc=all_chars.find();
end;
end;
do i=1 to 71;
char=strip(i);
rc=all_chars.sum(sum: counts);
output;
end;
do char= 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j';
rc=all_chars.sum(sum: counts);
output;
end;
stop;
run;

[/pre]
Peter_C
Rhodochrosite | Level 12
jw

thank you for doing the hard work - I should pay more respect and appreciation
I had this simpler implementation in mind for counts of incidence of any characteristic;
(assumes a table with ten columns c1 - c10 )
data view/view=view ;
set orig_data( keep= c1-c10) ;
array c(10) ;
do i=1 to 10 ;
characteristic = c(i) ;
output ;
end ;
keep characteristic ;
run ;
proc summary data= view nway ;
class characteristic ;
output out= counts( drop = _type_ ) ;
run ;
********************************************************
More interesting : I originally thought of multi combinations without rationalising that these will occur in any order and A with X is enough the same as X with A that they sould be counted together.
So, assuming 10 (string-type) columns in the data named c1-c10 here is my attempt to count all the combinations that occur;
data v2/view=v2 ;
set orig_data ;
call sortc( of c1-c10 ) ;
* that puts the values of the 10 columns into some consistent order ;
keep c1-c10 ;
run ;
proc summary nway data= v2 ;
class c1-c10 ;
output out= combinations_counted( drop= _type_ );
run ;
* that provides a table of each combination (of up to 10 characteristics) that occurs in the original data, with _freq_ counting the occurrence;
Although some of the combinations wil feature just two characteristics (when c3-c10 will be empty) a full report of two-way combinations probably should count also the occurrence of those two columns within 3-way to 10-way combinations. I find that a bigger challenge to analyse;

regards
peterC
chang_y_chung_hotmail_com
Obsidian | Level 7

If you end up having to use arrays a lot, then this may be a telltail of the data structured less than excellently. Here is one way that does not involve any arrays at all.

   /* test data */
   data one;
      id=1; result="N:f;1; 3; 52; a; b"; output;
      id=2; result="Y"; output;
      id=3; result="N:33"; output;
      id=4; result="N:1"; output;
      id=5; result="N:a;b;c"; output;
   run;

   /* the 81 tests */
   data tests;
      do n = 1 to 81;
         if n <= 71 then test = put(n, 2.-l);
         else test = substr("abcdefghij", n-71, 1);
         output;
      end;
      keep test;
   run;

   /* make a long-shaped data */
   data long;
      set one;
      result = compress(lowcase(result),"0123456789abcdefghij;","k");
      passedAll = ifn(missing(result), 1, 0);
      do point = 1 to 81;
         set tests point=point;
         failedThis = (indexw(trimn(result), trimn(test), ";")>0);
         passed = (passedAll or not failedThis);
         output;
      end;
      keep id test passed;
   run;

   /* usage example: count how many people passed each test */
   proc freq data=long;
      table test/list missing;
      where passed;
   run;

   /* reshape to wide -- it takes just one proc */
   proc transpose data=long out=wide(drop=_:) prefix=passed;
     by id;
     var passed;
   run;

   /* check */
   proc print data=wide;
     var passed1-passed5;
   run;
   /* on lst
   Obs    passed1    passed2    passed3    passed4    passed5
    1        0          1          0          1          1
    2        1          1          1          1          1
    3        1          1          1          1          1
    4        0          1          1          1          1
    5        1          1          1          1          1
   */

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 5615 views
  • 0 likes
  • 8 in conversation