Solved: Re: select on unique character codes from a column and generate a repo...

WendyT · Posted 09-07-2012 12:24 PM

SAS folks-

I need your help again - I'm stuck. I'm working on a report where I need to generate counts and statistics based on the value of codes within a dataset.

However, the codes may be combined in the data, so I need to find a way to first generate, then report on the unique codes. All codes are single characters, but to make it more fun, there is no order to the codes, they may repeat in an observation, and there are special characters (< and ? that I know of).

Here are some examples of what might occur looking at 12 observations with possible codes of null, I, J, Q > (may be up to 20 different codes depending on the dataset involved, last two years for one project is 80K obs, so it will get big):

obs code

1 I

2 J

3 QJ

4

5

6 JQ

7 IJ>Q

8

9 >

10

11 JIJ

12 QJI

a basic start to my report might be something like this:

code count nobs %

J 6 12 50%

Q 4 12 33%

> 2 12 17%

I 4 12 33%

no code 4 12 33%

I would like to use this solution for different datasets without knowing anything about the codes in advance.

Also, any tips on how to generate the report would also be very welcome - I'm playing with TABULATE.

Thanks for your help!

WendyT

LarryWorley · Posted 09-07-2012 03:45 PM

Wendy,

Here is another solution which uses some sql and a variation in the initial data step.

data x;

input code : $10. @@;
drop i_char code;

obs = _n_ ;
do i_char = 1 to length(code) ;
single_char = substr(code,i_char,1) ;
output ;
end ;
datalines;

I J QJ . . JQ IJ>Q . > . JIJ QJI

;

proc sql ;
create table x_short as
select distinct obs
                 ,single_char
    from x
    ;
select max(obs) into :num_obs
    from x
    ;
select single_char as code
        ,count(*)    as count
        ,&num_obs    as nobs
        ,(calculated count
         /&num_obs)     as percent
            format percent5.0
    from x_short
    group by single_char
    ;
quit ;

                            code      count      nobs percent

                                               4        12     33%
                            >                 2        12     17%
                            I                  4        12     33%
                            J                 6        12     50%
                            Q                4        12     33%

View solution in original post

Cynthia_sas · Posted 09-07-2012 12:45 PM

Hi:

Just curious, how are you counting observation 11 with a value of JIJ, where the J is repeated? It looks to me like you are only counting the J one time, even though it appears twice. You show a count of 6 for J, but there are actually 7 occurences of J in the data.

cynthia

WendyT · Posted 09-07-2012 12:51 PM

HI Cynthia-

Sorry to be unclear. Repeats of codes in a single observation only count as one. There really shouldn't be repeats, but it does happen sometimes.

Thanks!

Wendy

MikeZdeb · Posted 09-07-2012 02:17 PM

hi ... this gives you the table you posted (I'm sure that there will be lots of ideas on this one) ... a couple things make this interesting

given this "Repeats of codes in a single observation only count as one." you have to have some way of de-duplicating characters within a code

also, given your denominator is the original number of observations rather than the number of codes, not sure how you could get the correct percentages in either TABULATE or FREQ once you have a data set of single character codes to count

*your data;

data x;

input code : $10. @@;

datalines;

I J QJ . . JQ IJ>Q . > . JIJ QJI

;

data counts (keep=code count nobs pct);

array ch(32:126) _temporary_ (95*.);

set x end=last nobs=obs;

if length(code) eq 1 then ch(rank(code))+1;

else

do j=1 to length(code);

ch(rank(char(code,j))) + (^find(code,trim(char(code,j)),j+1));

end;

if last then do j=32 to 126;

nobs = obs;

if ^missing(ch(j)) then do;

count = ch(j);

pct = ch(j) / obs;

code = byte(j);

output;

end;

format pct percent.;

run;

code nobs count pct

12 4 33%

> 12 2 17%

I 12 4 33%

J 12 6 50%

Q 12 4 33%

Message was edited by: Mike Zdeb ... 10 minutes more thought, less SAS code

WendyT · Posted 09-07-2012 03:24 PM

MikeZdeb-

Thanks so much for your solution! I was able to run it on my whole dataset, but it's going to take me quite awhile to figure out how it works.

I've been noodling around with TABULATE, as this report will have to be sliced into various categories, and it looks like I'm going to have to do this in pieces to get the percentages the way I want them.

Thanks again!

WendyT

MikeZdeb · Posted 09-07-2012 05:41 PM

Hi ... FYI ... every character has a number associated with it ... http://www.albany.edu/~msz03/epi514/notes/appendix_B.pdf.

You can find that number with the RANK function. The numbers for printable characters range from 32 (space) to 126 (tilde, ~). So, the data step sets up an array (CH) that ranges from CH(32) to CH(126) and that array is used to keep track of how many times each character is encountered. Since you wanted characters de-dupped within codes, there's a loop that checks for duplicates (without that consideration, the code would be simpler).

When the counting is done (last observation read), another loop computes the percentages and writes observations to a data set. There's another function that returns the character associated with each number, the BYTE function, and it is used in the loop to add the character (CODE) to the data set.

WendyT · Posted 09-10-2012 10:15 AM

Everyone-

Thanks so much for taking the time to help me out - I have some new avenues to explore. It's wonderful to see different ways to solve a problem.

I have a meeting with our computer folks to discuss (and hopefully schedule) our upgrade to 9.3 later today.

With grateful thanks...

WendyT

joehinson · Posted 09-07-2012 03:18 PM

Hello Wendy, here is a hash solution.

First, as observations are read in, codes are stripped from their texts. A smaller hash object ("rep") then checks for repeats. If a code is repeated in an observation then count is suppressed. Another hash object ("cod") then accumulates unique codes and stores counts per codes: (I haven't dealt with the final output table that you want because that should be easy).

The nice thing about this hash approach is that you need not know anything in advance about the codes, the number and sizes of the texts, the repeats, etc:

data have;

infile datalines missover;

input seq codetext $;

datalines;

1 I

2 J

3 QJ

4

5

6 JQ

7 IJ>Q

8

9 >

10

11 JIJ

12 QJI

;

run;

data _null_;

length code $1 codecount coderepeat codelen 8;

if(1=2) then set have;

declare hash cod(ordered:"a");

cod.defineKey("code");

cod.defineData("code","codecount");

cod.defineDone();

declare hash rep(ordered:"a");

rep.defineKey("code");

rep.defineDone();

code=".";

codecount=0;

k=0;

coderepeat=0;

do until (done);

set have end=done;

rx=rep.clear();

codelen=length(codetext);

k=k+1;

codecount=1;

do x=1 to codelen;

code=substr(codetext,x,1);

rp=rep.add();

coderepeat=countc(codetext,code,"i");

rc=cod.find();

codecount=codecount+(rc=0)*(rp=0);

rca=cod.replace();

codecount=0;

end;

cod.output(dataset:"codetable");

stop;

run;

OUTPUT:

	code	codecount
1		4
2	>	1
3	I	4
4	J	6
5	Q	4

joehinson · Posted 09-07-2012 03:31 PM

Hmmmm, why is ">" giving a count of 1 instead of 2? Will fix and get back!

LarryWorley · Posted 09-07-2012 03:45 PM

Wendy,

Here is another solution which uses some sql and a variation in the initial data step.

data x;

input code : $10. @@;
drop i_char code;

obs = _n_ ;
do i_char = 1 to length(code) ;
single_char = substr(code,i_char,1) ;
output ;
end ;
datalines;

I J QJ . . JQ IJ>Q . > . JIJ QJI

;

proc sql ;
create table x_short as
select distinct obs
                 ,single_char
    from x
    ;
select max(obs) into :num_obs
    from x
    ;
select single_char as code
        ,count(*)    as count
        ,&num_obs    as nobs
        ,(calculated count
         /&num_obs)     as percent
            format percent5.0
    from x_short
    group by single_char
    ;
quit ;

                            code      count      nobs percent

                                               4        12     33%
                            >                 2        12     17%
                            I                  4        12     33%
                            J                 6        12     50%
                            Q                4        12     33%

WendyT · Posted 09-07-2012 04:27 PM

LarryWorley-

Thanks so much - this one I understand! (so you get the "Correct Answer" award) I didn't know that one SQL step could contain multiple selects and table creations (I saved the output as a table when I applied it to my big dataset).

I will be able to apply this in so many places.

WendyT

Message was edited by: Wendy Tweedale

joehinson · Posted 09-07-2012 03:50 PM

Here is the fixed hash solution (SAS 9.2 or 9.3).

As I said, with the hash approach, one does not need to know in advance the sizes of arrays, etc.:

data have;

infile datalines missover;

input seq codetext $;

datalines;

1 I

2 J

3 QJ

4

5

6 JQ

7 IJ>Q

8

9 >

10

11 JIJ

12 QJI

;

run;

data _null_;

length code $1 codecount coderepeat codelen 8;

if(1=2) then set have;

declare hash cod(ordered:"a");

cod.defineKey("code");

cod.defineData("code","codecount");

cod.defineDone();

declare hash rep(ordered:"a");

rep.defineKey("code");

rep.defineDone();

code=".";

codecount=1;

do until (done);

set have end=done;

rx=rep.clear();

codelen=length(codetext);

codecount=1;

do x=1 to codelen;

code=substr(codetext,x,1);

rp=rep.add();

rc=cod.find();

codecount=codecount+((rc=0)*(rp=0));

rca=cod.replace();

codecount=1;

end;

cod.output(dataset:"codetable");

stop;

run;

OUTPUT:

	code	codecount
1		4
2	>	2
3	I	4
4	J	6
5	Q	4

WendyT · Posted 09-07-2012 04:05 PM

joehinson-

Thanks so much for the hash solution- I don't understand how it works, but it runs beautifully.

I have so much to learn...

WendyT

joehinson · Posted 09-07-2012 05:03 PM

This one is just for future considerations. Proc Tabulate is not going to go away anytime soon, but new tools are in the horizon.

SAS 9.3 comes with an experimental tool: ODS Report Writing Interface, which also employs objects, to create pretty output tables.

ODS objects play very well with hash objects, as shown below: (The output is PDF, and the location must be specified in the ODS statement):

data _null_;

length code $1 codecount datasize percents codelen 8;

if(1=2) then set have nobs=nsize;

declare hash cod(ordered:"a");

cod.defineKey("code");

cod.defineData("code","codecount", "datasize", "percents");

cod.defineDone();

declare hiter hicod("cod");

declare hash rep(ordered:"a");

rep.defineKey("code");

rep.defineDone();

code=".";

codecount=1;

datasize=0;

percents=0;

do until (done);

set have end=done;

rx=rep.clear();

codelen=length(codetext);

codecount=1;

do x=1 to codelen;

code=substr(codetext,x,1);

rp=rep.add();

rc=cod.find();

datasize=nsize;

codecount=codecount+((rc=0)*(rp=0));

percents=int(100*(codecount/datasize));

rca=cod.replace();

codecount=1;

end;

*========USING THE SAS 9.3 ODS REPORT WRITING INTERFACE===========================================;

options nonumber nodate nocenter;

ods listing close;

ods pdf file="C:\Documents and Settings\hinsonj\Desktop\SASoutputs\WendysTable.pdf" notoc; *<------(CHANGE TO YOUR OWN LOCATION);

declare odsout summ();

summ.title(text:"Wendy's Desired Table of Code Frequencies " ||" %sysfunc(today(),mmddyy10.)");

summ.layout_gridded(columns:1);

summ.region();

summ.table_start();

summ.row_start();

summ.format_cell(inhibit:"TLR", text: "CODE", overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"TLR", text: "COUNTS", overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"TLR", text: "NOBS", overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"TLR", text: "PERCENTS", overrides:"cellwidth=4cm just=center");

summ.row_end();

summ.table_end();

summ.table_start();

summ.line();

do while(hicod.next() eq 0);

summ.row_start();

summ.format_cell(inhibit:"BTLR", text: code, overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"BTLR", text: codecount, overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"BTLR", text: datasize, overrides:"cellwidth=4cm just=center");

summ.format_cell(inhibit:"BTLR", text: percents, overrides:"cellwidth=4cm just=center");

summ.row_end();

end;

summ.table_end();

summ.line();

summ.region();

summ.layout_end();

stop;

run;

ods pdf close;

Andre · Posted 09-11-2012 04:32 AM

Joe,

I was surprised by your code, as it is completely new for me!

but when i tried it , i encounter difficulties even after adding a data _null_; in the head

(i use 9.3.0 under windows)

What is the correction, just to see the final production.

Andre

see further:

112 options nonumber nodate nocenter;

113 ods listing close;

114 ods pdf file="d\notes\test\WTable.pdf" notoc;

NOTE: Writing ODS PDF output to DISK destination

"d:\My SAS Files\9.3\d\notes\test\WTable.pdf", printer "PDF".

115 data _null_;

116 declare odsout summ();

117 summ.title(text:"Wendy's Desired Table of Code Frequencies " ||"

117! %sysfunc(today(),mmddyy10.)");

118 summ.layout_gridded(columns:1);

119 summ.region();

120 summ.table_start();

121 summ.row_start();

122 summ.format_cell(inhibit:"TLR", text: "CODE",

122! overrides:"cellwidth=4cm just=center");

123 summ.format_cell(inhibit:"TLR", text: "COUNTS",

123! overrides:"cellwidth=4cm just=center");

124 summ.format_cell(inhibit:"TLR", text: "NOBS",

124! overrides:"cellwidth=4cm just=center");

125 summ.format_cell(inhibit:"TLR", text: "PERCENTS",

125! overrides:"cellwidth=4cm just=center");

126 summ.row_end();

127 summ.table_end();

128 summ.table_start();

129 summ.line();

130 do while(hicod.next() eq 0);

----------

557

ERROR: DATA STEP Component Object failure. Aborted during the COMPILATION phase.

ERROR 557-185: Variable hicod is not an object.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: L'étape DATA used (Total process time):

real time 0.08 secondes

cpu time 0.03 secondes

131 summ.row_start();

132 summ.format_cell(inhibit:"BTLR", text: code,

132! overrides:"cellwidth=4cm just=center");

133 summ.format_cell(inhibit:"BTLR", text: codecount,

133! overrides:"cellwidth=4cm just=center");

134 summ.format_cell(inhibit:"BTLR", text: datasize,

134! overrides:"cellwidth=4cm just=center");

135 summ.format_cell(inhibit:"BTLR", text: percents,

135! overrides:"cellwidth=4cm just=center");

136 summ.row_end();

137 end;

138 summ.table_end();

139 summ.line();

140 summ.region();

141 summ.layout_end();

142 stop;

143 run;

144 ods pdf close;

NOTE: ODS PDF printed no output.

(This sometimes results from failing to place a RUN statement before the ODS PDF CLOSE

statement.)

select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Re: select on unique character codes from a column and generate a report

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away