SAS Programming

mh04 · Posted 08-24-2023 05:22 PM

I have a long dataset with a 36-char STRING variable by ID and CLASS level. Some ID's CLASS values have multiple STRING values and I need to collapse the STRING by unique ID-CLASS group. Each character in the string represents a month and values should be collapsed as follows: if month X is 0 across all STRING within an ID-CLASS group, then month X should be 0 in final string; if month X is a 1 in any STRING within an ID-CLASS group, then month X should be 1; and if there are only 2s or 2s and 0s in month X in all STRING within an ID-CLASS group, then month X should be 2.

So for ID #1002, the final data should contain one row for CLASS #973, with the same info as before, and only one row for CLASS #934 where the STRING would be 010122111110000000000000000000000000.

ID	CLASS	STRING
10002	973	000000000000011111000000000000111110
10002	934	010122121110000000000000000000000000
10002	934	020222111100000000000000000000000000

I tried creating 36 month-level indicators based on the string and using them below. It's working in some cases, but not in others. I think when an earlier string contains a 2 and a later one has a 1, it updates to 1 correctly. But when a 1 appears in an earlier string and a 2 in a later string, it changes the 1 to 2.

data dsn;

length newstring $36;

set dsn;

by id class;

array mstr (36) STR01-STR36;

array nstr (36) NSTR01-NSTR36;

do i=1 to 36;

if first.unitid then nstr[i] = 0;

end;

retain nstr01--nstr36;

do i=1 to 36 ;

if mstr[i] > 0 then nstr[i] = min(of mstr[i]);

end;

newstring = cats(of nstr:);

run;

I'd appreciate any tips or suggestions. Thanks.

ballardw · Posted 08-24-2023 07:44 PM

In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.

This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.

Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

proc format;
invalue stupidorder 
1=9
0=0
2=2;
value stupidorder
9='1'
0='0'
2='2'
;
run;

data working;
   set have;
   array ns (36) ;
   do i=1 to 36;
      ns[i]= input(substr(string,i,1),stupidorder.);
   end;
run;  


proc summary data=working nway;
   /* this ASSUMES all of the ID and Class value
      combinations are grouped in the data
      if that is not the case SORT the data first.
   */
   by id class notsorted;
   var ns: ;
   output out=summary (drop=_:) max=;
run;

data want;
   set summary;
   array ns (*) ns:;
   length string $ 36;
   do i=1 to dim(ns);
      string=cats(string,put(ns[i],stupidorder. -L));
   end;
   drop i ns: ;
run;

For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.

Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.

An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.

Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.

View solution in original post

ballardw · Posted 08-24-2023 07:44 PM

In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.

This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.

Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

proc format;
invalue stupidorder 
1=9
0=0
2=2;
value stupidorder
9='1'
0='0'
2='2'
;
run;

data working;
   set have;
   array ns (36) ;
   do i=1 to 36;
      ns[i]= input(substr(string,i,1),stupidorder.);
   end;
run;  


proc summary data=working nway;
   /* this ASSUMES all of the ID and Class value
      combinations are grouped in the data
      if that is not the case SORT the data first.
   */
   by id class notsorted;
   var ns: ;
   output out=summary (drop=_:) max=;
run;

data want;
   set summary;
   array ns (*) ns:;
   length string $ 36;
   do i=1 to dim(ns);
      string=cats(string,put(ns[i],stupidorder. -L));
   end;
   drop i ns: ;
run;

For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.

Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.

An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.

Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.

mh04 · Posted 08-28-2023 08:53 AM

Thank you and sorry about my formatting!

mkeintz · Posted 08-24-2023 09:35 PM

And what if a position in the string has a zero, a one, AND a two?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 08-24-2023 09:56 PM

I think I am correctly interpreting your rule by saying:

If a character in a string ever contains a 1, the final character should be a 1
Else if that character ever contains a 2, the final character should be a 2
Otherwise it is a zero.

If that is correct, and if your data are already grouped by ID/CLASS, then:

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

data want (drop=i);
  set have;
  by id class notsorted;
  array matrix{0:2,36} _temporary_;
  if first.class=1 then call missing(of matrix{*});

  /* If this is not a single obs, then update the 3-row matrix */
  if not (first.class=1 and last.class=1) then do i=1 to 36 ;
    matrix(input(char(string,i),1.),i)+1;
  end;

  if last.class;
  /* If this is not a single obs, then interpret the 3-row matrix*/
  if first.id=0 then do i=1 to length(string);
    if      matrix(1,i)>0 then substr(string,i,1)='1'; 
    else if matrix(2,i)>0 then substr(string,i,1)='2'; 
    else                       substr(string,i,1)='0';
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mh04 · Posted 08-28-2023 08:53 AM

Yes, your interpretation is correct. Thank you!

Tom · Posted 08-24-2023 10:13 PM

It is going to be much easier if you transpose that data into one observation per month.

data tall;
  set have ;
  do month=1 to length(string);
    char=char(string,month);
    if char ne '0' then output;
  end;
  drop string;
run;

If you are worried about performance make it a view instead of physical copy of the data.

Then collapse based on your rules.

proc sql ;
 create table collapsed as
 select id, class 
      , month
      , case when (max(char='1')) then '1'
             when (max(char='2')) then '2'
             else '0'
        end as char
 from tall
 group by id, class, month
 order by id, class, month
 ;
quit;

If you need to you can rebuild the string.

data want;
  do until(last.class);
    set collapsed ;
    by id class ;
    length string $36 ;
    if first.class then string=repeat('0',35);
    substr(string,month,1)=char;
  end;
  drop month char;
run;

Results

Obs     id      class                   string

 1     10002     934     010122111110000000000000000000000000
 2     10002     973     000000000000011111000000000000111110

mh04 · Posted 08-28-2023 08:54 AM

Thank you!

FreelanceReinh · Posted 08-25-2023 10:19 AM

Hello @mh04,

The BOR function lends itself to implement your collapsing rule. Sadly, it cannot deal with 72-bit strings, so some disassembling and reassembling is needed, which makes the code longer:

data want(drop=b: c i);
length c $72;
array b[0:2];
do until(last.class);
  set have;
  by id class notsorted;
  c=put(translate(string,'000110'x,'012'),$hex72.);
  do i=0 to 2;
    b[i]=bor(max(0,b[i]),input(substr(c,24*i+1),binary24.));
  end;
end;
c=put(b1,binary24.)||put(b2,binary24.)||put(b3,binary24.);
string=translate(input(c,$hex72.),'0121','00011011'x);
run;

mh04 · Posted 08-28-2023 08:59 AM

Thank you! This is a bit over my head but I look forward to testing it and learning new things.

SAS Programming

Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Re: Collapsing string values within by-groups

Result by group

Concat string in group by

Trouble with BY-group processing using GLM parameterization

getting max and min value by group

proc sql count non missing observations in variable by group

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...