BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
mh04
Obsidian | Level 7

I have a long dataset with a 36-char STRING variable by ID and CLASS level. Some ID's CLASS values have multiple STRING values and I need to collapse the STRING by unique ID-CLASS group. Each character in the string represents a month and values should be collapsed as follows: if month X is 0 across all STRING within an ID-CLASS group, then month X should be 0 in final string; if month X is a 1 in any STRING within an ID-CLASS group, then month X should be 1; and if there are only 2s or 2s and 0s in month X in all STRING within an ID-CLASS group, then month X should be 2. 

 

So for ID #1002, the final data should contain one row for CLASS #973, with the same info as before, and only one row for CLASS #934 where the STRING would be 010122111110000000000000000000000000. 

 

ID

CLASS

STRING

10002

973

000000000000011111000000000000111110

10002

934

010122121110000000000000000000000000

10002

934

020222111100000000000000000000000000

 

I tried creating 36 month-level indicators based on the string and using them below. It's working in some cases, but not in others. I think when an earlier string contains a 2 and a later one has a 1, it updates to 1 correctly. But when a 1 appears in an earlier string and a 2 in a later string, it changes the 1 to 2.

 

data dsn;

length newstring $36;

set dsn;

by id class;

 

array mstr (36) STR01-STR36;

array nstr (36) NSTR01-NSTR36;

do i=1 to 36;

if first.unitid then nstr[i] = 0;

end;

 

retain nstr01--nstr36;

do i=1 to 36 ;

if mstr[i] > 0 then nstr[i] = min(of mstr[i]);

end;

 

newstring = cats(of nstr:);

run;

 

I'd appreciate any tips or suggestions. Thanks.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.

 

This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.

 

Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

proc format;
invalue stupidorder 
1=9
0=0
2=2;
value stupidorder
9='1'
0='0'
2='2'
;
run;

data working;
   set have;
   array ns (36) ;
   do i=1 to 36;
      ns[i]= input(substr(string,i,1),stupidorder.);
   end;
run;  


proc summary data=working nway;
   /* this ASSUMES all of the ID and Class value
      combinations are grouped in the data
      if that is not the case SORT the data first.
   */
   by id class notsorted;
   var ns: ;
   output out=summary (drop=_:) max=;
run;

data want;
   set summary;
   array ns (*) ns:;
   length string $ 36;
   do i=1 to dim(ns);
      string=cats(string,put(ns[i],stupidorder. -L));
   end;
   drop i ns: ;
run;

For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.

 

Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.

 

An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.

 

Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.

View solution in original post

9 REPLIES 9
ballardw
Super User

In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.

 

This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.

 

Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

proc format;
invalue stupidorder 
1=9
0=0
2=2;
value stupidorder
9='1'
0='0'
2='2'
;
run;

data working;
   set have;
   array ns (36) ;
   do i=1 to 36;
      ns[i]= input(substr(string,i,1),stupidorder.);
   end;
run;  


proc summary data=working nway;
   /* this ASSUMES all of the ID and Class value
      combinations are grouped in the data
      if that is not the case SORT the data first.
   */
   by id class notsorted;
   var ns: ;
   output out=summary (drop=_:) max=;
run;

data want;
   set summary;
   array ns (*) ns:;
   length string $ 36;
   do i=1 to dim(ns);
      string=cats(string,put(ns[i],stupidorder. -L));
   end;
   drop i ns: ;
run;

For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.

 

Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.

 

An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.

 

Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.

mh04
Obsidian | Level 7
Thank you and sorry about my formatting!
mkeintz
PROC Star

And what if a position in the string  has a zero, a one, AND a two?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mkeintz
PROC Star

I think I am correctly interpreting your rule by saying:

  1. If a character in a string ever contains a 1, the final character should be a 1
  2. Else if that character ever contains a 2, the final character should be a 2
  3. Otherwise it is a zero.

 

If that is correct, and if your data are already grouped by ID/CLASS, then:

data have;
input ID  CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;

data want (drop=i);
  set have;
  by id class notsorted;
  array matrix{0:2,36} _temporary_;
  if first.class=1 then call missing(of matrix{*});

  /* If this is not a single obs, then update the 3-row matrix */
  if not (first.class=1 and last.class=1) then do i=1 to 36 ;
    matrix(input(char(string,i),1.),i)+1;
  end;

  if last.class;
  /* If this is not a single obs, then interpret the 3-row matrix*/
  if first.id=0 then do i=1 to length(string);
    if      matrix(1,i)>0 then substr(string,i,1)='1'; 
    else if matrix(2,i)>0 then substr(string,i,1)='2'; 
    else                       substr(string,i,1)='0';
  end;
run;

 

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
mh04
Obsidian | Level 7
Yes, your interpretation is correct. Thank you!
Tom
Super User Tom
Super User

It is going to be much easier if you transpose that data into one observation per month.

data tall;
  set have ;
  do month=1 to length(string);
    char=char(string,month);
    if char ne '0' then output;
  end;
  drop string;
run;

If you are worried about performance make it a view instead of physical copy of the data.

Then collapse based on your rules.

proc sql ;
 create table collapsed as
 select id, class 
      , month
      , case when (max(char='1')) then '1'
             when (max(char='2')) then '2'
             else '0'
        end as char
 from tall
 group by id, class, month
 order by id, class, month
 ;
quit;

If you need to you can rebuild the string.

data want;
  do until(last.class);
    set collapsed ;
    by id class ;
    length string $36 ;
    if first.class then string=repeat('0',35);
    substr(string,month,1)=char;
  end;
  drop month char;
run;

Results

Obs     id      class                   string

 1     10002     934     010122111110000000000000000000000000
 2     10002     973     000000000000011111000000000000111110
mh04
Obsidian | Level 7
Thank you!
FreelanceReinh
Jade | Level 19

Hello @mh04,

 

The BOR function lends itself to implement your collapsing rule. Sadly, it cannot deal with 72-bit strings, so some disassembling and reassembling is needed, which makes the code longer:

data want(drop=b: c i);
length c $72;
array b[0:2];
do until(last.class);
  set have;
  by id class notsorted;
  c=put(translate(string,'000110'x,'012'),$hex72.);
  do i=0 to 2;
    b[i]=bor(max(0,b[i]),input(substr(c,24*i+1),binary24.));
  end;
end;
c=put(b1,binary24.)||put(b2,binary24.)||put(b3,binary24.);
string=translate(input(c,$hex72.),'0121','00011011'x);
run;
mh04
Obsidian | Level 7
Thank you! This is a bit over my head but I look forward to testing it and learning new things.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1472 views
  • 4 likes
  • 5 in conversation