- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a long dataset with a 36-char STRING variable by ID and CLASS level. Some ID's CLASS values have multiple STRING values and I need to collapse the STRING by unique ID-CLASS group. Each character in the string represents a month and values should be collapsed as follows: if month X is 0 across all STRING within an ID-CLASS group, then month X should be 0 in final string; if month X is a 1 in any STRING within an ID-CLASS group, then month X should be 1; and if there are only 2s or 2s and 0s in month X in all STRING within an ID-CLASS group, then month X should be 2.
So for ID #1002, the final data should contain one row for CLASS #973, with the same info as before, and only one row for CLASS #934 where the STRING would be 010122111110000000000000000000000000.
ID |
CLASS |
STRING |
10002 |
973 |
000000000000011111000000000000111110 |
10002 |
934 |
010122121110000000000000000000000000 |
10002 |
934 |
020222111100000000000000000000000000 |
I tried creating 36 month-level indicators based on the string and using them below. It's working in some cases, but not in others. I think when an earlier string contains a 2 and a later one has a 1, it updates to 1 correctly. But when a 1 appears in an earlier string and a 2 in a later string, it changes the 1 to 2.
data dsn;
length newstring $36;
set dsn;
by id class;
array mstr (36) STR01-STR36;
array nstr (36) NSTR01-NSTR36;
do i=1 to 36;
if first.unitid then nstr[i] = 0;
end;
retain nstr01--nstr36;
do i=1 to 36 ;
if mstr[i] > 0 then nstr[i] = min(of mstr[i]);
end;
newstring = cats(of nstr:);
run;
I'd appreciate any tips or suggestions. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.
This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.
Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.
data have; input ID CLASS STRING :$36.; datalines; 10002 973 000000000000011111000000000000111110 10002 934 010122121110000000000000000000000000 10002 934 020222111100000000000000000000000000 ; proc format; invalue stupidorder 1=9 0=0 2=2; value stupidorder 9='1' 0='0' 2='2' ; run; data working; set have; array ns (36) ; do i=1 to 36; ns[i]= input(substr(string,i,1),stupidorder.); end; run; proc summary data=working nway; /* this ASSUMES all of the ID and Class value combinations are grouped in the data if that is not the case SORT the data first. */ by id class notsorted; var ns: ; output out=summary (drop=_:) max=; run; data want; set summary; array ns (*) ns:; length string $ 36; do i=1 to dim(ns); string=cats(string,put(ns[i],stupidorder. -L)); end; drop i ns: ; run;
For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.
Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.
An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.
Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In one respect I would say that your rules mean that the values picked for the 0,1, and 2 were chosen poorly as there is no natural order of the shown value compared to your apparent concern for the value.
This approach works for the shown values by 1) treating 1 as the most important value by changing it to a numeric 9, then using the max summary statistic after separating out the columns of the value. Then reassigning the 9 back to a 1 when rebuilding the string. Though I would submit that the 36 character string is an awkward bulky and unnatural way to handle 36 values (or 3 or 15 or 125) values. I would set a coding scheme that makes sense in a more general term and likely deal with 36 variables.
Note use of custom informat to create numeric values from the character and the a custom format to allow simpler put of values as desired into a single string.
data have; input ID CLASS STRING :$36.; datalines; 10002 973 000000000000011111000000000000111110 10002 934 010122121110000000000000000000000000 10002 934 020222111100000000000000000000000000 ; proc format; invalue stupidorder 1=9 0=0 2=2; value stupidorder 9='1' 0='0' 2='2' ; run; data working; set have; array ns (36) ; do i=1 to 36; ns[i]= input(substr(string,i,1),stupidorder.); end; run; proc summary data=working nway; /* this ASSUMES all of the ID and Class value combinations are grouped in the data if that is not the case SORT the data first. */ by id class notsorted; var ns: ; output out=summary (drop=_:) max=; run; data want; set summary; array ns (*) ns:; length string $ 36; do i=1 to dim(ns); string=cats(string,put(ns[i],stupidorder. -L)); end; drop i ns: ; run;
For future questions please provide in the form of a data step such as shown. Or at least paste stuff into a text box opened on the forum with the </> icon above the main message window. Whatever created that "table" of values is truly obnoxious to work with and any larger sample of data I would have either skipped this question entirely or reduced it just a couple of values.
Paste code into a text box as well. That will preserve formatting, if you use such, and makes reading code much easier as well as separating code out from discussion text.
An approach using RETAIN (or LAG functions) would likely only work for at most two rows of data per ID/Class combination. When you have rules like "any of the rows has 1" keeping track of having a 1 for multiple rows at the same time as trying ti keep the 0 or 2 straight gets to be pretty obnoxious.
Note: if 0 is a "no occurence" you might have used a missing, 0(not important) and 1 (important) coding scheme (and in 36 numeric variables) as there are quite a lot of things that can be told with numeric statistics from 0/1 coded variables.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
And what if a position in the string has a zero, a one, AND a two?
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think I am correctly interpreting your rule by saying:
- If a character in a string ever contains a 1, the final character should be a 1
- Else if that character ever contains a 2, the final character should be a 2
- Otherwise it is a zero.
If that is correct, and if your data are already grouped by ID/CLASS, then:
data have;
input ID CLASS STRING :$36.;
datalines;
10002 973 000000000000011111000000000000111110
10002 934 010122121110000000000000000000000000
10002 934 020222111100000000000000000000000000
;
data want (drop=i);
set have;
by id class notsorted;
array matrix{0:2,36} _temporary_;
if first.class=1 then call missing(of matrix{*});
/* If this is not a single obs, then update the 3-row matrix */
if not (first.class=1 and last.class=1) then do i=1 to 36 ;
matrix(input(char(string,i),1.),i)+1;
end;
if last.class;
/* If this is not a single obs, then interpret the 3-row matrix*/
if first.id=0 then do i=1 to length(string);
if matrix(1,i)>0 then substr(string,i,1)='1';
else if matrix(2,i)>0 then substr(string,i,1)='2';
else substr(string,i,1)='0';
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It is going to be much easier if you transpose that data into one observation per month.
data tall;
set have ;
do month=1 to length(string);
char=char(string,month);
if char ne '0' then output;
end;
drop string;
run;
If you are worried about performance make it a view instead of physical copy of the data.
Then collapse based on your rules.
proc sql ;
create table collapsed as
select id, class
, month
, case when (max(char='1')) then '1'
when (max(char='2')) then '2'
else '0'
end as char
from tall
group by id, class, month
order by id, class, month
;
quit;
If you need to you can rebuild the string.
data want;
do until(last.class);
set collapsed ;
by id class ;
length string $36 ;
if first.class then string=repeat('0',35);
substr(string,month,1)=char;
end;
drop month char;
run;
Results
Obs id class string 1 10002 934 010122111110000000000000000000000000 2 10002 973 000000000000011111000000000000111110
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @mh04,
The BOR function lends itself to implement your collapsing rule. Sadly, it cannot deal with 72-bit strings, so some disassembling and reassembling is needed, which makes the code longer:
data want(drop=b: c i);
length c $72;
array b[0:2];
do until(last.class);
set have;
by id class notsorted;
c=put(translate(string,'000110'x,'012'),$hex72.);
do i=0 to 2;
b[i]=bor(max(0,b[i]),input(substr(c,24*i+1),binary24.));
end;
end;
c=put(b1,binary24.)||put(b2,binary24.)||put(b3,binary24.);
string=translate(input(c,$hex72.),'0121','00011011'x);
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content