Hi I have a data that looks like this:
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
I want to parse X in such a way that I get a resulting data that looks like this:
id x1 start_ndx end_ndx start_date end_date
1 1 1 5 01/01/2006 01/05/2006
1 0 6 9 01/06/2006 01/09/2006
1 1 10 11 01/10/2006 01/11/2006
1 0 12 12 01/12/2006 01/12/2006
1 A 13 15 01/13/2006 01/15/2006
1 B 16 16 01/16/2006 01/16/2006
Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.
Thanks for the help
I added an array, but this probably isn't the most efficient way to do this.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want(drop=position: start length_of_x i x x1_tracker);
set have;
length_of_x=length(x);
array long_string[*] $ position1 - position4000;
do i=1 to length_of_x;
long_string[i] = substr(x,i,1);
end;
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
if i le length_of_x and long_string[i] ne long_string[i+1] then output;
end;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Will the dates vary from record to record? Can you show how this may work for another ID so that the solution is general enough to work on your data.
@arorata wrote:
Hi I have a data that looks like this:
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
I want to parse X in such a way that I get a resulting data that looks like this:
id x1 start_ndx end_ndx start_date end_date
1 1 1 5 01/01/2006 01/05/2006
1 0 6 9 01/06/2006 01/09/2006
1 1 10 11 01/10/2006 01/11/2006
1 0 12 12 01/12/2006 01/12/2006
1 A 13 15 01/13/2006 01/15/2006
1 B 16 16 01/16/2006 01/16/2006
Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.
Thanks for the help
Thanks for responding. Yes both, value of X and start can be different for different ID. For example ID=2 can have data looking like this:
X="AAABB000111AABB"
start=03/24/2010;
Cheers!!
I think the FINDC function is helpful here. Also the length of X needs to be at least one character longer than the longest string, otherwise the last run will not be output.
data have;
id=1;
length x $128;
X="111110000110AAAB";
start="01jan2006"d;
output;
id=2;
X="11111 0000110AAAB111110000110A";
start="06jan2006"d;
output;
run;
data want;
set have;
i = 1;
do j = 1 by 1 while(i ne 0);
t = char(x,i);
s = findc(x,t,i,'k');
if s then do;
start_ndx = i;
end_ndx = s-1;
start_date = start + start_ndx -1;
end_date = start + end_ndx -1;
output;
end;
i = s;
end;
drop i s;
format start start_date end_date date9.;
run;
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want;
set have;
length_of_x=length(x);
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
output;
end;
run;
proc print data=want;
title 'Check results of logic';
run;
data want(drop=x start length_of_x x1_tracker i);
set want;
by id x1 notsorted;
if last.x1;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Hi Suzanne, thanks for your reply. Your code works but the only issue is that in my actual data the variable "X" pertains to daily medication history of subjects over a period of 10 years. Actual string is of length 3652. There are 300,000 subjects so what this does is create an intermediate "person-day" file which has approximately 800 million records. The final data is manageable though. Is there a way to avoid outputting all those intermediate records? Appreciate any help.
Cheers
Here's pretty close to what you want. I couldn't get the last record to output correctly so you'll need to play with the IF conditions a bit to get that last record - need to head out for a bit.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
data want;
set have;
n_loop=lengthn(x);
start_char=substr(x, 1, 1);
start_date=start;
end_date=start;
do i=2 to n_loop;
x_now=substr(x, i, 1);
if start_char=x_now then
do;
end_date+1;
end;
else if start_char ne x_now and i ne n_loop
then do;
output;
start_char=x_now;
start_date=end_date + 1;
end_date=start_date;
end;
else do;
end_date + 1;
output;
end;
end;
format start_date end_date date9.;
run;
Thanks @SuzanneDorinski for the data step to mock up a solution.
Hope this helps you get a little further at least.
Thanks Reeza,
I will try to play around with it and see if I can figure it out. Appreciate all the help.
I added an array, but this probably isn't the most efficient way to do this.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want(drop=position: start length_of_x i x x1_tracker);
set have;
length_of_x=length(x);
array long_string[*] $ position1 - position4000;
do i=1 to length_of_x;
long_string[i] = substr(x,i,1);
end;
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
if i le length_of_x and long_string[i] ne long_string[i+1] then output;
end;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Hey everyone,
Thanks for all your responses. These were very helpful! I used parts from most of these and came up with following:
data want;
set have;
format start_date end_date start mmddyy10.;
str_end=length(x);
do i=1 to str_end-1;
char=substr(x, i, 1);
if char ne "0" and substr(x, i+1, 1)="0" then
do j=i+1 to min(i, str_end) while(substr(x, j, 1)="0");
substr(x, j, 1)=char;
i=j;
end;
end;
start_ndx=1;
start_date=start+start_ndx-1;
end_ndx = 0;
end_date=start+end_ndx-1;
x1=substr(x,1,1);
do k=2 to str_end;
if substr(x,k,1) ^= x1 then
do;
end_ndx = k-1;
end_date=start+end_ndx-1;
output;
start_ndx = k;
start_date=start+start_ndx-1;
x1 = substr(x,k,1);
end;
end;
end_ndx = length(x);
end_date=start+end_ndx-1;
output;
drop i j k char;
run;
proc print data=want;var x x1 start start_ndx end_ndx start_date end_date;run
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
data temp;
set have;
do i=1 to length(x);
x1=char(x,i);output;
end;
drop x;
run;
proc summary data=temp;
by id start x1 notsorted;
output out=temp1;
run;
data want;
set temp1;
by id;
retain lag_cum;
if first.id then do;cum=0; cum+_freq_;start_ndx=1; end_ndx=_freq_;end;
else do; cum+_freq_;start_ndx=lag_cum+1; end_ndx=cum; end;
lag_cum=cum;
format start date9.;
drop _:;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.