Hi I have a data that looks like this:
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
I want to parse X in such a way that I get a resulting data that looks like this:
id x1 start_ndx end_ndx start_date end_date
1 1 1 5 01/01/2006 01/05/2006
1 0 6 9 01/06/2006 01/09/2006
1 1 10 11 01/10/2006 01/11/2006
1 0 12 12 01/12/2006 01/12/2006
1 A 13 15 01/13/2006 01/15/2006
1 B 16 16 01/16/2006 01/16/2006
Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.
Thanks for the help
I added an array, but this probably isn't the most efficient way to do this.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want(drop=position: start length_of_x i x x1_tracker);
set have;
length_of_x=length(x);
array long_string[*] $ position1 - position4000;
do i=1 to length_of_x;
long_string[i] = substr(x,i,1);
end;
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
if i le length_of_x and long_string[i] ne long_string[i+1] then output;
end;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Will the dates vary from record to record? Can you show how this may work for another ID so that the solution is general enough to work on your data.
@arorata wrote:
Hi I have a data that looks like this:
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
I want to parse X in such a way that I get a resulting data that looks like this:
id x1 start_ndx end_ndx start_date end_date
1 1 1 5 01/01/2006 01/05/2006
1 0 6 9 01/06/2006 01/09/2006
1 1 10 11 01/10/2006 01/11/2006
1 0 12 12 01/12/2006 01/12/2006
1 A 13 15 01/13/2006 01/15/2006
1 B 16 16 01/16/2006 01/16/2006
Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.
Thanks for the help
Thanks for responding. Yes both, value of X and start can be different for different ID. For example ID=2 can have data looking like this:
X="AAABB000111AABB"
start=03/24/2010;
Cheers!!
I think the FINDC function is helpful here. Also the length of X needs to be at least one character longer than the longest string, otherwise the last run will not be output.
data have;
id=1;
length x $128;
X="111110000110AAAB";
start="01jan2006"d;
output;
id=2;
X="11111 0000110AAAB111110000110A";
start="06jan2006"d;
output;
run;
data want;
set have;
i = 1;
do j = 1 by 1 while(i ne 0);
t = char(x,i);
s = findc(x,t,i,'k');
if s then do;
start_ndx = i;
end_ndx = s-1;
start_date = start + start_ndx -1;
end_date = start + end_ndx -1;
output;
end;
i = s;
end;
drop i s;
format start start_date end_date date9.;
run;
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want;
set have;
length_of_x=length(x);
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
output;
end;
run;
proc print data=want;
title 'Check results of logic';
run;
data want(drop=x start length_of_x x1_tracker i);
set want;
by id x1 notsorted;
if last.x1;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Hi Suzanne, thanks for your reply. Your code works but the only issue is that in my actual data the variable "X" pertains to daily medication history of subjects over a period of 10 years. Actual string is of length 3652. There are 300,000 subjects so what this does is create an intermediate "person-day" file which has approximately 800 million records. The final data is manageable though. Is there a way to avoid outputting all those intermediate records? Appreciate any help.
Cheers
Here's pretty close to what you want. I couldn't get the last record to output correctly so you'll need to play with the IF conditions a bit to get that last record - need to head out for a bit.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
data want;
set have;
n_loop=lengthn(x);
start_char=substr(x, 1, 1);
start_date=start;
end_date=start;
do i=2 to n_loop;
x_now=substr(x, i, 1);
if start_char=x_now then
do;
end_date+1;
end;
else if start_char ne x_now and i ne n_loop
then do;
output;
start_char=x_now;
start_date=end_date + 1;
end_date=start_date;
end;
else do;
end_date + 1;
output;
end;
end;
format start_date end_date date9.;
run;
Thanks @SuzanneDorinski for the data step to mock up a solution.
Hope this helps you get a little further at least.
Thanks Reeza,
I will try to play around with it and see if I can figure it out. Appreciate all the help.
I added an array, but this probably isn't the most efficient way to do this.
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
proc print data=have;
title 'Test case';
run;
data want(drop=position: start length_of_x i x x1_tracker);
set have;
length_of_x=length(x);
array long_string[*] $ position1 - position4000;
do i=1 to length_of_x;
long_string[i] = substr(x,i,1);
end;
x1_tracker='';
do i=1 to length_of_x;
x1=substr(x,i,1);
if x1 ne x1_tracker then
do;
start_ndx=i;
end_ndx=i;
start_date=start-1+i;
x1_tracker=x1;
end;
else
do;
end_ndx=i;
end;
end_date=start_date+end_ndx-start_ndx;
if i le length_of_x and long_string[i] ne long_string[i+1] then output;
end;
run;
proc print data=want noobs;
format start_date end_date mmddyy10. ;
title 'Result for test case';
run;
Hey everyone,
Thanks for all your responses. These were very helpful! I used parts from most of these and came up with following:
data want;
set have;
format start_date end_date start mmddyy10.;
str_end=length(x);
do i=1 to str_end-1;
char=substr(x, i, 1);
if char ne "0" and substr(x, i+1, 1)="0" then
do j=i+1 to min(i, str_end) while(substr(x, j, 1)="0");
substr(x, j, 1)=char;
i=j;
end;
end;
start_ndx=1;
start_date=start+start_ndx-1;
end_ndx = 0;
end_date=start+end_ndx-1;
x1=substr(x,1,1);
do k=2 to str_end;
if substr(x,k,1) ^= x1 then
do;
end_ndx = k-1;
end_date=start+end_ndx-1;
output;
start_ndx = k;
start_date=start+start_ndx-1;
x1 = substr(x,k,1);
end;
end;
end_ndx = length(x);
end_date=start+end_ndx-1;
output;
drop i j k char;
run;
proc print data=want;var x x1 start start_ndx end_ndx start_date end_date;run
data have;
id=1;
X="111110000110AAAB";
start="01jan2006"d;
output;
run;
data temp;
set have;
do i=1 to length(x);
x1=char(x,i);output;
end;
drop x;
run;
proc summary data=temp;
by id start x1 notsorted;
output out=temp1;
run;
data want;
set temp1;
by id;
retain lag_cum;
if first.id then do;cum=0; cum+_freq_;start_ndx=1; end_ndx=_freq_;end;
else do; cum+_freq_;start_ndx=lag_cum+1; end_ndx=cum; end;
lag_cum=cum;
format start date9.;
drop _:;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.