Solved: How to split a row into multiple records each time a character in a st...

arorata · Posted 09-09-2018 11:21 AM

Hi I have a data that looks like this:

data have;

id=1;

X="111110000110AAAB";

start="01jan2006"d;

output;

run;

I want to parse X in such a way that I get a resulting data that looks like this:

id x1 start_ndx end_ndx start_date end_date

1 1 1 5 01/01/2006 01/05/2006

1 0 6 9 01/06/2006 01/09/2006

1 1 10 11 01/10/2006 01/11/2006

1 0 12 12 01/12/2006 01/12/2006

1 A 13 15 01/13/2006 01/15/2006

1 B 16 16 01/16/2006 01/16/2006

Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.

Thanks for the help

SuzanneDorinski · Posted 09-09-2018 03:21 PM

I added an array, but this probably isn't the most efficient way to do this.

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want(drop=position: start length_of_x i x x1_tracker);
  set have;
  length_of_x=length(x);
  array long_string[*] $ position1 - position4000;
  do i=1 to length_of_x;
    long_string[i] = substr(x,i,1);
  end;
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    if i le length_of_x and long_string[i] ne long_string[i+1] then output;
  end;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

View solution in original post

Reeza · Posted 09-09-2018 12:51 PM

Will the dates vary from record to record? Can you show how this may work for another ID so that the solution is general enough to work on your data.

@arorata wrote:

Hi I have a data that looks like this:

data have;

   id=1;

   X="111110000110AAAB";

   start="01jan2006"d;

   output;

run;

I want to parse X in such a way that I get a resulting data that looks like this:

id      x1 start_ndx    end_ndx   start_date           end_date

1        1       1                5            01/01/2006         01/05/2006

1        0       6                9            01/06/2006         01/09/2006

1        1     10               11         01/10/2006         01/11/2006

1        0     12               12           01/12/2006         01/12/2006

1        A     13               15           01/13/2006         01/15/2006

1        B     16               16           01/16/2006         01/16/2006

Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.

Thanks for the help

arorata · Posted 09-09-2018 01:48 PM

Thanks for responding. Yes both, value of X and start can be different for different ID. For example ID=2 can have data looking like this:

X="AAABB000111AABB"

start=03/24/2010;

Cheers!!

data_null__ · Posted 09-09-2018 05:38 PM

I think the FINDC function is helpful here. Also the length of X needs to be at least one character longer than the longest string, otherwise the last run will not be output.

data have;
   id=1;
   length x $128;
   X="111110000110AAAB";
   start="01jan2006"d;
   output;
   id=2;
   X="11111 0000110AAAB111110000110A";
   start="06jan2006"d;
   output;
   run;
data want;
   set have;
   i = 1;
   do j = 1 by 1 while(i ne 0);
      t = char(x,i);
      s = findc(x,t,i,'k');
      if s then do;
         start_ndx  = i;
         end_ndx    = s-1;
         start_date = start + start_ndx -1;
         end_date   = start + end_ndx   -1;
         output;
         end;
      i = s;
      end;
   drop i s;
   format start start_date end_date date9.;
   run;

SuzanneDorinski · Posted 09-09-2018 01:19 PM

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want;
  set have;
  length_of_x=length(x);
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    output;
  end;
run;

proc print data=want;
  title 'Check results of logic';
run;

data want(drop=x start length_of_x x1_tracker i);
  set want;
  by id x1 notsorted;
  if last.x1;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

arorata · Posted 09-09-2018 02:07 PM

Hi Suzanne, thanks for your reply. Your code works but the only issue is that in my actual data the variable "X" pertains to daily medication history of subjects over a period of 10 years. Actual string is of length 3652. There are 300,000 subjects so what this does is create an intermediate "person-day" file which has approximately 800 million records. The final data is manageable though. Is there a way to avoid outputting all those intermediate records? Appreciate any help.

Cheers

Reeza · Posted 09-09-2018 03:02 PM

Here's pretty close to what you want. I couldn't get the last record to output correctly so you'll need to play with the IF conditions a bit to get that last record - need to head out for a bit.

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

data want;
	set have;
	n_loop=lengthn(x);
	start_char=substr(x, 1, 1);
	start_date=start;
	end_date=start;

	do i=2 to n_loop;
		x_now=substr(x, i, 1);

		if start_char=x_now then
			do;
				end_date+1;
			end;
		else if start_char ne x_now and i ne n_loop
			then do;
				output;
				start_char=x_now;
				start_date=end_date + 1;
				end_date=start_date;
			end;
		else do;
		    end_date + 1;
			output;
	    end;
	end;
	format start_date end_date date9.;
run;

Thanks @SuzanneDorinski for the data step to mock up a solution.

Hope this helps you get a little further at least.

arorata · Posted 09-09-2018 03:10 PM

Thanks Reeza,

I will try to play around with it and see if I can figure it out. Appreciate all the help.

SuzanneDorinski · Posted 09-09-2018 03:21 PM

I added an array, but this probably isn't the most efficient way to do this.

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want(drop=position: start length_of_x i x x1_tracker);
  set have;
  length_of_x=length(x);
  array long_string[*] $ position1 - position4000;
  do i=1 to length_of_x;
    long_string[i] = substr(x,i,1);
  end;
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    if i le length_of_x and long_string[i] ne long_string[i+1] then output;
  end;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

arorata · Posted 09-10-2018 07:43 AM

Hey everyone,

Thanks for all your responses. These were very helpful! I used parts from most of these and came up with following:

data want;
   set have;
   format start_date end_date start mmddyy10.;
   str_end=length(x);
   do i=1 to str_end-1;
       char=substr(x, i, 1);
           if char ne "0" and substr(x, i+1, 1)="0" then
               do j=i+1 to min(i, str_end) while(substr(x, j, 1)="0");
                   substr(x, j, 1)=char;
                   i=j;
           end;
   end;
   start_ndx=1;
   start_date=start+start_ndx-1;
   end_ndx = 0;
   end_date=start+end_ndx-1;
   x1=substr(x,1,1);
   do k=2 to str_end;
       if substr(x,k,1) ^= x1 then
           do;
               end_ndx = k-1;
               end_date=start+end_ndx-1;
               output;
               start_ndx = k;
               start_date=start+start_ndx-1;
               x1 = substr(x,k,1);
       end;
   end;
   end_ndx = length(x);
   end_date=start+end_ndx-1;
   output;
   drop i j k char;
run;

proc print data=want;var x x1 start start_ndx end_ndx start_date end_date;run

Ksharp · Posted 09-10-2018 10:25 AM

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;
data temp;
 set have;
 do i=1 to length(x);
  x1=char(x,i);output;
 end;
 drop x;
run;
proc summary data=temp;
 by id start x1 notsorted;
 output out=temp1;
run;
data want;
 set temp1;
 by id;
 retain lag_cum;
 if first.id then do;cum=0; cum+_freq_;start_ndx=1; end_ndx=_freq_;end;
  else do; cum+_freq_;start_ndx=lag_cum+1; end_ndx=cum;  end;
 lag_cum=cum;
format start date9.;
drop _:;
run;

How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Re: How to split a row into multiple records each time a character in a string changes

Registration is open

SAS Training: Just a Click Away