BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
arorata
Fluorite | Level 6

Hi I have a data that looks like this:

data have;

   id=1;

   X="111110000110AAAB";

   start="01jan2006"d;

   output;

run;

I want to parse X in such a way that I get a resulting data that looks like this:

 id      x1  start_ndx    end_ndx   start_date           end_date

 1        1       1                5            01/01/2006         01/05/2006

 1        0       6                9            01/06/2006         01/09/2006

 1        1     10               11           01/10/2006         01/11/2006

 1        0     12               12           01/12/2006         01/12/2006

 1        A     13               15           01/13/2006         01/15/2006

 1        B     16               16           01/16/2006         01/16/2006

 

Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.

Thanks for the help

1 ACCEPTED SOLUTION

Accepted Solutions
SuzanneDorinski
Lapis Lazuli | Level 10

I added an array, but this probably isn't the most efficient way to do this.

 

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want(drop=position: start length_of_x i x x1_tracker);
  set have;
  length_of_x=length(x);
  array long_string[*] $ position1 - position4000;
  do i=1 to length_of_x;
    long_string[i] = substr(x,i,1);
  end;
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    if i le length_of_x and long_string[i] ne long_string[i+1] then output;
  end;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

 

View solution in original post

10 REPLIES 10
Reeza
Super User

Will the dates vary from record to record? Can you show how this may work for another ID so that the solution is general enough to work on your data. 

 


@arorata wrote:

Hi I have a data that looks like this:

data have;

   id=1;

   X="111110000110AAAB";

   start="01jan2006"d;

   output;

run;

I want to parse X in such a way that I get a resulting data that looks like this:

 id      x1  start_ndx    end_ndx   start_date           end_date

 1        1       1                5            01/01/2006         01/05/2006

 1        0       6                9            01/06/2006         01/09/2006

 1        1     10               11           01/10/2006         01/11/2006

 1        0     12               12           01/12/2006         01/12/2006

 1        A     13               15           01/13/2006         01/15/2006

 1        B     16               16           01/16/2006         01/16/2006

 

Note that in the example the initial start is 01/01/2006 so it aligns with the start_ndx and end_ndx. This is not always the case for my data. Start could have any value (For example: 01/18/2006) and the code should still work.

Thanks for the help


 

arorata
Fluorite | Level 6

Thanks for responding. Yes both, value of X and start can be different for different ID. For example ID=2 can have data looking like this:

X="AAABB000111AABB"

start=03/24/2010;

Cheers!!

data_null__
Jade | Level 19

I think the FINDC function is helpful here.  Also the length of X needs to be at least one character longer than the longest string, otherwise the last run will not be output.

 

data have;
   id=1;
   length x $128;
   X="111110000110AAAB";
   start="01jan2006"d;
   output;
   id=2;
   X="11111 0000110AAAB111110000110A";
   start="06jan2006"d;
   output;
   run;
data want;
   set have;
   i = 1;
   do j = 1 by 1 while(i ne 0);
      t = char(x,i);
      s = findc(x,t,i,'k');
      if s then do;
         start_ndx  = i;
         end_ndx    = s-1;
         start_date = start + start_ndx -1;
         end_date   = start + end_ndx   -1;
         output;
         end;
      i = s;
      end;
   drop i s;
   format start start_date end_date date9.;
   run;

 

 

Capture.PNG

SuzanneDorinski
Lapis Lazuli | Level 10
data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want;
  set have;
  length_of_x=length(x);
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    output;
  end;
run;

proc print data=want;
  title 'Check results of logic';
run;

data want(drop=x start length_of_x x1_tracker i);
  set want;
  by id x1 notsorted;
  if last.x1;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

 

arorata
Fluorite | Level 6

Hi Suzanne, thanks for your reply. Your code works but the only issue is that in my actual data the variable "X" pertains to daily medication history of subjects over a period of 10 years. Actual string is of length 3652. There are 300,000 subjects so what this does is create an intermediate "person-day" file which has approximately 800 million records. The final data is manageable though. Is there a way to avoid outputting all those intermediate records? Appreciate any help.

Cheers

Reeza
Super User

Here's pretty close to what you want. I couldn't get the last record to output correctly so you'll need to play with the IF conditions a bit to get that last record - need to head out for a bit.

 

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

data want;
	set have;
	n_loop=lengthn(x);
	start_char=substr(x, 1, 1);
	start_date=start;
	end_date=start;

	do i=2 to n_loop;
		x_now=substr(x, i, 1);

		if start_char=x_now then
			do;
				end_date+1;
			end;
		else if start_char ne x_now and i ne n_loop
			then do;
				output;
				start_char=x_now;
				start_date=end_date + 1;
				end_date=start_date;
			end;
		else do;
		    end_date + 1;
			output;
	    end;
	end;
	format start_date end_date date9.;
run;

Thanks @SuzanneDorinski for the data step to mock up a solution. 

 

Hope this helps you get a little further at least. 

arorata
Fluorite | Level 6

Thanks Reeza,

I will try to play around with it and see if I can figure it out. Appreciate all the help.

SuzanneDorinski
Lapis Lazuli | Level 10

I added an array, but this probably isn't the most efficient way to do this.

 

data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;

proc print data=have;
  title 'Test case';
run;

data want(drop=position: start length_of_x i x x1_tracker);
  set have;
  length_of_x=length(x);
  array long_string[*] $ position1 - position4000;
  do i=1 to length_of_x;
    long_string[i] = substr(x,i,1);
  end;
  x1_tracker='';
  do i=1 to length_of_x;
    x1=substr(x,i,1);
    if x1 ne x1_tracker then
      do;
        start_ndx=i;
        end_ndx=i;
        start_date=start-1+i;
        x1_tracker=x1;
      end;
    else
      do;
        end_ndx=i;
      end;
    end_date=start_date+end_ndx-start_ndx;
    if i le length_of_x and long_string[i] ne long_string[i+1] then output;
  end;
run;

proc print data=want noobs;
  format start_date end_date mmddyy10. ;
  title 'Result for test case';
run;

 

arorata
Fluorite | Level 6

Hey everyone,

Thanks for all your responses. These were very helpful! I used parts from most of these and came up with following:

 

data want;
    set have;
    format start_date end_date start mmddyy10.;
    str_end=length(x);
    do i=1 to str_end-1;
        char=substr(x, i, 1);
            if char ne "0" and substr(x, i+1, 1)="0" then
                do j=i+1 to min(i, str_end) while(substr(x, j, 1)="0");
                    substr(x, j, 1)=char;
                    i=j;
            end;
    end;
    start_ndx=1;
    start_date=start+start_ndx-1;
    end_ndx = 0;
    end_date=start+end_ndx-1;
    x1=substr(x,1,1);
    do k=2 to str_end;
        if substr(x,k,1) ^= x1 then
            do;
                end_ndx = k-1;
                end_date=start+end_ndx-1;
                output;
                start_ndx = k;
                start_date=start+start_ndx-1;
                x1 = substr(x,k,1);
        end;
    end;
    end_ndx = length(x);
    end_date=start+end_ndx-1;
    output;
    drop i j k char;
run;

proc print data=want;var x x1 start start_ndx end_ndx start_date end_date;run

 

 

Ksharp
Super User
data have;
	id=1;
	X="111110000110AAAB";
	start="01jan2006"d;
	output;
run;
data temp;
 set have;
 do i=1 to length(x);
  x1=char(x,i);output;
 end;
 drop x;
run;
proc summary data=temp;
 by id start x1 notsorted;
 output out=temp1;
run;
data want;
 set temp1;
 by id;
 retain lag_cum;
 if first.id then do;cum=0; cum+_freq_;start_ndx=1; end_ndx=_freq_;end;
  else do; cum+_freq_;start_ndx=lag_cum+1; end_ndx=cum;  end;
 lag_cum=cum;
format start date9.;
drop _:;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 969 views
  • 1 like
  • 5 in conversation