BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Rodcjones
Obsidian | Level 7

I have a scenario that's probably easier to illustrate through the HAVE and WANT datasets below. But here is the scenario: Individual ID is on a journey and her starting location POINT1 and starting date START are recorded, as are her ending location POINT2 and date END. ID 1 in the example below had a different POINT2 for each value of END, but ID 2 got held up at POINT B, so several entries have repeat values of POINT2 despite distinct values of END.

 

In order to conduct the necessarily analysis, I want to reduce the dataset to rows where POINT1 <> POINT2 but somehow capture the START value from the first row of a value of POINT1 and in the same row capture the value of POINT2 and END for the first subsequent row where POINT1 <> POINT2.

 

Like I said, the datasets probably show it better!

 

data have;
input id $ point1 $ START :mmddyy8. POINT2 $ END: MMDDYY8.;
format START MMDDYY8. END MMDDYY8.;
datalines;
1 Z 2/1/20 X 2/2/20
1 X 2/2/20 W 2/3/20
1 W 2/3/20 V 2/5/20
2 A 2/1/20 B 2/2/20
2 B 2/2/20 B 2/3/20
2 B 2/3/20 B 2/6/20
2 B 2/6/20 B 2/7/20
2 B 2/7/20 C 2/10/20
2 C 2/10/20 D 2/11/20
2 D 2/11/20 E 2/12/20
;
data WANT;
input id $ point1 $ START :mmddyy8. POINT2 $ END: MMDDYY8.;
format START MMDDYY8. END MMDDYY8.;
datalines;
1 Z 2/1/20 X 2/2/20
1 X 2/2/20 W 2/3/20
1 W 2/3/20 V 2/5/20
2 A 2/1/20 B 2/2/20
2 B 2/2/20 C 2/10/20
2 C 2/10/20 D 2/11/20
2 D 2/11/20 E 2/12/20
;

 

I use LAG to get the actual dataset into this shape so have tried using it for this task but have come up short. Also tried RETAIN, but that also seems impossible/inefficient (i.e. requiring multiple DO loops where the number of loops is dependent on attributes of the data . . . lots of extra steps). Any ideas? Thanks!

 

Using SAS EG 7.1 (SAS 9.4)

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

Like this?

 

data WANT;   
  retain _DUP _START _POINT1;
  drop _:;
  set HAVE;

  %* Hold up found, save start info;
  if POINT1=POINT2 & ^_DUP then do; _POINT1=POINT1; _START=START; _DUP=1; return; end;

  %* Hold up continues, skip;
  if POINT1=POINT2 & _DUP then return; 

  %* Hold up ends, fetch start info;
  if POINT1 ne POINT2 & _DUP then do; POINT1=_POINT1; START=_START; _DUP=0; end;

  output;
run;   
ID POINT1 START POINT2 END
1 Z 01FEB2020 X 02FEB2020
1 X 02FEB2020 W 03FEB2020
1 W 03FEB2020 V 05FEB2020
2 A 01FEB2020 B 02FEB2020
2 B 02FEB2020 C 10FEB2020
2 C 10FEB2020 D 11FEB2020
2 D 11FEB2020 E 12FEB2020

 

View solution in original post

6 REPLIES 6
Quentin
Super User

I think RETAIN with by-group processing should help.

 

Sounds like you want to output one record for each unique value of ID-Point1, where the value of START is the first value and the value of END is the last value.  Can you show the code you tried with RETAIN?  You shouldn't need DO loops.

 

Another way to approach it would be PROC SQL.

BASUG is hosting free webinars Next up: Mark Keintz presenting History Carried Forward, Future Carried Back: Mixing Time Series of Differing Frequencies on May 8. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
PGStats
Opal | Level 21

Here is one solution:

 

data have;
input id $ point1 $ START :mmddyy8. POINT2 $ END: MMDDYY8.;
format START END yymmdd10.;
datalines;
1 Z 2/1/20 X 2/2/20
1 X 2/2/20 W 2/3/20
1 W 2/3/20 V 2/5/20
2 A 2/1/20 B 2/2/20
2 B 2/2/20 B 2/3/20
2 B 2/3/20 B 2/6/20
2 B 2/6/20 B 2/7/20
2 B 2/7/20 C 2/10/20
2 C 2/10/20 D 2/11/20
2 D 2/11/20 E 2/12/20
;

data want;
do until(last.id);
	set have; by id;
	if point1 = point2 and point1 = p2 then do;
		p2 = point2;
		e = end;
		end;
	else do;
		if not missing(p2) then output;
		p1 = point1;
		s = start;
		p2 = point2;
		e = end;
		end;
	end;
output;
format s e yymmdd10.;
drop point1 start point2 end;
rename p1=point1 p2=point2 s=start e=end;
run;

proc print; var id point1 start point2 end; run;
PG
ChrisNZ
Tourmaline | Level 20

Like this?

 

data WANT;   
  retain _DUP _START _POINT1;
  drop _:;
  set HAVE;

  %* Hold up found, save start info;
  if POINT1=POINT2 & ^_DUP then do; _POINT1=POINT1; _START=START; _DUP=1; return; end;

  %* Hold up continues, skip;
  if POINT1=POINT2 & _DUP then return; 

  %* Hold up ends, fetch start info;
  if POINT1 ne POINT2 & _DUP then do; POINT1=_POINT1; START=_START; _DUP=0; end;

  output;
run;   
ID POINT1 START POINT2 END
1 Z 01FEB2020 X 02FEB2020
1 X 02FEB2020 W 03FEB2020
1 W 03FEB2020 V 05FEB2020
2 A 01FEB2020 B 02FEB2020
2 B 02FEB2020 C 10FEB2020
2 C 10FEB2020 D 11FEB2020
2 D 11FEB2020 E 12FEB2020

 

Rodcjones
Obsidian | Level 7

Thanks all for the ideas and valuable code snippets. ChrisNZ's solution is the one that I ended up going with. There are parts of it that I would like to understand better, so if you can help with these questions I'd be very appreciative.

 

1. The code has no problem keeping the IDs straight/not mixing the rows of ID 1 and 2. It's not clear to me why? Is it because the dataset is presorted? But even so, how does it know not to analyze these two rows because they pertain to different IDs?


1 W 2/3/20 V 2/5/20
2 A 2/1/20 B 2/2/20

 

2. This is probably just a vocabulary thing, but I didn't follow what "holp up" referred to in the comments, e.g.,

 %* Holp up found, save start info;

 

Thanks again.

ChrisNZ
Tourmaline | Level 20

1. Good point. If the data contained

1 W 2/3/20 W 2/5/20
2 A 2/1/20 B 2/2/20

then the logic would fail as it only checks whether the points are the same, and not the ID.

If this condition can arise, a further check is warranted. Either using a BY statement or lag(ID), the latter being possibly a tad faster.

 

2. Argh. Ugly typo! I fixed it. I realise I am slowly becoming blind to typos. 😞

Tom
Super User Tom
Super User

Probably easier if you don't eliminate the time the spend stopped.

proc sort data=have;
 by id start end ;
run;

data want;
do until(last.point2);
  set have;
  by id point1 point2 notsorted;
  s=min(s,start);
  e=max(e,end);
end;
start=s;
end=e;
drop s e;
run;
Obs    id    point1         START    POINT2           END

 1     1       Z       2020-02-01      X       2020-02-02
 2     1       X       2020-02-02      W       2020-02-03
 3     1       W       2020-02-03      V       2020-02-05
 4     2       A       2020-02-01      B       2020-02-02
 5     2       B       2020-02-02      B       2020-02-07
 6     2       B       2020-02-07      C       2020-02-10
 7     2       C       2020-02-10      D       2020-02-11
 8     2       D       2020-02-11      E       2020-02-12

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 635 views
  • 4 likes
  • 5 in conversation