Solved: Merging rows based on date and location

cbagdon-cox · Posted 08-11-2020 02:33 PM

Trying to merge rows in which an individual was at the same location but has multiple date ranges resulting in multiple rows. Want to have one row per location with the first start date and last end date. Having a blank end date means they are still at that location and that would need to be kept.

I don't even know where to start with this merge. Open to DATA and PROC SQL steps. This data has approximately 500 rows currently for approximately 60 individuals. The reason the data populates like this is because there are separate bed assignments for each row, that level of data is not needed for this output, just location and the dates.

What my data currently looks like:

FULLNAME	START	END	LOCATION
SMITH, JOHN	14Sep2010	15Sep2010	CEDAR
SMITH, JOHN	15Sep2010	28Sep2010	CEDAR
SMITH, JOHN	28Sep2010	27Oct2010	CEDAR
SMITH, JOHN	28Oct2010	20Apr2011	OAK
SMITH, JOHN	20Apr2011	02May2011	OAK
SMITH, JOHN	02May2011	17May2011	OAK
SMITH, JOHN	17May2011	18Jul2011	PINE-A
SMITH, JOHN	18Jul2011	03Oct2011	PINE-A
SMITH, JOHN	03Oct2011	13Dec2011	PINE-A
SMITH, JOHN	13Dec2011	31Jan2012	OAK
SMITH, JOHN	31Jan2012	08Feb2012	OAK
SMITH, JOHN	08Feb2012	08Aug2012	OAK
SMITH, JOHN	08Aug2012	17Dec2012	OAK
MILLER, SALLY	21Aug2015	26Aug2015	PINE-B
MILLER, SALLY	26Aug2015	22Sep2015	CEDAR
MILLER, SALLY	22Sep2015	14Oct2015	CEDAR
MILLER, SALLY	14Oct2015	22Nov2015	CEDAR
MILLER, SALLY	22Nov2015	04Dec2015	CEDAR
MILLER, SALLY	04Dec2015	10Dec2015	CEDAR
MILLER, SALLY	10Dec2015	10Mar2016	OAK
MILLER, SALLY	10Mar2016	29Dec2017	OAK
MILLER, SALLY	29Dec2017	10May2018	OAK
MILLER, SALLY	10May2018	01Nov2018	OAK
MILLER, SALLY	01Nov2018		OAK
JOHNSON, ROBERT	16Apr2017	17Apr2017	MAPLE-A
JOHNSON, ROBERT	17Apr2017	08May2017	MAPLE-A
JOHNSON, ROBERT	08May2017	22May2017	MAPLE-B
JOHNSON, ROBERT	22May2017	31May2017	WILLOW
JOHNSON, ROBERT	31May2017	08Jun2017	WILLOW
JOHNSON, ROBERT	08Jun2017	02Aug2017	WILLOW
JOHNSON, ROBERT	02Aug2017	07Aug2017	MAPLE-C
JOHNSON, ROBERT	07Aug2017	11Sep2017	WILLOW
JOHNSON, ROBERT	11Sep2017	03Apr2018	WILLOW

What I want my data to look like:

FULLNAME	START	END	LOCATION
SMITH, JOHN	14Sep2010	27Oct2010	CEDAR
SMITH, JOHN	28Oct2010	17May2011	OAK
SMITH, JOHN	17May2011	13Dec2011	PINE-A
SMITH, JOHN	13Dec2011	17Dec2012	OAK
MILLER, SALLY	21Aug2015	26Aug2015	PINE-B
MILLER, SALLY	26Aug2015	10Dec2015	CEDAR
MILLER, SALLY	10Dec2015		OAK
JOHNSON, ROBERT	16Apr2017	08May2017	MAPLE-A
JOHNSON, ROBERT	08May2017	22May2017	MAPLE-B
JOHNSON, ROBERT	22May2017	02Aug2017	WILLOW
JOHNSON, ROBERT	02Aug2017	07Aug2017	MAPLE-C

cbagdon-cox · Posted 08-12-2020 12:47 PM

Here was the code that ended up working for this. Essentially had to make a new ID for each stay and then sort by min start date and max end date based on those IDs. Also a step to account for the missing value. Thanks for the assistance!

data want1;
set have;
retain ConsecID 0;
ConsecID= ifn(lag(fullname)^=fullname or lag(location)^=location, ConsecID+1, ConsecID);
run;

PROC SQL;
CREATE TABLE want2 AS
SELECT DISTINCT 
  FULLNAME
, min(start) as STARTDATE format=date9.
, case when (max(coalesce(end, '01JAN2099'd)))='01JAN2099'd 
	then . 
	else (max(end)) 
	end as NEWEND format=date9.
, location
FROM WORK.want1 AS A
	group by consecID
	order by fullname
;quit;

View solution in original post

novinosrin · Posted 08-11-2020 02:37 PM

proc sql;
create table want as
select fullname. min(start) as start format=date9.,max(end) as end format=date9.,
location
from your_table
group by location,fullname;
quit;

cbagdon-cox · Posted 08-11-2020 06:02 PM

Almost worked but not quite. Two issues:

1. Doesn't account for blank end dates.

2. It merges multiple stays at a location into one with the first stays start date and the last stays ends date. I need to know the start and end date for each stay at a location.

For example, John Smith stayed at Oak on two occasions, I need the start and end date for each of those stays as opposed to the start date of the first stay and the end date of the last stay.

ballardw · Posted 08-11-2020 07:07 PM

PLEASE provide data in the form of a data step. I have made a short bit from your post but I really don't like having to spend a lot time manually fixing posted data.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

See if this matches your expectation:

data have;
infile datalines dlm='|';
informat fullname $25. start end date9. location $10.;
format start end date9.;
input FULLNAME	START	END	LOCATION;
datalines;
SMITH, JOHN|14Sep2010|15Sep2010|CEDAR
SMITH, JOHN|15Sep2010|28Sep2010|CEDAR
SMITH, JOHN|28Sep2010|27Oct2010|CEDAR
SMITH, JOHN|28Oct2010|20Apr2011|OAK
SMITH, JOHN|20Apr2011|02May2011|OAK
SMITH, JOHN|02May2011|17May2011|OAK
SMITH, JOHN|17May2011|18Jul2011|PINE-A
SMITH, JOHN|18Jul2011|03Oct2011|PINE-A
SMITH, JOHN|03Oct2011|13Dec2011|PINE-A
SMITH, JOHN|13Dec2011|31Jan2012|OAK
SMITH, JOHN|31Jan2012|08Feb2012|OAK
SMITH, JOHN|08Feb2012|08Aug2012|OAK
SMITH, JOHN|08Aug2012|17Dec2012|OAK
;

proc sort data=have;
  by fullname location start;
run;

data want;
   set have;
   by fullname location;
   lend= lag(end);
   retain locstart;
   if first.location then locstart=start;
   else if start ne lend then do;
   /* break in sequence*/
      output;
      locstart = start;
   end;
   else if last.location then output;
   drop lend locstart;
run;

Because of the requirements you state you may need to sort the output Want data set differently afterwards.

LAG lets you keep a value from the last record.

Retain keeps a specified variable value across records until reset.

The BY statement creates automatic variables First. and Last to indicate if the current record is the first or last of a group (can be both if there is only one) so you can conditionally reset values or do execute other statements such as the output to the data set.

cbagdon-cox · Posted 08-11-2020 07:28 PM

I apologize for the data display. I am working with protected data and therefore cannot provide the actual data I am working with, the names and locations have been changed.

While the code ran, the output still had issues. In some cases stays at locations were omitted completely and in others there were still multiple rows for a single stay at a location. This also does not account for the blank end dates that need to be kept.

ballardw · Posted 08-11-2020 07:41 PM

@cbagdon-cox wrote:

I apologize for the data display. I am working with protected data and therefore cannot provide the actual data I am working with, the names and locations have been changed.

While the code ran, the output still had issues. In some cases stays at locations were omitted completely and in others there were still multiple rows for a single stay at a location. This also does not account for the blank end dates that need to be kept.

Code is just as good as example data and clearly stated requirements.

You now have an example of how to provide data. Provide some that matches your data better.

Since you did not provide any missing ends there was nothing to test.

An appropriate something like this may be needed:

if missing (end) then do;

output;

call missing(locstart);

/* and maybe other stuff in here*/

end;

But without an example this a guess just like any other condition that was not provided in example input and output.

cbagdon-cox · Posted 08-11-2020 07:48 PM

Please see original post and example (though incorrectly formatted) data. Missing end dates are mentioned in the question and shown in the example.

I have no code as I didn't know where to start with this task.

First time poster just trying to get some help, not a scolding.

novinosrin · Posted 08-11-2020 08:00 PM

Hi @cbagdon-cox Please try this-

data have;
input FULLNAME & $30.	(START	END) (:date9.)	LOCATION $;
format START	END date9.;
cards;
SMITH, JOHN	  14Sep2010	15Sep2010	CEDAR
SMITH, JOHN	  15Sep2010	28Sep2010	CEDAR
SMITH, JOHN	  28Sep2010	27Oct2010	CEDAR
SMITH, JOHN	  28Oct2010	20Apr2011	OAK
SMITH, JOHN	  20Apr2011	02May2011	OAK
SMITH, JOHN	  02May2011	17May2011	OAK
SMITH, JOHN	  17May2011	18Jul2011	PINE-A
SMITH, JOHN	  18Jul2011	03Oct2011	PINE-A
SMITH, JOHN	  03Oct2011	13Dec2011	PINE-A
SMITH, JOHN	  13Dec2011	31Jan2012	OAK
SMITH, JOHN	  31Jan2012	08Feb2012	OAK
SMITH, JOHN	  08Feb2012	08Aug2012	OAK
SMITH, JOHN	  08Aug2012	17Dec2012	OAK
MILLER, SALLY	21Aug2015	26Aug2015	PINE-B
MILLER, SALLY	26Aug2015	22Sep2015	CEDAR
MILLER, SALLY	22Sep2015	14Oct2015	CEDAR
MILLER, SALLY	14Oct2015	22Nov2015	CEDAR
MILLER, SALLY	22Nov2015	04Dec2015	CEDAR
MILLER, SALLY	04Dec2015	10Dec2015	CEDAR
MILLER, SALLY	10Dec2015	10Mar2016	OAK
MILLER, SALLY	10Mar2016	29Dec2017	OAK
MILLER, SALLY	29Dec2017	10May2018	OAK
MILLER, SALLY	10May2018	01Nov2018	OAK
MILLER, SALLY	01Nov2018	 .	OAK
JOHNSON, ROBERT	 16Apr2017	17Apr2017	MAPLE-A
JOHNSON, ROBERT	 17Apr2017	08May2017	MAPLE-A
JOHNSON, ROBERT	 08May2017	22May2017	MAPLE-B
JOHNSON, ROBERT	 22May2017	31May2017	WILLOW
JOHNSON, ROBERT	 31May2017	08Jun2017	WILLOW
JOHNSON, ROBERT	 08Jun2017	02Aug2017	WILLOW
JOHNSON, ROBERT	 02Aug2017	07Aug2017	MAPLE-C
JOHNSON, ROBERT	 07Aug2017	11Sep2017	WILLOW
JOHNSON, ROBERT	  11Sep2017	03Apr2018	WILLOW
;

data  want;
 do  until(last.location);
  set have;
  by fullname location notsorted;
  if first.location then _n_=start;
 end;
 start=_n_;
run;

cbagdon-cox · Posted 08-12-2020 12:47 PM

Here was the code that ended up working for this. Essentially had to make a new ID for each stay and then sort by min start date and max end date based on those IDs. Also a step to account for the missing value. Thanks for the assistance!

data want1;
set have;
retain ConsecID 0;
ConsecID= ifn(lag(fullname)^=fullname or lag(location)^=location, ConsecID+1, ConsecID);
run;

PROC SQL;
CREATE TABLE want2 AS
SELECT DISTINCT 
  FULLNAME
, min(start) as STARTDATE format=date9.
, case when (max(coalesce(end, '01JAN2099'd)))='01JAN2099'd 
	then . 
	else (max(end)) 
	end as NEWEND format=date9.
, location
FROM WORK.want1 AS A
	group by consecID
	order by fullname
;quit;

Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

Re: Merging rows based on date and location

SAS Innovate 2025: Call for Content