Trying to merge rows in which an individual was at the same location but has multiple date ranges resulting in multiple rows. Want to have one row per location with the first start date and last end date. Having a blank end date means they are still at that location and that would need to be kept.
I don't even know where to start with this merge. Open to DATA and PROC SQL steps. This data has approximately 500 rows currently for approximately 60 individuals. The reason the data populates like this is because there are separate bed assignments for each row, that level of data is not needed for this output, just location and the dates.
What my data currently looks like:
FULLNAME | START | END | LOCATION |
SMITH, JOHN | 14Sep2010 | 15Sep2010 | CEDAR |
SMITH, JOHN | 15Sep2010 | 28Sep2010 | CEDAR |
SMITH, JOHN | 28Sep2010 | 27Oct2010 | CEDAR |
SMITH, JOHN | 28Oct2010 | 20Apr2011 | OAK |
SMITH, JOHN | 20Apr2011 | 02May2011 | OAK |
SMITH, JOHN | 02May2011 | 17May2011 | OAK |
SMITH, JOHN | 17May2011 | 18Jul2011 | PINE-A |
SMITH, JOHN | 18Jul2011 | 03Oct2011 | PINE-A |
SMITH, JOHN | 03Oct2011 | 13Dec2011 | PINE-A |
SMITH, JOHN | 13Dec2011 | 31Jan2012 | OAK |
SMITH, JOHN | 31Jan2012 | 08Feb2012 | OAK |
SMITH, JOHN | 08Feb2012 | 08Aug2012 | OAK |
SMITH, JOHN | 08Aug2012 | 17Dec2012 | OAK |
MILLER, SALLY | 21Aug2015 | 26Aug2015 | PINE-B |
MILLER, SALLY | 26Aug2015 | 22Sep2015 | CEDAR |
MILLER, SALLY | 22Sep2015 | 14Oct2015 | CEDAR |
MILLER, SALLY | 14Oct2015 | 22Nov2015 | CEDAR |
MILLER, SALLY | 22Nov2015 | 04Dec2015 | CEDAR |
MILLER, SALLY | 04Dec2015 | 10Dec2015 | CEDAR |
MILLER, SALLY | 10Dec2015 | 10Mar2016 | OAK |
MILLER, SALLY | 10Mar2016 | 29Dec2017 | OAK |
MILLER, SALLY | 29Dec2017 | 10May2018 | OAK |
MILLER, SALLY | 10May2018 | 01Nov2018 | OAK |
MILLER, SALLY | 01Nov2018 | OAK | |
JOHNSON, ROBERT | 16Apr2017 | 17Apr2017 | MAPLE-A |
JOHNSON, ROBERT | 17Apr2017 | 08May2017 | MAPLE-A |
JOHNSON, ROBERT | 08May2017 | 22May2017 | MAPLE-B |
JOHNSON, ROBERT | 22May2017 | 31May2017 | WILLOW |
JOHNSON, ROBERT | 31May2017 | 08Jun2017 | WILLOW |
JOHNSON, ROBERT | 08Jun2017 | 02Aug2017 | WILLOW |
JOHNSON, ROBERT | 02Aug2017 | 07Aug2017 | MAPLE-C |
JOHNSON, ROBERT | 07Aug2017 | 11Sep2017 | WILLOW |
JOHNSON, ROBERT | 11Sep2017 | 03Apr2018 | WILLOW |
What I want my data to look like:
FULLNAME | START | END | LOCATION |
SMITH, JOHN | 14Sep2010 | 27Oct2010 | CEDAR |
SMITH, JOHN | 28Oct2010 | 17May2011 | OAK |
SMITH, JOHN | 17May2011 | 13Dec2011 | PINE-A |
SMITH, JOHN | 13Dec2011 | 17Dec2012 | OAK |
MILLER, SALLY | 21Aug2015 | 26Aug2015 | PINE-B |
MILLER, SALLY | 26Aug2015 | 10Dec2015 | CEDAR |
MILLER, SALLY | 10Dec2015 | OAK | |
JOHNSON, ROBERT | 16Apr2017 | 08May2017 | MAPLE-A |
JOHNSON, ROBERT | 08May2017 | 22May2017 | MAPLE-B |
JOHNSON, ROBERT | 22May2017 | 02Aug2017 | WILLOW |
JOHNSON, ROBERT | 02Aug2017 | 07Aug2017 | MAPLE-C |
Here was the code that ended up working for this. Essentially had to make a new ID for each stay and then sort by min start date and max end date based on those IDs. Also a step to account for the missing value. Thanks for the assistance!
data want1;
set have;
retain ConsecID 0;
ConsecID= ifn(lag(fullname)^=fullname or lag(location)^=location, ConsecID+1, ConsecID);
run;
PROC SQL;
CREATE TABLE want2 AS
SELECT DISTINCT
FULLNAME
, min(start) as STARTDATE format=date9.
, case when (max(coalesce(end, '01JAN2099'd)))='01JAN2099'd
then .
else (max(end))
end as NEWEND format=date9.
, location
FROM WORK.want1 AS A
group by consecID
order by fullname
;quit;
proc sql;
create table want as
select fullname. min(start) as start format=date9.,max(end) as end format=date9.,
location
from your_table
group by location,fullname;
quit;
Almost worked but not quite. Two issues:
1. Doesn't account for blank end dates.
2. It merges multiple stays at a location into one with the first stays start date and the last stays ends date. I need to know the start and end date for each stay at a location.
For example, John Smith stayed at Oak on two occasions, I need the start and end date for each of those stays as opposed to the start date of the first stay and the end date of the last stay.
PLEASE provide data in the form of a data step. I have made a short bit from your post but I really don't like having to spend a lot time manually fixing posted data.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
See if this matches your expectation:
data have; infile datalines dlm='|'; informat fullname $25. start end date9. location $10.; format start end date9.; input FULLNAME START END LOCATION; datalines; SMITH, JOHN|14Sep2010|15Sep2010|CEDAR SMITH, JOHN|15Sep2010|28Sep2010|CEDAR SMITH, JOHN|28Sep2010|27Oct2010|CEDAR SMITH, JOHN|28Oct2010|20Apr2011|OAK SMITH, JOHN|20Apr2011|02May2011|OAK SMITH, JOHN|02May2011|17May2011|OAK SMITH, JOHN|17May2011|18Jul2011|PINE-A SMITH, JOHN|18Jul2011|03Oct2011|PINE-A SMITH, JOHN|03Oct2011|13Dec2011|PINE-A SMITH, JOHN|13Dec2011|31Jan2012|OAK SMITH, JOHN|31Jan2012|08Feb2012|OAK SMITH, JOHN|08Feb2012|08Aug2012|OAK SMITH, JOHN|08Aug2012|17Dec2012|OAK ; proc sort data=have; by fullname location start; run; data want; set have; by fullname location; lend= lag(end); retain locstart; if first.location then locstart=start; else if start ne lend then do; /* break in sequence*/ output; locstart = start; end; else if last.location then output; drop lend locstart; run;
Because of the requirements you state you may need to sort the output Want data set differently afterwards.
LAG lets you keep a value from the last record.
Retain keeps a specified variable value across records until reset.
The BY statement creates automatic variables First. and Last to indicate if the current record is the first or last of a group (can be both if there is only one) so you can conditionally reset values or do execute other statements such as the output to the data set.
I apologize for the data display. I am working with protected data and therefore cannot provide the actual data I am working with, the names and locations have been changed.
While the code ran, the output still had issues. In some cases stays at locations were omitted completely and in others there were still multiple rows for a single stay at a location. This also does not account for the blank end dates that need to be kept.
@cbagdon-cox wrote:
I apologize for the data display. I am working with protected data and therefore cannot provide the actual data I am working with, the names and locations have been changed.
While the code ran, the output still had issues. In some cases stays at locations were omitted completely and in others there were still multiple rows for a single stay at a location. This also does not account for the blank end dates that need to be kept.
Code is just as good as example data and clearly stated requirements.
You now have an example of how to provide data. Provide some that matches your data better.
Since you did not provide any missing ends there was nothing to test.
An appropriate something like this may be needed:
if missing (end) then do;
output;
call missing(locstart);
/* and maybe other stuff in here*/
end;
But without an example this a guess just like any other condition that was not provided in example input and output.
Hi @cbagdon-cox Please try this-
data have;
input FULLNAME & $30. (START END) (:date9.) LOCATION $;
format START END date9.;
cards;
SMITH, JOHN 14Sep2010 15Sep2010 CEDAR
SMITH, JOHN 15Sep2010 28Sep2010 CEDAR
SMITH, JOHN 28Sep2010 27Oct2010 CEDAR
SMITH, JOHN 28Oct2010 20Apr2011 OAK
SMITH, JOHN 20Apr2011 02May2011 OAK
SMITH, JOHN 02May2011 17May2011 OAK
SMITH, JOHN 17May2011 18Jul2011 PINE-A
SMITH, JOHN 18Jul2011 03Oct2011 PINE-A
SMITH, JOHN 03Oct2011 13Dec2011 PINE-A
SMITH, JOHN 13Dec2011 31Jan2012 OAK
SMITH, JOHN 31Jan2012 08Feb2012 OAK
SMITH, JOHN 08Feb2012 08Aug2012 OAK
SMITH, JOHN 08Aug2012 17Dec2012 OAK
MILLER, SALLY 21Aug2015 26Aug2015 PINE-B
MILLER, SALLY 26Aug2015 22Sep2015 CEDAR
MILLER, SALLY 22Sep2015 14Oct2015 CEDAR
MILLER, SALLY 14Oct2015 22Nov2015 CEDAR
MILLER, SALLY 22Nov2015 04Dec2015 CEDAR
MILLER, SALLY 04Dec2015 10Dec2015 CEDAR
MILLER, SALLY 10Dec2015 10Mar2016 OAK
MILLER, SALLY 10Mar2016 29Dec2017 OAK
MILLER, SALLY 29Dec2017 10May2018 OAK
MILLER, SALLY 10May2018 01Nov2018 OAK
MILLER, SALLY 01Nov2018 . OAK
JOHNSON, ROBERT 16Apr2017 17Apr2017 MAPLE-A
JOHNSON, ROBERT 17Apr2017 08May2017 MAPLE-A
JOHNSON, ROBERT 08May2017 22May2017 MAPLE-B
JOHNSON, ROBERT 22May2017 31May2017 WILLOW
JOHNSON, ROBERT 31May2017 08Jun2017 WILLOW
JOHNSON, ROBERT 08Jun2017 02Aug2017 WILLOW
JOHNSON, ROBERT 02Aug2017 07Aug2017 MAPLE-C
JOHNSON, ROBERT 07Aug2017 11Sep2017 WILLOW
JOHNSON, ROBERT 11Sep2017 03Apr2018 WILLOW
;
data want;
do until(last.location);
set have;
by fullname location notsorted;
if first.location then _n_=start;
end;
start=_n_;
run;
Here was the code that ended up working for this. Essentially had to make a new ID for each stay and then sort by min start date and max end date based on those IDs. Also a step to account for the missing value. Thanks for the assistance!
data want1;
set have;
retain ConsecID 0;
ConsecID= ifn(lag(fullname)^=fullname or lag(location)^=location, ConsecID+1, ConsecID);
run;
PROC SQL;
CREATE TABLE want2 AS
SELECT DISTINCT
FULLNAME
, min(start) as STARTDATE format=date9.
, case when (max(coalesce(end, '01JAN2099'd)))='01JAN2099'd
then .
else (max(end))
end as NEWEND format=date9.
, location
FROM WORK.want1 AS A
group by consecID
order by fullname
;quit;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.