I have a list of the countries where people lived. If they changed adres within a country this appears on several lines, however, I would like to make it one line. Consider the following example:
data sample;
input ID $ Country $ Startdate :date9. Enddate :date9. Time;
format Startdate Enddate date9.;
datalines;
A France 05NOV2006 03OCT2012 6.1
A France 04OCT2012 05SEP2015 3.0
A France 06SEP2015 01JUN2016 0.8
A US 02JUN2016 18SEP2019 3.4
B France 17DEC2006 09MAY2007 0.4
B France 10MAY2007 01FEB2014 6.9
B Germany 02FEB2014 02FEB2015 1.0
B Germany 03FEB2015 02JUL2017 2.5
B France 03JUL2017 05APR2018 0.8
B US 06APR2018 18SEP2019 1.5
;run;
I want the result to create one line per block of continuous living in one country and the sum of the time spent in that country:
data result;
input ID $ Country $ Startdate :date9. Enddate :date9. Time;
format Startdate Enddate date9.;
datalines;
A France 05NOV2006 01JUN2016 9.9
A US 02JUN2016 18SEP2019 3.4
B France 17DEC2006 01FEB2014 7.3
B Germany 02FEB2014 02JUL2017 3.5
B France 03JUL2017 05APR2018 0.8
B US 06APR2018 18SEP2019 1.5
;run;
This is my basic attempt, but I can't figure out how to only get the startdate from the first line, and the enddate from the last line, and how to make sure I only combine continuous periods in one country, not when there was another country in between. Possibly something with "if first.startdate..." in a data step?
proc sql;
create table combined as
select distinct ID, Country, Startdate, Enddate, sum(Time)
from sample
group by ID, Country;
quit;
Problems like yours can be solved with a single data-step using first/last:
data work.want;
set work.sample;
by ID Country notsorted;
retain sumTime firstStartDate;
drop sumTime firstStartDate;
if first.Country then do;
sumTime = 0;
firstStartDate = StartDate;
end;
sumTime = sumTime + Time;
if last.Country then do;
StartDate = firstStartDate;
Time = sumTime;
output;
end;
run;
EDIT: Just read that non-continuous periods have to be treated in a special way, but you don't have them in the data you posted, so they aren't incorporated in my "solution", too.
Just figured I can use min and max to get the right startdate, just need to figure out how to distinguish non-continuous periods:
proc sql;
create table combined as
select distinct ID, Country, min(Startdate) as Startdate, max(Enddate) as Enddate, sum(Time) as time
from sample
group by ID, Country;
quit;
Problems like yours can be solved with a single data-step using first/last:
data work.want;
set work.sample;
by ID Country notsorted;
retain sumTime firstStartDate;
drop sumTime firstStartDate;
if first.Country then do;
sumTime = 0;
firstStartDate = StartDate;
end;
sumTime = sumTime + Time;
if last.Country then do;
StartDate = firstStartDate;
Time = sumTime;
output;
end;
run;
EDIT: Just read that non-continuous periods have to be treated in a special way, but you don't have them in the data you posted, so they aren't incorporated in my "solution", too.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.