Hi All,
I have a dataset test1 which has 6 columns like below and need output test2
Test1
Credit_type | Location | Ptype | Price | St_date | end_date |
IC | South | AC | 10 | 1/1/2020 | 1/31/2020 |
IC | South | AC | 20 | 2/1/2020 | 2/28/2020 |
RC | RC | SC | 5 | 1/1/2024 | 1/31/2024 |
RC | RC | SC | 6 | 2/1/2024 | 2/28/2024 |
RC | RC | SC | 5 | 3/1/2024 | 3/31/2024 |
RC | RC | SC | 5 | 4/1/2024 | 4/30/2024 |
RC | RC | SC | 5 | 5/1/2024 | 5/31/2024 |
First check from Credit_type , Location , Ptype , Price are same from one row to another row .
if they are different no need to compress like row1,row2,row3 and row4 .
But when we check row 5 to row 6 it is same from Credit_type to Price
Now check end date of row 5 which is 3/31/2024 to start date of row 6 which is 4/1/2024 where the difference is one day if yes then change end date of row 5 to 4/30/2024 and delete row 6
Now again check row 5 to row 7 it is same from Credit_type to Price
Now check end date of row 5 which is 4/30/2024 to start date of row 6 which is 5/1/2024 where the difference is one day if yes then change end date of row 5 to 5/31/2024 and delete row 7
and process goes on and create output like below
Can anyone please help ?
Thank you
Test2
Credit_type | Location | Ptype | Price | St_date | end_date |
IC | South | AC | 10 | 1/1/2020 | 1/31/2020 |
IC | South | AC | 20 | 2/1/2020 | 2/28/2020 |
RC | RC | SC | 5 | 1/1/2024 | 1/31/2024 |
RC | RC | SC | 6 | 2/1/2024 | 2/28/2024 |
RC | RC | SC | 5 | 3/1/2024 | 5/31/2024 |
Assuming the data are already grouped, then proc summary will provide the minimum st_date and maximum end_date with a minimum of code:
data have;
input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
run;
proc summary data=have ;
by credit_type location ptype price notsorted;
output out=want (drop=_type_ _freq_) min(st_date)=st_date max(end_date)=end_date;
run;
Editted note: Commonly, instead of the BY statement, one might identify the grouping variables in a CLASS statement. But that would collapse all matching records, even if they are not contiguous in the data set. It would combine row 3 with rows 4, 5, and 6 5, 6, and 7, which does not match your sample output.
Before even writing any code you need to provide some information about your date variables such as what TYPE are they? If you are not sure then run proc contents on your data set and show us the result.
I ask this because about half the questions on this forum manipulating or using dates turn out to be using character values or something that is not an actual SAS date value. There are many functions that work with dates to allow comparisons or computations between dates but if the values are not dates then first thing is typically to add the proper type variables.
Dates are numeric mmddyy10.
What is the purpose of this exercise? Are you just collapsing the contiguous intervals?
Looks like perhaps your data is sorted by these variables?
data have;
input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
;
proc sort;
by credit_type location ptype st_date end_date;
run;
But you want to process it as if it was sorted by just
by credit_type location ptype price;
So use the NOTSORTED keyword.
data want;
set have;
by credit_type location ptype price notsorted ;
retain start end;
if first.price then call missing(start,end);
if st_date - 1 > end then do;
if not first.price and not last.price then output;
start=min(start,st_date);
end=max(end,end_date);
end;
if last.price then output;
format start end yymmdd10.;
drop st_date end_date;
rename start=st_date end=end_date;
run;
Result:
Credit_ Obs type Location Ptype Price st_date end_date 1 IC South AC 10 2020-01-01 2020-01-31 2 IC South AC 20 2020-02-01 2020-02-28 3 RC RC SC 5 2024-01-01 2024-01-31 4 RC RC SC 6 2024-02-01 2024-02-28 5 RC RC SC 5 2024-03-01 2024-05-31
PS If you display dates as digits in either MDY order or DMY order you will confuse 50% of your audience. Use DATE or YYMMDD format to avoid confusion.
Assuming the data are already grouped, then proc summary will provide the minimum st_date and maximum end_date with a minimum of code:
data have;
input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
run;
proc summary data=have ;
by credit_type location ptype price notsorted;
output out=want (drop=_type_ _freq_) min(st_date)=st_date max(end_date)=end_date;
run;
Editted note: Commonly, instead of the BY statement, one might identify the grouping variables in a CLASS statement. But that would collapse all matching records, even if they are not contiguous in the data set. It would combine row 3 with rows 4, 5, and 6 5, 6, and 7, which does not match your sample output.
Hi @jhh197
This is a common problem when working with slowly changing dimension tables. You have a table with some information that is valid in a given time period from a Start Date to an End Date. This information is split over several rows, and you want to compress the rows into one row with the total time period. But gaps between time periods should be preserved, so the compression should only be done on rows with consecutive time periods.
It is difficult, if not impossible, to handle with SQL steps and SAS procedures, so the good old Data Step is the Swiss Army Knife here. The following example data + code is taken from a lecture I gave recently, and my native tongue is Danish, but the code should be understandable anyway, Afdeling is Departmenti English. It is based on an SQL step to order data and a Data Step with look-ahead through two Set statements to perform the compression.
* Testdata;
data medarbejder_historik;
infile datalines;
length ID 8 PersonID 8 Personnavn $34 Afdeling $60 Dato_Start 8 Dato_Slut 8;
informat Dato_Start Dato_Slut ddmmyy10.;
format Dato_Start Dato_Slut ddmmyyd10.;
input ID @7 PersonID @14 Personnavn $char16. @30 Afdeling $char42. @73 Dato_Start Dato_Slut;
datalines;
227 3144 Max Goldman Data Warehouse Team 01-06-2010 14-09-2010
493 3144 Max Goldman Data Warehouse 01-01-2020 31-03-2021
495 3144 Max Goldman Data Warehouse 01-04-2021 31-12-9999
497 6731 Bruce Banner Team 2 Økonomisk Styring og Analyse 01-06-2012 30-06-2012
665 6921 Raymond Babbitt Data Warehouse 17-11-2008 31-05-2010
671 3144 Max Goldman Data Warehouse Team 01-07-2011 31-07-2012
1032 6731 Bruce Banner ØSA Baglandet 01-09-2013 31-12-2014
1245 6921 Raymond Babbitt Data og Analyse 01-04-2013 31-08-2013
1355 7693 John Gustafson Data Warehouse 01-04-2021 31-12-9999
1665 7693 John Gustafson Data og Analyse 01-09-2013 28-02-2014
1680 3144 Max Goldman Data og Analyse 01-09-2013 28-02-2014
1761 6921 Raymond Babbitt Data Warehouse 01-01-2016 30-09-2016
1849 6731 Bruce Banner Team Økonomistyring Øst 01-01-2015 31-12-2015
1995 3144 Max Goldman Data og Analyse 01-04-2013 31-07-2013
2265 7693 John Gustafson Data Warehouse 01-04-2018 31-12-2019
2593 6731 Bruce Banner OKÆ Økonomi 25-05-1972 04-05-2009
2726 6731 Bruce Banner Data Warehouse 01-04-2018 31-12-2019
2872 6921 Raymond Babbitt Data og Analyse 01-09-2014 31-12-2015
2971 3144 Max Goldman Data Warehouse Team 01-08-2012 31-03-2013
2972 6921 Raymond Babbitt Data Warehouse 05-05-2017 10-05-2017
2987 3144 Max Goldman Data Warehouse Team 15-09-2010 30-06-2011
3433 6921 Raymond Babbitt Data Warehouse Team 01-06-2010 14-09-2010
3770 7693 John Gustafson Data Warehouse 01-02-2017 04-05-2017
3910 3144 Max Goldman Data Warehouse 05-05-2017 10-05-2017
3998 6731 Bruce Banner OKÆ Økonomi 05-05-2009 19-05-2009
4048 7693 John Gustafson Data Warehouse 01-01-2016 31-01-2017
4127 3144 Max Goldman Data og Analyse 01-03-2014 31-08-2014
4332 6921 Raymond Babbitt Data Warehouse 30-11-2016 31-01-2017
4398 6921 Raymond Babbitt Data Warehouse 01-01-2020 31-03-2021
4459 7693 John Gustafson IT-udvikling og løsninger 01-01-1970 16-11-2008
4534 3144 Max Goldman Data og Analyse 01-08-2013 31-08-2013
4724 7693 John Gustafson Data Warehouse 05-05-2017 10-05-2017
4757 3144 Max Goldman Data og Analyse 01-09-2014 31-12-2015
5066 6921 Raymond Babbitt Data Warehouse 11-05-2017 31-03-2018
5113 6921 Raymond Babbitt Data Warehouse 28-11-2016 29-11-2016
5223 7693 John Gustafson Data Warehouse 01-01-2020 31-03-2021
5238 7693 John Gustafson Data Warehouse 17-11-2008 31-05-2010
5316 6731 Bruce Banner Økonomi, Team Data og Ledelsesinformation 01-01-2017 30-04-2017
5429 6731 Bruce Banner Team 2 01-01-2012 31-05-2012
5585 7693 John Gustafson Data og Analyse 01-03-2014 31-08-2014
5687 6921 Raymond Babbitt Data Warehouse 01-04-2021 31-12-9999
5748 7693 John Gustafson Data og Analyse 01-04-2013 31-08-2013
5825 3144 Max Goldman Data Warehouse 01-02-2017 04-05-2017
5896 6731 Bruce Banner Data Warehouse 05-05-2017 10-05-2017
5903 6921 Raymond Babbitt Data Warehouse 01-02-2017 04-05-2017
6345 6731 Bruce Banner Data Warehouse 01-05-2017 04-05-2017
6650 3144 Max Goldman Data Warehouse 01-01-2016 31-01-2017
6795 6921 Raymond Babbitt Data og Analyse 01-09-2013 28-02-2014
6882 6921 Raymond Babbitt Data og Analyse 01-03-2014 31-08-2014
6899 6921 Raymond Babbitt Data Warehouse 01-04-2018 31-12-2019
7128 6921 Raymond Babbitt IT-udvikling og løsninger 01-01-1970 16-11-2008
7128 7693 John Gustafson Data Warehouse Team 01-06-2010 14-09-2010
7283 6731 Bruce Banner Data Warehouse 11-05-2017 31-03-2018
7384 6731 Bruce Banner Data Warehouse 01-01-2020 31-03-2021
8193 3144 Max Goldman Data Warehouse 17-11-2008 31-05-2010
8449 7693 John Gustafson Data og Analyse 01-09-2014 31-12-2015
8533 6921 Raymond Babbitt Data Warehouse Team 01-07-2011 31-03-2013
8711 6921 Raymond Babbitt Data Warehouse Team 15-09-2010 30-06-2011
9004 6731 Bruce Banner Data Warehouse 01-04-2021 31-12-9999
9216 6731 Bruce Banner Team Effekt og Analyse 01-01-2016 31-12-2016
9291 7693 John Gustafson Data Warehouse 11-05-2017 31-03-2018
9312 3144 Max Goldman Data Warehouse 11-05-2017 31-03-2018
9346 7693 John Gustafson Data Warehouse Team 15-09-2010 30-06-2011
9370 6731 Bruce Banner Økonomiteam 01-12-2009 31-12-2011
9469 3144 Max Goldman IT-udvikling og løsninger 01-01-1970 16-11-2008
9570 7693 John Gustafson Data Warehouse Team 01-07-2011 31-03-2013
9692 6731 Bruce Banner Team 2 Økonomisk Styring og Analyse 01-07-2012 31-08-2013
9700 6731 Bruce Banner OKÆ Økonomi og Borger 20-05-2009 30-11-2009
9767 3144 Max Goldman Data Warehouse 01-04-2018 31-12-2019
9767 3144 Max Goldman Data Warehouse 01-04-2022 30-07-2022
;
run;
/******************************************************************************/
/* Komprimering af disse testdata til et velformet historikdatasæt */
/******************************************************************************/
proc sql;
create table w1 as
select PersonID, Personnavn, Afdeling, Dato_Start, Dato_Slut
from medarbejder_historik
order by PersonID, Personnavn, Afdeling, Dato_Start, Dato_Slut;
quit;
data person_afdeling_renset (drop = thisgroupstart thisgroupslut NextID NextAfd NextStart NextSlut);
set w1 end=end_of_data;
retain thisgroupstart thisgroupslut;
by PersonID Personnavn Afdeling;
if not end_of_data then
set w1 (
firstobs=2
drop= Personnavn
rename=(PersonID=NextID Afdeling=NextAfd Dato_Start=NextStart Dato_Slut=NextSlut)
);
if first.Afdeling then do;
thisgroupstart = Dato_Start;
thisgroupslut = Dato_Slut;
end;
if not last.Afdeling then do;
if Dato_Slut < NextStart - 1 then do;
Dato_Start = thisgroupstart;
output;
thisgroupstart = NextStart;
end;
thisgroupslut = max(Dato_Slut, thisgroupslut);
end;
if last.Afdeling then do;
dato_start = thisgroupstart;
dato_slut = max(Dato_Slut, thisgroupslut);
output;
end;
run;
proc sort data=person_afdeling_renset; by PersonID Dato_Start;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.