Solved: SAS Row by row compare

jhh197 · Posted 12-09-2022 04:31 PM

Hi All,

I have a dataset test1 which has 6 columns like below and need output test2
Test1

Credit_type	Location	Ptype	Price	St_date	end_date
IC	South	AC	10	1/1/2020	1/31/2020
IC	South	AC	20	2/1/2020	2/28/2020
RC	RC	SC	5	1/1/2024	1/31/2024
RC	RC	SC	6	2/1/2024	2/28/2024
RC	RC	SC	5	3/1/2024	3/31/2024
RC	RC	SC	5	4/1/2024	4/30/2024
RC	RC	SC	5	5/1/2024	5/31/2024

First check from Credit_type , Location , Ptype , Price are same from one row to another row .
if they are different no need to compress like row1,row2,row3 and row4 .
But when we check row 5 to row 6 it is same from Credit_type to Price
Now check end date of row 5 which is 3/31/2024 to start date of row 6 which is 4/1/2024 where the difference is one day if yes then change end date of row 5 to 4/30/2024 and delete row 6
Now again check row 5 to row 7 it is same from Credit_type to Price
Now check end date of row 5 which is 4/30/2024 to start date of row 6 which is 5/1/2024 where the difference is one day if yes then change end date of row 5 to 5/31/2024 and delete row 7
and process goes on and create output like below

Can anyone please help ?

Thank you
Test2

Credit_type	Location	Ptype	Price	St_date	end_date
IC	South	AC	10	1/1/2020	1/31/2020
IC	South	AC	20	2/1/2020	2/28/2020
RC	RC	SC	5	1/1/2024	1/31/2024
RC	RC	SC	6	2/1/2024	2/28/2024
RC	RC	SC	5	3/1/2024	5/31/2024

mkeintz · Posted 12-09-2022 11:48 PM

Assuming the data are already grouped, then proc summary will provide the minimum st_date and maximum end_date with a minimum of code:

data have;
  input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
  format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
run;

proc summary data=have   ;
    by credit_type location ptype price    notsorted;
    output out=want (drop=_type_ _freq_) min(st_date)=st_date max(end_date)=end_date;
run;

Editted note: Commonly, instead of the BY statement, one might identify the grouping variables in a CLASS statement. But that would collapse all matching records, even if they are not contiguous in the data set. It would combine row 3 with rows ~~4, 5, and 6~~ 5, 6, and 7, which does not match your sample output.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 12-09-2022 05:54 PM

Before even writing any code you need to provide some information about your date variables such as what TYPE are they? If you are not sure then run proc contents on your data set and show us the result.

I ask this because about half the questions on this forum manipulating or using dates turn out to be using character values or something that is not an actual SAS date value. There are many functions that work with dates to allow comparisons or computations between dates but if the values are not dates then first thing is typically to add the proper type variables.

jhh197 · Posted 12-09-2022 06:12 PM

Dates are numeric mmddyy10.

Tom · Posted 12-09-2022 11:15 PM

What is the purpose of this exercise? Are you just collapsing the contiguous intervals?

Looks like perhaps your data is sorted by these variables?

data have;
  input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
  format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
;

proc sort;
  by credit_type location ptype st_date end_date;
run;

But you want to process it as if it was sorted by just

  by credit_type location ptype price;

So use the NOTSORTED keyword.

data want;
  set have;
  by credit_type location ptype price notsorted ;
  retain start end;
  if first.price then call missing(start,end);
  if st_date - 1 > end then do;
     if not first.price and not last.price then output;
     start=min(start,st_date);
     end=max(end,end_date);
  end;
  if last.price then output;
  format start end yymmdd10.;
  drop st_date end_date;
  rename start=st_date end=end_date;
run;

Result:

       Credit_
Obs     type      Location    Ptype    Price       st_date      end_date

 1       IC        South       AC        10     2020-01-01    2020-01-31
 2       IC        South       AC        20     2020-02-01    2020-02-28
 3       RC        RC          SC         5     2024-01-01    2024-01-31
 4       RC        RC          SC         6     2024-02-01    2024-02-28
 5       RC        RC          SC         5     2024-03-01    2024-05-31

PS If you display dates as digits in either MDY order or DMY order you will confuse 50% of your audience. Use DATE or YYMMDD format to avoid confusion.

jhh197 · Posted 12-10-2022 01:11 AM

Thank you so much for all your help . I will try both the solutions and update .

mkeintz · Posted 12-09-2022 11:48 PM

Assuming the data are already grouped, then proc summary will provide the minimum st_date and maximum end_date with a minimum of code:

data have;
  input Credit_type $ Location $ Ptype $ Price St_date :mmddyy. end_date :mmddyy.;
  format st_date end_date yymmdd10.;
cards;
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/1/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024
run;

proc summary data=have   ;
    by credit_type location ptype price    notsorted;
    output out=want (drop=_type_ _freq_) min(st_date)=st_date max(end_date)=end_date;
run;

Editted note: Commonly, instead of the BY statement, one might identify the grouping variables in a CLASS statement. But that would collapse all matching records, even if they are not contiguous in the data set. It would combine row 3 with rows ~~4, 5, and 6~~ 5, 6, and 7, which does not match your sample output.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jhh197 · Posted 12-10-2022 01:10 AM

Thank you so much for your response . Thanks a lot . I will try this and update

jhh197 · Posted 12-11-2022 04:41 PM

Hi ,
Code is working . Can you please help me when my data is like below
Credit_type Location Ptype Price St_date end_date
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/5/2024 4/30/2024
RC RC SC 5 5/1/2024 5/31/2024

For example row 5 and row 6 the price is same but the difference between enddate of row5 and start date of row 6 is greater than 1 day then the output should be like below
Credit_type Location Ptype Price St_date end_date
IC South AC 10 1/1/2020 1/31/2020
IC South AC 20 2/1/2020 2/28/2020
RC RC SC 5 1/1/2024 1/31/2024
RC RC SC 6 2/1/2024 2/28/2024
RC RC SC 5 3/1/2024 3/31/2024
RC RC SC 5 4/52024 5/31/2024

Thank you

ErikLund_Jensen · Posted 12-12-2022 05:29 AM

Hi @jhh197

This is a common problem when working with slowly changing dimension tables. You have a table with some information that is valid in a given time period from a Start Date to an End Date. This information is split over several rows, and you want to compress the rows into one row with the total time period. But gaps between time periods should be preserved, so the compression should only be done on rows with consecutive time periods.

It is difficult, if not impossible, to handle with SQL steps and SAS procedures, so the good old Data Step is the Swiss Army Knife here. The following example data + code is taken from a lecture I gave recently, and my native tongue is Danish, but the code should be understandable anyway, Afdeling is Departmenti English. It is based on an SQL step to order data and a Data Step with look-ahead through two Set statements to perform the compression.

* Testdata;
data medarbejder_historik;
  infile datalines;
  length ID 8 PersonID 8 Personnavn $34 Afdeling $60 Dato_Start 8 Dato_Slut 8;
  informat Dato_Start Dato_Slut ddmmyy10.;
  format Dato_Start Dato_Slut ddmmyyd10.;
  input ID @7 PersonID @14 Personnavn $char16. @30 Afdeling $char42. @73 Dato_Start Dato_Slut;
  datalines;
227   3144   Max Goldman     Data Warehouse Team                        01-06-2010 14-09-2010
493   3144   Max Goldman     Data Warehouse                             01-01-2020 31-03-2021
495   3144   Max Goldman     Data Warehouse                             01-04-2021 31-12-9999
497   6731   Bruce Banner    Team 2 Økonomisk Styring og Analyse        01-06-2012 30-06-2012
665   6921   Raymond Babbitt Data Warehouse                             17-11-2008 31-05-2010
671   3144   Max Goldman     Data Warehouse Team                        01-07-2011 31-07-2012
1032  6731   Bruce Banner    ØSA Baglandet                              01-09-2013 31-12-2014
1245  6921   Raymond Babbitt Data og Analyse                            01-04-2013 31-08-2013
1355  7693   John Gustafson  Data Warehouse                             01-04-2021 31-12-9999
1665  7693   John Gustafson  Data og Analyse                            01-09-2013 28-02-2014
1680  3144   Max Goldman     Data og Analyse                            01-09-2013 28-02-2014
1761  6921   Raymond Babbitt Data Warehouse                             01-01-2016 30-09-2016
1849  6731   Bruce Banner    Team Økonomistyring Øst                    01-01-2015 31-12-2015
1995  3144   Max Goldman     Data og Analyse                            01-04-2013 31-07-2013
2265  7693   John Gustafson  Data Warehouse                             01-04-2018 31-12-2019
2593  6731   Bruce Banner    OKÆ Økonomi                                25-05-1972 04-05-2009
2726  6731   Bruce Banner    Data Warehouse                             01-04-2018 31-12-2019
2872  6921   Raymond Babbitt Data og Analyse                            01-09-2014 31-12-2015
2971  3144   Max Goldman     Data Warehouse Team                        01-08-2012 31-03-2013
2972  6921   Raymond Babbitt Data Warehouse                             05-05-2017 10-05-2017
2987  3144   Max Goldman     Data Warehouse Team                        15-09-2010 30-06-2011
3433  6921   Raymond Babbitt Data Warehouse Team                        01-06-2010 14-09-2010
3770  7693   John Gustafson  Data Warehouse                             01-02-2017 04-05-2017
3910  3144   Max Goldman     Data Warehouse                             05-05-2017 10-05-2017
3998  6731   Bruce Banner    OKÆ Økonomi                                05-05-2009 19-05-2009
4048  7693   John Gustafson  Data Warehouse                             01-01-2016 31-01-2017
4127  3144   Max Goldman     Data og Analyse                            01-03-2014 31-08-2014
4332  6921   Raymond Babbitt Data Warehouse                             30-11-2016 31-01-2017
4398  6921   Raymond Babbitt Data Warehouse                             01-01-2020 31-03-2021
4459  7693   John Gustafson  IT-udvikling og løsninger                  01-01-1970 16-11-2008
4534  3144   Max Goldman     Data og Analyse                            01-08-2013 31-08-2013
4724  7693   John Gustafson  Data Warehouse                             05-05-2017 10-05-2017
4757  3144   Max Goldman     Data og Analyse                            01-09-2014 31-12-2015
5066  6921   Raymond Babbitt Data Warehouse                             11-05-2017 31-03-2018
5113  6921   Raymond Babbitt Data Warehouse                             28-11-2016 29-11-2016
5223  7693   John Gustafson  Data Warehouse                             01-01-2020 31-03-2021
5238  7693   John Gustafson  Data Warehouse                             17-11-2008 31-05-2010
5316  6731   Bruce Banner    Økonomi, Team Data og Ledelsesinformation  01-01-2017 30-04-2017
5429  6731   Bruce Banner    Team 2                                     01-01-2012 31-05-2012
5585  7693   John Gustafson  Data og Analyse                            01-03-2014 31-08-2014
5687  6921   Raymond Babbitt Data Warehouse                             01-04-2021 31-12-9999
5748  7693   John Gustafson  Data og Analyse                            01-04-2013 31-08-2013
5825  3144   Max Goldman     Data Warehouse                             01-02-2017 04-05-2017
5896  6731   Bruce Banner    Data Warehouse                             05-05-2017 10-05-2017
5903  6921   Raymond Babbitt Data Warehouse                             01-02-2017 04-05-2017
6345  6731   Bruce Banner    Data Warehouse                             01-05-2017 04-05-2017
6650  3144   Max Goldman     Data Warehouse                             01-01-2016 31-01-2017
6795  6921   Raymond Babbitt Data og Analyse                            01-09-2013 28-02-2014
6882  6921   Raymond Babbitt Data og Analyse                            01-03-2014 31-08-2014
6899  6921   Raymond Babbitt Data Warehouse                             01-04-2018 31-12-2019
7128  6921   Raymond Babbitt IT-udvikling og løsninger                  01-01-1970 16-11-2008
7128  7693   John Gustafson  Data Warehouse Team                        01-06-2010 14-09-2010
7283  6731   Bruce Banner    Data Warehouse                             11-05-2017 31-03-2018
7384  6731   Bruce Banner    Data Warehouse                             01-01-2020 31-03-2021
8193  3144   Max Goldman     Data Warehouse                             17-11-2008 31-05-2010
8449  7693   John Gustafson  Data og Analyse                            01-09-2014 31-12-2015
8533  6921   Raymond Babbitt Data Warehouse Team                        01-07-2011 31-03-2013
8711  6921   Raymond Babbitt Data Warehouse Team                        15-09-2010 30-06-2011
9004  6731   Bruce Banner    Data Warehouse                             01-04-2021 31-12-9999
9216  6731   Bruce Banner    Team Effekt og Analyse                     01-01-2016 31-12-2016
9291  7693   John Gustafson  Data Warehouse                             11-05-2017 31-03-2018
9312  3144   Max Goldman     Data Warehouse                             11-05-2017 31-03-2018
9346  7693   John Gustafson  Data Warehouse Team                        15-09-2010 30-06-2011
9370  6731   Bruce Banner    Økonomiteam                                01-12-2009 31-12-2011
9469  3144   Max Goldman     IT-udvikling og løsninger                  01-01-1970 16-11-2008
9570  7693   John Gustafson  Data Warehouse Team                        01-07-2011 31-03-2013
9692  6731   Bruce Banner    Team 2 Økonomisk Styring og Analyse        01-07-2012 31-08-2013
9700  6731   Bruce Banner    OKÆ Økonomi og Borger                      20-05-2009 30-11-2009
9767  3144   Max Goldman     Data Warehouse                             01-04-2018 31-12-2019
9767  3144   Max Goldman     Data Warehouse                             01-04-2022 30-07-2022
;
run;

/******************************************************************************/
/* Komprimering af disse testdata til et velformet historikdatasæt            */
/******************************************************************************/
  
proc sql;
    create table w1 as
    select PersonID, Personnavn, Afdeling, Dato_Start, Dato_Slut
    from medarbejder_historik
    order by PersonID, Personnavn, Afdeling, Dato_Start, Dato_Slut;
quit;

data person_afdeling_renset (drop = thisgroupstart thisgroupslut NextID NextAfd NextStart NextSlut);
  set w1 end=end_of_data;
  retain thisgroupstart thisgroupslut;

  by PersonID Personnavn Afdeling;

  if not end_of_data then 
    set w1 (
      firstobs=2 
      drop= Personnavn
      rename=(PersonID=NextID Afdeling=NextAfd Dato_Start=NextStart Dato_Slut=NextSlut)
    );

  if first.Afdeling then do;
    thisgroupstart = Dato_Start;
    thisgroupslut = Dato_Slut;
  end;

  if not last.Afdeling then do;
    if Dato_Slut < NextStart - 1 then do;
      Dato_Start = thisgroupstart;
      output;
      thisgroupstart = NextStart;
    end;
    thisgroupslut = max(Dato_Slut, thisgroupslut);
  end;

  if last.Afdeling then do;
    dato_start = thisgroupstart;
    dato_slut = max(Dato_Slut, thisgroupslut);
    output; 
  end;
run;

proc sort data=person_afdeling_renset; by PersonID Dato_Start;
run;

SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Re: SAS Row by row compare

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!