BookmarkSubscribeRSS Feed
CynthiaWei
Obsidian | Level 7

Hi SAS Pros,

I have a dataset like showed below.

data Have;
input ID first_date :mmddyy10. last_date :mmddyy10. edu class;
format first_date mmddyy10.;
format last_date mmddyy10.;
datalines;
1 02-01-2015  02-01-2015  3  3 
1 02-01-2015  02-12-2015  3  1
1 02-01-2015  02-16-2015  3  3
1 02-01-2015  03-01-2015  3  3
1 08-20-2015  09-01-2015  3  3
1 10-06-2015  10-09-2015  2  1
2 10-19-2017  10-19-2017  1  1 
2 10-19-2017  10-21-2017  3  1
2 10-19-2017  11-05-2017  2  2
2 10-19-2017  11-11-2017  1  1
2 10-19-2017  12-22-2017  1  1
2 12-25-2017  12-25-2017  3  2
;
run;

I would like to recode the last dates if the records under the same ID match on ID, first_date, edu, and class, with only last dates are different. One of these set of records has a first date and last date are the same (i.e. the first record for each ID). The I would like to enter the date of the record with same first and last dates (i.e. 02-01-2015 for ID=1 and 10-19-2017 for ID=2) for the last dates for those records match on ID, first_date, edu, and class under the same ID.

 

If the records under the same ID are different in any of ID, first_date, edu, and class, then just leave as them own, regardless first date is different or the same as the last date.

 

The Want data is also post below.

1 02-01-2015 02-01-2015 3 3

1 02-01-2015 02-12-2015 3 1

1 02-01-2015 02-01-2015 3 3

1 02-01-2015 02-01-2015 3 3

1 08-20-2015 09-01-2015 3 3

1 10-06-2015 10-09-2015 2 1

2 10-19-2017 10-19-2017 1 1

2 10-19-2017 10-21-2017 3 1

2 10-19-2017 11-05-2017 2 2

2 10-19-2017 10-19-2017 1 1

2 10-19-2017 10-19-2017 1 1

2 12-25-2017 12-25-2017 3 2

 

Thank you very much for any help!

 

Best regards,

 

C

6 REPLIES 6
Patrick
Opal | Level 21

If you just want to use the first last_date per key based on sort order of your source data then below one way to go.

data Have;
  input ID first_date :mmddyy10. last_date :mmddyy10. edu class;
  format first_date mmddyy10.;
  format last_date mmddyy10.;
  datalines;
1 02-01-2015  02-01-2015  3  3 
1 02-01-2015  02-12-2015  3  1
1 02-01-2015  02-16-2015  3  3
1 02-01-2015  03-01-2015  3  3
1 08-20-2015  09-01-2015  3  3
1 10-06-2015  10-09-2015  2  1
2 10-19-2017  10-19-2017  1  1 
2 10-19-2017  10-21-2017  3  1
2 10-19-2017  11-05-2017  2  2
2 10-19-2017  11-11-2017  1  1
2 10-19-2017  12-22-2017  1  1
2 12-25-2017  12-25-2017  3  2
;

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'have(keep=id first_date edu class last_date)');
      h1.defineKey('id','first_date','edu','class');
      h1.defineData('last_date');
      h1.defineDone();
    end;

  set have;
  h1.find();
run;

data expected;
  input ID first_date :mmddyy10. last_date :mmddyy10. edu class;
  format first_date mmddyy10.;
  format last_date mmddyy10.;
  datalines;
1 02-01-2015 02-01-2015 3 3
1 02-01-2015 02-12-2015 3 1
1 02-01-2015 02-01-2015 3 3
1 02-01-2015 02-01-2015 3 3
1 08-20-2015 09-01-2015 3 3
1 10-06-2015 10-09-2015 2 1
2 10-19-2017 10-19-2017 1 1
2 10-19-2017 10-21-2017 3 1
2 10-19-2017 11-05-2017 2 2
2 10-19-2017 10-19-2017 1 1
2 10-19-2017 10-19-2017 1 1
2 12-25-2017 12-25-2017 3 2
;

proc compare
  data=want
  comp=expected
  ;
run;

While loading the SAS hash table without option "multidata" SAS will only pick the first instance of a unique key combination and that's what the lookup using the find() method will return.

CynthiaWei
Obsidian | Level 7

Hi,

 

Thanks for the codes. However, they just happened to be the first records. It is not necessary to be the first record with the same first and last date. I just list them like that.

 

What I want is to meet the conditions I mentioned in the first thread.

 

Thanks,

C

Patrick
Opal | Level 21

The condition for changing last_date is met and the code returns the desired data as posted by you.

The last_date used will be taken from the first occurrence of the key combination encountered in the data. 

mkeintz
PROC Star

I understand your task as follows: if a given combination of id*first_date*edu*class has one record do nothing, i.e. leave the last_date value as is.  But if that combination has more than one record, then assign last_date=first_date.

 

This can be done by passing through each ID twice.  The first pass to count the frequency of a given id*first_date*edu*class, and the second pass to output the record with last_date modified if needed:

 

data Have;
  input ID first_date :mmddyy10. last_date :mmddyy10. edu class;
  format first_date mmddyy10.;
  format last_date mmddyy10.;
  datalines;
1 02-01-2015  02-01-2015  3  3 
1 02-01-2015  02-12-2015  3  1
1 02-01-2015  02-16-2015  3  3
1 02-01-2015  03-01-2015  3  3
1 08-20-2015  09-01-2015  3  3
1 10-06-2015  10-09-2015  2  1
2 10-19-2017  10-19-2017  1  1 
2 10-19-2017  10-21-2017  3  1
2 10-19-2017  11-05-2017  2  2
2 10-19-2017  11-11-2017  1  1
2 10-19-2017  12-22-2017  1  1
2 12-25-2017  12-25-2017  3  2
;

data want;
  set have (in=firstpass)  have (in=secondpass);
  by id;

  array _n_fdates {%sysevalf("01jan2015"d):%sysevalf("31dec2017"d),3,3} _temporary_;
  if first.id then call missing(of _n_fdates{*});

  if firstpass then _n_fdates{first_date,edu,class}+1;
  if secondpass;
  if _n_fdates{first_date,edu,class}>1 then last_date=first_date;
run;

The array _N_FDATES is a 3-dimensional array.  The first dimension has a lower bound of "01jan2015"d and an upper bound of "31dec2017"d, to cover the expected range of first_date values.  Change it to cover the actual domain of your data.  The second dimension goes from 1 to 3 to cover EDU values, and the third also goes from 1 to 3 to cover CLASS values.  They can be changed to other integer ranges too, if needed.  

 

This program assumes the data are sorted by ID, but can be any order within ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
CynthiaWei
Obsidian | Level 7

Hi,

 

Thank you so much for the code! I was not able to work on this project in the past weeks so sorry for my late reply.

 

I think your understanding is quite close to my task. I think there is one criterion needs to add here: if a given combination of id*first_date*edu*class has one record do nothing, i.e. leave the last_date value as is.  But if that combination has more than one record and one record has a date difference of 0, then assign last_date=first_date for the rest. So, if a given combination of id*first_date*edu*class has more than one record but each record has a date difference >0 and none of the records has a date difference of 0 within this ID, then leave them. Examples like follows:(there is no record like "5   8-19-2017  8-19-2017    1         1" for this person)

ID  First_date   Last_date    EDU   Class

5   8-19-2017  11-11-2017    1         1
5   8-19-2017  12-22-2017    1         1

 

Could you please advice me what the code look like when incorporating the above situation. I apologize for not specifying this before.

 

And if the variables that are going to be involved in a given combination are varA, varB, varC, varD, varE, varF and varG, I just need to put all of them in the "{}" in the if then statement, right?

 

Thank you so much!

 

Best regards,

mkeintz
PROC Star

@CynthiaWei wrote:

Hi,

 

Thank you so much for the code! I was not able to work on this project in the past weeks so sorry for my late reply.

 

I think your understanding is quite close to my task. I think there is one criterion needs to add here: if a given combination of id*first_date*edu*class has one record do nothing, i.e. leave the last_date value as is.  But if that combination has more than one record and one record has a date difference of 0, then assign last_date=first_date for the rest. So, if a given combination of id*first_date*edu*class has more than one record but each record has a date difference >0 and none of the records has a date difference of 0 within this ID, then leave them. Examples like follows:(there is no record like "5   8-19-2017  8-19-2017    1         1" for this person)

ID  First_date   Last_date    EDU   Class

5   8-19-2017  11-11-2017    1         1
5   8-19-2017  12-22-2017    1         1

Could you please advice me what the code look like when incorporating the above situation. I apologize for not specifying this before.

 

So instead of assigning last_date=first_date for all combinations that have multiple records, as I had programmed before, just do it for all combinations that have multiple records, of which at least one already has last_date=first_date:

 

In that case then during the first pass, the program should still count records by combination.  But in addition, if any record has first_date=last_date, then add (say) 1,000 to the count.  I choose 1,000 because I assume no combination is likely to have near 1,000 records.

 

This means:

  1. A single record would have count=1 or 1,001
  2. Multiple records would have counts 2,3,4,...    or 1,002+
  3. Therefore, during the second pass, only count>1,001 requires assigning last_date=first_date

 

 

data want;
  set have (in=firstpass)  have (in=secondpass);
  by id;

  array _n_fdates {%sysevalf("01jan2015"d):%sysevalf("31dec2017"d),3,3} _temporary_;
  if first.id then call missing(of _n_fdates{*});

  if firstpass then do;
    _n_fdates{first_date,edu,class}+1;
    if first_date=last_date then _n_fdates{first_date,edu,class}+1000;
  end;
  if secondpass;
  if _n_fdates{first_date,edu,class}>1001 then last_date=first_date;
run;

 

 

 

And if the variables that are going to be involved in a given combination are varA, varB, varC, varD, varE, varF and varG, I just need to put all of them in the "{}" in the if then statement, right?

In short yes.  Make provision for each variable in the array statement, and then use those variable names as the array indexes in the subsequent statements.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 461 views
  • 0 likes
  • 3 in conversation