BookmarkSubscribeRSS Feed
NanZ
Calcite | Level 5

Hello SAS Users,

 

I am trying to clean a longitudinal data set for people with multiple visits. The recorded date however involves mistakes/unreasonable dates. For example, the same check-in (datin) and check-out date (datout). I tried to add one day to the check-out date or remove one day to the check-in date, depending on the date of visit before the same check-in/out date, with using lead and lag function. The removing one day code works but not the adding one day code.

 

Thanks for any suggestions.

 

Merry Christmas!

 

 

 

The resulted data should look like:

 

Data WANT
1 1 2
1 2 3
1 3 4
1 4 5
(deleted, because the check-out date in this period is later than the check-in date of the next period , 20>6)
1 5 6 (or 6 7 but should not beyond 9, the next check in date)
1 9 10
1 10 11
1 11 12
1 12 13
(deleted)
1 14 15  (or 13 14 but should not cross with previous check-out date or next check in date)
1 15 16    (or 14 15 but should not cross with previous check-out date or next check in date)
1 16 17

 

 

 

 

** Example data and tried code ;

data HAVE;
input clientid datin datout ;
datalines;
1 1 2
1 2 2
1 3 3
1 4 4
1 5 20
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;

 

proc expand data= h1 out= h1;
 convert datin = lead_datin /transformout = (lead 1);
 convert datout = lead_datout /transformout = (lead 1);
 by clientid;
run;

 

* Create lag date;
 data h1 ;
 format clientid flag_inout_sameday datin datout date_in date_out lead_datin lead_datout lag_datin lag_datout time  ;
 set h1 ;
 
 date_in=datin;
 date_out=datout;

 

  * The same day in/out;
 if datin=datout and datin~=. then flag_inout_sameday = 1;

 

 * unreasonable datout ;
 if datout>lead_datout and lead_datout~=. then delete ;

 

  by clientid;
 lag_datin = ifn(first.clientid, . , lag(datin));
 lag_datout = ifn(first.clientid, . , lag(datout));

 

 

 *** Clean records with Same day in/out ***;
 if    not first.clientid and flag_inout_sameday = 1 and datin>lag_datout then date_in = datin - 1;
  else if not first.clientid and flag_inout_sameday = 1 and datin<lead_datin then date_out = datout + 1;


run;

8 REPLIES 8
novinosrin
Tourmaline | Level 20

Hi @NanZ  I am not quite getting the intuition of your exercise as the sample suggests only 1 day interval between datein and dateout? Hmm such lucky patients or hotel customers who occupy just for a day.

 

Anyway FWIW


data h1;
input clientid datin datout ;
datalines;
1 1 2
1 2 2
1 3 3
1 4 4
1 5 7
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;
 

data want;
 do until(last.clientid);
  set h1(drop=datout);
  by clientid;
  _min=min(_min,datin);
  _max=max(_max,datin);
 end;
 do _n_=_min to _max;
  datin=_n_;
  dateout=datin+1;
  output;
 end;
 drop _:;
run;

A little more comprehensive information would help

 

NanZ
Calcite | Level 5

Thanks, for your quick response.  

 

The synthetic data is shelter records but not hospital ones. Some shelters require people to register every day, some may not do so. A shelter may require daily registration in some specific periods (for example, winter) but not other periods. Because of that, the data contains the combination of the two types of stays across people/shelters.

 

For the same day check-in/out record, it could be a person came to the door at midnight, a mistake to input the date, or some other unknow reasons. Unless these records are apparently incorrect, all obs are planned to be kept.

 

Actually, my final dataset should look like the following - the start and end date for a continuous stay

 

Final data format:

id datin datout

1 1 6
1 9 13

1 14 17

 

Corresponding to the "cleaned" data:

 

clientid datin datout
1 1 2
1 2 3
1 3 4
1 4 5
(deleted) (A mistake in my original post. Revision has been done)

1 5 6
1 9 10
1 10 11
1 11 12
1 12 13
(deleted)
1 14 15  
1 15 16   
1 16 17

 

 


@novinosrin wrote:

Hi @NanZ  I am not quite getting the intuition of your exercise as the sample suggests only 1 day interval between datein and dateout? Hmm such lucky patients or hotel customers who occupy just for a day.

 

Anyway FWIW

data h1;
input clientid datin datout ;
datalines;
1 1 2
1 2 2
1 3 3
1 4 4
1 5 7
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;
 

data want;
 do until(last.clientid);
  set h1(drop=datout);
  by clientid;
  _min=min(_min,datin);
  _max=max(_max,datin);
 end;
 do _n_=_min to _max;
  datin=_n_;
  dateout=datin+1;
  output;
 end;
 drop _:;
run;

A little more comprehensive information would help

 


 

 

novinosrin
Tourmaline | Level 20

Hmm, still not able to comprehend properly.

 

Okay, To keep it simple, can you provide the following 

 

1. A sample data of what you HAVE (your input dataset)

2. A sample data of what you WANT( your output dataset for the corresponding input)

3. A brief explanation/narrative of the logic to derive the WANT relevant to 1 and 2. 

 

Should i assume  

1. is 

Final data format:

id datin datout

1 1 6
1 9 13

1 14 17

 

and 

 

 2 is 

Corresponding to the "cleaned" data:

 

clientid datin datout
1 1 2
1 2 3
1 3 4
1 4 5
(deleted) (A mistake in my original post. Revision has been done)

1 5 6
1 9 10
1 10 11
1 11 12
1 12 13
(deleted)
1 14 15  
1 15 16   
1 16 17

 

If yes, then

 


data have;
input id datin datout;
cards;
1 1 6
1 9 13
1 14 17
;

data want;
 set have;
 do datin=datin to datout-1;
  datout=datin+1;
  output;
 end;
run;
NanZ
Calcite | Level 5

Let me try but I am not sure I can make it clearer.

 

1. HAVE

clientid datin datout

1 1 2
1 2 2
1 3 3
1 4 4
1 5 20
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;

 

2. WANT

 

clientid datin datout

1 1 2
1 2 3
1 3 4
1 4 5
(deleted)
1 5 6
1 9 10
1 10 11
1 11 12
1 12 13
(deleted)
1 14 15 
1 15 16 
1 16 17

 

Logic:

Change the same day in/out to sequential days in/out, base on the previous/next record.

 

For example

1. For the second record. The original check in/out date is 2 2. I would like to change it to 2 3, because the previous record is 1 2. Or the check in date in the second record should not before the previous check out date. Consequently, the 3rd record, 3 3, will become 3 4 and so on

 

To summarize, adding one day to the check out date because of the record with check in/out date 1 2

 

2. Change the second record to the last from 16 16 (check in/out) to 15 16 due to the last record has value 16 17 (check in/out).  Because the check out date in the second to the last record should not after the check in date of the last record. Consequently, change 15 15 to 14 15 and so on.

 

To summarize, removing one day from the check in date because of the record with check in/out date 16 17

 

The record with value of  5 20 (check in/out) will be deleted because of the record after this one (6 6) shows he checked in again before he checked out. The check out date on the 20th most likely be a mistake.

 

Hope the explanation makes things clear but not adds confusion.  

novinosrin
Tourmaline | Level 20

Hi @NanZ   See if this helps?

 


data HAVE;
input clientid datin datout ;
datalines;
1 1 2
1 2 2
1 3 3
1 4 4
1 5 20
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;


data want;
 do _n_=1 by 1 until(last.clientid);
  set have;
  by clientid;
  if datin<_t and _t then _f=1;else _f=0;
  _t=datout;
  datout=datin+1;
  if not _f then output;
 end;
 drop _:;
run;

If the above seems ok, and still the 1 16 17 duplicate bothers you, then the following addresses that piece too

 


data want;
 if _n_=1 then do;
  dcl hash H () ;
  h.definekey  ("datin","datout") ;
  h.definedata ("datin","datout") ;
  h.definedone () ;
 end; 
 do _n_=1 by 1 until(last.clientid);
  set have;
  by clientid;
  _f= datin<_t and _t;
  _t=datout;
  datout=datin+1;
  if not _f and h.check() ne 0 then output;
  _rc=h.add();
 end;
 h.clear();
 drop _:;
run;
NanZ
Calcite | Level 5

Thanks, @novinosrin, for your time and great help.

 

I checked the results generated from your code and found that both deleted record 6 and 12, instead of 5 and 11. I have to check the code on the original data whether the they generate desired results or not. But I would prefer to delete the right one and correct the ones planned, i.e., record 6 and 12.

 

Also, the second code deleted one additional record - 14.

novinosrin
Tourmaline | Level 20

Hi @NanZ , Thank you for clarifying further and your patience. Please try this and see if this meets ?

 


data HAVE;
input clientid datin datout ;
datalines;
1 1 2
1 2 2
1 3 3
1 4 4
1 5 20
1 6 6
1 9 10
1 11 11
1 12 12
1 13 13
1 14 16
1 15 15
1 16 16
1 16 17
;


data want;
 do _n_=1 by 1 until(last.clientid);
  set have;
  by clientid;
  array t(9999) _temporary_;
  if datin<_t  then t(_n_-1)=1;
  _t=datout;
 end;
 do _n_=1 to _n_;
  set have;
  if t(_n_) then continue;
  datout=datin+1;
  output;
 end;
 call missing(of t(*));
 drop _:;
run;
NanZ
Calcite | Level 5

Thanks, @novinosrin, for your great patience too and great help.

 

The newest code deleted the right records, 5 and 11. But the results for the last two records are the same, 16 and 17.

 

Also, I am not familiar with array but the data has millions of records.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1061 views
  • 0 likes
  • 2 in conversation