BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Bal23
Lapis Lazuli | Level 10

PROC SQL;
CREATE TABLE ta1 AS 
SELECT DISTINCT id, icd,   feq, date 

group by id , date,icd, 
order by id, date, icd;


QUIT;

Now if I want to remove some records if there are two records on same day, what should I do

I can not simply drop freq=2, because I do want to keep those have two visits on different days with same icd code

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

Assuming data HAVE is sorted by id/date, you can use the DIF function (where DIF(X) is defined as X-LAG(X)).

 

data want;
  set have;
  by id ;
  if first.id or dif(date)>7;
run;

 

 

Note: If FIRST.ID is true, it is logically superfluous to evaluate the DIF function, but SAS evaluates it anyhow-(which is good, because the DIF function needs to be evaluated for every record).  I did a test to demonstrate that it works as desired.  But if you're really nervous, the subsetting IF statement could be modified to

  if dif(date)>7  or first.id;

 

 

which would insure that DIF  is always evaluated (and the underlying lag queues are always updated.

 

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

12 REPLIES 12
LinusH
Tourmaline | Level 20

So what defines a duplicate? Id and date alone?

Removing duplicates usually makes the data less informative. Aggregation is preferred when possible (then you could have count column telling how many original records that contributed).

 

But, PROC SORT does this (remove duplicates using BY) that you is almost impossible to do in SQL.

Data never sleeps
Bal23
Lapis Lazuli | Level 10

for this project, if there are two records with the same icd code, on the same day for the same patient, it might be a duplicate record that is not what I am interested in

But I want to include those records with two visits, freq=2, with different dates, with same id

sort is fine, then is there a function, to delete some records with same id, same icd, on the same day?

Shmuel
Garnet | Level 18

Your code ask for   

DISTINCT id, icd,   feq, date

as much as I understand you will get duplicates only if ICD or FEQ differs between rows.

In such case, which one you want to delete ?

 

Run next code and check the output dataset (contains the duplicates) :

 

proc sort data=ta1; by ID DATE; run;

data dup;

  set ta1;

       by id  date;

           if not (first.date and last.date);

run;

 

 

Bal23
Lapis Lazuli | Level 10

 I ran it, i still have same day records

 

please see table below, here only patient id 4, these two records are what i need, because these two visists are on two different dates, for id2  , because they have same icd code, same date, same patient id, i need to remove

 

by looking at the data, i notice, i also need to remove id 3, the billing date is 2 day difference, id5, one day difference, it might still be about the same visit, this is NOT i am interested in, so i need to remove those records with same id, same icd, same date or two billing date difference less than one week

 

thanks.

id

icd

freq

amount

billingdate

2

v234

2

78.91

1/1/2015

2

v234

2

96.05

1/1/2015

3

v44

2

43.74

2/3/2015

3

v44

2

22.04

2/5/2015

4

e34

2

52.26

3/6/2015

4

e34

2

11.27

9/5/2015

5

j23

2

379.08

4/1/2015

5

j23

2

217.12

4/2/2015

Shmuel
Garnet | Level 18

Of course you get more than on row with same date.

I wanted to show you the dupliacte date records.

 

Look at ID=2. On 1st row amount=78.91 while on 2nd row amount=96.05.

 

Which row you prefer to delete ? or maybe you want to sum the amounts per ID ICD BILLINGDATE ? 

Bal23
Lapis Lazuli | Level 10

Thank you and I am sorry I have not made it clear.

 

when the bill date difference is within 7 days, even it has three or records, I want to consider it as one visit only, there fore, I only want to keep the first day record, and delete all other billing records, which are after the first record, but not later than 7 days,

I need to create a variable, timelapase=time2-time1, and if timelapse > 0 but <=7, then all those later records will be deleted, only the first record will be kept

here, the record with 78.91  will be recorded

i sort it by id, date, and icd

then if first.id=1 and first.date=1 and first.icd=1, then timelaspse=0, time=date;

else timelapse=date(the next record)-time,, or date3(the records after that)=date-time

 

but I do not know how to do that

Shmuel
Garnet | Level 18

Then try next code:

data want;
  set have;  
       by ID DATE TCD;   /* assumed data is sorted */
            retain prev_date; drop prev_date;
           
           if first.ID then do;
              prev_date = date;
              output;
           end; else
           if date - prev_date > 7 then do;
              prev_date = date;
              output;
           end; else delete;
run;

 

Bal23
Lapis Lazuli | Level 10

Thanks. I did try your code, and I deleted all records after the first record for the same patient.

I do not know where the problem is and how to fix, I still want to keep later records, which are after 7 days

Shmuel
Garnet | Level 18

@Bal23 - what code have you run ?

 

I have preceded a step to read your data and run next code:

data test;
  infile datalines dlm=',' truncover;
  input id icd $ amoumt date mmddyy10.;
  format date ddmmyy10.;
datalines;
2,v234,78.91,01/01/2015
2,v234,96.05,01/01/2015
3,v44,43.74,02/03/2015
3,v44,22.04,02/05/2015
4,e34,52.26,03/06/2015
4,e34,11.27,09/05/2015
5,j23,379.08,04/01/2015
5,j23,217.12,04/02/2015
; run;

proc sort data=test; by id icd date; run;

data want;
 set test;
  by id;
     retain prev_date; drop prev_date;
     if first.id then do;
        prev_date = date;
        output;
     end; else
     if date - prev_date > 7 then do;
        prev_date = date;
        output;
     end; else delete;
run;

then output contains next lines  (date format mmddyy10.):

2    v234   78.91  01/01/2015

3    v44     43.74  02/03/2015                        

4    e34     52.26  03/06/2015

4    e34     11.27  09/05/2015

5    j23    379.08  04/01/2015

 

 

 

mkeintz
PROC Star

Assuming data HAVE is sorted by id/date, you can use the DIF function (where DIF(X) is defined as X-LAG(X)).

 

data want;
  set have;
  by id ;
  if first.id or dif(date)>7;
run;

 

 

Note: If FIRST.ID is true, it is logically superfluous to evaluate the DIF function, but SAS evaluates it anyhow-(which is good, because the DIF function needs to be evaluated for every record).  I did a test to demonstrate that it works as desired.  But if you're really nervous, the subsetting IF statement could be modified to

  if dif(date)>7  or first.id;

 

 

which would insure that DIF  is always evaluated (and the underlying lag queues are always updated.

 

Regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Shmuel
Garnet | Level 18

If someone is visiting twice a week, then the 3rd or 4th visit will be after more than 7 days since first visit.

Would you like to delete it or save it ?

 

Using filter like  if first.id or dif(date) > 7 will delete the 3rd and the 4th visit,

 while 

data want;
  set have;  
       by ID DATE TCD;   /* assumed data is sorted */
            retain prev_date; drop prev_date;
           
           if first.ID then do;
              prev_date = date;
              output;
           end; else
           if date - prev_date > 7 then do;
              prev_date = date;
              output;
           end; else delete;
run;

will preserve, at least the 4th visit.

 

 

mkeintz
PROC Star

And the avoidance of DIF in the alternative would also get the 7th, 10th, 13th dates if it was a multi-week series at two dates per week.  But it would not get visits 8,9,11,12.

 

In such a case a more likely objective would be to determine the first date (BEGDATE), last date (ENDDATE) and number of dates in each series - where a series is defined as a sequence of  dates with no internal  gaps over 7 days.  The example below produces such results using two DIF functions and a merge statement of a HAVE with itself (with the 2nd have using FIRSTOBS=2).  Note the merge does not have an associated BY statement.  If it did, the program would fail:

 

 

 

data want (drop=next_:);
  format begdate yymmddn8.;

  do ndates=1 by 1 until (dif(next_id)>0 or dif(next_date)>7);
    merge have (rename=(date=enddate))
          have (firstobs=2 keep=id date rename=(id=next_id date=next_date));
    if begdate=. then begdate=date;
  end;

run;

 

Of course,  in the original problem there are other variables that are being ignored in this new program.  But my response here is part of my campaign to make folks more informed, and more comfortable with the LAG and DIF functions.  I think they are often unnecessarily avoided.

 

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 5421 views
  • 1 like
  • 4 in conversation