data step to merge for a date range

Accepted Solution Solved
Reply
Regular Contributor
Posts: 161
Accepted Solution

data step to merge for a date range

Dear All:

Could you tell me what is the DATA STEP equivalent for the following SQL statement?  When I run the following code, SAS runs extremely slow on a high-end workstation.  So I think the sort sort merge approach may help in this situation.  Thank you ~

proc sql;

     create table want as

          select a.*, coalesce(a.condition, b.condition) as condition

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and 0 <= abs(a.visit_date - b.visit_date) <= 1 and a.type = b.type;

quit;


Accepted Solutions
Solution
‎07-24-2014 04:06 AM
Contributor
Posts: 25

Re: data step to merge for a date range

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN.  One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

   merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

  by patient_id type;

  if a and abs(visit_date - b_visit_date) <= 1;  /* the only result less than 0 for abs() is missing */

run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,430

Re: data step to merge for a date range

Not sure if this would speed it up but:

proc sql;

     create table WANT as

     select     A.*,     /* Do you really need all variables? */

                   COALESCE(A.CONDITION,B.CONDITION) as CONDITION

     from       (select *,datepart(VISIT_DATE) as DTE1) HAVE1 A    

/* I assume you use abs() as datetime variable? */

     left join   (select *,datepart(VISIT_DATE) as DTE2) HAVE2 B

     on          A.PATIENT_ID=B.PATIENT_ID

     and        A.DTE - B.DTE in (0,1);

quit;

As for datastep, well you would need to sort both of them.  Then:

data want;

     merge have1 (rename=(visit_date=dte1)) have2 (rename=(visit_date=dte2) keep=patient_id condition visit_date);

     by patient_id;

     if 0 <= (datepart(dte1) - datepart(dte2)) <= 1 then output;

run;

Regular Contributor
Posts: 161

Re: data step to merge for a date range

Thank you, RW9!

The abs() function is there to allow for difference in date within 1 day because sometimes the database has date difference.  And 1 day difference is okay.

Super User
Posts: 9,691

Re: data step to merge for a date range

Post your sample data and  the output you need .

Regular Contributor
Posts: 161

Re: data step to merge for a date range

Hi, Ksharp:

The data is not too big.  Only 6 GB in total and much smaller if I keep only a few variables.  But it take extremely long time to run on a PC with plenty of system resources.  Hardware is definitely not a constraint.  It is possible that the algorithm deployed by SQL is not efficient.  I used

_method buffersize=1G to no avail.

Trusted Advisor
Posts: 1,204

Re: data step to merge for a date range

Hi,

Sample data would give a better picture of the problem. Based on the above sql query, I'll try data step in this way

proc sort data=have1;
by patient_id type;
run;

proc sort data=have2;
by patient_id type;
run;

data want;

merge have1(in=a rename=(visit_date=dte1)) have2(in=b rename=(visit_date=dte2));

by patient_id type;

if a and 0<=abs(dte1 - dte2)<=1;

run;

Regular Contributor
Posts: 161

Re: data step to merge for a date range

I'll give it a try to see if this method works better on the PC. 

Valued Guide
Posts: 2,175

Re: data step to merge for a date range

Clearly we don't see the problem -

which involves the physical size of your data

and system capacity on the utility areas that proc sql would use for sorting

and we see none of the opportunity to improve run-time that might be possible depending

on database platforms holding your data

and indexes that might be available

so we cannot offer much reliable guidance until we know more

except here are some rules and ideas

since you demand a.dte and b.dte both be present for your join clause, why not use an inner join

If you have no indexes and your large data are stored in standard sas datasets

consider pre- sorting using tagsort option, selecting just the columns you need for the join

If you keep the original row numbers you can return later to the large data and pick out just the rows you need by POINT-ing to the required rows

 

consider streamlining the sql join especially if you can remove that ABS() function

For example

select {whatever}

from have1 a, have2 b

Where a.dte = ( b.dte-1)

and a.id= b.id and a.type=b.type

To collect the other cases (of the results from abs() function) add a "union all" of a similar SELECT but reversing the DTE calculation in the where clause.

Most databases including sas sql struggle to optimise joins and where clauses involving OR conditions which you are hiding in that ABS()

If these ideas seem alien check out the papers on optimising your sql and data steps, while you wait for your code to run

Regular Contributor
Posts: 161

Re: data step to merge for a date range

Will pre-sorting actually help in the SQL?  It seems to me that proc sql is a different econsystem that don't use the sorting in the data to make runtime shorter.  Instead, it just does its own thing.  That is why I'm thinking use DATA STEP to force SAS to use the sorting I've done.

Sorting the data takes only about 30 seconds.

Inner join is necessary here so that I can track how many missing variables or the quality of the matching.  If I use inner join, I won't be able to know if there are duplicate merge and etc.

In fact, I tried to use group by and having statement to make the matching even more robust but when I run that code, I found the disk activity is several KB per second, which is abnormally low given the system resource that SAS can deploy.

Does changing some of the system option help?

Valued Guide
Posts: 2,175

Re: data step to merge for a date range

proc sql _method ;

persuades sql to reveal the join techniques that it has adopted.

(often  databases spool one of the tables of a left join leading to performance problems  because they don't like using null in joins

Since your data sort quickly

A datastep merge would only cause problems where you have more than one row per ID and TYPE as it doesn't do cartesians

Regular Contributor
Posts: 161

Re: data step to merge for a date range

Thank you for telling me the caveat for the data step!

Contributor
Posts: 25

Re: data step to merge for a date range

Hi,

I think the problem is that using a function with arguments from both tables is resolved by creating a cartesian product before the function is evaluated.  Fortunately, simple comparisons usually perform much better.  Try this:

proc sql;

     create table want as

          select a.*, coalesce(a.condition, b.condition) as condition

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type

           where -1 <= (a.visit_date - b.visit_date) <= 1;

quit;


Placing the 'inexact' matching of dates in separate where clause may allow the SQL optimiser to evaluate this condition after it has evaluated the 'exact' match.  Alternatively, try:


proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where -1 <= (visit_date - b_visit_date) <= 1;

quit;


Valued Guide
Posts: 2,175

Re: data step to merge for a date range

Doesn't the use of b_dte in the where clause, eliminate rows selected by the Left side of the join which have no matching rows from the right side?

Solution
‎07-24-2014 04:06 AM
Contributor
Posts: 25

Re: data step to merge for a date range

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN.  One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

   merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

  by patient_id type;

  if a and abs(visit_date - b_visit_date) <= 1;  /* the only result less than 0 for abs() is missing */

run;

Regular Contributor
Posts: 161

Re: data step to merge for a date range

Thank you, guys.  I'll implement this too see if this procedure get the problems resolved.  Smiley Happy

Really appreciate the help !

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 1779 views
  • 6 likes
  • 6 in conversation