Solved: Re: data step to merge for a date range

caveman529 · Posted 07-23-2014 04:43 AM

Dear All:

Could you tell me what is the DATA STEP equivalent for the following SQL statement? When I run the following code, SAS runs extremely slow on a high-end workstation. So I think the sort sort merge approach may help in this situation. Thank you ~

proc sql;

create table want as

select a.*, coalesce(a.condition, b.condition) as condition

from have1 as a left join have2 as b

on a.patient_id = b.patient_id and 0 <= abs(a.visit_date - b.visit_date) <= 1 and a.type = b.type;

quit;

DaveBirch · Posted 07-24-2014 04:06 AM

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN. One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

create table want as

select c.* from

(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

from have1 as a left join have2 as b

on a.patient_id = b.patient_id and a.type = b.type) as c

where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

by patient_id type;

if a and abs(visit_date - b_visit_date) <= 1; /* the only result less than 0 for abs() is missing */

run;

View solution in original post

RW9 · Posted 07-23-2014 05:05 AM

Not sure if this would speed it up but:

proc sql;

create table WANT as

select A.*, /* Do you really need all variables? */

COALESCE(A.CONDITION,B.CONDITION) as CONDITION

from (select *,datepart(VISIT_DATE) as DTE1) HAVE1 A

/* I assume you use abs() as datetime variable? */

left join (select *,datepart(VISIT_DATE) as DTE2) HAVE2 B

on A.PATIENT_ID=B.PATIENT_ID

and A.DTE - B.DTE in (0,1);

quit;

As for datastep, well you would need to sort both of them. Then:

data want;

merge have1 (rename=(visit_date=dte1)) have2 (rename=(visit_date=dte2) keep=patient_id condition visit_date);

by patient_id;

if 0 <= (datepart(dte1) - datepart(dte2)) <= 1 then output;

run;

caveman529 · Posted 07-23-2014 04:45 PM

Thank you, RW9!

The abs() function is there to allow for difference in date within 1 day because sometimes the database has date difference. And 1 day difference is okay.

Ksharp · Posted 07-23-2014 09:00 AM

Post your sample data and the output you need .

caveman529 · Posted 07-23-2014 04:47 PM

Hi, Ksharp:

The data is not too big. Only 6 GB in total and much smaller if I keep only a few variables. But it take extremely long time to run on a PC with plenty of system resources. Hardware is definitely not a constraint. It is possible that the algorithm deployed by SQL is not efficient. I used

_method buffersize=1G to no avail.

stat_sas · Posted 07-23-2014 12:41 PM

Hi,

Sample data would give a better picture of the problem. Based on the above sql query, I'll try data step in this way

proc sort data=have1;
by patient_id type;
run;

proc sort data=have2;
by patient_id type;
run;

data want;

merge have1(in=a rename=(visit_date=dte1)) have2(in=b rename=(visit_date=dte2));

by patient_id type;

if a and 0<=abs(dte1 - dte2)<=1;

run;

caveman529 · Posted 07-23-2014 04:53 PM

I'll give it a try to see if this method works better on the PC.

Peter_C · Posted 07-23-2014 03:00 PM

Clearly we don't see the problem -

which involves the physical size of your data

and system capacity on the utility areas that proc sql would use for sorting

and we see none of the opportunity to improve run-time that might be possible depending

on database platforms holding your data

and indexes that might be available

so we cannot offer much reliable guidance until we know more

except here are some rules and ideas

since you demand a.dte and b.dte both be present for your join clause, why not use an inner join

If you have no indexes and your large data are stored in standard sas datasets

consider pre- sorting using tagsort option, selecting just the columns you need for the join

If you keep the original row numbers you can return later to the large data and pick out just the rows you need by POINT-ing to the required rows

consider streamlining the sql join especially if you can remove that ABS() function

For example

select {whatever}

from have1 a, have2 b

Where a.dte = ( b.dte-1)

and a.id= b.id and a.type=b.type

To collect the other cases (of the results from abs() function) add a "union all" of a similar SELECT but reversing the DTE calculation in the where clause.

Most databases including sas sql struggle to optimise joins and where clauses involving OR conditions which you are hiding in that ABS()

If these ideas seem alien check out the papers on optimising your sql and data steps, while you wait for your code to run

caveman529 · Posted 07-23-2014 04:49 PM

Will pre-sorting actually help in the SQL? It seems to me that proc sql is a different econsystem that don't use the sorting in the data to make runtime shorter. Instead, it just does its own thing. That is why I'm thinking use DATA STEP to force SAS to use the sorting I've done.

Sorting the data takes only about 30 seconds.

Inner join is necessary here so that I can track how many missing variables or the quality of the matching. If I use inner join, I won't be able to know if there are duplicate merge and etc.

In fact, I tried to use group by and having statement to make the matching even more robust but when I run that code, I found the disk activity is several KB per second, which is abnormally low given the system resource that SAS can deploy.

Does changing some of the system option help?

Peter_C · Posted 07-23-2014 05:08 PM

proc sql _method ;

persuades sql to reveal the join techniques that it has adopted.

(often databases spool one of the tables of a left join leading to performance problems because they don't like using null in joins

Since your data sort quickly

A datastep merge would only cause problems where you have more than one row per ID and TYPE as it doesn't do cartesians

caveman529 · Posted 07-26-2014 08:07 PM

Thank you for telling me the caveat for the data step!

DaveBirch · Posted 07-24-2014 01:32 AM

Hi,

I think the problem is that using a function with arguments from both tables is resolved by creating a cartesian product before the function is evaluated. Fortunately, simple comparisons usually perform much better. Try this:

proc sql;

create table want as

select a.*, coalesce(a.condition, b.condition) as condition

from have1 as a left join have2 as b

on a.patient_id = b.patient_id and a.type = b.type

where -1 <= (a.visit_date - b.visit_date) <= 1;

quit;

Placing the 'inexact' matching of dates in separate where clause may allow the SQL optimiser to evaluate this condition after it has evaluated the 'exact' match. Alternatively, try:

proc sql;

create table want as

select c.* from

(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

from have1 as a left join have2 as b

on a.patient_id = b.patient_id and a.type = b.type) as c

where -1 <= (visit_date - b_visit_date) <= 1;

quit;

Peter_C · Posted 07-24-2014 02:59 AM

Doesn't the use of b_dte in the where clause, eliminate rows selected by the Left side of the join which have no matching rows from the right side?

DaveBirch · Posted 07-24-2014 04:06 AM

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN. One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

create table want as

select c.* from

(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

from have1 as a left join have2 as b

on a.patient_id = b.patient_id and a.type = b.type) as c

where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

by patient_id type;

if a and abs(visit_date - b_visit_date) <= 1; /* the only result less than 0 for abs() is missing */

run;

caveman529 · Posted 07-26-2014 08:12 PM

Thank you, guys. I'll implement this too see if this procedure get the problems resolved.

Really appreciate the help !

Registration is open

SAS Training: Just a Click Away