BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
caveman529
Calcite | Level 5

Dear All:

Could you tell me what is the DATA STEP equivalent for the following SQL statement?  When I run the following code, SAS runs extremely slow on a high-end workstation.  So I think the sort sort merge approach may help in this situation.  Thank you ~

proc sql;

     create table want as

          select a.*, coalesce(a.condition, b.condition) as condition

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and 0 <= abs(a.visit_date - b.visit_date) <= 1 and a.type = b.type;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
DaveBirch
Obsidian | Level 7

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN.  One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

   merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

  by patient_id type;

  if a and abs(visit_date - b_visit_date) <= 1;  /* the only result less than 0 for abs() is missing */

run;

View solution in original post

14 REPLIES 14
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Not sure if this would speed it up but:

proc sql;

     create table WANT as

     select     A.*,     /* Do you really need all variables? */

                   COALESCE(A.CONDITION,B.CONDITION) as CONDITION

     from       (select *,datepart(VISIT_DATE) as DTE1) HAVE1 A    

/* I assume you use abs() as datetime variable? */

     left join   (select *,datepart(VISIT_DATE) as DTE2) HAVE2 B

     on          A.PATIENT_ID=B.PATIENT_ID

     and        A.DTE - B.DTE in (0,1);

quit;

As for datastep, well you would need to sort both of them.  Then:

data want;

     merge have1 (rename=(visit_date=dte1)) have2 (rename=(visit_date=dte2) keep=patient_id condition visit_date);

     by patient_id;

     if 0 <= (datepart(dte1) - datepart(dte2)) <= 1 then output;

run;

caveman529
Calcite | Level 5

Thank you, RW9!

The abs() function is there to allow for difference in date within 1 day because sometimes the database has date difference.  And 1 day difference is okay.

Ksharp
Super User

Post your sample data and  the output you need .

caveman529
Calcite | Level 5

Hi, Ksharp:

The data is not too big.  Only 6 GB in total and much smaller if I keep only a few variables.  But it take extremely long time to run on a PC with plenty of system resources.  Hardware is definitely not a constraint.  It is possible that the algorithm deployed by SQL is not efficient.  I used

_method buffersize=1G to no avail.

stat_sas
Ammonite | Level 13

Hi,

Sample data would give a better picture of the problem. Based on the above sql query, I'll try data step in this way

proc sort data=have1;
by patient_id type;
run;

proc sort data=have2;
by patient_id type;
run;

data want;

merge have1(in=a rename=(visit_date=dte1)) have2(in=b rename=(visit_date=dte2));

by patient_id type;

if a and 0<=abs(dte1 - dte2)<=1;

run;

caveman529
Calcite | Level 5

I'll give it a try to see if this method works better on the PC. 

Peter_C
Rhodochrosite | Level 12

Clearly we don't see the problem -

which involves the physical size of your data

and system capacity on the utility areas that proc sql would use for sorting

and we see none of the opportunity to improve run-time that might be possible depending

on database platforms holding your data

and indexes that might be available

so we cannot offer much reliable guidance until we know more

except here are some rules and ideas

since you demand a.dte and b.dte both be present for your join clause, why not use an inner join

If you have no indexes and your large data are stored in standard sas datasets

consider pre- sorting using tagsort option, selecting just the columns you need for the join

If you keep the original row numbers you can return later to the large data and pick out just the rows you need by POINT-ing to the required rows

 

consider streamlining the sql join especially if you can remove that ABS() function

For example

select {whatever}

from have1 a, have2 b

Where a.dte = ( b.dte-1)

and a.id= b.id and a.type=b.type

To collect the other cases (of the results from abs() function) add a "union all" of a similar SELECT but reversing the DTE calculation in the where clause.

Most databases including sas sql struggle to optimise joins and where clauses involving OR conditions which you are hiding in that ABS()

If these ideas seem alien check out the papers on optimising your sql and data steps, while you wait for your code to run

caveman529
Calcite | Level 5

Will pre-sorting actually help in the SQL?  It seems to me that proc sql is a different econsystem that don't use the sorting in the data to make runtime shorter.  Instead, it just does its own thing.  That is why I'm thinking use DATA STEP to force SAS to use the sorting I've done.

Sorting the data takes only about 30 seconds.

Inner join is necessary here so that I can track how many missing variables or the quality of the matching.  If I use inner join, I won't be able to know if there are duplicate merge and etc.

In fact, I tried to use group by and having statement to make the matching even more robust but when I run that code, I found the disk activity is several KB per second, which is abnormally low given the system resource that SAS can deploy.

Does changing some of the system option help?

Peter_C
Rhodochrosite | Level 12

proc sql _method ;

persuades sql to reveal the join techniques that it has adopted.

(often  databases spool one of the tables of a left join leading to performance problems  because they don't like using null in joins

Since your data sort quickly

A datastep merge would only cause problems where you have more than one row per ID and TYPE as it doesn't do cartesians

caveman529
Calcite | Level 5

Thank you for telling me the caveat for the data step!

DaveBirch
Obsidian | Level 7

Hi,

I think the problem is that using a function with arguments from both tables is resolved by creating a cartesian product before the function is evaluated.  Fortunately, simple comparisons usually perform much better.  Try this:

proc sql;

     create table want as

          select a.*, coalesce(a.condition, b.condition) as condition

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type

           where -1 <= (a.visit_date - b.visit_date) <= 1;

quit;


Placing the 'inexact' matching of dates in separate where clause may allow the SQL optimiser to evaluate this condition after it has evaluated the 'exact' match.  Alternatively, try:


proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where -1 <= (visit_date - b_visit_date) <= 1;

quit;


Peter_C
Rhodochrosite | Level 12

Doesn't the use of b_dte in the where clause, eliminate rows selected by the Left side of the join which have no matching rows from the right side?

DaveBirch
Obsidian | Level 7

Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN.  One of the better reasons for converting an SQL join into a data step merge is get better control over like that.

However, my second offering could be easily fixed:

proc sql;

     create table want as

   select  c.* from

          (select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date

              from have1 as a left join have2 as b

                   on a.patient_id = b.patient_id and a.type = b.type) as c

   where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;

quit;

or, you could try this variation to the response by stat@sas:

data want;

   merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));

  by patient_id type;

  if a and abs(visit_date - b_visit_date) <= 1;  /* the only result less than 0 for abs() is missing */

run;

caveman529
Calcite | Level 5

Thank you, guys.  I'll implement this too see if this procedure get the problems resolved.  Smiley Happy

Really appreciate the help !

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 6486 views
  • 6 likes
  • 6 in conversation