Dear All:
Could you tell me what is the DATA STEP equivalent for the following SQL statement? When I run the following code, SAS runs extremely slow on a high-end workstation. So I think the sort sort merge approach may help in this situation. Thank you ~
proc sql;
create table want as
select a.*, coalesce(a.condition, b.condition) as condition
from have1 as a left join have2 as b
on a.patient_id = b.patient_id and 0 <= abs(a.visit_date - b.visit_date) <= 1 and a.type = b.type;
quit;
Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN. One of the better reasons for converting an SQL join into a data step merge is get better control over like that.
However, my second offering could be easily fixed:
proc sql;
create table want as
select c.* from
(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date
from have1 as a left join have2 as b
on a.patient_id = b.patient_id and a.type = b.type) as c
where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;
quit;
or, you could try this variation to the response by stat@sas:
data want;
merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));
by patient_id type;
if a and abs(visit_date - b_visit_date) <= 1; /* the only result less than 0 for abs() is missing */
run;
Not sure if this would speed it up but:
proc sql;
create table WANT as
select A.*, /* Do you really need all variables? */
COALESCE(A.CONDITION,B.CONDITION) as CONDITION
from (select *,datepart(VISIT_DATE) as DTE1) HAVE1 A
/* I assume you use abs() as datetime variable? */
left join (select *,datepart(VISIT_DATE) as DTE2) HAVE2 B
on A.PATIENT_ID=B.PATIENT_ID
and A.DTE - B.DTE in (0,1);
quit;
As for datastep, well you would need to sort both of them. Then:
data want;
merge have1 (rename=(visit_date=dte1)) have2 (rename=(visit_date=dte2) keep=patient_id condition visit_date);
by patient_id;
if 0 <= (datepart(dte1) - datepart(dte2)) <= 1 then output;
run;
Thank you, RW9!
The abs() function is there to allow for difference in date within 1 day because sometimes the database has date difference. And 1 day difference is okay.
Post your sample data and the output you need .
Hi, Ksharp:
The data is not too big. Only 6 GB in total and much smaller if I keep only a few variables. But it take extremely long time to run on a PC with plenty of system resources. Hardware is definitely not a constraint. It is possible that the algorithm deployed by SQL is not efficient. I used
_method buffersize=1G to no avail.
Hi,
Sample data would give a better picture of the problem. Based on the above sql query, I'll try data step in this way
proc sort data=have1;
by patient_id type;
run;
proc sort data=have2;
by patient_id type;
run;
data want;
merge have1(in=a rename=(visit_date=dte1)) have2(in=b rename=(visit_date=dte2));
by patient_id type;
if a and 0<=abs(dte1 - dte2)<=1;
run;
I'll give it a try to see if this method works better on the PC.
Clearly we don't see the problem -
which involves the physical size of your data
and system capacity on the utility areas that proc sql would use for sorting
and we see none of the opportunity to improve run-time that might be possible depending
on database platforms holding your data
and indexes that might be available
so we cannot offer much reliable guidance until we know more
except here are some rules and ideas
since you demand a.dte and b.dte both be present for your join clause, why not use an inner join
If you have no indexes and your large data are stored in standard sas datasets
consider pre- sorting using tagsort option, selecting just the columns you need for the join
If you keep the original row numbers you can return later to the large data and pick out just the rows you need by POINT-ing to the required rows
consider streamlining the sql join especially if you can remove that ABS() function
For example
select {whatever}
from have1 a, have2 b
Where a.dte = ( b.dte-1)
and a.id= b.id and a.type=b.type
To collect the other cases (of the results from abs() function) add a "union all" of a similar SELECT but reversing the DTE calculation in the where clause.
Most databases including sas sql struggle to optimise joins and where clauses involving OR conditions which you are hiding in that ABS()
If these ideas seem alien check out the papers on optimising your sql and data steps, while you wait for your code to run
Will pre-sorting actually help in the SQL? It seems to me that proc sql is a different econsystem that don't use the sorting in the data to make runtime shorter. Instead, it just does its own thing. That is why I'm thinking use DATA STEP to force SAS to use the sorting I've done.
Sorting the data takes only about 30 seconds.
Inner join is necessary here so that I can track how many missing variables or the quality of the matching. If I use inner join, I won't be able to know if there are duplicate merge and etc.
In fact, I tried to use group by and having statement to make the matching even more robust but when I run that code, I found the disk activity is several KB per second, which is abnormally low given the system resource that SAS can deploy.
Does changing some of the system option help?
proc sql _method ;
persuades sql to reveal the join techniques that it has adopted.
(often databases spool one of the tables of a left join leading to performance problems because they don't like using null in joins
Since your data sort quickly
A datastep merge would only cause problems where you have more than one row per ID and TYPE as it doesn't do cartesians
Thank you for telling me the caveat for the data step!
Hi,
I think the problem is that using a function with arguments from both tables is resolved by creating a cartesian product before the function is evaluated. Fortunately, simple comparisons usually perform much better. Try this:
proc sql;
create table want as
select a.*, coalesce(a.condition, b.condition) as condition
from have1 as a left join have2 as b
on a.patient_id = b.patient_id and a.type = b.type
where -1 <= (a.visit_date - b.visit_date) <= 1;
quit;
Placing the 'inexact' matching of dates in separate where clause may allow the SQL optimiser to evaluate this condition after it has evaluated the 'exact' match. Alternatively, try:
proc sql;
create table want as
select c.* from
(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date
from have1 as a left join have2 as b
on a.patient_id = b.patient_id and a.type = b.type) as c
where -1 <= (visit_date - b_visit_date) <= 1;
quit;
Doesn't the use of b_dte in the where clause, eliminate rows selected by the Left side of the join which have no matching rows from the right side?
Yes, you're correct - values that are missing/absent in where clauses due to no matching rows from the right side will turn a LEFT JOIN into an INNER JOIN. One of the better reasons for converting an SQL join into a data step merge is get better control over like that.
However, my second offering could be easily fixed:
proc sql;
create table want as
select c.* from
(select a.*, coalesce(a.condition, b.condition) as condition, b.visit_date as b_visit_date
from have1 as a left join have2 as b
on a.patient_id = b.patient_id and a.type = b.type) as c
where b_visit_date = . or -1 <= (visit_date - b_visit_date) <= 1;
quit;
or, you could try this variation to the response by stat@sas:
data want;
merge have1(in=a) have2(in=b rename=(visit_date=b_visit_date));
by patient_id type;
if a and abs(visit_date - b_visit_date) <= 1; /* the only result less than 0 for abs() is missing */
run;
Thank you, guys. I'll implement this too see if this procedure get the problems resolved.
Really appreciate the help !
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.