Not really. The Host/Engine Dependent Information shows the DBMS-specifics (schema) and the ODBC driver (the .dll on the bottom)
Run
data datameting;
set tms.datameting;
run;
first, and then the code against the dataset in WORK:
data remove8seconds;
do until (last.UniekID);
set datameting;
/* and so on */
The first step will show you the raw time to simply get the data from the SQL server.
And if your UniekID groups are large enough to drop data out of the buffer, my data step will force a double read of the complete dataset.
With local data (WORK), the operating system will still have previously read dataset pages in the cache, and that alone will speed things up. Not even mentioning the performance difference between the network connection to your SQL server and the I/O bus to your internal disks, which should be substantial.
Please try the above first step, and let us know how long it took. Then run my data step.
So you can see that the simple sequential transfer of data alone is taking so much time.
One of the problems here is that SAS cannot push processing down to the database (implicit pass-through), to only retrieve the results.
One additional try:
proc sql;
create table remove8seconds as
select
UniekID,
waarde,
tjid
from tms.datameting
group by UniekID
having min(tjid) + 8 <= tjid and tjid <= max(tjid) - 8
;
quit;
This might enable SAS to push some of the processing to SQL Server, and it should be usable in explicit pass-through, if that is possible through an ODBC connection (I have no experience with that).
But if the result is still a large number of observations, you will still feel the network bottleneck.
@Kurt_Bremser Okay So it finished running the first part (it took about 12 miutes I guess). I now have a new datameting. So now you want me to run the following: Is that correct?
data remove8seconds; do until (last.UniekID); set work.datameting; by UniekID; if first.UniekID then start = Tijd; end; end = Tijd; do until (last.UniekID); set work.datameting; by UniekID; if start + 8 le Tijd le end - 8 then output; end; drop start end; run;
PS it would be nice to see the log from that 12-minute transfer step.
@Kurt_Bremser There you go:
I must say that below code works super fast:
data remove8seconds; do until (last.UniekID); set work.datameting; by UniekID; if first.UniekID then start = Tijd; end; end = Tijd; do until (last.UniekID); set work.datameting; by UniekID; if start + 8 le Tijd le end - 8 then output; end; drop start end; run;
So it gives output data but it also gives the following error:
You need to sort the data first; I guess that was implicitly pushed to the DB when you ran the code against the table there.
Run
proc sort data=datameting;
by uniekid tjid;
run;
before running the data step again.
@Kurt_Bremser Okay, finally got it working. Appreciate your help.
It seemed that it copied the table from the ODBC to WORK. Is that correct? If so whats the benefit of this and should I always start with below code on every table? It seemd to make things 1000X faster.
data datameting; set tms.datameting; run;
And of course also interested in:
proc sort data=datameting; by UniekID Tijd; run; data removeseconds; do until (last.UniekID); set work.datameting; by UniekID; if first.UniekID then start = Tijd; end; end = Tijd; do until (last.UniekID); set work.datameting; by UniekID; if start + 8 le Tijd le end - 8 then output; end; drop start end; run;
The first step copies the data to your WORK library, which (usually) resides on your local disk, or a local disk within a server. "Local can" also be a disk storage box attached through a SAN, and these tend to be mighty fast nowadays, faster than a network connection. You can safely assume that you will get the best I/O performance out of WORK, if the SAS administrator is worth his/her pay.
Next, in order to use a BY statement in a data step (or any other step that supports a BY), the data has to be sorted accordingly; that's why the extra PROC SORT step is necessary.
By using BY in a data step, we get two additional automatic variables for each variable in the BY: first. and last., which signal the first observation in a group and the last one.
So a double DO works like this: first, it reads all observations from a group (until last.); during this, we can record certain things, here the tjid value of the first observation of a uniekid group. After the first loop has finished, we have the contents of the last observation of the group in the PDV (Program Data Vector); so I store the tjid value in variable end. We can now use those two stored values to make decisions in the second DO loop, that reads exactly the same observations that the first loop read.
The double DO loop is very good when you need to make decisions on observations based on values stored in other observations within the same group. By double-reading a group in succession, most of the time the second read happens out of the disk cache, and not the storage medium itself, which translates to a single sequential physical read of the data, the fastest way to process data from (disk) storage, even in the days of SSD drives.
PS you can combine the first two steps:
proc sort
data=tms.datameting
out=datameting
;
by UniekID Tijd;
run;
which should be a little faster than the two separate steps combined.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.