BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
cgates
Obsidian | Level 7
I have a query that works fine, but it takes hours to run because of how much data it pulls in (like past 7 years of data). There is a new restriction being imposed stating that any queries on this database will be killed after 30 minutes of runtime - no exceptions. So now I have to find a way to still pull all of the data I need but in 30 minute (or less) increments.

The most rudimentary way I can think to do this is to have separate queries for each year (with the hope that only pulling one year of data takes less than 30 min to run - this isnt guaranteed though so I might have to run every 6 month queries). EX. A query for 2020, another for 2021, another for 2022, and so forth. But I’m wondering if there’s a more eloquent/efficient way to write this with a loop that doesn’t require me to hard code the dates whether they’re every 6 months or every year. Basing it on runtime would be best since that is where the restriction is taking place. The hard part would be if I could get it to iterate based on time, the next run would have to know where the previous run stopped to it could append onto the previous dataset.

Any help is appreciated!

This is how the query is currently pulled:
PROC SQL;
CONNECT TO SQLSVR(DATABASE=DB, USER=MYID, PASSWORD=PASS READBUFF=1000);

CREATE TABLE RECORDPULL AS
SELECT * FROM CONNECTION TO SQLSVR
(SELECT *
FROM MYDATABASE
WHERE ADD_DT <= ‘2023-12-31’);

QUIT;
1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

You could try something like done in below sample code that pulls data from the DB in yearly or monthly slices. 

The code assumes that the SQL Server variable ADD_DT is of type DATE. If it's datetime then you would need to amend the code accordingly.

If your SQL is as simple as shared then I'd be using Proc Append and try if increasing the readbuff value increases performance. If using a SQL then potentially just using the default could perform best.

Below code creates monthly or yearly SAS tables because this allows to easily add more data (dates) at a later time or if something falls over to only re-run for the failed date periods.

%macro slices(source_tbl,start_dt, end_dt,interval=year,target_tbl=want,replace=YES);

  %local start stop end lib tbl;

  %let end  =%sysfunc(putn("&end_dt"d,f16. -l));
  %let start=&start_dt;
  %let lib=%scan(work.&target_tbl,-2);

  %do i=0 %to 1000;

    %let stop =%sysfunc(intnx(&interval,"&start_dt"d,&i,e));
    %let stop =%sysfunc(min(&stop,&end),date9.);

    %let tbl=%scan(work.&target_tbl,-1);
    %if %upcase(&interval)=YEAR %then %let tbl=&tbl._&interval._%sysfunc(putn("&start"d,year4. -l));
    %else 
    %if %upcase(&interval)=MONTH %then %let tbl=&tbl._&interval._%sysfunc(putn("&start"d,yymmn6. -l));

    %if not %sysfunc(exist(&lib..&tbl)) or %upcase(&replace)=YES %then
      %do;
        proc datasets lib=&lib nolist nowarn;
          delete &tbl;
          run;
          append data=&source_tbl base=&lib..&tbl;
            where ADD_DT between "&start"d and "&stop"d;
          run;
        quit;
      %end;
    %if %sysfunc(putn("&stop"d,f16. -l))=&end %then %goto exit;
    %let start=%sysfunc(intnx(&interval,"&start_dt"d,%eval(&i+1),b),date9.);
  %end;
  %exit:;

%mend;

View solution in original post

11 REPLIES 11
Sajid01
Meteorite | Level 14

Hello
Your query, if it is as posted, would be pulling all the data earlier than the specified date.

Add more filter conditions and consider using the between operator if you want to pull data between two dates.
Have a look at the following link. It is about pulling last seven years data

https://stackoverflow.com/questions/27599557/how-to-get-last-7-days-data-from-current-datetime-to-la...

cgates
Obsidian | Level 7
Yes I know how to filter to between dates. That is what I would do if I had to pull by every 6 months or every year. I’m asking for a more condensed solution based on runtime. Or a loop that can take a starting and end date and loop through every 6 months without me having to hard code it.
SASKiwi
PROC Star

How many rows and columns of data are you currently reading? To test if your program slowness is caused by a slow network connection or not then you could try this query which just does a row count:

PROC SQL;
CONNECT TO SQLSVR(DATABASE=DB, USER=MYID, PASSWORD=PASS READBUFF=1000);

CREATE TABLE RECORDPULL AS
SELECT * FROM CONNECTION TO SQLSVR
(SELECT count(*) as RowCount
FROM MYDATABASE
WHERE ADD_DT <= '2023-12-31');
QUIT;

If this query is way faster then you have a slow network connection. If not then the SQL Server database is poorly performing, and you should ask your database administrator why. Maybe there is no index on ADD_DT and if there was it could solve your performance issues without having to reinvent your query.

Reeza
Super User
It's easy enough to loop the dates, but your actual query is helpful, maybe the forum garbled your post but that doesn't look valid to me.

It looks like it pulls everything before the date, is that what you're doing? Do you know the start date?

Also, is your historical data actually changing? Can you pull the last 5 years, cache that and then only pull the last 2 years dynamically? Or 3 years?
Patrick
Opal | Level 21

You could try something like done in below sample code that pulls data from the DB in yearly or monthly slices. 

The code assumes that the SQL Server variable ADD_DT is of type DATE. If it's datetime then you would need to amend the code accordingly.

If your SQL is as simple as shared then I'd be using Proc Append and try if increasing the readbuff value increases performance. If using a SQL then potentially just using the default could perform best.

Below code creates monthly or yearly SAS tables because this allows to easily add more data (dates) at a later time or if something falls over to only re-run for the failed date periods.

%macro slices(source_tbl,start_dt, end_dt,interval=year,target_tbl=want,replace=YES);

  %local start stop end lib tbl;

  %let end  =%sysfunc(putn("&end_dt"d,f16. -l));
  %let start=&start_dt;
  %let lib=%scan(work.&target_tbl,-2);

  %do i=0 %to 1000;

    %let stop =%sysfunc(intnx(&interval,"&start_dt"d,&i,e));
    %let stop =%sysfunc(min(&stop,&end),date9.);

    %let tbl=%scan(work.&target_tbl,-1);
    %if %upcase(&interval)=YEAR %then %let tbl=&tbl._&interval._%sysfunc(putn("&start"d,year4. -l));
    %else 
    %if %upcase(&interval)=MONTH %then %let tbl=&tbl._&interval._%sysfunc(putn("&start"d,yymmn6. -l));

    %if not %sysfunc(exist(&lib..&tbl)) or %upcase(&replace)=YES %then
      %do;
        proc datasets lib=&lib nolist nowarn;
          delete &tbl;
          run;
          append data=&source_tbl base=&lib..&tbl;
            where ADD_DT between "&start"d and "&stop"d;
          run;
        quit;
      %end;
    %if %sysfunc(putn("&stop"d,f16. -l))=&end %then %goto exit;
    %let start=%sysfunc(intnx(&interval,"&start_dt"d,%eval(&i+1),b),date9.);
  %end;
  %exit:;

%mend;
LinusH
Tourmaline | Level 20

Pulling big chunks of data works best if the table is partitioned:

https://learn.microsoft.com/en-us/sql/relational-databases/partitions/partitioned-tables-and-indexes...

But if this is a recurring query, why not keep a local copy of the history data, and just pull the recent dates/year?

Data never sleeps
yabwon
Onyx | Level 15

seems like a simple %do-loop job:

%macro loopOver(start,end,by=200);

%local s e;
%let s=%sysevalf("&start."d+0);
%let e=%sysevalf("&end."d+0);
%put _local_;

PROC SQL noexec; /* <----- Remember to remove NOEXEC option */
CONNECT TO SQLSVR(DATABASE=DB, USER=MYID, PASSWORD=PASS READBUFF=1000);

%do i=&s. %to &e. %by &by.;

CREATE TABLE RECORDPULL_%sysfunc(int(&i.),yymmddn8.) AS
SELECT * FROM CONNECTION TO SQLSVR
(SELECT *
 FROM MYDATABASE
 WHERE ADD_DT 
  between %str(%')%sysfunc(int(&i.),yymmdd10.)%str(%')
      abd %str(%')%sysfunc(int(%sysevalf(&i.+&by.-1)),yymmdd10.)%str(%')
);

%end;

QUIT;



%mend loopOver;

options mprint; 
%loopOver(1jan2016,31dec2023)

Remember to adjust the code. Both, the BY= parameter and NOEXEC option.

 

BTW. I consider writing something like: 

SELECT * FROM CONNECTION TO SQLSVR (SELECT *

a very very bad programming practice... (those "*" to be clear)

 

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



cgates
Obsidian | Level 7
Thanks! This works except the by 200 element goes past the end date for the last date. In your example I would need it to stop at 31dec2023, but it goes to 18mar2024, since that is 200 days after 01sep2023.
yabwon
Onyx | Level 15
%macro loopOver(start,end,by=200);

%local s e;
%let s=%sysevalf("&start."d+0);
%let e=%sysevalf("&end."d+0);
%put _local_;

PROC SQL noexec; /* <----- Remember to remove NOEXEC option */
CONNECT TO SQLSVR(DATABASE=DB, USER=MYID, PASSWORD=PASS READBUFF=1000);

%do i=&e. %to &s. %by -&by.;

CREATE TABLE RECORDPULL_%sysfunc(int(&i.),yymmddn8.) AS
SELECT * FROM CONNECTION TO SQLSVR
(SELECT *
 FROM MYDATABASE
 WHERE ADD_DT 
  between %str(%')%sysfunc(int(&i.),yymmdd10.)%str(%')
      abd %str(%')%sysfunc(max(%sysevalf(&i.-&by.+1),&s.),yymmdd10.)%str(%')
);

%end;

QUIT;



%mend loopOver;

options mprint; 
%loopOver(1jan2016,31dec2023)
_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Patrick
Opal | Level 21

@cgates wrote:
Thanks! This works except the by 200 element goes past the end date for the last date. In your example I would need it to stop at 31dec2023, but it goes to 18mar2024, since that is 200 days after 01sep2023.

The %slices() macro I've posted earlier will stop at the selected end date. 

data have;
  format add_dt date9.;
  do add_dt='15Feb2020'd to '10Jun2023'd;
    output;
  end;
run;
%slices(work.have,01Jan2020, 20May2023,target_tbl=want);
proc sql;
  select max(ADD_DT) format=date9.
  from want_year_2023
  ;
quit;

Patrick_0-1712191960160.png

 

AhmedAl_Attar
Rhodochrosite | Level 12

Hi @cgates 

I'm surprised none of the replies commented on the READBUFF=1000 clause!

 

When I look at improving Database extraction queries, these are the things I look at:

  

  • Can I increase the READBUFF= value?

Typically, increasing this value causes faster data transfer, but it will depend on how much memory you have allocated to your SAS session.

%put %sysfunc(getoption(memsize));

 and whether or not you can increase it!?

 

  • How many of the actual table columns are really being used in subsequent steps -- if certain columns being dropping later, then I exclude them from extraction query at the first place. Only get the rows and columns that you actually need.
  • Is any of the table columns defined as BLOB or CLOB, if yes, these columns would have maximum value length of 32767 when moved over to SAS data set! Now you better ask yourself, could that cause data truncation? if, yes, then the original query (Select * from ....) would needs to be rewritten to properly handle such columns and ensure data integrity when moved to SAS.
  • Does the SAS code perform additional filtering in subsequent steps that could be applied to the query to start with? If yes, then add these filtering conditions to your query to reduce the number of extracted records. Only get the rows and columns that you actually need.

 

I hope this provides you with preliminary pre-requisites that you can incorporate with the rest of the solutions and suggestions you have already received,

Ahmed  

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 975 views
  • 10 likes
  • 8 in conversation