Solved: Re: using "having" in proc sql - run times

Ramakanthkrovi · Posted 07-19-2019 12:24 AM

This is a dataset with 8 million records and I am trying to understand if this code is fine or if there is another way of achieving it (creating a max timestamp in a previous step and using a where clause) because it is taking ages to execute this (2 hours) and I wanted to check if it can be made to run quicker.

proc sql;
create table xyz as
select *
from abc
group by aa_id,ab_id
having date_timestamp = max(date_timestamp);
quit;

Thank you.

SASKiwi · Posted 07-23-2019 03:25 AM

@Ramakanthkrovi - to get rid of the re-merging note your query would have to look like this:

proc sql;
create table xyz as
select aa_id
      ,ab_id
      ,max(date_timestamp) as date_timestamp_max
from abc
group by aa_id,ab_id
having date_timestamp = max(date_timestamp);
quit;

View solution in original post

koyelghosh · Posted 07-19-2019 12:58 AM

I have seen here that SAS experts recommend to use PROC SQL for smaller datasets. Is it possible for you to achieve the same goal using DATA step, intelligently? If it is a good idea to use PROC SQL for 8 million records then one of the experts will tell you. So if possible wait for their response or try with DATA step.

Best wishes

SASKiwi · Posted 07-19-2019 02:48 AM

Please post the log of your SQL query. I suspect it contains a SAS note about re-merging data given you haven't included any sum-type variables in your SELECT. If it is re-merging this will definitely slow your query.

Ramakanthkrovi · Posted 07-21-2019 07:31 PM

@SASKiwi yes, the log contains that note. how can I overcome it?

SASKiwi · Posted 07-21-2019 07:37 PM

@Ramakanthkrovi - all columns in your SELECT must also be repeated in your GROUP BY except for summary-type calculations like the MAX function you are using.

Ramakanthkrovi · Posted 07-21-2019 07:48 PM

I am grouping by two columns already mentioned in the group by statement and I am selecting * from the dataset.

Reeza · Posted 07-21-2019 09:31 PM

@Ramakanthkrovi wrote:

I am grouping by two columns already mentioned in the group by statement and I am selecting * from the dataset.

Have you tried running it in stages or without having to see how long it takes? If the parts take about 2 hours individually, the fact that the main query takes 2 hours won't be surprising. If the parts complete in minutes then clearly there's something else that's the issue.

SASKiwi · Posted 07-23-2019 03:25 AM

@Ramakanthkrovi - to get rid of the re-merging note your query would have to look like this:

proc sql;
create table xyz as
select aa_id
      ,ab_id
      ,max(date_timestamp) as date_timestamp_max
from abc
group by aa_id,ab_id
having date_timestamp = max(date_timestamp);
quit;

andreas_lds · Posted 07-19-2019 03:25 AM

You don't need proc sql at all:

proc sort data=sashelp.class out=work.class;
   by Sex descending Age;
run;

data work.want;
   set work.class;
   by Sex descending Age;

   length _Age 8;
   retain _Age;
   drop _Age;

   if first.Sex then do;
      _Age = Age;
   end;

   if Age = _Age;
run;

Reeza · Posted 07-19-2019 08:21 PM

How wide is that table? You must have another issue or forgot to submit quit.

It used to take me 20 minutes to process 30 million rows with a lot of calculations on a desktop with 8GB of RAM.

@Ramakanthkrovi wrote:

This is a dataset with 8 million records and I am trying to understand if this code is fine or if there is another way of achieving it (creating a max timestamp in a previous step and using a where clause) because it is taking ages to execute this (2 hours) and I wanted to check if it can be made to run quicker.

proc sql;
create table xyz as
select *
from abc
group by aa_id,ab_id
having date_timestamp = max(date_timestamp);
quit;

Thank you.

Ramakanthkrovi · Posted 07-23-2019 02:01 AM

Hardware could be an issue.

I am running on SAS EG on the cloud so I cannot pinpoint the exact problem.

Reeza · Posted 07-23-2019 11:01 AM

The cloud isn't that different than on the server, so not sure how that affects anything.

Registration is open

SAS Training: Just a Click Away