DATA Step, Macro, Functions and more

taking too much time to run any alternatives

Reply
Frequent Contributor
Posts: 84

taking too much time to run any alternatives

Hi ,

My Have dataset is very big. Whenever i run the want program it is taking too much time is there any way to reduce the time.

I heard Hash table, but still i am not comfortable to write codes on hash table. Could anyone write hash code for the want program.

Thanks for your time.

Proc sql;

create table want as

select * from have where zip = &bzip ;

quit;

Super User
Posts: 7,758

Re: taking too much time to run any alternatives

So you want to find the zipcode for the area with the maximum value of y?

Or, more generic, find the record with a certain highest value?

Do this;

data want (keep=remember_x max_y rename=(remember_x=x max_y=y));

set have end=done;

retain remember_x "" max_y 0;

if y > max_y then do;

  max_y = y;

  remember_x = x;

end;

if done then output;

run;

That way you only have to make one pass through the dataset. If you need more columns in the output dataset, specify them accordingly.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,942

Re: taking too much time to run any alternatives

Try optimising your code/data.  From just the dataset you showed:

proc sql;

  create table want as

  select  ZIP,

          X,

          Y

  from    SASHELP.ZIPCODE

  group by ZIP

  having  Y=max(Y);

quit;

Will be faster than the three steps.  However I don't know your data.

Frequent Contributor
Posts: 84

Re: taking too much time to run any alternatives

Thanks for replies.

Howver i want this part to be optimized.

Proc sql;

create table want as

select * from have where zip = &bzip ;

quit;

Is there any alternatives? Like Hasthable or any other.

Thank You

Super User
Posts: 7,758

Re: taking too much time to run any alternatives

You can't optimize that. It is a single pass through the file and a simple where condition, it doesn't get any simpler than that. What would you need the hashtable for, if there's only one zipcode to compare to?

You could try to optimize your storage by setting up a SPDS library on several physically separate disks. SPDS also stores metadata in the "buckets" that makes finding records easier for the SAS system.

Another trick to try is adding an index on zip to the dataset, unless the dataset is already sorted by zip.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Super User
Posts: 7,942

Re: taking too much time to run any alternatives

The code I provided takes away the two additional steps, and the macro processing.  Hence it should run quicker.  There isn't much from a code point of view to make it go much faster.  Hence, as Kurt Bremser has stated, you need to assess your data/storage and usage.  If your data is really that big that one pass over the data takes too long then take a look at some documentation about data warehousing principals for instance, consider breaking the data in more manageable chunks etc.

Frequent Contributor
Posts: 84

Re: taking too much time to run any alternatives

Thanks for your suggestions.

This is my original question. Please suggest now.

Respected Advisor
Posts: 4,919

Re: taking too much time to run any alternatives

On the software side, Kurt's suggestion of adding an index on ZIPCODE is the best way to go. Further improvements would be hardware based.

PG

PG
Frequent Contributor
Posts: 84

Re: taking too much time to run any alternatives

Index means? Could you please elaborate. Could you please write a sample code. Thanks.

Super User
Posts: 19,769

Re: taking too much time to run any alternatives

As mentioned above, create an index on your zipcode and/or sort your data by zipcode as well - especially if you're looping a process.

Creating and Exploiting SAS Indexes

http://www2.sas.com/proceedings/sugi29/123-29.pdf

In general if you absolutely require someone to write code for you, you need to hire a consultant Smiley Happy

PROC Star
Posts: 1,167

Re: taking too much time to run any alternatives

Another software option is to store your data in a database that's optimized for retrievals of this type, but that's probably out of scope.

As other suggest, use an index. If you need a consultant, I'm available! Smiley Happy:smileylaugh:

Super User
Posts: 7,758

Re: taking too much time to run any alternatives

SAS(R) 9.2 SQL Procedure User's Guide or

Base SAS(R) 9.2 Procedures Guide

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 3,250

Re: taking too much time to run any alternatives

You haven't really provided enough information on your problem to be able to decide if indexes would be better or not.

How big is your HAVE dataset - size in GB - gigabytes - and number of rows, and columns? How many rows do you expect in your WANT dataset?

As a general rule if you are selecting more than 25% of your rows in a query an index isn't going to help, plus there is the extra time of creating and maintaining the index.

There are many options to consider apart from indexes. For example if your HAVE dataset has many columns you could try compressing it. If the compression percentage is high, then any query that runs on it will be quicker. On some SAS datasets I use compression reduces their size by 80% or higher! That means processing this dataset any way you like will be faster.

Frequent Contributor
Posts: 84

Re: taking too much time to run any alternatives

Hi SASKiwi,

How big is your HAVE dataset - size in GB - gigabytes - 50GB


and number of rows - 2bn


and columns? - 3


How many rows do you expect in your WANT dataset? 50000


Thanks..

Super User
Posts: 3,250

Re: taking too much time to run any alternatives

With only 3 columns compression is unlikely to help.

Your WANT row count suggests an index could help. How many distinct values does ZIP have and how many will you be selecting in your query?  The more distinct values you have and the less you are selecting the better the index performance.

You might want to test your index once it is created by selecting, just 1 value of ZIP, then just 2 etc just to see how it performs.

Ask a Question
Discussion stats
  • 14 replies
  • 469 views
  • 1 like
  • 7 in conversation