BookmarkSubscribeRSS Feed
kumarK
Quartz | Level 8

Hi ,

My Have dataset is very big. Whenever i run the want program it is taking too much time is there any way to reduce the time.

I heard Hash table, but still i am not comfortable to write codes on hash table. Could anyone write hash code for the want program.

Thanks for your time.

Proc sql;

create table want as

select * from have where zip = &bzip ;

quit;

14 REPLIES 14
Kurt_Bremser
Super User

So you want to find the zipcode for the area with the maximum value of y?

Or, more generic, find the record with a certain highest value?

Do this;

data want (keep=remember_x max_y rename=(remember_x=x max_y=y));

set have end=done;

retain remember_x "" max_y 0;

if y > max_y then do;

  max_y = y;

  remember_x = x;

end;

if done then output;

run;

That way you only have to make one pass through the dataset. If you need more columns in the output dataset, specify them accordingly.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Try optimising your code/data.  From just the dataset you showed:

proc sql;

  create table want as

  select  ZIP,

          X,

          Y

  from    SASHELP.ZIPCODE

  group by ZIP

  having  Y=max(Y);

quit;

Will be faster than the three steps.  However I don't know your data.

kumarK
Quartz | Level 8

Thanks for replies.

Howver i want this part to be optimized.

Proc sql;

create table want as

select * from have where zip = &bzip ;

quit;

Is there any alternatives? Like Hasthable or any other.

Thank You

Kurt_Bremser
Super User

You can't optimize that. It is a single pass through the file and a simple where condition, it doesn't get any simpler than that. What would you need the hashtable for, if there's only one zipcode to compare to?

You could try to optimize your storage by setting up a SPDS library on several physically separate disks. SPDS also stores metadata in the "buckets" that makes finding records easier for the SAS system.

Another trick to try is adding an index on zip to the dataset, unless the dataset is already sorted by zip.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

The code I provided takes away the two additional steps, and the macro processing.  Hence it should run quicker.  There isn't much from a code point of view to make it go much faster.  Hence, as Kurt Bremser has stated, you need to assess your data/storage and usage.  If your data is really that big that one pass over the data takes too long then take a look at some documentation about data warehousing principals for instance, consider breaking the data in more manageable chunks etc.

kumarK
Quartz | Level 8

Thanks for your suggestions.

This is my original question. Please suggest now.

PGStats
Opal | Level 21

On the software side, Kurt's suggestion of adding an index on ZIPCODE is the best way to go. Further improvements would be hardware based.

PG

PG
kumarK
Quartz | Level 8

Index means? Could you please elaborate. Could you please write a sample code. Thanks.

Reeza
Super User

As mentioned above, create an index on your zipcode and/or sort your data by zipcode as well - especially if you're looping a process.

Creating and Exploiting SAS Indexes

http://www2.sas.com/proceedings/sugi29/123-29.pdf

In general if you absolutely require someone to write code for you, you need to hire a consultant Smiley Happy

TomKari
Onyx | Level 15

Another software option is to store your data in a database that's optimized for retrievals of this type, but that's probably out of scope.

As other suggest, use an index. If you need a consultant, I'm available! Smiley Happy:smileylaugh:

SASKiwi
PROC Star

You haven't really provided enough information on your problem to be able to decide if indexes would be better or not.

How big is your HAVE dataset - size in GB - gigabytes - and number of rows, and columns? How many rows do you expect in your WANT dataset?

As a general rule if you are selecting more than 25% of your rows in a query an index isn't going to help, plus there is the extra time of creating and maintaining the index.

There are many options to consider apart from indexes. For example if your HAVE dataset has many columns you could try compressing it. If the compression percentage is high, then any query that runs on it will be quicker. On some SAS datasets I use compression reduces their size by 80% or higher! That means processing this dataset any way you like will be faster.

kumarK
Quartz | Level 8

Hi SASKiwi,

How big is your HAVE dataset - size in GB - gigabytes - 50GB


and number of rows - 2bn


and columns? - 3


How many rows do you expect in your WANT dataset? 50000


Thanks..

SASKiwi
PROC Star

With only 3 columns compression is unlikely to help.

Your WANT row count suggests an index could help. How many distinct values does ZIP have and how many will you be selecting in your query?  The more distinct values you have and the less you are selecting the better the index performance.

You might want to test your index once it is created by selecting, just 1 value of ZIP, then just 2 etc just to see how it performs.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 2207 views
  • 1 like
  • 7 in conversation