BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
dsbihill
Obsidian | Level 7

Hello I am working with a very large data set somewhere in the 50 million + records.  The code that i have attached works ok when i subset out a month but will not work on a much larger chunk of the data.

What is the best way for me to be able to do this on the entire dataset consisting of 5 years? 

Is sorting and then using data step my best option?

Thank you so much for all your expertise.

proc sql;

    create table ambient_temp_dc as

    select RNH, RN, datepart('occur date'n) as date, hour('occur date'n) as hour;

        avg(basic_AT) as avg_ambient

        from rmdeoap.gets_dw_eoa_flt_dccca_v

    where RNH = 'CSX' and 'occur date'n ge '01Nov2014'd

    group by RNH, RN, date, hour

    ;

quit;

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

50 million records isn't that big so there are various ways to handle this. I'd start off by trying a standard proc means even without the sort, though you may need it.

You can use formats, datetime12. to get the data by hour rather than calculate the hour/date and then separate them out afterwards if really desired.

If none of the above solutions work, you can always process the data a month at a time.

proc means data=rmdeoap.gets_dw_eoa_flt_dccca_v

  (where = (RNH='CSX') keep=RNH RN 'occur date'n basic_AT) noprint;

  class RNH RN 'occur_date'n;

  ways RNH*RN*'occur_date'n;

  format 'occur_date'n datetime12.;

  output out=ambient_temp_dc mean(basic_at)=avg_ambient;

run;



View solution in original post

3 REPLIES 3
Kurt_Bremser
Super User

Try to sort your existing dataset first.

proc sort

  data=rmdeoap.gets_dw_eoa_flt_dccca_v (

    where=(RNH = 'CSX' and 'occur date'n ge '01Nov2014'd)

  )

  out=tempdata

;

by RNH, RN, date, hour;

run;

If that succeeds, the data step (or proc means) to build the average is a breeze.

Note that the where condition on the input data set reduces the size of the utility file built while sorting, so you may escape your disk full problem. proc sql has the nasty habit of throwing all the data into one big utility file and doing all its stuff there, causing a tankerload of competing disk accesses that tend to slow down systems to a crawl and often cause unnecessary disk space problems.

proc sort can be made faster and more stable by using UTILLOC= to assign a separate physical disk for the utility files.

Reeza
Super User

50 million records isn't that big so there are various ways to handle this. I'd start off by trying a standard proc means even without the sort, though you may need it.

You can use formats, datetime12. to get the data by hour rather than calculate the hour/date and then separate them out afterwards if really desired.

If none of the above solutions work, you can always process the data a month at a time.

proc means data=rmdeoap.gets_dw_eoa_flt_dccca_v

  (where = (RNH='CSX') keep=RNH RN 'occur date'n basic_AT) noprint;

  class RNH RN 'occur_date'n;

  ways RNH*RN*'occur_date'n;

  format 'occur_date'n datetime12.;

  output out=ambient_temp_dc mean(basic_at)=avg_ambient;

run;



Ksharp
Super User

I prefer to Data step + Array .

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 1235 views
  • 4 likes
  • 4 in conversation