BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MeganE
Pyrite | Level 9

Hi,

 

I'm taking a dataset with 54 million records and want to subset it to another location with a where statement.  By default, i'm doing this:

    data out.PS&year._2015 (where=(hcfasaf="P"));
        set clm2015.ps&year.;
    run;

 

But it's taking about 45 minutes and i have to do 11 years worth.  I looked into proc copy, but there's no where= statement that works with that.

 

Any ideas?

Thanks,

Megan

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

Your use of the WHERE statement could be improved.  By putting it on the DATA statement, the program reads in all observations and only subsets the output.  Move it here:

 

   data out.PS&year._2015;
        set clm2015.ps&year. (where=(hcfasaf="P"));
    run;

 

 

That way, the program will subset the observations on the way in instead of on the way out.

View solution in original post

10 REPLIES 10
jklaverstijn
Rhodochrosite | Level 12
Have you considered creating an index on column hcfasaf? Once you want to subset your data most approaches take the same path behind the scenes. If you say "another location" have a look at other aspecs as well, like network bandwidth and I/O contention. If your source and target reside on the same on the Sam disk(s) then that may actually be your problem and not SAS.

Regards Jan.
MeganE
Pyrite | Level 9

There already is an index on the dataset, but on different variables. If were to create an index, i'd have to create a new copy anyway, b/c i can't modify what's there or it'll remove what currently exists and what's there is standard.

 

And i'm copying to and from the same server.  I don't know what an aspec is.

Kurt_Bremser
Super User

If the data has to travel over the network (ie SAS and the dataset do not reside in the same host), then 45 minutes is actually quite fast.

If everything is on the same server, you may have the problem of competing disk accesses.

For operations like this, I try to make source and target storage physically separate (source on SAN, target on internal disks; or source on one physical volume, target on another).

Have you tried the (compress=yes) dataset option on the target dataset? With (large) string variables present this can significantly reduce storage use and therefore I/O.

jklaverstijn
Rhodochrosite | Level 12

With "aspects" I mean all those factors that determine the performance of the task at hand. There are many ways to do things in SAS and often performance relies heaviliy on the choices you make. But your SAS code it is not the only thing to look at. Your system components (CPU, memory, I/O) are important players here. If your source and target are on the same server there is a chance they are also on the same disks. For best performance it is important that this situation is avoided.

 

On a side note it is surprising to hear you need to recreate all indexes when adding one. Could you explain a bit more why this would be the case?

 

And please keep in mind that at the end of the day you may just have to wait, as the task you try to perform takes a while because of the large volumes you are processing.

 

- Jan.

MeganE
Pyrite | Level 9

I just meant that i can't modify existing indexes.  This isn't "my" dataset, it belongs to the department.  I can't go changing things like that.  Other people use these datasets.

ChrisNZ
Tourmaline | Level 20

1- You can create a new index without removing anything.

    For example using  proc sql; create index ...

 

2- Please post the log times so we know whether the issue is I/O (likely) or CPU.
    If you are copying to the same disk sub-system, speed will be low. To such en extend that it can be faster to
    make two copy operations. For example to the WORK library and then from WORK to the final destination.


3- As outlined by @Astounding, the where clause should probably be moved.

 

4- See https://communities.sas.com/t5/Base-SAS-Programming/How-to-speed-up-performance-of-Sas-9-4-TS1M3-thr... where speeding up data access is also discussed. Compression and memory usage in particular could make a difference in your case.

Astounding
PROC Star

Your use of the WHERE statement could be improved.  By putting it on the DATA statement, the program reads in all observations and only subsets the output.  Move it here:

 

   data out.PS&year._2015;
        set clm2015.ps&year. (where=(hcfasaf="P"));
    run;

 

 

That way, the program will subset the observations on the way in instead of on the way out.

MeganE
Pyrite | Level 9

Thanks!  I ran this on a test case and my original way was 2 minutes, this one was 14 seconds.  🙂

FreelanceReinh
Jade | Level 19

I would follow @Astounding's suggestion if the subset is comparably small (i.e., say, only <50% of the observations meet the WHERE condition). If, on the contrary, the subset is large, a subsetting IF (if hcfasaf='P';) might be faster than the approach using WHERE.

Ksharp
Super User

This is best situation for PROC DS2.

 

 


libname y v9 '/folders/myfolders/';
data y.class;
 set sashelp.class;
run;

proc ds2;
data Female(overwrite=yes);
method run();
 set y.class;
 if sex='F';
end;
enddata;
run;
quit;

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 2577 views
  • 1 like
  • 7 in conversation