06-08-2016 03:16 PM
I have a data set with about 100,000 records in it. I would like to create a smaller data set from that one that is composed of the first 50,000 records. I know I can add a sequential counter column to it and then do it that way, but I was wondering if there is a way to do this by just selecting the first 50,000 records of the larger one on the fly?
06-08-2016 03:40 PM
Also note ...
You can subset directly:
proc summary data=have (obs=50000) .....
If the subset is for use in multiple steps, be wary of using a view. Your program will likely run faster if you create a data set rather than a view. Each time a view is used, it has to re-execute the instructions within it.
06-08-2016 04:21 PM
06-08-2016 05:18 PM
Rather than try to debate the issue, I ran a test. Granted, it's one machine and one operating system. I tested 10,000 runs against a data set, and it took about 14 minutes. Not wanting to wait that long, I tested only 5,000 runs against a view. That took nearly 12 minutes (almost twice as long per run). The test data set contained 100 variables and (originally) 100,000 observations.
Each "run" was a simple step:
At this point, I'm still convinced that the view is slower for multiple uses.
06-08-2016 05:38 PM
Yes @Astounding, there is certainly extra buffering involved in using a view. But OP's question was for an "on the fly" method, which I assume is something to be used just once.
OBS= dataset option is probably the most efficient method when nothing more than limiting the number observations read is required.
06-08-2016 07:05 PM - edited 06-08-2016 07:07 PM
Yes @Astounding you are correct and I didn't intend to suggest that VIEW was faster, only that for this example it won't make much difference.
options Compress=yes NOTE: Elapsed time VIEW: 00:01:04 NOTE: Elapsed time DATA: 00:00:45 options compress=no NOTE: Elapsed time VIEW: 00:01:13 NOTE: Elapsed time DATA: 00:00:15
This test program attempts to remove all but reading from the test.
data example; array a (1:100); do _n_ = 1 to 1e5; output; end; run; data testv / view=testv; set example(obs=50000); run; data testd; set example(obs=50000); run; %macro test(n); %local _begin_ i; %let _begin_ = %sysfunc(datetime()); options notes=0; %do i = 1 %to &n; data _null_; set testv; run; %end; options notes=1; %put NOTE: Elapsed time VIEW: %sysfunc(intck(seconds,&_begin_,%sysfunc(datetime())),TOD20); %let _begin_ = %sysfunc(datetime()); options notes=0; %do i = 1 %to &n; data _null_; set testd; run; %end; options notes=1; %put NOTE: Elapsed time DATA: %sysfunc(intck(seconds,&_begin_,%sysfunc(datetime())),TOD20); %mend; %test(250);
06-14-2016 02:39 PM
Yes Astounding, I actually tried this method after and it works fine as well. Going forward, this is probably what I would use.
Need further help from the community? Please ask a new question.