Hi Everyone
I have a data set with about 100,000 records in it. I would like to create a smaller data set from that one that is composed of the first 50,000 records. I know I can add a sequential counter column to it and then do it that way, but I was wondering if there is a way to do this by just selecting the first 50,000 records of the larger one on the fly?
Paul
Use a view:
data short / view=short;
set have;
if _n_ > 50000 then stop;
run;
proc summary data=short;
...
run;
Use a view:
data short / view=short;
set have;
if _n_ > 50000 then stop;
run;
proc summary data=short;
...
run;
Thanks PG. That works!
Paul
Also note ...
You can subset directly:
proc summary data=have (obs=50000) .....
If the subset is for use in multiple steps, be wary of using a view. Your program will likely run faster if you create a data set rather than a view. Each time a view is used, it has to re-execute the instructions within it.
@Astoundingfor the subset of first 50,000 obs using OBS= vs the view proposed by @PGStats or a new data set with 50,000 obs it will make little different if any for each method will still read 50,000 records and stop.
Rather than try to debate the issue, I ran a test. Granted, it's one machine and one operating system. I tested 10,000 runs against a data set, and it took about 14 minutes. Not wanting to wait that long, I tested only 5,000 runs against a view. That took nearly 12 minutes (almost twice as long per run). The test data set contained 100 variables and (originally) 100,000 observations.
Each "run" was a simple step:
data want;
set subset;
run;
or
data want;
set viewname;
run;
At this point, I'm still convinced that the view is slower for multiple uses.
Yes @Astounding, there is certainly extra buffering involved in using a view. But OP's question was for an "on the fly" method, which I assume is something to be used just once.
OBS= dataset option is probably the most efficient method when nothing more than limiting the number observations read is required.
Yes @Astounding you are correct and I didn't intend to suggest that VIEW was faster, only that for this example it won't make much difference.
options Compress=yes
NOTE: Elapsed time VIEW: 00:01:04
NOTE: Elapsed time DATA: 00:00:45
options compress=no
NOTE: Elapsed time VIEW: 00:01:13
NOTE: Elapsed time DATA: 00:00:15
This test program attempts to remove all but reading from the test.
data example;
array a[100] (1:100);
do _n_ = 1 to 1e5;
output;
end;
run;
data testv / view=testv;
set example(obs=50000);
run;
data testd;
set example(obs=50000);
run;
%macro test(n);
%local _begin_ i;
%let _begin_ = %sysfunc(datetime());
options notes=0;
%do i = 1 %to &n;
data _null_;
set testv;
run;
%end;
options notes=1;
%put NOTE: Elapsed time VIEW: %sysfunc(intck(seconds,&_begin_,%sysfunc(datetime())),TOD20);
%let _begin_ = %sysfunc(datetime());
options notes=0;
%do i = 1 %to &n;
data _null_;
set testd;
run;
%end;
options notes=1;
%put NOTE: Elapsed time DATA: %sysfunc(intck(seconds,&_begin_,%sysfunc(datetime())),TOD20);
%mend;
%test(250);
Yes Astounding, I actually tried this method after and it works fine as well. Going forward, this is probably what I would use.
Paul
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.