- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone!
I have 1000 tables 100gb each. I am processing them one by one with simple data steps.
I want to write an if statement that would check the first row, and if column X is not equal to 1 then stop this data step (meaning stop to work with this dataset) and go to the next table.
Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if x ne 1 then stop;
Are you positive X is an integer? If not then you might need to add some fuzz.
if -1E5 <= x-1 <= 1E5 then stop;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
if x ne 1 then stop;
Are you positive X is an integer? If not then you might need to add some fuzz.
if -1E5 <= x-1 <= 1E5 then stop;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Question: Are you going to produce a single dataset from all the qualifying datasets, or must you create one dataset for each qualifying dataset?
If the answer is "create 1 dataset from all qualifying datasets", then you could process all 1,000 in a single step:
data want (drop=_:);
set have: curobs=cobs indsname=dsnam;
retain _skipdata ' ';
if cobs=1 then _skipdata=ifc(x^=1,'Y','N');
if _skipdata='N';
sourcedsn=dsnam;
*** Your other code here ***;
run;
Notes:
- SET HAVE: , with the trailing : means read all datasets whose names begin with 'HAVE'. You could also say things like SET HAVE: OLD: .
- The curobs= tells SAS to put the current observation number in variable COBS. And cobs is the observation number relative to each incoming dataset, not to the aggregate of all the datasets.
- So the "if cobs=1 then ...." checks the value of X for the first obs of each dataset.
- I assume that you contemplate running the same code for each of the 1,000 datasets you refer to, so you can use a single copy of that code in this data step.
- I put in the "sourcedsn=dsnam" statement just in case you want to track the numerous source datasets for each record in the single output dataset.
But if you want one output data set for each qualifying input dataset (i.e. you want work.new_have1 from work.have1, work.new_have4 from work.have4), then you could:
%macro t (indsn=,outdsn=)/ des='Code to apply to each dataset';
data &outdsn;
set &indsn;
*** other code here ***;
run;
%mend t;
data _null_;
set have: (obs=1 keep=x) curobs=cobs indsname=dsname ;
if x=1;
outdsn='new_' || scan(dsname,2,'.');
call execute(cats('*%t(indsn=',dsname,',outdsn=',outdsn,');'));
run;
- The objective of this strategy is to avoid creating even a dummy output dataset for those original datasets that do not qualify.
- The use of the macro T is based on the assumption that you want to do the same process to each qualifying dataset (in this case each qualifying "HAVE..." dataset.
- The "obs=1" parameter just tells SAS to read only the first obs from each HAVE... dataset.
- The CALL EXECUTE statement sets up a call to macro T just for those datasets that meet your criterion in their first observation.
- Because the INDSNAME= parametr returns a two-level dataset name, you see a SCAN function that tracks only the second level (i.e. HAVE1 from WORK.HAVE1) in naming a new dataset.
- This program reads every record from every dataset, but it only keeps records from qualifying datasets. So if the non-qualifying datasets are large, and also comprise a large proportion of candidate datasets, then this strategy could become inefficient.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------