05-04-2017 08:28 AM
I am checking for invalid values for all variables in a dataset. My data has more than 250 numeric values and I know the range of values for all the variables. Some have a range of 0-6 and blank as values and some have 0-4, 99, ., and blank as valid values. I found several examples like using proc freq, proc format along with data step, data step with where and put statement. All the techniques I found seem to be little time consuming and I am on a time crunch.
I was wondering if there is much efficient way to get this done. Any recommendations would be really appreciated.
05-04-2017 08:57 AM
Excel files are useless for presenting example data, as they do not keep the structural information of SAS datasets (variable types, lengths, formats,....). On top of that, many corporate firewalls (where most SAS installations are used) will routinely block MS Office files for security reasons.
Post your example data in a data step, as described here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...
05-04-2017 10:27 AM
If I know ahead of time the ranges of values then when I read the data I use custom informats such as
Proc format library=work;
. , 1,2,3,4,5,6= _same_
other = _error_;
Then use that informat when reading the variables. If the value is out of range when read the log shows an error message.
If you have existing values to check then use arrays to check related variables.
If you have a format like
., 1,2,3,4 = 'In Range'
other = 'invalid';
Then in a data step you specify all of the variables that use that format such as:
array grp Q1 Q3 Q25-Q30; /*< list of ALL variables with the same rule*/
do i= 1 to dim(grp);
if put(grp[i],check.) = 'invalid' then put "appropriate message identifying varialbe and record information";
Note that the above works will with VNAME function to return the specific variable involved.
Unless you have 100+ formats needed the above should be easily accomplished to identify problem values and records in less than one day and if you only have 5 or 6 common ranges to check then considerably less. I would expect a couple hours to be more likely.
The complicated time consuming issues are when Var1 is in a range that indicates a further restrinction on other variables such as if Var1 =1 then variables Var3, Var4 and Var5 should be missing or within another specified range.