Hi all,
I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:
proc sql noprint;
select(value) into :list_of_values separated by ','
from myfile;
quit;
With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values.
The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain.
I suspect a different approach such as a format might be more flexible.
Skeleton program example.
/* IF this set actually only contains valid values you would need a separate Fmtname value for each specific list of valid values. The C is for character variables, use N for numeric */ proc sql ; Create table validcntlin as select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C' from myfile; quit; proc format cntlin=validcntlin; run; data want; set have; where put(variablename,validvalue.) ne 'Valid'; run;
Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.
If multiple variables share the same range of valid values then you only need one format to test.
If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .
Proc format; value somefmtname 1 - 3000='Valid' 9999='Valid' ; run;
For an example that only reports numbers in the range of 1 to 3000 (including decimal portions) and the code 9999 as 'Valid'.
This particular approach would have a formatted value of the numeric value as a BEST for anything except those in the ranges. OR use an othere= 'Invalid' and test for 'Invalid' as the formatted value. It is possible to code an Other into a Cntlin data set but would need more coding to add to the cntlin created in SQL to append a record with the variable, HLO and the value needed to set that.
Personally I've never used a discrete format with more than about 2,000 values but I don't see a problem. And the use is dead easy and possible in lots of places.
Obviously that dummy Want set should be restricted to only include identification variables and the invalid value variable(s).
I actually use a few formats like this when I read data sets because I have data sources that slip in new things like site identifiers.
So I have in my data step that reads the data lines of code like:
If put (siteid, validsitefmt.) ne 'Valid' then put "WARNING: Invalid site id of " siteid +1 "for " idvariable= _n_= ;
Then my log shows information about the problems before the data gets processed any further and I can 1) validate whether that value should be there from the sources and take appropriate action. Sometimes it is a simple spelling error, sometimes in means a new site and I update the metadata about the sites and the validation format (append a record if the Cntlin is a data set, or add line of code to Proc Format statements) and rerun the proc format to update. Then reread the file.
If macro variables can't contain the desired information because it exceeds the maximum length of a macro variable, another solution is CALL EXECUTE.
But either with macros or CALL EXECUTE, how do you indicate to the program which values are valid and which are invalid?
Will a subquery work instead?
Otherwise a data step that generates the full code dynamically does not have that limit, ie CALL EXECUTE()/DOSUBL.
%let age = 14;
data test;
set sashelp.class end = eof;;
where age = &age.;
if _n_ =1 then do;
call execute('proc sql; select * from sashelp.class where name in (');
end;
if not eof then do;
call execute(quote(name));
call execute(",");
end;
if eof then do;
call execute(quote(name));
call execute("); quit;");
end;
run;
@Walternate wrote:
Hi all,
I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:
proc sql noprint;
select(value) into :list_of_values separated by ','
from myfile;
quit;
With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values.
The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain.
So this example creates a test dataset which consists of each valid value listed out separated by commas?
I suspect a different approach such as a format might be more flexible.
Skeleton program example.
/* IF this set actually only contains valid values you would need a separate Fmtname value for each specific list of valid values. The C is for character variables, use N for numeric */ proc sql ; Create table validcntlin as select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C' from myfile; quit; proc format cntlin=validcntlin; run; data want; set have; where put(variablename,validvalue.) ne 'Valid'; run;
Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.
If multiple variables share the same range of valid values then you only need one format to test.
If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .
Proc format; value somefmtname 1 - 3000='Valid' 9999='Valid' ; run;
For an example that only reports numbers in the range of 1 to 3000 (including decimal portions) and the code 9999 as 'Valid'.
This particular approach would have a formatted value of the numeric value as a BEST for anything except those in the ranges. OR use an othere= 'Invalid' and test for 'Invalid' as the formatted value. It is possible to code an Other into a Cntlin data set but would need more coding to add to the cntlin created in SQL to append a record with the variable, HLO and the value needed to set that.
Personally I've never used a discrete format with more than about 2,000 values but I don't see a problem. And the use is dead easy and possible in lots of places.
Obviously that dummy Want set should be restricted to only include identification variables and the invalid value variable(s).
I actually use a few formats like this when I read data sets because I have data sources that slip in new things like site identifiers.
So I have in my data step that reads the data lines of code like:
If put (siteid, validsitefmt.) ne 'Valid' then put "WARNING: Invalid site id of " siteid +1 "for " idvariable= _n_= ;
Then my log shows information about the problems before the data gets processed any further and I can 1) validate whether that value should be there from the sources and take appropriate action. Sometimes it is a simple spelling error, sometimes in means a new site and I update the metadata about the sites and the validation format (append a record if the Cntlin is a data set, or add line of code to Proc Format statements) and rerun the proc format to update. Then reread the file.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.