Solved: Creating macro vars using PROC SQL select into

Walternate · Posted 06-25-2021 01:24 PM

Hi all,

I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:

proc sql noprint;

select(value) into :list_of_values separated by ','

from myfile;

quit;

With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values.

The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain.

ballardw · Posted 06-25-2021 01:59 PM

I suspect a different approach such as a format might be more flexible.

Skeleton program example.

/* IF this set actually only contains valid values
   you would need a separate Fmtname value for each specific list
   of valid values. The C is for character variables, use N for numeric
*/
proc sql ;
   Create table validcntlin as
   select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C'
   from myfile;
quit;

proc format cntlin=validcntlin;
run;

data want;
   set have;
   where put(variablename,validvalue.) ne 'Valid';
run;

Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.

If multiple variables share the same range of valid values then you only need one format to test.

If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .

Proc format;
value somefmtname
1 - 3000='Valid'
9999='Valid'
;
run;

For an example that only reports numbers in the range of 1 to 3000 (including decimal portions) and the code 9999 as 'Valid'.

This particular approach would have a formatted value of the numeric value as a BEST for anything except those in the ranges. OR use an othere= 'Invalid' and test for 'Invalid' as the formatted value. It is possible to code an Other into a Cntlin data set but would need more coding to add to the cntlin created in SQL to append a record with the variable, HLO and the value needed to set that.

Personally I've never used a discrete format with more than about 2,000 values but I don't see a problem. And the use is dead easy and possible in lots of places.

Obviously that dummy Want set should be restricted to only include identification variables and the invalid value variable(s).

I actually use a few formats like this when I read data sets because I have data sources that slip in new things like site identifiers.

So I have in my data step that reads the data lines of code like:

If put (siteid, validsitefmt.) ne 'Valid' then put "WARNING: Invalid site id of " siteid +1 "for " idvariable= _n_= ;

Then my log shows information about the problems before the data gets processed any further and I can 1) validate whether that value should be there from the sources and take appropriate action. Sometimes it is a simple spelling error, sometimes in means a new site and I update the metadata about the sites and the validation format (append a record if the Cntlin is a data set, or add line of code to Proc Format statements) and rerun the proc format to update. Then reread the file.

View solution in original post

PaigeMiller · Posted 06-25-2021 01:27 PM

If macro variables can't contain the desired information because it exceeds the maximum length of a macro variable, another solution is CALL EXECUTE.

But either with macros or CALL EXECUTE, how do you indicate to the program which values are valid and which are invalid?

--
Paige Miller

Reeza · Posted 06-25-2021 01:35 PM

Will a subquery work instead?

Otherwise a data step that generates the full code dynamically does not have that limit, ie CALL EXECUTE()/DOSUBL.

%let age = 14;

data test;
set sashelp.class end = eof;;
where age = &age.;

if _n_ =1 then do;
	call execute('proc sql; select * from sashelp.class where name in (');
end;

if not eof then do;
call execute(quote(name));
call execute(",");
end;

if eof then do;
call execute(quote(name));
call execute("); quit;");
end;

run;

@Walternate wrote:

Hi all,

I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:

proc sql noprint;

select(value) into :list_of_values separated by ','

from myfile;

quit;

With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values.

The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain.

Walternate · Posted 06-25-2021 01:55 PM

So this example creates a test dataset which consists of each valid value listed out separated by commas?

Reeza · Posted 06-25-2021 02:09 PM

No - run the code. It generates the query below, which is also run and the results are displayed.

select * from sashelp.class where name in (<list of names of those who are 14>);

If both are in the same system though you can just do the following - no need for formats or macro variables.

proc sql;
create table want as
select * from tableA where name in (select name from TableB);
quit;

So no need to generate a macro list or format or any code.

ballardw · Posted 06-25-2021 01:59 PM

I suspect a different approach such as a format might be more flexible.

Skeleton program example.

/* IF this set actually only contains valid values
   you would need a separate Fmtname value for each specific list
   of valid values. The C is for character variables, use N for numeric
*/
proc sql ;
   Create table validcntlin as
   select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C'
   from myfile;
quit;

proc format cntlin=validcntlin;
run;

data want;
   set have;
   where put(variablename,validvalue.) ne 'Valid';
run;

Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.

If multiple variables share the same range of valid values then you only need one format to test.

If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .

Proc format;
value somefmtname
1 - 3000='Valid'
9999='Valid'
;
run;