BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Walternate
Obsidian | Level 7

Hi all,

I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:

 

proc sql noprint;

         select(value) into :list_of_values separated by ','

         from myfile;

quit;

 

With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values. 

 

The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain. 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
ballardw
Super User

I suspect a different approach such as a format might be more flexible.

Skeleton program example.

/* IF this set actually only contains valid values
   you would need a separate Fmtname value for each specific list
   of valid values. The C is for character variables, use N for numeric
*/
proc sql ;
   Create table validcntlin as
   select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C'
   from myfile;
quit;

proc format cntlin=validcntlin;
run;

data want;
   set have;
   where put(variablename,validvalue.) ne 'Valid';
run;

Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.

 

If multiple variables share the same range of valid values then  you only need one format to test.

If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .

Proc format;
value somefmtname
1 - 3000='Valid'
9999='Valid'
;
run;

For an example that only reports numbers in the range of 1 to 3000 (including decimal portions) and the code 9999 as 'Valid'.

This particular approach would have a formatted value of the numeric value as a BEST for anything except those in the ranges. OR use an othere= 'Invalid' and test for 'Invalid' as the formatted value. It is possible to code an Other into a Cntlin data set but would need more coding to add to the cntlin created in SQL to append a record with the variable, HLO and the value needed to set that.

 

Personally I've never used a discrete format with more than about 2,000 values but I don't see a problem. And the use is dead easy and possible in lots of places.

 

Obviously that dummy Want set should be restricted to only include identification variables and the invalid value variable(s).

 

I actually use a few formats like this when I read data sets because I have data sources that slip in new things like site identifiers.

So I have in my data step that reads the data lines of code like:

 

If put (siteid, validsitefmt.) ne 'Valid' then put "WARNING: Invalid site id of " siteid +1 "for " idvariable= _n_= ;

Then my log shows information about the problems before the data gets processed any further and I can 1) validate whether that value should be there from the sources and take appropriate action. Sometimes it is a simple spelling error, sometimes in means a new site and I update the metadata about the sites and the validation format (append a record if the Cntlin is a data set, or add line of code to Proc Format statements) and rerun the proc format to update. Then reread the file.

View solution in original post

5 REPLIES 5
PaigeMiller
Diamond | Level 26

If macro variables can't contain the desired information because it exceeds the maximum length of a macro variable, another solution is CALL EXECUTE.

 

But either with macros or CALL EXECUTE, how do you indicate to the program which values are valid and which are invalid? 

--
Paige Miller
Reeza
Super User

Will a subquery work instead?

Otherwise a data step that generates the full code dynamically does not have that limit, ie CALL EXECUTE()/DOSUBL.

 

%let age = 14;

data test;
set sashelp.class end = eof;;
where age = &age.;

if _n_ =1 then do;
	call execute('proc sql; select * from sashelp.class where name in (');
end;

if not eof then do;
call execute(quote(name));
call execute(",");
end;

if eof then do;
call execute(quote(name));
call execute("); quit;");
end;

run;

@Walternate wrote:

Hi all,

I have a list of tens of thousands of valid values for a set of variables and am trying to set up a lookup to determine which records have invalid values for these vars. I am trying to use PROC SQL select into to create a macro list of valid values for the variable:

 

proc sql noprint;

         select(value) into :list_of_values separated by ','

         from myfile;

quit;

 

With this in place, I can just set up a lookup for whether the value of a given variable matches one of the values in &list_of_values. 

 

The problem is that I have too many values listed so SAS is indicating that the &list_of_values macro variable cannot be built because it would be too long. Is there a workaround for this? The only thing I can think to do is break up the variable list into multiple macro lists, but that is a bit of a pain. 

 

 


 

Walternate
Obsidian | Level 7

So this example creates a test dataset which consists of each valid value listed out separated by commas?

Reeza
Super User
No - run the code. It generates the query below, which is also run and the results are displayed.

select * from sashelp.class where name in (<list of names of those who are 14>);

If both are in the same system though you can just do the following - no need for formats or macro variables.

proc sql;
create table want as
select * from tableA where name in (select name from TableB);
quit;

So no need to generate a macro list or format or any code.
ballardw
Super User

I suspect a different approach such as a format might be more flexible.

Skeleton program example.

/* IF this set actually only contains valid values
   you would need a separate Fmtname value for each specific list
   of valid values. The C is for character variables, use N for numeric
*/
proc sql ;
   Create table validcntlin as
   select distinct (value) as start, "validvalue" as fmtname, label='Valid', type='C'
   from myfile;
quit;

proc format cntlin=validcntlin;
run;

data want;
   set have;
   where put(variablename,validvalue.) ne 'Valid';
run;

Advantage: once you have the cntlin data set created you don't need to pull values from the source again like you would with macro variables. Once the format is created if stored in library in the format search path you don't need to do anything until you have to update the format definition because of new data. If you need to add a new format for new variables you only need to create the data/code for that one format as the format catalog can hold many formats.

 

If multiple variables share the same range of valid values then  you only need one format to test.

If variables are numeric and continuous it would likely be easier to write range statements in proc format, especially if you have a few code values .

Proc format;
value somefmtname
1 - 3000='Valid'
9999='Valid'
;
run;

For an example that only reports numbers in the range of 1 to 3000 (including decimal portions) and the code 9999 as 'Valid'.

This particular approach would have a formatted value of the numeric value as a BEST for anything except those in the ranges. OR use an othere= 'Invalid' and test for 'Invalid' as the formatted value. It is possible to code an Other into a Cntlin data set but would need more coding to add to the cntlin created in SQL to append a record with the variable, HLO and the value needed to set that.

 

Personally I've never used a discrete format with more than about 2,000 values but I don't see a problem. And the use is dead easy and possible in lots of places.

 

Obviously that dummy Want set should be restricted to only include identification variables and the invalid value variable(s).

 

I actually use a few formats like this when I read data sets because I have data sources that slip in new things like site identifiers.

So I have in my data step that reads the data lines of code like:

 

If put (siteid, validsitefmt.) ne 'Valid' then put "WARNING: Invalid site id of " siteid +1 "for " idvariable= _n_= ;

Then my log shows information about the problems before the data gets processed any further and I can 1) validate whether that value should be there from the sources and take appropriate action. Sometimes it is a simple spelling error, sometimes in means a new site and I update the metadata about the sites and the validation format (append a record if the Cntlin is a data set, or add line of code to Proc Format statements) and rerun the proc format to update. Then reread the file.

SAS INNOVATE 2024

Innovate_SAS_Blue.png

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Get the $99 certification deal.jpg

 

 

Back in the Classroom!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 580 views
  • 0 likes
  • 4 in conversation