BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dvdh
Fluorite | Level 6

Hi all! For a project I am building a data quality checking framework. I have a table containing ~100 different variable names, and their allowed values and subsequently have to check if all values are within these (discrete) allowed values. My end goal is to have a flag for each variable indicating if all values are within the allowed set of values. I know this is not very common, but I was trying to solve this by implementing an SQL query within a datastep in the following way. 

 

DATA WANT;
set HAVE;

call symputx('var_to_check', NAME); /* Creating global variable to pass to the sql query */
call symputx('allowed_vals', allowed_vals_str);


PROC SQL;
SELECT DISTINCT count(&var_to_check.)
INTO :var_count
FROM HAVE2
WHERE &var_to_check. NOT IN (&allowed_vals.)
;QUIT;

N = &var_count; 
RUN;

 

Now the SQL query seems to work, but only for one variable.. Does it 'break out' of the datastep when being run? And are their other, more efficient ways of implementing this? I am quite new to SAS..

 

Thank you!  

 

EDIT: I see I should have provided some data to make it more clear, sorry - quite new to SAS.. This is approximately what I want: 

 

data HAVE;
infile datalines delimiter='|';
length NAME $9;
length AllowedValues $5;
input NAME $ AllowedValues $;
datalines;
Confirmed_Status|{0,1}
Remidiation_Flag|{0,1}
;

data HAVE2;
infile datalines delimiter='|';
input Confirmed_Status $ Remidiation_Flag $;
datalines;
0|1
1|0
0|0
0|0
1|0
2|0
0|0
;

data WANT;
infile datalines delimiter='|';
length NAME $9;
length AllowedValues $5;
input NAME $ AllowedValues $ Flag;
datalines;
Confirmed_Status|{0,1}|1
Remidiation_Flag|{0,1}|0
;

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

So first structure you rules metadata so that it will be easy to use it to generate code.  For example if the values are character then the valid values need to be in quotes.

data rules ;
  length NAME $32 allowedValues $100;
  input NAME AllowedValues ;
datalines;
Confirmed_Status (0,1)
Remidiation_Flag ('0','1')
;

Then use the RULES metadata to generate the CODE you need to test the rules.

filename code temp;
data _null_;
  set rules end=eof;
  file code;
  if _N_=1 then put 
    'proc sql;'
   /'create table wide as'
   /'select'
   /' ' @;
  else put ',' @;
  put 'max(not ' name 'in ' allowedvalues ') as ' name ;
  if eof then put 
    'from actual'
  / ';'
  / 'quit;'
  / 'proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;'
  / 'run;'
  ;
run;

Now let's create some input data:

data actual ;
  input Confirmed_Status Remidiation_Flag $;
datalines;
0 1
1 0
0 0
0 0
1 0
2 0
0 0
;

And run the code against it.

%include code / source2;

Results:

2809 +proc sql;
2810 +create table wide as
2811 +select
2812 + max(not Confirmed_Status in (0,1) ) as Confirmed_Status
2813 +,max(not Remidiation_Flag in ('0','1') ) as Remidiation_Flag
2814 +from actual
2815 +;
NOTE: Table WORK.WIDE created, with 1 rows and 2 columns.

2816 +quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2817 +proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;
2818 +run;

NOTE: There were 1 observations read from the data set WORK.WIDE.
NOTE: The data set WORK.WANT has 2 observations and 2 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
Obs          name          FLAG

 1     Confirmed_Status      1
 2     Remidiation_Flag      0

View solution in original post

10 REPLIES 10
PeterClemmensen
Tourmaline | Level 20

2 things.

 

  1. The Proc SQL step does not run within the Data Step. Rather, the data step runs first, then the Proc SQL. This is because, even though there is no Run Statement, the Proc Statement in Proc SQL serves as a step boundary
  2. You real goal here is to ensure that all variable values are within some pre defined values stored in a field in another table, correct? Can you supply some sample data?
dvdh
Fluorite | Level 6

Hi, thank you for the response! 

 

Yes, my goal is to ensure all data within a large table (HAVE2) adheres to some constraints provided in a smaller table (HAVE). I would like to create a flag within the small table (HAVE) if one or more values within the large table are not within my bounds. 

 

Here is a small example of what my data can look like.. Note that the allowed values can also be set of strings, and are not limited to just two values! 

 

 

 

Schermafbeelding 2022-11-29 103347.pngSchermafbeelding 2022-11-29 103428.png

PeterClemmensen
Tourmaline | Level 20

Ok. And what should happen if the allowed values are violated?

dvdh
Fluorite | Level 6
If the allowed values are violated, I would like to add an indication in the small table. So for example, a new parameter called domain_flag that's simply 1 or 0 - 0 if all values are within the allowed parameters, or 1 if they are not!
Kurt_Bremser
Super User

Have you ever seen a data step where the infile is a picture? I haven't, and no one else here, because data steps can only read text files.

So please, do never (as in NEVER) supply pictures, always (as in ALWAYS) supply data steps with datalines, which we can easily copy/paste into our SAS environment and submit to recreate your data.

LinusH
Tourmaline | Level 20

You can combine SQL and data step logic in PROC DS2.

Not sure if that solves your specific case.

Data never sleeps
dvdh
Fluorite | Level 6
Maybe! I will see if I can fix somethin with DS2
Quentin
Super User

Hi,

 

Since you ask for possible approaches, I would suggest you review Ron Cody's data cleaning book. Even the free excerpt may be helpful:

https://www.sas.com/storefront/aux/en/spcodydata/61703_excerpt.pdf

 

See on page 15, the section "Using Formats to Check for Invalid Values".

 

It sounds like you already have  a data source that defines the valid values for each variable.  If you create a format for each variable which defines valid/invalid, then counting the valid/invalid is as simple as running PROC FREQ.  There is a bit of over head to creating the formats and assigning them, but if you're developing a data cleaning process, I really like the use of formats for simple checks like this.  And there are lots of other approaches to data cleaning in the book as well.  It's a great book.

 

--Q.

 

 

 

 

Tom
Super User Tom
Super User

So first structure you rules metadata so that it will be easy to use it to generate code.  For example if the values are character then the valid values need to be in quotes.

data rules ;
  length NAME $32 allowedValues $100;
  input NAME AllowedValues ;
datalines;
Confirmed_Status (0,1)
Remidiation_Flag ('0','1')
;

Then use the RULES metadata to generate the CODE you need to test the rules.

filename code temp;
data _null_;
  set rules end=eof;
  file code;
  if _N_=1 then put 
    'proc sql;'
   /'create table wide as'
   /'select'
   /' ' @;
  else put ',' @;
  put 'max(not ' name 'in ' allowedvalues ') as ' name ;
  if eof then put 
    'from actual'
  / ';'
  / 'quit;'
  / 'proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;'
  / 'run;'
  ;
run;

Now let's create some input data:

data actual ;
  input Confirmed_Status Remidiation_Flag $;
datalines;
0 1
1 0
0 0
0 0
1 0
2 0
0 0
;

And run the code against it.

%include code / source2;

Results:

2809 +proc sql;
2810 +create table wide as
2811 +select
2812 + max(not Confirmed_Status in (0,1) ) as Confirmed_Status
2813 +,max(not Remidiation_Flag in ('0','1') ) as Remidiation_Flag
2814 +from actual
2815 +;
NOTE: Table WORK.WIDE created, with 1 rows and 2 columns.

2816 +quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2817 +proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;
2818 +run;

NOTE: There were 1 observations read from the data set WORK.WIDE.
NOTE: The data set WORK.WANT has 2 observations and 2 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds
Obs          name          FLAG

 1     Confirmed_Status      1
 2     Remidiation_Flag      0
dvdh
Fluorite | Level 6

Thank you! This works and allows me to be very flexible! Now I'm just gonna try to understand it properly 🙂 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3573 views
  • 7 likes
  • 6 in conversation