Hi all! For a project I am building a data quality checking framework. I have a table containing ~100 different variable names, and their allowed values and subsequently have to check if all values are within these (discrete) allowed values. My end goal is to have a flag for each variable indicating if all values are within the allowed set of values. I know this is not very common, but I was trying to solve this by implementing an SQL query within a datastep in the following way.
DATA WANT;
set HAVE;
call symputx('var_to_check', NAME); /* Creating global variable to pass to the sql query */
call symputx('allowed_vals', allowed_vals_str);
PROC SQL;
SELECT DISTINCT count(&var_to_check.)
INTO :var_count
FROM HAVE2
WHERE &var_to_check. NOT IN (&allowed_vals.)
;QUIT;
N = &var_count;
RUN;
Now the SQL query seems to work, but only for one variable.. Does it 'break out' of the datastep when being run? And are their other, more efficient ways of implementing this? I am quite new to SAS..
Thank you!
EDIT: I see I should have provided some data to make it more clear, sorry - quite new to SAS.. This is approximately what I want:
data HAVE;
infile datalines delimiter='|';
length NAME $9;
length AllowedValues $5;
input NAME $ AllowedValues $;
datalines;
Confirmed_Status|{0,1}
Remidiation_Flag|{0,1}
;
data HAVE2;
infile datalines delimiter='|';
input Confirmed_Status $ Remidiation_Flag $;
datalines;
0|1
1|0
0|0
0|0
1|0
2|0
0|0
;
data WANT;
infile datalines delimiter='|';
length NAME $9;
length AllowedValues $5;
input NAME $ AllowedValues $ Flag;
datalines;
Confirmed_Status|{0,1}|1
Remidiation_Flag|{0,1}|0
;
So first structure you rules metadata so that it will be easy to use it to generate code. For example if the values are character then the valid values need to be in quotes.
data rules ;
length NAME $32 allowedValues $100;
input NAME AllowedValues ;
datalines;
Confirmed_Status (0,1)
Remidiation_Flag ('0','1')
;
Then use the RULES metadata to generate the CODE you need to test the rules.
filename code temp;
data _null_;
set rules end=eof;
file code;
if _N_=1 then put
'proc sql;'
/'create table wide as'
/'select'
/' ' @;
else put ',' @;
put 'max(not ' name 'in ' allowedvalues ') as ' name ;
if eof then put
'from actual'
/ ';'
/ 'quit;'
/ 'proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;'
/ 'run;'
;
run;
Now let's create some input data:
data actual ;
input Confirmed_Status Remidiation_Flag $;
datalines;
0 1
1 0
0 0
0 0
1 0
2 0
0 0
;
And run the code against it.
%include code / source2;
Results:
2809 +proc sql; 2810 +create table wide as 2811 +select 2812 + max(not Confirmed_Status in (0,1) ) as Confirmed_Status 2813 +,max(not Remidiation_Flag in ('0','1') ) as Remidiation_Flag 2814 +from actual 2815 +; NOTE: Table WORK.WIDE created, with 1 rows and 2 columns. 2816 +quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 2817 +proc transpose data=wide out=want(rename=(col1=FLAG)) name=name; 2818 +run; NOTE: There were 1 observations read from the data set WORK.WIDE. NOTE: The data set WORK.WANT has 2 observations and 2 variables. NOTE: PROCEDURE TRANSPOSE used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
Obs name FLAG 1 Confirmed_Status 1 2 Remidiation_Flag 0
2 things.
Hi, thank you for the response!
Yes, my goal is to ensure all data within a large table (HAVE2) adheres to some constraints provided in a smaller table (HAVE). I would like to create a flag within the small table (HAVE) if one or more values within the large table are not within my bounds.
Here is a small example of what my data can look like.. Note that the allowed values can also be set of strings, and are not limited to just two values!
Ok. And what should happen if the allowed values are violated?
Have you ever seen a data step where the infile is a picture? I haven't, and no one else here, because data steps can only read text files.
So please, do never (as in NEVER) supply pictures, always (as in ALWAYS) supply data steps with datalines, which we can easily copy/paste into our SAS environment and submit to recreate your data.
You can combine SQL and data step logic in PROC DS2.
Not sure if that solves your specific case.
Hi,
Since you ask for possible approaches, I would suggest you review Ron Cody's data cleaning book. Even the free excerpt may be helpful:
https://www.sas.com/storefront/aux/en/spcodydata/61703_excerpt.pdf
See on page 15, the section "Using Formats to Check for Invalid Values".
It sounds like you already have a data source that defines the valid values for each variable. If you create a format for each variable which defines valid/invalid, then counting the valid/invalid is as simple as running PROC FREQ. There is a bit of over head to creating the formats and assigning them, but if you're developing a data cleaning process, I really like the use of formats for simple checks like this. And there are lots of other approaches to data cleaning in the book as well. It's a great book.
--Q.
So first structure you rules metadata so that it will be easy to use it to generate code. For example if the values are character then the valid values need to be in quotes.
data rules ;
length NAME $32 allowedValues $100;
input NAME AllowedValues ;
datalines;
Confirmed_Status (0,1)
Remidiation_Flag ('0','1')
;
Then use the RULES metadata to generate the CODE you need to test the rules.
filename code temp;
data _null_;
set rules end=eof;
file code;
if _N_=1 then put
'proc sql;'
/'create table wide as'
/'select'
/' ' @;
else put ',' @;
put 'max(not ' name 'in ' allowedvalues ') as ' name ;
if eof then put
'from actual'
/ ';'
/ 'quit;'
/ 'proc transpose data=wide out=want(rename=(col1=FLAG)) name=name;'
/ 'run;'
;
run;
Now let's create some input data:
data actual ;
input Confirmed_Status Remidiation_Flag $;
datalines;
0 1
1 0
0 0
0 0
1 0
2 0
0 0
;
And run the code against it.
%include code / source2;
Results:
2809 +proc sql; 2810 +create table wide as 2811 +select 2812 + max(not Confirmed_Status in (0,1) ) as Confirmed_Status 2813 +,max(not Remidiation_Flag in ('0','1') ) as Remidiation_Flag 2814 +from actual 2815 +; NOTE: Table WORK.WIDE created, with 1 rows and 2 columns. 2816 +quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 2817 +proc transpose data=wide out=want(rename=(col1=FLAG)) name=name; 2818 +run; NOTE: There were 1 observations read from the data set WORK.WIDE. NOTE: The data set WORK.WANT has 2 observations and 2 variables. NOTE: PROCEDURE TRANSPOSE used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
Obs name FLAG 1 Confirmed_Status 1 2 Remidiation_Flag 0
Thank you! This works and allows me to be very flexible! Now I'm just gonna try to understand it properly 🙂
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.