DATA Step, Macro, Functions and more

Macro Data Quality: check numeric values outside boundary conditions character outside value list

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 127
Accepted Solution

Macro Data Quality: check numeric values outside boundary conditions character outside value list

Dear experts,

 

  1. I have the following table (input_analysis) as result of a proc content:

 

Library

Library Member Name

Data Set

Special

Variable Name

Variable

Variable

Variable

Variable

Variable Format

Name

 

Label

Data Set

 

Type

Length

Number

Label

 

 

 

 

Type (From

 

 

 

 

 

 

 

 

 

TYPE=)

 

 

 

 

 

 

IN

AG_201602

 

 

AB

1

8

43

 

BEST

IN

AG_201602

 

 

ACTIF

2

1

61

 

$

IN

AG_201602

 

 

ALT_RV_NR

1

8

71

 

BEST

IN

AG_201602

 

 

ALT_VNR

1

8

70

 

BEST

 

  1. I enhance the data with the following additional information:

 

proc sql; create table DC.input_analysis as select t.*

,t1.Variable_Name_Original   

,t1.Variable_Name_English    

,t1.Variable_description     

,t1.Variable_value_list

,t1.Variable_category  

,t1.Variable_min 

,t1.Variable_max

from DC.input_analysis t

left outer join In.Variables_information t1 on (t.variable_name=t1.Variable_Name_Original or t.variable_name=t1.Variable_Name_English)

;run;

 

  1. On the basis of the new variables(from Variables_information t1), I should perform some checks on the data related to each observation of the proc contents (e.g. ACTIF contained in Work.Test).
  2. I enhance the table with some variables that should be calculated for each observation of the proc content (i. variables in Work.Test)

proc sql; create table DC.input_analysis_result as select

t.*

,0 as n_missing

,0 as perc_missing

,0 as over_up_bound

,0 as max

,0 as over_dow_bound

,0 as min

,0 as n_not_value_list

,"                         " as unexpected_values

from DC.input_analysis t ;run;

 

  1. I would like to perform some calculations (here below two of them) recursively for each observation of the proc contents (variable name):

 

  1. in the table that I mentioned above (create table DC.input_analysis) there is a variable named Variable_value_list that contains for all the non numeric variables the value lists. E.g. for the Variable Source I have AG, SA. Check requested: running a proc sql for each not numeric variable and grouping by variable in order to get the value list and list all the values that are not in the value list (e.g. I found AB instead of AG and SA put not like in (‘%AG%’,’%SA%’)
  2. in the table that I mentioned above (create table DC.input_analysis) I have also to variables, t1.Variable_min,t1.Variable_max indicating the upper and lower bound. Min and max generated within the proc means are already a helpful indication but my aim is also to get the amount of observation for the numeric variables that do not satisfy the boundary conditions.

 

How can I automatize these calculation keep it as simple as possible? Thank a lot in advance.


Accepted Solutions
Solution
‎04-11-2016 10:04 AM
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Hi,

 

Well, I am not writing it for you.  Here is an example using sashelp.class and a input dataset.  Basically the idea is from your test metadata, to generate a call out to one of the test macros.  You can have as many test macros, or calls to them as you like:

data input_analysis_res;
  length variable_name_original variable_list variable_min variable_max $200;
  variable_name_original="NAME"; output;
  variable_name_original="SEX"; variable_list="M F"; output;
  variable_list="";
  variable_name_original="AGE"; variable_min="14"; variable_max="20"; output;
  variable_name_original="HEIGHT"; variable_min="57"; variable_max="63"; output;
  variable_name_original="WEIGHT"; variable_min="90"; variable_max="120"; output;
run;

proc sql;
  create table RESULTS (VAR char(200),TEST char(200),ID char(200),VAL char(200));
quit;

%macro Check_List (ds=,var=,vlist=);
  proc sql;
    insert into RESULTS 
    select  distinct
            "&VAR.",
            "In List",
            NAME,
            &VAR.
    from    &DS.
    where   &VAR. not in (&VLIST.);
  quit;
%mend Check_List;

%macro Check_Range (ds=,var=,vmin=,vmax=);
  proc sql;
    insert into RESULTS
    select  distinct
            "&VAR.",
            "Outside Range &VMIN. - &VMAX.",
            NAME,
            strip(put(&VAR.,best.))
    from    &DS.
    where   &VAR. not between &VMIN. and &VMAX.;
  quit;
%mend Check_Range;

data _null_;
  set input_analysis_res;
  if variable_list ne "" then 
    call execute(cats('%Check_List (ds=sashelp.class,var=',variable_name_original,',vlist="',tranwrd(strip(variable_list)," ",'" "'),'");'));
  if variable_min ne "" then 
    call execute(cats('%Check_Range (ds=sashelp.class,var=',variable_name_original,',vmin=',variable_min,',vmax=',variable_max,');'));
run;

View solution in original post


All Replies
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Hi,

 

What may be easier, is if you can post some test data (in the form of a datastep) and what you want the output to look like.  I would imagine a simple datastep with call execute() should be able to achieve what you want.  Its a bit difficult to say without seeing the actual data and what DC.input_analysis looks like.

data have;
  ab="ABC"; output;
  ab="DEF"; output;
run;
data test;
  var='AB'; test='ABC';
run;
proc sql;
  create table RESULTS (TEST char(200),RESULT char(200));
quit;
data _null_;
  set test;
  call execute('proc sql; insert into RESULTS set TEST="where '||strip(var)||' ne '||strip(test)||'",
                  RESULT=case when exists(select distinct '||strip(var)||' from HAVE where '||strip(var)||' ne "'||strip(test)||'") then "Fail" else "Pass" end; quit;');
run;
Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

[ Edited ]

HI RW9, attached you find:

- DC.input_analysis: input data set (data from the proc contents enhanced)

- test: data set to be tested

- expected results: in green the data that I already get running the proc means below:

 

proc means
data=test STACKODS n nmiss range min max;
var _numeric_;
ods output summary=stacked
;run;

 

In yellow the data I would like to get in an easy manner. My idea was: take individually each variable of the data set 'DC.input_analysis' (Macro + loop) and store the result obtained in 'DC.input_analysis'.

 

Thanks for your support,

Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Yes, so you can do something along these lines, within a if block, define the code to be checked and update in your master results table:

data _null_;
  set dc.input_analysis;
  if variable_value_list ne "" then do;
    call execute('proc sql; insert into RESULTS select distinct "Value not in list",CATEGORY from HAVE 
                    where CATEGORY not in ("'||strip(tranwrd(variable_value_list,",",'","'))||'");
                  quit;');
  end;
  if variable_min ne "" then do;
    call execute('proc sql; insert into RESULTS select distinct "Not in range",PAYMENT_DATE from HAVE 
                    where PAYMENT_DATE not between input("'||strip(VARIABLE_MIN)||'",ddmmyy10.) and input("'||strip(VARIABLE_MAX)||'",ddmmyy10.);
                  quit;');
  end;
  ...
run;
Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

thanks a lot for the suggestion... I will check it.

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

[ Edited ]

Dear RW9,

 

I am trying to use your code but I am still doing something wron: here the the current version of my code:

 

data _null_;
set DC.input_analysis_res;
if Variable_value_list ne "" then do;
call execute('proc sql; insert into DC.input_analysis_res select distinct "Unexpected_values", variable_value_list in.test
where _all_ not in ("'||strip(tranwrd(variable_value_list,",",'","'))||'");
quit;');
end;
/*variable_min ne "" then do;
call execute('proc sql; insert into outliers_down select distinct "Not in range",PAYMENT_DATE from HAVE
where PAYMENT_DATE not between input("'||strip(VARIABLE_MIN)||'",ddmmyy10.) and input("'||strip(VARIABLE_MAX)||'",ddmmyy10.);
quit;');
end;*/
run;

 

Attached again the input file and the error message I got.

I do not get the role of the variable you described "category". Since I should do the same check for all the the variables I called it _ALL_. 

Further help on it is welcome.

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

log screenshot


screenshot.PNG
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

data _null_;
  set DC.input_analysis_res;
  if Variable_value_list ne "" then do;
    call execute('proc sql; 
                    insert into DC.input_analysis_res 
                    select distinct "Unexpected_values", variable_value_list in.test
                    where _all_ not in ("'||strip(tranwrd(variable_value_list,",",'","'))||'");
                  quit;');
  end;
run;

Well, I am not sure what: 

variable_value_list in.test

 Is meant to be, its not valid SQL syntaxt.  If you take in.test out, add a from statement, and remove the _ALL_ reference it should work.  Oh, one thing you might need to bear in mind, if you put a long string in there, you will get a warning to indicate long string, to avoid this use:
options noquotelenmax;

Before, and afterwards:

options quotelenmax;

 

One other thing to remember, you can put your actual code in a macro and call that, this may help organisation:

%macro In_list (var=,vals=);
  proc sql; 
    insert into DC.input_analysis_res 
    select distinct "Unexpected_values", variable_value_list in.test
    where  &VAR. not in (&VALS.);
  quit;
%mend In_List;

data _null_;
  set DC.input_analysis_res;
  if Variable_value_list ne "" then call execute('%In_List (var='||strip(var)||',vals=',strip(variable_value_list),');');
run;

 

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Dear RW9,

 

the structure with the macrois at list for me better organized and I prefer it but still, I do not get what I want. :-(

Based on my understanding:

1. you check for each row of the table DC.input_analysis_res if the variable Variable_value_list is not empty, in case not you go to the next step;

2. using call execute you call the specified macro, defining the required inputs:

      2a. var='||strip(var)||;

      2b. vals=',strip(variable_value_list);

3. the macro recursively for each row, select the variable (colum) in the table test (e.g. Nationality) extracts all the values assumed (e.g. Jpanese, French...)  that are not  &vals;

4. the values selected in 3 are stored as variable "Unexpected_values" in DC.input_analysis_res.

 

Did I understand correctly?

Here the part that I am using for the debug and attached the log.

 

%macro In_list (var=,vals=);
proc sql;
insert into DC.input_analysis_res
select distinct
"Unexpected_values",
variable_value_list in.test
where &VAR. not in (&VALS.);
quit;
%mend In_List;
data _null_;
set DC.input_analysis_res;
if Variable_value_list ne "" then call execute('%In_List (var='||strip(var)||',vals=',strip(variable_value_list),');');
run;

/* debug:
proc sql; create table aaa as select &var., count(*) as n from in.test where &var. not in (&vals.) group by &var.; run;
proc sql; create table aaa as select segment, count(*) as n from in.test where segment not in (&vals.) group by segment; run;
*/


screenshot_2.PNG
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

The error is here:

if Variable_value_list ne "" then call execute('%In_List (var='||strip(var)||',vals=',strip(variable_value_list),');');

                                                                                                            ^                                  ^

 

Should be 

if Variable_value_list ne "" then call execute('%In_List (var='||strip(var)||',vals='||strip(variable_value_list)||');');

 

You use commas when using a function to concatenate, e.g. cat("abc","def").  Your not in this instance so you ues the double bars.  Call execute accepts one string.  That is what your basically doing here creating a string which then gets pushed out after the datastep stops.  That string then becomes code,  so is then syntax checked and compiled.  If you struggling, check your string looks right first.

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Dear RW9,

 

it gets better and better and when it will be working I can also adapt the same macro for defferent purposes but still not get the data I want. I googled  for the error and I tryit to get more information from the debug (Options nomacrogen NoSymbolgen nomlogic nomprint nomfile.) but I do not get why the variable var is not initialized: ERROR: All positional parameters must precede keyword parameters. Attached the log.


screenshot3.PNG
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

The clue here is in the NOTE: Variable var is uninitialised.  In your dataset DC.input_analysis_res, look for the variable containing the variable name of the test in question - could it be variable_name_english?  The second thing to check, and I can't see it here, is what does variable_value_list contain, is there some values with a comma, e.g. A,B,D?  If so then you will need to pass it as a string:

if ... call execute('%In_List (var='||strip(variable_name_english)||',vals="'||strip(variable_value_list)||'");');

Note the double quotes after vals=.  

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Dear RW9,

 

thanks for the indication, I now replaced all the "," used to separate the value in the value list with "_" and I do not get the error message anymore. :-)

Anyway I am still struggling with a syntax error (attachment). Any idea? Thanks again.


screenshot3.PNG
Super User
Super User
Posts: 7,392

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Your text is still being generated as:

...select distinct "Unexepected values",variable_value_list in.test where . not in...

                                                                                  ^      ^

The indicated part is not valid syntax.

Also, you have warnings about the variable VAR not existing in dataset DC.input_analysis_res, that is also why you have the var uninitialised.  I can't see where it is coming from, you would need to show the macro, and the dataset (at least a few lines of it) which feed into it.

Frequent Contributor
Posts: 127

Re: Macro Data Quality: check numeric values outside boundary conditions character outside value lis

Dear RW,

 

sorry but I am getting confuse, let me do a short recap:

1. DC.Input_analysis_res is created as a result of a proc contents (in this case of the file in.test)

proc contents data=in.test
out=DC.input_analysis noprint
;run;

2. some variables are dropped and other renamed:

proc sql; create table DC.input_analysis as select
(cats(trim(t.LIBNAME),trim(t.MEMNAME),trim(t.NAME))) as unique_ID_data_checks
,t.LIBNAME as library_name
,t.MEMNAME as data_set_name
,t.NAME as variable_name
,t.FORMAT
,t.LENGTH
,t.NOBS as observation_number
,t.MODATE as Last_Modified_Date from DC.input_analysis t ;run;

3. the following vatiables are added (e.g Variable_value_list) using an external data source:

proc sql; create table DC.input_analysis as select t.*
,t1.Variable_Name_Original
,t1.Variable_Name_English
,t1.Variable_description /* variable description */
,t1.Variable_value_list /* possible values (not comma separated!!!!!!) */
,t1.Default_value
,t1.Variable_category /* Classification: date, amount, ID, key */
,t1.Variable_min /* lower bound */
,t1.Variable_max /* upper bound */
from DC.input_analysis t
left outer join In.Variables_information t1 on (t.variable_name=t1.Variable_Name_Original or t.variable_name=t1.Variable_Name_English)
;run;
4. run the proc means and get already for the numeric variable in test the following information:

proc means
data=in.test STACKODS n nmiss range min max;
var _numeric_;
ods output summary=DC.stacked
;run;

5.Create the table DC.input_analysis_res where: the data from input analysis are taken and enhanced with the ones from DC.stacked and additionally 3 new variable are created (in bold):

proc sql;
create table DC.input_analysis_res as
select t1.*, t2.*,
0 as outliers_up,
0 as outliers_down,
" " as Unexpected_values
from DC.input_analysis t1 left outer join DC.stacked t2 on t1.variable_name=t2.Variable
;quit;

Till the point 5 everything is running smoothly. Now I should perform some checks that the command STACKODS was not able to do. For instance: find the value for each observation in DC.input_analysis_res, on the basis of Variable_value_list I would like to fill the variable Unexpected_values. How?Taking each observation of the variable variable_name in DC.input_analysis_res as variable in in.test and comparing the value of all observation for each variable with the value list given in DC.input_analysis_res (field Variable_value_list). In order to get what I want at the point 5, the macro you suggested seems to be the (almost) the perfect solution:

 

%macro In_list (var=,vals=);
Options nomacrogen NoSymbolgen nomlogic nomprint nomfile;
proc sql;
insert into DC.input_analysis_res
select distinct
"Unexpected_values",
variable_value_list in.test
where &VAR. not in (&VALS.);
quit;
%mend In_List;


data _null_;
set DC.input_analysis_res;
if Variable_value_list ne "" then call execute('%In_List (var='||strip(var)||',vals='||strip(variable_value_list)||');');
run;

 

Coming back to your question:

in in.test there is no variable named Variable_value_list (this was the input and it should remanin unchanged, the value list is given for each variable by all the observations).

in DC.input_analysis_res there is the variable Variable_value_list since it was taken as enahncement (left outer join) if the name would be worng, I would not get it in DC.input_analysis_res.

Did we have the same understanding of the same Situation? Thanks a lot RW9.

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 16 replies
  • 494 views
  • 1 like
  • 2 in conversation