BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
2222
Calcite | Level 5

Hello community,

Can anyone help me to see if in multiple datasets an array or list of variables exists? Ideally, the code would check the "dirty" datasets and then compare these to the corresponding "cleaned" dataset checking for a set of variables that have (hopefully) been removed from the "dirty" datasets making them the "cleaned" datasets. For example, datasets1-5 have variables that have been removed to create dataset1(-5)_cleaned and id like some code to check to make sure all the "cleaning" has been done by returning a 0 or 1 or exists/does not exist for each variable in each cleaned dataset. (using sas EG)
Hope this makes some sense.

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

Try this instead then. 

 

First create some fake data to test this in a clean NEW SAS session, nothing else should exist in the work library.


*fake data set 1;
data class1;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
run;

*fake data set 2;
data class2;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
run;

Then run this code, exactly as is:

 



%let lib_compare = WORK;


%macro count_missing(input_dsn=);
*define formats to count missing values;
proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
    value nmissfmt .="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN.;
    table _all_ / missing;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;

ods select all;
*Format output;

data long;
    length variable $32. variable_value $50.;
    set temp;
    INPUT_DATASET = "&input_dsn.";
    Variable=scan(table, 2);
    Variable_Value=strip(trim(vvaluex(variable)));
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    keep INPUT_DATASET variable variable_value frequency percent cum: presentation;
    label variable='Variable' variable_value='Variable Value';
run;

proc sort data=long;
    by INPUT_DATASET variable;
run;

*transpose only N;
*save to data set with prefix _CT_;
proc transpose data=long out=_ct_%scan(&input_dsn., 2, '.') prefix=N_;
    by INPUT_DATASET variable;
    id variable_value;
    var frequency;
run;

proc datasets lib=&lib_compare. nolist nodetails;
delete temp long;
quit;
%mend;


*run the macro for all data sets in the library specified;
data run_for_all_datasets;
set sashelp.vtable;
where libname = upcase("&lib_compare"); *must be upper case;
str = catt('%count_missing(input_dsn=', libname, '.', memname, ');');
call execute(str);
keep libname memname str;
run;

*Append all results together into one data set;
data summary_all;
length input_dataset $41. variable $32.;
set _ct_:;
run;

Now check the work library for a dataset called SUMMARY_ALL.

It should have a data set that looks like this if you run a PROC PRINT on it:

Obs	input_dataset	variable	_NAME_	N_Missing	N_Not Missing
1	WORK.CLASS1	Age	Frequency	1	18
2	WORK.CLASS1	Height	Frequency	4	15
3	WORK.CLASS1	Name	Frequency	.	19
4	WORK.CLASS1	Sex	Frequency	4	15
5	WORK.CLASS1	Weight	Frequency	4	15
6	WORK.CLASS2	Age	Frequency	1	18
7	WORK.CLASS2	Height	Frequency	4	15
8	WORK.CLASS2	Name	Frequency	.	19
9	WORK.CLASS2	Sex	Frequency	4	15
10	WORK.CLASS2	Weight	Frequency	4	15

This tells you the data set, the variable in the data set and how many records are missing/not missing. 

 

Is this what you're looking for? If so, then you can change the library reference in the code above (%let lib_compare = WORK;) to be the library you want to check and run the code again for it. Delete temporary datasets between runs and tests with the following, changing lib in the code as needed.

*uncomment once you test this is working;
*delete temporary count tables for each dataset;
proc datasets lib=WORK nodetails nolist;
delete _ct_: class: run_: summary:;
quit;

View solution in original post

35 REPLIES 35
Kurt_Bremser
Super User

Create dataset with your "allowed" variable names.

Then, use DICTIONARY.COLUMNS:

proc sql;
create table unclean as
  select memname, name
  from dictionary.columns
  where libname = "YOURLIB" and upcase(name) not in (select upcase(name) from allowed)
;
quit;

libname must be in uppercase.

2222
Calcite | Level 5

hi Kurt, thank you for replying.

could you please tell me what that code you wrote does and where do i put all the variables to check?

Kurt_Bremser
Super User
proc sql;
create table unclean as
  select memname, name /* these are variables in the COLUMNS pseudo-table */
  from dictionary.columns
  where
    libname = "YOURLIB" /* this should be the library containing your datasets */
    and upcase(name) not in
      (select upcase(name) from allowed) /* a sub-select querying a dataset */
;
quit;

The dataset WORK.ALLOWED in my example needs to contain your valid variable names.

2222
Calcite | Level 5

Hi Kurt,

 

Thanks so much for the explanation. I tried this but got an error i dont know how to fix (below):

 

proc sql;
create table unclean as
select memname, a90_cleaned, a95_cleaned /* these are variables in the COLUMNS pseudo-table */ (these are two of the many variables to check)
from dictionary.columns
where
libname = "\\SAS\D\S20\A\DATA" /* this should be the library containing your datasets */ (all the datasets are in this folder)
and upcase(name) not in
(select upcase(name) from allowed) /* a sub-select querying a dataset */
;
quit;

 

error.PNG

Kurt_Bremser
Super User

You have to create the dataset WORK.ALLOWED first, with a variable called NAME which contains your valid variable names. You can use a DATA step with DATALINES for this (unless you already have another source for this).

2222
Calcite | Level 5

Hi Kurt,

Thank you and happy new year!

I think I have miscommunicated the goal and so please forgive me for making it more complicated that it needed to be.

What I'm trying to do is see if, within a dataset, a certain variable value has been removed. The variable itself (the column name) will remain but the goal of the checking is to make sure sure that the cell has been cleared. There are 10 datasets and about 10 variables in each ds (some of which have the same name in each ds). So i just want to make sure that the code that cleans the dirty dataset has done its job. Maybe a count and compare would do it?

2222
Calcite | Level 5

I did this and the results say 0 for each variable so im assuming that means there is nothing in each of the variable columns, but this seems wrong/ugly and if i had a million variables id have to type them all in which seems wrong:

libname mylib "\\folder path";

proc sql;
select count(a7) as a7,
count(e1) as e1,
count(m2) as m2,
count(p) as p,
from mylib.w90_cleaned;
quit;

2222
Calcite | Level 5
Someone else has written code to remove the values (these are old values that are no longer valid), i want to check to make sure that the values have indeed been removed.
2222
Calcite | Level 5
yes sir. wish i had written it like you just did, sorry.
Patrick
Opal | Level 21

@2222 Before talking this any further can you please confirm if below code is heading into the direction of what you're after?

If it is not then can you please be as specific as possible what you want and ideally provide a have and a want table/example so we can understand.

If below code is heading into the direction of what you're after then please specify further:
1. Are the variables to test of type character or of type numeric?

2. Can there be more than one value you need to search for?

3. Is this a one-off or something you need to run on a regular basis?

...and I just read in your latest post that "some variables are string, some are numbers": I guess this means that unlike to your initial post you actually need to check for different values (or list of values) per variable. Can you please be specific and best provide some representative sample data and desired result as this often reduces requirement ambiguity a lot.

%macro exceptions(libname=,memname=,varname=,value=,exception_tbl=work.exceptions);

  proc datasets lib=work nolist nowarn;
    delete __exceptions;
  quit;

  data work.__exceptions;
    length
      __libname $8
      __memname $32
      __rownum  8
      __varname $32
      __value   $100
      ;
    keep
      __libname
      __memname
      __rownum
      __varname
      __value
      ;
    rename 
      __libname=libname
      __memname=memname
      __rownum =rownum
      __varname=varname
      __value  =value
      ;

    set &libname..&memname;
    if &varname="&value" then
      do;
        __libname="%upcase(&libname)";
        __memname="%upcase(&memname)";
        __rownum=_n_;
        __varname="&varname";
        __value="&value";
        output;
      end;
  run;

  proc append base=&exception_tbl data=work.__exceptions;
  run;

%mend;


proc datasets lib=work nolist nowarn;
  delete exceptions;
quit;

%exceptions(libname=sashelp,memname=class,varname=name,value=Louise);
%exceptions(libname=sashelp,memname=classfit,varname=name,value=Louise);

proc print data=work.exceptions;
run;

Patrick_0-1672797335294.png

 

2222
Calcite | Level 5
Hello Patrick, thank you for your reply and help. It really is just as Kirk said "So you need to check for the given variables that the count of non-missing values is zero?"
Patrick
Opal | Level 21

@2222 wrote:
Hello Patrick, thank you for your reply and help. It really is just as Kirk said "So you need to check for the given variables that the count of non-missing values is zero?"

Then I leave it to @Kurt_Bremser to support you further.

The way I understood your requirement only certain values would have been deleted resulting in the variable having rows ("cells") with missing values and other rows with actual still valid values.

That's what the code I posted is doing: Creating an exception table for all the "cells" with invalid values (the ones that should have been deleted). If everything is o.k. then this exception table has no rows.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 2827 views
  • 3 likes
  • 7 in conversation