BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
lansoprazole
Obsidian | Level 7

Hi,

 

I have a dataset with around 520 variable with many missing values. I want to delete variables with >90% missing values (I dont want to impute). Is there any macro or anything else to do it? 

 

I tried ChatGPT, but I got an error that GPT could not solve.

 

Here is GPTcode:

/* Step 1: Calculate the percentage of missing values for each variable */
ods output onewayfreqs=missing;
proc freq data=dxccsr_transposed;
tables _all_ / missing;
run;

/* Step 2: Identify variables with 90% or more missing values */
data _null_;
set missing end=lastobs;
if _n_ = 1 then call symputx('numvars',_nobs_);
array pctmiss(*) _:;
array vars(*) $ _CHARACTER_;
do i = 1 to dim(pctmiss);
if pctmiss(i) >= 90 then vars(i) = scan(vlabel(pctmiss(i)), 1, ' ');
end;
if lastobs then call symputx('delvars',catx(' ', of vars(*)));
run;

/* Step 3: Define a macro to drop variables */
%macro dropvars;
data dxccsr_transposed(drop=&delvars);
set dxccsr_transposed;
run;
%mend;

/* Step 4: Call the macro to drop variables */
%dropvars;

 

The log error:

NOTE: Line generated by the macro variable "DELVARS".
1 Table DXCCSR_EAR006 3
-
214
23

ERROR 214-322: Variable name 3 is not valid.

ERROR 23-7: Invalid value for the DROP option.

 

Thanks

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User
%macro drop_vars(dsn=,pct=);
proc transpose data=&dsn.(obs=0) out=vname;
var _all_;
run;
data _null_;
 set vname end=last;
 if _n_=1 then call execute('proc sql;create table n_miss as select ');
 call execute(catx(' ','nmiss(',_name_,') as ',_name_));
 if last then call execute("from &dsn.;quit;");
  else call execute(',');
run;
proc transpose data=n_miss out=n_miss2;
var _all_;
run;

%let dsid=%sysfunc(open(&dsn.));
%let nobs=%sysfunc(attrn(&dsid.,nlobs));
%let dsid=%sysfunc(close(&dsid.));

data n_miss3;
 set n_miss2;
 per_missing=col1/&nobs.;
run;
proc sql noprint;
select _NAME_ into :drops separated by ','
 from n_miss3
  where per_missing > &pct. ;
alter table &dsn.
 drop &drops.;
quit;
%mend;

data have;
 set sashelp.heart;
run;
/*
dsn is the dataset you want modify
pct is the percent of missing you want to drop
*/
%drop_vars(dsn=have,pct=0.6)

View solution in original post

15 REPLIES 15
ballardw
Super User

Names of your original variables? I ask because you are using

 

array pctmiss(*) _:;

And the ODS output for onewayfreqs will only have variables with names that start with _ for the variables I have created, none of the Proc Freq variables. If that is the case and some of your variables of interest do not start with _ that could be the problem. Or did you mean to use F_: for the formatted variables that Ods output creates?

 

 

The proc freq code you show would have the variable PERCENT with values of 90 or greater as your value of interest, not any of the other variables.  I don't see where you are checking for any missing values. I am afraid that your code checks to see if the original value was greater than 90, not the percent missing.

 

This will show the variables that have a value with a percent > 90 but not checking for missing.

data reduce;
   set missing;
   length var $ 32;
   var=scan(table,2);
   if percent>90;

   keep var percent;
run;

 

yabwon
Onyx | Level 15

You need a SAS programmer to help you, not a ChatGPT bull*hit, So, good you writing here 🙂

 

Try this:

data test;
  output;
  output;
  output;
  a=4;
  output;
  b=4;
  output;
  c=4;
  output;
  d=4;
  output;
  e=4;
  output;
  f=4;
  output;
  g=4;
  output;
  h=4;
  output;
  i=4;
  output;
  j=4;
  output;
  k=4;
  output;
  l=4;
  m=4;
  n=4;
  o=4;
  output;
run;
proc print data=test;
run;


data _null_;

  if 0 then set test; 
  array _variables_ _NUMERIC_; /* list your variables here */ 

  length _variableName_ $ 32 _numberOfMissing_ 8;
  declare hash _H_(ordered:'a');
  _H_.defineKey("_variableName_");
  _H_.defineData("_variableName_");
  _H_.defineData("_numberOfMissing_");
  _H_.DefineDone();

  do until(EOF);
    set test NOBS=NOBS end=EOF;
    
    do over _variables_;
      _variableName_  = vname(_variables_);
      _numberOfMissing_ = 0;
      _RC_=_H_.find(); 
      _numberOfMissing_ + nmiss(_variables_);
      _H_.replace();
    end;

  end;

  _H_.output(dataset:"missing_count");
  stop;
run;

proc print data=missing_count;
run;

data missing_count;
  if 0 then set test(drop=_ALL_) nobs=_nobs_;
  set missing_count;
  nobs=_nobs_;
  _percentOfMissing_ = _numberOfMissing_/nobs;
  format _percentOfMissing_ percent10.2;
run;

proc print data=missing_count;
run;

proc sql noprint;
  select _variableName_
  into :_variableNameMissing_ separated by " "
  from missing_count
  where _percentOfMissing_ > .90 /* <------------- 90% */
  ;
run;

options symbolgen;
data test2;
  set test(drop=&_variableNameMissing_.);
run;

 

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation



Ksharp
Super User
%macro drop_vars(dsn=,pct=);
proc transpose data=&dsn.(obs=0) out=vname;
var _all_;
run;
data _null_;
 set vname end=last;
 if _n_=1 then call execute('proc sql;create table n_miss as select ');
 call execute(catx(' ','nmiss(',_name_,') as ',_name_));
 if last then call execute("from &dsn.;quit;");
  else call execute(',');
run;
proc transpose data=n_miss out=n_miss2;
var _all_;
run;

%let dsid=%sysfunc(open(&dsn.));
%let nobs=%sysfunc(attrn(&dsid.,nlobs));
%let dsid=%sysfunc(close(&dsid.));

data n_miss3;
 set n_miss2;
 per_missing=col1/&nobs.;
run;
proc sql noprint;
select _NAME_ into :drops separated by ','
 from n_miss3
  where per_missing > &pct. ;
alter table &dsn.
 drop &drops.;
quit;
%mend;

data have;
 set sashelp.heart;
run;
/*
dsn is the dataset you want modify
pct is the percent of missing you want to drop
*/
%drop_vars(dsn=have,pct=0.6)
lansoprazole
Obsidian | Level 7

Great. Thanks for your help.

sqlGoddess
SAS Employee
i wanted to get 100% missing values for sas & a dataframe but get different number of columns for both. any idea why?

python import
df2=pd.read_csv("https://raw.githubusercontent.com/CharuSAS/SASPythonDataScientists/main/pattern_widespread_decline__...", encoding='latin-1')

here's the python code I used,
cols = dfbig.columns[dfbig.isnull().all
dfbig.drop(cols, axis=1, inplace=True)
24 columns are returned

sas import
filename dst2 url "https://raw.githubusercontent.com/CharuSAS/SASPythonDataScientists/main/pattern_widespread_decline__...";

proc import file=dst2 out=work.dst2 dbms=csv;
run;
I just made 2 changes to your sas code-
where per_missing = &pct. ;
%drop_vars(dsn=have,pct=1.0)
33 columns are returned.
Reeza
Super User
Which is correct?

Confirm the data is read in correctly in each program, likely one is reading it in correctly.

Both 'data import' procedures being used make a bunch of guesses in the background as to datatypes. Likely they're different, resulting in different data being processed by the drop_vars macro or the .drop() operation in python.
sqlGoddess
SAS Employee

good point,

I went back & ran a proc contents after the sas import, 

sqlGoddess_0-1724098046270.png

and here's the python describe on the dataframe

sqlGoddess_1-1724098093457.png

both have read in 66907 rows and 170 columns.

 

the problem seems to be at the point of dropping columns with 100% missing values.

Ksharp
Super User
And I noticed that your python function "dfbig.isnull()" only checked the NULL value, you also need to check EMPTY value.
Because SAS is unlike database, SAS take EMPTY value as NULL value , but database take them as two different thing .
Ksharp
Super User

OK. Here is another way to check if a variable is all missing.


data have;
 set sashelp.heart;
 call missing(sex,weight);
run;

ods select none;
ods output nlevels=want(where=(NNonMissLevels=0));
proc freq data=have  nlevels;
table _all_;
run;
ods select all;

You can compare it with my PROC SQL.

BTW, my code drop both character and numeric type variables , I am not sure your Python code would drop  both of them.

 

I am unable to test it due to not access to that csv .

and also you could check the missing value in both sas and python to see if there are some difference.

Ksharp
Super User
And make sure there are not the ERROR/WARNING info in LOG after running this macro.
Quentin
Super User

As an aside, because I already see a couple answers posted...

 

That code that ChatGPT spit out is really horrific.  You can imagine what it was trying to do (or I guess what it was predicting should be done), but it seems to me as so far off that it's not helpful.  I wonder if ChatGPT is significantly better at predicting code for other languages, where it might have had more training data.  I've seen plenty of programming folks saying they could use ChatGPT as a first draft of code, or to generate code for the basic/boring stuff.  But I'm not at all convinced that the SAS code it can generate is helpful. Interesting, maybe, but I'm not sure it's helpful.  

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
ballardw
Super User

@Quentin wrote:

As an aside, because I already see a couple answers posted...

 

That code that ChatGPT spit out is really horrific.  You can imagine what it was trying to do (or I guess what it was predicting should be done), but it seems to me as so far off that it's not helpful.  I wonder if ChatGPT is significantly better at predicting code for other languages, where it might have had more training data.  I've seen plenty of programming folks saying they could use ChatGPT as a first draft of code, or to generate code for the basic/boring stuff.  But I'm not at all convinced that the SAS code it can generate is helpful. Interesting, maybe, but I'm not sure it's helpful.  


Considering that SAS Proc Import for text files, which knows what you are attempting to do and does a fair job, has recurring issues with making many id type variables numeric when likely not the best choice I sort of cringe at what Chatgpt might suggest for some of the files I read.

 

Of course we have absolutely no idea what @lansoprazole asked ChatGPT to do in the first place. Considering the percentage of poorly formed questions we see on this forum I would be surprised if the question was phrased well to begin with and then the bot used the very few parts that it understood, >90 for example, to build from.

 

Not to mention the likelihood of it coming up with use of a custom informat to provide data validations on reading is extremely small.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Reeza
Super User
Interesting response from ChatGPT. It seems to have the right steps but the code cobbled together doesn't align at all. The second step input wouldn't be what would come out of the PROC FREQ and it seems to be working on character variables only. And using labels not variable names.
Reeza
Super User

Apparently I wrote a macro to do this a few years ago...

https://gist.github.com/statgeek/feedc3fc520cb0d2018ca2a8cab241d8

 

 

%macro drop_missing_pct(input_dsn = , output_dsn=, pct = , id_vars=);

*input_dsn = input data set name;
*output_dsn = output data set name;
*pct = missing percent, variables with a percentage of missing above this value are dropped;
*id_vars = space delimited list of variables that you do not want to include in the analysis such as ID variables;


*create format for missing;
proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
    value nmissfmt .="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN. (drop = &ID_Vars);
    table _all_ / missing;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;


ods select all;
*Format and organize output;

data long;
    length variable $32. variable_value $50.;
    set temp;
    Variable=scan(table, 2);
    Variable_Value=strip(trim(vvaluex(variable)));
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    keep variable variable_value frequency percent cum: presentation;
    label variable='Variable' variable_value='Variable Value';
run;

*not required for display purposes;
proc sort data=long;
    by variable;
run;

*select variables more than x% missing;
proc sql noprint;
select variable into :drop_var_list separated by " "
from long where variable_value = 'Missing' and percent > &pct;
quit;

*Drop variables;
data &output_dsn;
set &input_dsn;
drop &drop_var_list;
run;

*clean up;
*uncomment after testing;
/* proc sql; */
/* drop table long; */
/* drop table temp; */
/* quit; */

%mend;

***************************************************************************************************
*Example Usage
***************************************************************************************************;

data class;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
run;


%drop_missing_pct(input_dsn = class, output_dsn = want, pct = 20, id_vars = Name);



*check output;
proc contents data=want;
run;

 

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 2495 views
  • 17 likes
  • 7 in conversation