BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Antoni
Fluorite | Level 6

Hello,
it's 12hours since i'm trying to get over this problem. I need to write a report about credit scorring.

There are values assigned as 'M' that represent missing data(some comedian did that). I need to get rid of both usual missing data which is "." and "M" so i can start calculating correlations and regression based on regular data. I tried to reasign those "M" values into regular ' ' in a range of variables that contain the "M" value but i'm constantly failing i have egsams soon and have to write this report till saturday but since 3 days i have about 8 lines. 
This is my recent ceration but it fails:

 

data wanted;
set have( keep=act_age -- app_char_cars);
if (act_age -- app_char_cars)='M' then (act_age -- app_char_cars)=' ';
run;

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

@Antoni wrote:

data wanted;
set sklejone( keep=act_age -- default_cus12);
array t act_age -- default_cus12;
do over t;
if t= 'M' then t=' ';
end;
run;

thats this one, i checked and had previously included character variables, i have included the right range of ony numerical variables and it did nothing, no error, no expected output

 


You didn't answer the question and you don't seem to understand what we've been saying about missing values.

You need to read the page below about SAS special missing values. 

 

You can have missing values as just missing, or you can code them to identify different types of missing values. For example, in survey data, I will often code N/A questions as .N and skipped questions/no answer as .M so they can be differentiated in the results.  

 

https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=lrcon&docsetTarget=p1xr9f...

 

So modifying my code according to the comment from @Astounding earlier in the thread, give the following macro definition. This should work now, I tested it.

 

 

 

%macro drop_missing_pct(input_dsn = , output_dsn=, pct = , id_vars=);

*create format for missing;
proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
     value nmissfmt  ._ - .Z ="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN. (drop = &ID_Vars);
    table _all_ / missing;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;


ods select all;
*Format output;

data long;
    length variable $32. variable_value $50.;
    set temp;
    Variable=scan(table, 2);
    Variable_Value=strip(trim(vvaluex(variable)));
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    keep variable variable_value frequency percent cum: presentation;
    label variable='Variable' variable_value='Variable Value';
run;

*not required for display purposes;
proc sort data=long;
    by variable;
run;

*select variables more than x% missing;
proc sql noprint;
select variable into :drop_var_list separated by " "
from long where variable_value = 'Missing' and percent > &pct;
quit;

*Drop variables;
data &output_dsn;
set &input_dsn;
drop &drop_var_list;
run;

*clean up;
*uncomment after testing;
/* proc sql; */
/* drop table long; */
/* drop table temp; */
/* quit; */

%mend;

Example showing it works, with special missing values:

*create test data set with many missing values;
data class;
	set sashelp.class;

	if age=13 then
		call missing(height, weight, sex);

	if name in ('Jane', 'Carol', 'Henry', 'Alfred') then
		do;
			age=.A;
			sex=.M;
		end;

	if name in ('Judy', 'Mary') then
		weight=.W;


run;

%drop_missing_pct(input_dsn=class, output_dsn=want, pct=20, id_vars=Name);
*check output;

proc contents data=want;
run;

title 'Missing Report for data set';

proc print data=long;
run;

title 'Data set with columns missing more than 20% dropped';

proc print data=want (obs=5);
run;

 

View solution in original post

46 REPLIES 46
novinosrin
Tourmaline | Level 20


data wanted;
set have( keep=act_age -- app_char_cars);
array t  act_age -- app_char_cars;
do over t;
 if t= 'M' then t=' ';
end;
run;

 Hi @Antoni -

1. Group the list of variables in an array. This assumes all the variables in the list are of the same type and the arrangement is contiguous

2. Cycle through the array using a Loop. Yes, it's a linear search and replace method. Therefore, If an element aka variable in the array is "M", reset the value of the element to ' '

 

Antoni
Fluorite | Level 6
it is. If i would want to delete variables with more than 40% missing values, both ' ' and 'M' it would require some kind of macro to be written, right?
Antoni
Fluorite | Level 6

ERROR: All variables in array list must be the same type, i.e., all numeric or character. - well that's not new sight for me 😞

novinosrin
Tourmaline | Level 20

@Antoni  Hence the reason I mentioned "This assumes all the variables in the list are of the same type and the arrangement is contiguous"

 

Either way, If you are running against time for a deliverable I suggest not to be concerned about the mechanisms. Please post a clean representative mock sample of what you HAVE and what you WANT. I'm sure somebody will provide the needed solution. You need to include the data types and other info so that you can avoid going back and forth.

 

Just act fast, get the report delivered and then open up a new thread to discuss & learn. Chill

 

 

 

Astounding
PROC Star

Chances are your data is fine and needs no change.  The "comedian" may have actually known what (s)he was doing.

 

If you intend to run correlations, these variables must be numeric.  So what appears as "M" is actually a special form of a missing value.  SAS supports 28 different forms of missing values for numeric variables, and "." is only one of those.  So are .A, .B, ..., .Z, and ._

 

Using a special missing value lets you differentiate why the value is missing, but it is still missing.  So if all this rings true, let your data alone!  It's fine.  Try running a correlation and see what happens.

Antoni
Fluorite | Level 6
the problem is i am completly new to this kid of operations and it's easier to me to have only numerical values because i won't have to deal with it through many other operations that needs to be done
ChrisNZ
Tourmaline | Level 20

> it's easier to me to have only numerical values

 

You already have numerical values. Read @Astounding 's reply again. Chances are you need to fix exactly nothing.

Antoni
Fluorite | Level 6
and what if i want to delete variables that have more than 40% missing data? i would have to create macro?
Reeza
Super User

I'm not sure a macro actually helps that much here. 

 

That being said, here's an example of a macro that would do it. There's a small example at the end that shows how it's used as well and it drops anything over 20% in that example. I included that as a parameter so it's actually a more generic program. If your list ends up being longer than 64k characters this may not work as well for you. 

 




%macro drop_missing_pct(input_dsn = , output_dsn=, pct = , id_vars=);

*create format for missing;
proc format;
    value $ missfmt ' '="Missing" other="Not Missing";
    value nmissfmt .="Missing" other="Not Missing";
run;

*Proc freq to count missing/non missing;
ods select none;
*turns off the output so the results do not get too messy;
ods table onewayfreqs=temp;

proc freq data=&INPUT_DSN. (drop = &ID_Vars);
    table _all_ / missing;
    format _numeric_ nmissfmt. _character_ $missfmt.;
run;


ods select all;
*Format output;

data long;
    length variable $32. variable_value $50.;
    set temp;
    Variable=scan(table, 2);
    Variable_Value=strip(trim(vvaluex(variable)));
    presentation=catt(frequency, " (", trim(put(percent/100, percent7.1)), ")");
    keep variable variable_value frequency percent cum: presentation;
    label variable='Variable' variable_value='Variable Value';
run;

*not required for display purposes;
proc sort data=long;
    by variable;
run;

*select variables more than x% missing;
proc sql noprint;
select variable into :drop_var_list separated by " "
from long where variable_value = 'Missing' and percent > &pct;
quit;

*Drop variables;
data &output_dsn;
set &input_dsn;
drop &drop_var_list;
run;

*clean up;
*uncomment after testing;
/* proc sql; */
/* drop table long; */
/* drop table temp; */
/* quit; */

%mend;

And the test to show it works:

 


*test;

data class;
    set sashelp.class;

    if age=14 then
        call missing(height, weight, sex);

    if name='Alfred' then
        call missing(sex, age, height);
    label age="Fancy Age Label";
run;


%drop_missing_pct(input_dsn = class, output_dsn = want, pct = 20, id_vars = Name);



*check output;
proc contents data=want;
run;



https://gist.github.com/statgeek/feedc3fc520cb0d2018ca2a8cab241d8

Astounding
PROC Star

Given that there might be special missing values in the data, this line in the macro should be revised:

    value nmissfmt .="Missing" other="Not Missing";

It should become:

    value nmissfmt  ._ - .Z ="Missing" other="Not Missing";
Antoni
Fluorite | Level 6
yes and here's one otf the problem i'm dealing with that i could not find a solution for

drop_missing_pct(input_dsn = , output_dsn=, pct = , id_vars=)

i'm unable to put id_vars= because i need the results from all of the 2300 variables, so i need to either put them as range, write them separately or ther's other solution that im not aware of, please be patient im very new to those things and there were 6 lessons where we made really basic things and now we need to create this kind of report..
Reeza
Super User
ID_VARS are for variables you DO NOT want included, not those you want to include. Did you try running the code? Does it not do what you want?
Antoni
Fluorite | Level 6

it showed this error,

ERROR: Invalid macro parameter name 4. It should be a valid SAS identifier no longer than 32 characters.
ERROR: A dummy macro will be compiled.

on this line:
length variable $32. variable_value $50.;

i tried to change "variable value %50" to 32 but its still the same

 

 

Reeza
Super User
Did my example code not work? Did you run it exactly as posted?

Show the log from the code I provided and then show the code AND log you submitted for your data set.

sas-innovate-white.png

Missed SAS Innovate in Orlando?

Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.

 

Register now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 46 replies
  • 2235 views
  • 14 likes
  • 6 in conversation