DATA Step, Macro, Functions and more

Delete observations randomly based on percentage and impute them

Accepted Solution Solved
Reply
Contributor
Posts: 21
Accepted Solution

Delete observations randomly based on percentage and impute them

Hello,

I am new to sas programming. I am doing data analysis. I want to delete observations(not collumn or raw) and then in output i need full data set in which it should show deleted value as missing so i can impute those values for later use.I want to delete observations randomly based on percentage. can anyone help me with code ?


Accepted Solutions
Solution
‎02-25-2017 02:40 PM
Trusted Advisor
Posts: 1,381

Re: Delete observations randomly based on percentage and impute them

Here is a possible method-code but I dont know what is the preffered randon functio

in order to force the desired percantage:

 

%let n=5; /* number of observations-values per row */
%let p=0.05; /* desired percentage */
data want;
 set test(drop=F) end=eof;
     array obs {&n} _numeric_;
     retain nmis 0;

     do i=1 to &n;
        if round(ranuni(100),0.01) = &p
        then do; nmis+1; call missing(obs(i)); end;
     end;
if eof then put nmis= " is the number of missing values created "; DROP i nmis; run;

View solution in original post


All Replies
Super User
Posts: 17,837

Re: Delete observations randomly based on percentage and impute them

1. CALL MISSING() can be used to set multiple variables to missing. 

2. If you need an exact sample use PROC SURVEYSELECT and merge back with original dataset to identify selected records. Or use the following if approximate is ok. 

 

if rand('Bernoulli', 0.5) =1 then call missing(of _all_); 

Super User
Posts: 9,681

Re: Delete observations randomly based on percentage and impute them

It is better to post an example to expalin what you want .

Respected Advisor
Posts: 4,649

Re: Delete observations randomly based on percentage and impute them

[ Edited ]

Here is an example based on sashelp.class of how to select a random sample of 20% from each sex group for deletion

 

/* Sort by subgroups, if needed */
proc sort data=sashelp.class out=class; by sex; run;

/* Select sample in each subgroup */
proc surveyselect data=class out=classSample
    samprate=0.2 outall;
strata sex; /* subgroups */
run;

/* Set variables to missing for selected obs */
data classSample;
set classSample;
if selected then call missing(weight, height);
drop selected selectionProb samplingWeight;
run;

note: you could generate 100 random samples by adding option reps=100 to the proc surveyselect statement. The variable replicate would then be added to the output data.

PG
Contributor
Posts: 21

Re: Delete observations randomly based on percentage and impute them

suppose this is original data=

A B C D E

4 5 8 10 9

3 5 2  6  8

8 9 3  5  7

9 3 5  6  2

 

in out put data should be in this way. so later stages I can impute them.

A B C D E

4 5    10 9

3     2 6  8

8 9  3 5  7

   3  5     2

the data should be complete and there should be blank where we deleted values.

my original data is way bigger then this. please any help will be appreciated.

Trusted Advisor
Posts: 1,381

Re: Delete observations randomly based on percentage and impute them

The term "observation" in sas means, ususally, a row in the dataset.

Your output has same number of observation (rows) as input.

You did not deleted observations but randomaly assign missing to variables.

 

Should the percentage be of observations (rows) or of values ? which I'm not sure is the same.

 

 

Contributor
Posts: 21

Re: Delete observations randomly based on percentage and impute them

I want to delete random data values not the observations. and percentage should be depend on data values.if we think of 20%, out of 500 observations in every variable it should delete 100 data values. and out put should contain 500 observations in which deleted values should be blank or missing.

Solution
‎02-25-2017 02:40 PM
Trusted Advisor
Posts: 1,381

Re: Delete observations randomly based on percentage and impute them

Here is a possible method-code but I dont know what is the preffered randon functio

in order to force the desired percantage:

 

%let n=5; /* number of observations-values per row */
%let p=0.05; /* desired percentage */
data want;
 set test(drop=F) end=eof;
     array obs {&n} _numeric_;
     retain nmis 0;

     do i=1 to &n;
        if round(ranuni(100),0.01) = &p
        then do; nmis+1; call missing(obs(i)); end;
     end;
if eof then put nmis= " is the number of missing values created "; DROP i nmis; run;
Super User
Posts: 17,837

Re: Delete observations randomly based on percentage and impute them

@art297 You can use call Missing (of _all_) for both numeric and character. 

 

@Shmuel's code modified

 

%let n=5; /* number of observations-values per row */
%let p=0.05; /* desired percentage */
data want;
 set test(drop=F) end=eof;
     retain nmis 0;


        if round(ranuni(100),0.01) = &p
        then do; 
                nmis+1; 
               call missing(of _all_);
        end;
       
        if eof then put nmis= " is the number of missing values created ";
     DROP i nmis;
run;

Or from PGStats solution:

 

%let p=0.05;

proc sort data=sashelp.class out=class;
    by sex;
run;

proc surveyselect data=class out=selected samprate=&p outall;
    strata sex;
run;

data missing;
    set selected;

    if selected=1 then
        call missing (of _all);
run;
Trusted Advisor
Posts: 1,381

Re: Delete observations randomly based on percentage and impute them

@Reeza, using call missing(of _ALL_) - should it not enter missing values to all variables in the observation ?

 

Looking at wanted result of example, that is not the desired:

in out put data should be in this way. so later stages I can impute them.
A B C  D E
4 5   10 9
3 2 6  8
8 9 3  5 7
  3 5    2
PROC Star
Posts: 7,363

Re: Delete observations randomly based on percentage and impute them

@Reeza: I may be missing something, but @Shmuel and I suggested code for deleting the values in a random selection of 20% of the cells, not deleting all values for 20% of the cases.  As such, I don't think call missing(of _all_) would be applicable.

 

Super User
Posts: 17,837

Re: Delete observations randomly based on percentage and impute them

@Shmuel and @art297 You're correct, my mistake! 

 

PROC Star
Posts: 7,363

Re: Delete observations randomly based on percentage and impute them

@Shmuel types faster than I do but, if you need to delete both character and numeric (which I presume you do as you mentioned that the results should either be missing or blank), the method can be expanded to include both:

 

data want;
  set have;
  array nums(*) _numeric_;
  array chars(*) _character_;
  do _n_=1 to dim(nums);
    if rand("Uniform") lt .2 then call missing(nums(_n_));
  end;
  do _n_=1 to dim(chars);
    if rand("Uniform") lt .2 then call missing(chars(_n_));
  end;
run;

Art, CEO, AnalystFinder.com

Contributor
Posts: 21

Re: Delete observations randomly based on percentage and impute them

data RandClass(drop=i);
call streaminit(1234);
set sashelp.cars(keep=Cylinders Horsepower Enginesize);
array x {*} _numeric_;
do i = 1 to dim(x);
   if rand("Bern", 0.1) then         /* p=0.4 ==> about 40% missing */
      x[i]=.; end;
run;
proc print;
run;

data sashelp;
set sashelp.cars;
update cars Randclass;
by Cylinders Horsepower Enginesize;
run;
proc print cars;

 

 

I have removed data values randomly from three variables cylinders horsepower and enginesize from sashelp.cars. when i am trying to update newly created varaibles with missing value in data set sashelp.cars its giving me errors like

 
67 data sashelp;
68 set sashelp.cars;
69 update cars Randclass;
ERROR: File WORK.CARS.DATA does not exist.
70 by Cylinders Horsepower Enginesize;
71 run;
proc print cars;
____
22
202
ERROR 22-322: Syntax error, expecting one of the following: ;, BLANKLINE, CONTENTS, DATA, DOUBLE, GRANDTOTAL_LABEL, GRANDTOT_LABEL,
GRAND_LABEL, GTOTAL_LABEL, GTOT_LABEL, HEADING, LABEL, N, NOOBS, NOSUMLABEL, OBS, ROUND, ROWS, SPLIT, STYLE,
SUMLABEL, UNIFORM, WIDTH.
ERROR 202-322: The option or parameter is not recognized and will be ignored.
 
can anyone help me in this errors which is coming when i am trying to update. half part is working correct even i am getting as an output randclass dataset with 3 variables. in the 2nd part when i am trying to update its giving me errors.
Super User
Posts: 17,837

Re: Delete observations randomly based on percentage and impute them

The error is clear, do you have a dataset CARS created, check your library? If not then check your code. I think you meant to create a dataset called CARS but you called it SASHELP instead? Make sure you're consistent with the names used. 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 299 views
  • 2 likes
  • 6 in conversation