Solved: Re: How to randomly delete three observations by group?

hua · Posted 10-30-2018 12:31 PM

Hello all,

I'm trying to randomly delete three observations for each group, and add the result as a new column. Does anyone know how to randomly delete three observations by group?

The dataset I have is like the following:

ID year prep

1 2000 550

1 2001 600

1 2002 580

1 2003 550

1 2004 570

... ... ...

1 2018 550

2 2000 600

2 2001 600

2 2002 580

2 2003 550

2 2004 570

... ... ...

and I'm trying to get the new dataset as:

ID year prep new

1 2000 550 550

1 2001 600 .

1 2002 580 580

1 2003 550 .

1 2004 570 570

1 2005 530 530

1 2006 545 .

... ... ... ...

1 2018 550 550

2 2000 600 600

2 2001 600 .

2 2002 580 .

2 2003 550 550

2 2004 570 570

... ... ... ...

Thank you all!

Best,

Hua

SuryaKiran · Posted 10-30-2018 01:22 PM

I recommend using PROC SURVEYSELECT. Seed values changes with time so that when ever you run the random changes.

data have;
prep=500;
do id=1 to 3;
	do year=2000 to 2008;
	prep+50;
	output;
	end;
end;
run;

proc surveyselect data=have
	method=srs n=3 
	seed=%SYSFUNC(round(%Sysfunc(time())))  
	out=test (Keep=id year prep) noprint ;
strata id ;
run;

proc sql;
create table want as 
select a.*,case when b.prep is null then a.prep else . end as new_prep
	from have a
	left join test b on (a.id=b.id and a.year=b.year);
quit;

Thanks,
Suryakiran

View solution in original post

Reeza · Posted 10-30-2018 12:45 PM

1. Add a random number
2. sort by random number
3. Set to missing for the first 3 (or reassign as needed).
4. Resort to desired order

I don't know of another quick way, except possibly using PROC SURVEYSELECT and select N-3 from each group and then merging that back in.

hua · Posted 10-30-2018 02:10 PM

Thanks! It's a good way to do it!

novinosrin · Posted 10-30-2018 12:54 PM

Some idea here:

data have;
input ID     year     prep;
cards;
1      2000       550
1      2001       600
1      2002       580
1      2003       550
1      2004       570
1      2018       550
2      2000       600
2      2001       600
2      2002       580
2      2003       550
2      2004       570
;
data want;
do _n_=1 by 1 until(last.id);
set have;
by id;
end;
array t(3) _temporary_;
do _d=1 to 3;
t(_d)=rand('integer',1,_n_);
end;
do _n_=1 by 1 until(last.id);
set have;
by id;
if _n_ not in t then output;
end;
drop _:;
run;

novinosrin · Posted 10-30-2018 01:08 PM

@hua This version is a better one :


data want;
array t(3) _temporary_;
call missing(of t(*), __d);
do _n_=1 by 1 until(last.id);
set have;
by id;
end;
do while(n(of t(*))<dim(t));
_d=rand('integer',1,_n_);
if _d not in t then do;__d+1; t(__d)=_d;end;
end;
do _n_=1 by 1 until(last.id);
set have;
by id;
if _n_ not in t then output;
end;
drop _:;
run;

SuryaKiran · Posted 10-30-2018 01:22 PM

I recommend using PROC SURVEYSELECT. Seed values changes with time so that when ever you run the random changes.

data have;
prep=500;
do id=1 to 3;
	do year=2000 to 2008;
	prep+50;
	output;
	end;
end;
run;

proc surveyselect data=have
	method=srs n=3 
	seed=%SYSFUNC(round(%Sysfunc(time())))  
	out=test (Keep=id year prep) noprint ;
strata id ;
run;

proc sql;
create table want as 
select a.*,case when b.prep is null then a.prep else . end as new_prep
	from have a
	left join test b on (a.id=b.id and a.year=b.year);
quit;

Thanks,
Suryakiran

novinosrin · Posted 10-30-2018 01:25 PM

That can be over come by simple assignment of

call streaminit(123);

data want;
call streaminit(123);
array t(3) _temporary_;
call missing(of t(*), __d);
do _n_=1 by 1 until(last.id);
set have;
by id;
end;
do while(n(of t(*))<dim(t));
_d=rand('integer',1,_n_);
if _d not in t then do;__d+1; t(__d)=_d;end;
end;
do _n_=1 by 1 until(last.id);
set have;
by id;
if _n_ not in t then output;
end;
drop _:;
run;

I bet this this much faster if the requirement is "randomly delete three observations by group"

hua · Posted 10-30-2018 02:11 PM

Yes, it works for randomly deleting 3 obs, all I need to do next is to merge the result to the original dataset. Thank you!

novinosrin · Posted 10-30-2018 02:13 PM

Hmm Hang on I think I m missing something, There should not be a need merge back. I admit I haven't got the clarity of final output you want. Can you clarify, let me tweak the existing one plz

I think this change will do

if _n_ in t then call missing(new);
else new=prep;

data want;
call streaminit(123);
array t(3) _temporary_;
call missing(of t(*), __d);
do _n_=1 by 1 until(last.id);
set have;
by id;
end;
do while(n(of t(*))<dim(t));
_d=rand('integer',1,_n_);
if _d not in t then do;__d+1; t(__d)=_d;end;
end;
do _n_=1 by 1 until(last.id);
set have;
by id;
if _n_  in t then call missing(new);
else new=prep;
output;
end;
drop _:;
run;

hua · Posted 10-30-2018 02:18 PM

The randomly deleting part words well. The final output I want is to add a new column, which is the result after randomly deleting 3 obs, to the original dataset. SuryaKiran's code works well, which randomly selected 3 obs and then merge them as missing value to the original one.

novinosrin · Posted 10-30-2018 02:20 PM

Got it. Sure proc survey select is convenient and i tested that being much slower. I have modified mine for my satisfaction after all.

SAS Innovate 2025: Register Now