BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Zerg
Calcite | Level 5

Hello all,

 

I appreciate your help on writing an iteration process. I need to split a sample into three groups based on the following procedure:

 

1. Randomly select 3 seed observations from main dataset, and assign them to three sub datasets, "high", "middle", and "low" respectively based on the values of variable A from the seed observations. So each sub dataset now has one observation to start with.

 

2. Starting from the main dataset with the 3 seed observations excluded, get the difference between the value of variable A from each observation and the median value of variable A in the sub datasets. An observation will be added to one of the sub datasets when the sub dataset has the smallest value on the squared difference compared with other sub datasets.

 

3. Repeat step 2 until all the observations in the main dataset have been examined.

 

I have figured out the first step and have the following sample data to start with:

 

data have;
input a; 
cards;
-1.35
-1.10
-1.02
-0.72
-0.18
-0.11
0.31
0.58
0.67
;
run;

*randomly generate 3 seed observations*;
proc surveyselect data=have out=rand method=srs sampsize=3 seed=100 noprint; run;
data rand; set rand; n+1; run;
data t1; set rand; if n=1; run; data t1; set t1; drop n; run;*low sub dataset*;
data t2; set rand; if n=2; run; data t2; set t2; drop n; run;*middle sub dataset*;
data t3; set rand; if n=3; run; data t3; set t3; drop n; run;*high sub dataset*;

*exclude the 3 seed observations from main dataset*;
proc sql; create table data as select
a.*,b.x
from w1 a left join rand b
on a.x=b.x 
where b.x is null;
quit;

After running the code above, I have 3 sub datasets "t1" "t2" and "t3", and a main dataset "data". How can I code steps 2 and 3 with these datasets? I am open to coding step 1 in a more efficient manner as well. Many thank! 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Are you trying to do K-Mean Cluster ?

Check PROC FASTCLUST .

View solution in original post

4 REPLIES 4
PGStats
Opal | Level 21

Here is a way to do this with arrays. Will work for small size datasets.

 

data have;
input x; 
cards;
-1.35
-1.10
-1.02
-0.72
-0.18
-0.11
0.31
0.58
0.67
;

/* Put the data into random order, so that the first three obs will random picks */
data temp;
set have;
rnd = rand("uniform");
run;
proc sort data=temp; by rnd; run;

/* Implement clustering algorithm */
data a b c;
array a{100} _temporary_;
array b{100} _temporary_;
array c{100} _temporary_;
set temp end=done;
select (_n_);
    when (101) error "Capacity exceeded.";
    when (1) a{_n_} = x; 
    when (2) b{_n_} = x; 
    when (3) c{_n_} = x; 
    otherwise do;
        da = (x-median(of a{*}))**2;
        db = (x-median(of b{*}))**2;
        dc = (x-median(of c{*}))**2;
        if da = min(da, db, dc) then a{_n_} = x;
        else if db = min(da, db, dc) then b{_n_} = x;
        else c{_n_} = x;
        end;
    end;
if done then 
    do _n_ = 1 to dim(a);
        x = coalesce(a{_n_},b{_n_},c{_n_});
        if      n(a{_n_}) then output a;
        else if n(b{_n_}) then output b;
        else if n(c{_n_}) then output c;
        end;
keep x;
run;
PG
Zerg
Calcite | Level 5

Thank you for the code. I will test it to see if it meets my needs.

Ksharp
Super User

Are you trying to do K-Mean Cluster ?

Check PROC FASTCLUST .

Zerg
Calcite | Level 5

Thank you for your suggestion. This looks like a handy procedure that gives me what I am looking for.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 742 views
  • 0 likes
  • 3 in conversation