Solved: Re: selecting and removing observations from two data sets iteratively

SWEETSAS · Posted 12-15-2014 09:20 AM

data out;

input a b;

0.7 1.3

0.5 0.3

2.0 0.8

0.6 0.45

-0.4 2

;

run;

data reference;

input a b;

0.9 1.3

0.5 0.7

;

run;

Proc fastclus data=out maxc1 replace=none maxiter=0 noprint seed=reference

Out=mahalanobis;

Var pin:;

Run;

I found the following code somewhere on the web and I like to modify it using macro to do the following task:

The data data set "mahalanobis" will contain the variable "distance", representing the distance from the each observation in the data set "out" from the first observation in the "reference" data set.

Now I like to remove the first observation in the "reference" data set and the the observation with smallest value of "distance" in "mahalanobis" data set. Theses two obervations will be stored in an "output" data set with seperate ID to identify that they are from different data set. I will then repeat the same step for all the data set in "reference" data set. That is, compute the distance between the next first observation in "reference" data set and observations in "out" data set. I will then output the first observation in reference data set and the observation with the smallest distance in the "malanobis" data set.

When there is no more data in the reference data set, the macro ends.

Thanks

Tom · Posted 12-20-2014 11:20 PM

Try this macro.

%macro match

(cases=ref /* List of records to use as CASES */

,controls=main /* List of records to use as CONTROLS */

,id=id /* ID variable in CONTROLS */

,varlist=x y /* Variables to use in distance calculations */

,out=out /* Output dataset */

,with_replacement=0 /* Select with replacement? (0/1) */

);

%local i n rename remove ;

%* Generate RENAME list for &VARLIST varaibles ;

%do i=1 %to %sysfunc(countw(&varlist));

%let rename=&rename %scan(&varlist,&i)=match_%scan(&varlist,&i);

%end;

* Get count of number of cases ;

data _null_;

if eof then call symputx('n',_n_ -1);

set &cases end=eof;

run;

* Use all CONTROL records to start with ;

data distance;

set &controls;

keep &id &varlist ;

run;

%if %sysfunc(exist(&out)) or %sysfunc(exist(&out,view)) %then %do;

* Remove existing output dataset ;

proc delete data=&out; run;

%end;

%* Set REMOVE to missing value to that all controls used on first pass;

%let remove=.;

%do i=1 %to &n ;

* Calculate distance from next CASES, exclude previosly used CONTROL ;

proc fastclus maxc=1 replace=none maxiter=0 noprint

data=distance(where=((&id ^= &remove)))

seed=&cases (firstobs=&i obs=&i )

out=distance ;

;

var &varlist;

run;

* Sort to find closest ;

proc sort; by distance ; run;

* Generate next record ;

* Set REMOVE to eliminate this case ;

data next ;

set &cases (firstobs=&i obs=&i);

set distance(obs=1 keep=&id &varlist distance rename=(&id=match_&id &rename));

if not &with_replacement then call symputx('remove',match_&id);

run;

* Add to output dataset ;

proc append base=&out data=next force;

run;

%end;

%mend match;

data main;

input id x y;

datalines;

1 0.5 0.3

2 0.7 1.3

3 2.0 0.8

4 0.6 0.45

5 .25 .4

;

data ref;

input id x y;

datalines;

1 0.6 1.2

2 0.5 0.7

3 -0.3 1.5

;

options mprint ;

%match() ;

proc print;

run;

%match(with_replacement=1) ;

proc print;

run;

Without Replacement
Obs	id	x	y	match_id	match_x	match_y	DISTANCE
1	1	0.6	1	2	0.7	1.3	0.14142
2	2	0.5	1	4	0.6	0.45	0.26926
3	3	-0.3	2	5	0.25	0.4	1.22984
With Replacement
Obs	id	x	y	match_id	match_x	match_y	DISTANCE
1	1	0.6	1	2	0.7	1.3	0.14142
2	2	0.5	1	4	0.6	0.45	0.26926
3	3	-0.3	2	2	0.7	1.3	1.0198

View solution in original post

stat_sas · Posted 12-15-2014 02:13 PM

Try this.

Proc fastclus data=out maxc=1 replace=none maxiter=0 noprint seed=reference

Out=mahalanobis;

Var a b;

Run;

SWEETSAS · Posted 12-15-2014 06:01 PM

Thanks. But program is not different from what I have above except the change of the variable names. My main question is how to get it to run iteratively as described above.

stat_sas · Posted 12-15-2014 06:47 PM

Just to understand better, you are looking a way to classify each observation from dataset out based on observations in dataset reference right?

SWEETSAS · Posted 12-15-2014 09:42 PM

What I like to do is similar to matching each observation in "mahalanobis" data set with each observation in "reference" data set using smallest distance. However, in matching, a calliber is use to specify the range of value for matching criteria. But in my case I want to to match using smallest distance. Once I see a match, I will remove the two observation that are matched, and look for the next two observation. I will do this until the last observations in reference data.

This is similar to what is done in matching case and control using the %gmatch from Mayo Clinic. But in my case, I want to matching based on smallest distance from the "mahalanobis" data set.

Thanks

stat_sas · Posted 12-16-2014 02:45 PM

So the goal is to remove two observations from out data set. First will have minimum distance from the first observation from reference data set and second will have minimum distance from the second observation from reference data set right?

SWEETSAS · Posted 12-16-2014 11:32 PM

Remove the observation with minimum distance from the out data and the first observation in reference data. This is because the minimum distance is calculated relative to the first observation in the reference data set.

Once the first observation in reference data set is removed and send to some output data set, obviously next observation will then become the first obsevation in the reference data set, and the distance between tthis "new" first observation and the remaining observations in "out" data set is computed again. The observation with the smallest distance relative to first observation in the reference data set is identified. This observation with minimum distance and the first observation in reference data set is again removed and send to some output data set as in the first case. Then next observation in the reference data set becomes the first; then the distance is computed between this first observation in the reference data set and the remainining observations in out data set, the observation with the smallest distance is identify and remove and set to an output data set, ....

stat_sas · Posted 12-16-2014 11:55 PM

How about this?

Proc fastclus data=out maxc=2 replace=none maxiter=0 noprint seed=reference

Out=mahalanobis;

Var a b;

Run;

proc sort data=mahalanobis;

by cluster distance;

run;

data want;

set mahalanobis;

by cluster;

if not first.cluster;

run;

SWEETSAS · Posted 12-18-2014 08:22 AM

Thanks very much. These code were helpful in that when I modified it I seem to get close to what I wanted. I basically changed

data want;

set mahalanobis;

by cluster;

if not first.cluster;

run;

to

data want;

set mahalanobis;

by cluster;

if first.cluster;

run;

to capture the first observation in each cluster with the thought that the first observation in each cluster will reflect the observation with smallest distance to reference observation. But when so observations are close in value, this does not work well as some observations are omitted because of their proximity.

Essentially, I am trying to match each observation in "out" data set with the observation in the reference data set based on the minimum mahalanobis distance. The reason I use this procedure is because Feng et al,http://www.lexjansen.com/pharmasug/2006/publichealthresearch/pr05.pdf used the fastclus procedure to trick SAS to match by mahalanobis distance.

The main difference between what I am trying to do and what Feng et al. did is that I am not using any caliper and propensity score. I just want to match observations in the two data set by minimum distance. At the end I will have total observation equals number of observation in the reference data set.

Please, help

stat_sas · Posted 12-18-2014 03:54 PM

Hi,

Please provide sample data that supports above description.

Thanks,

SWEETSAS · Posted 12-18-2014 06:41 PM

data main;

input id x y;

datalines;

1 0.7 1.3

2 0.5 0.3

3 2.0 0.8

4 0.6 0.45

5 .25 .4

;

run;

data ref;

input id x y;

datalines;

1 0.6 1.2

2 0.5 0.7

3 -0.3 1.5

;

run;

proc fastclus data=main maxc=1 replace=none maxiter=0 noprint seed=ref

out=mahalanobis_to_point(drop=cluster);

var prin:;

run;

after first iteration we should have something like the mahalanobis_to_point:

data set

Obs id a b CLUSTER DISTANCE

1 1 0.7 1.30 1 0.20000

2 3 2.0 0.80 1 1.20830

4 4 0.6 0.45 4 0.20616

5 2 0.5 0.30 4 0.22361

6 5 0.25 0.4 5 0.32214

You then use a code like this:

proc sql noprint;

create table want(drop=diff) as

select a.*,b.id as out_id, b.distance as diff,b.a as out_a, b.b as out_b

from ref a,mahalanobis_to_point b

where a.id=1

having diff=min(diff);

quit;

want data set will look like:

Obs id a b out_id Out_a Out_b

1 1 0.6 1.2 1 0.7 1.3

In the second iteration the ref-1st data set now becomes:

data ref-1st;*let’s call this data set ref-1st

input id a b;

datalines;

2 0.5 0.7

3 -0.3 1.5

4 0.4 0.5

;

run;

This is because we have find the observation with the smallest distance to a=0.6 and b=1.2 and have remove the observation from the ref data set.

We then run

proc fastclus data=main maxc=1 replace=none maxiter=0 noprint seed=ref-1st

out=mahalanobis_to_point(drop=cluster);

var prin:;

run;

we again select the observation in the mahalanobis_to_point data set with smallest distance to a=0.5 b=0.7 and at it to the want data set

proc sql noprint;

create table want(drop=diff) as

select a.*,b.id as out_id, b.distance as diff,b.a as out_a, b.b as out_b

from ref-1st a,mahalanobis_to_point b

where a.id=1

having diff=min(diff);

quit;

the growing want data set will now look like:

Obs id a b out_id Out_a Out_2

1 1 0.6 1.2 1 0.7 1.3

1 2 0.5 0.7 4 0.5 0.45

We will then remove the match observation from ref-1st data set; let call the resulting data set ref-2

Ref-2 data set will look like:

data ref-2;*let’s call this data set ref-2

input id a b;

datalines;

3 -0.3 1.5

4 0.4 0.5

;

run;

we then run the proc fastclus again

proc fastclus data=main maxc=1 replace=none maxiter=0 noprint seed=ref-2

out=mahalanobis_to_point(drop=cluster);

var prin:;

run;

This will continue until we exhaust at the observations in ref data set. That is we find the match for each of the four observation in the ref data set

Observe that the data set “main” is the same throughout the iterations, while the ref data set reduces by one observation after each iteration.

stat_sas · Posted 12-20-2014 03:14 PM

Thanks for providing details. I can see that above suggested code is providing what you looking for. It provides observations from main which are close to ref data set and you save those into a separate data set right?

SWEETSAS · Posted 12-20-2014 04:50 PM

Yes. The challenge is in removing the first observation from ref once the distance has been calculated, so that the same can be done for the next observation. This is because the seed option in fastclus procedure compute the distance from the first observation in the ref data set to all the observation in main data set.

Reeza · Posted 12-20-2014 05:24 PM

Why don't you post what you have for the macro version, and we can take a look at that.

You probably want a counter somewhere to include how many obs in the ref file, and a do while loop in the macro.

Not sure of the structure quite yet but something like this:

%macro loop();

*get the number of observations in ref dataset;

proc sql noprint;

select count(*) into :count

from ref;

quit;

*loop start;

%do %while (&count>0);

*run proc fastclus;

*id obs you need to keep;

*append into dataset that keeps for the run;

*remove obs from ref data;

*generate new count variable;

*end loop;

%end;

%mend;

SWEETSAS · Posted 12-21-2014 12:32 PM

Thanks Reeza. You guys are awesome

Classroom Training Available!