topic Re: Nearest neighbour between two datasets in Statistical Procedures

Nearest neighbour between two datasets

Bjarke — Wed, 28 Aug 2013 21:55:46 GMT

Hi,

I would like to implement a nearest neighbours algorithm. More specifically, I have say 20000 customers in a dataset A that I would like to match against another a dataset B containing 2000 customers and for each customer in A find the top 50 nearest neighbours (most alike customers) in B. All of the relevant variables would be more or less standardised. For instance, income and age would both be numerical values with roughly same mean and variance. Sadly, I haven't been able to find any procedures in SAS that can match one dataset against another in this fashion.. If anyone can point me to the right one for this I would be euphoric!

Seeing as I can't find a fitting procedure I'm thinking of how to do this using data steps:

1. Merge the datasets A and B together to create a similarity matrix named C. For each customer in dataset A a variable would be created for each customer in dataset B showing the similarity using Euclidean distance. Furthermore, a linked variable would be created to hold the unique id of the customer in question.

2. For each customer in C select the 50 variables that have the lowest value (=highest similarity) and keep the unique customer ID's connected to these values. The result of this would be that I for each customer in A will have 50 unique customer ID's that will help me find the 50 records in B that match the customer the best.

Is this the right way to do it?

Kind regards,

Bjarke

Re: Nearest neighbour between two datasets

Reeza — Wed, 28 Aug 2013 22:01:33 GMT

This sounds like a Case/Control problem, where you match patients based on several variables.

If so you can search on those terms to help find relevant solutions. Probabilistic match is a good one to look into as well.

I'm a fan of the Mayo Clinic Macros but if you want a solution that is customized that's also doable. There's another thread on here today where lists an algorithm that may be close to what you want.

If it's not a case/control problem, ignore the above

Re: Nearest neighbour between two datasets

art297 — Wed, 28 Aug 2013 22:04:33 GMT

I think you had a typo in 2 and that you meant to say "select the 50 records ..". How many variables do you have that are common and standardized in the two datasets?

Re: Nearest neighbour between two datasets

Bjarke — Wed, 28 Aug 2013 22:26:01 GMT

Thanks for the quick responses!

Reeza:

I must say that I can't find the similarity with Case/Control. Perhaps I should expand a bit more on the purpose of this data mining.. The 2000 customers in B could be customers, who has just left the insurance company that I work at. I would then like to see if I can to some extent predict whether similar customers will leave as well. Similarly, it could be a very valuable tool for selecting which customers to select for cross-selling purposes. In this case the 2000 customers would already have been targeted and the rest of the portfolio would be matched to select similar customers to those that had a good hit-rate when attempting to sell additional products.

Arthur:

Well, my point was actually to have a variable on the dataset C for each customer in B. By selecting the 50 variables with the lowest distance I would be able to identify the top 50 records. I've edited the original text to clarify this. Another approach would be to create an observation for each combination of A and B resulting in roughly 40 million observations and then selecting from these. My hope is that the variable-based approach will be faster. Moreover, I might want to expand the datasets with even more customers if possible.

As an start I believe perhaps 4 variables will be in common and used for the similarity rating. However, it would be very nice if the method would be able to handle up to 10 variables.

Re: Nearest neighbour between two datasets

art297 — Wed, 28 Aug 2013 23:55:15 GMT

Sounds more like predicting churn than it does trying to find nearest neighbor. Take a look at: http://www2.sas.com/proceedings/sugi27/p114-27.pdf

and do a google search on: predicting churn sas

Re: Nearest neighbour between two datasets

PGStats — Thu, 29 Aug 2013 02:19:25 GMT

SAS has procedure modeclus that can find nearest neighbors quite efficiently. Look at the following example

/* Create some example datasets */

data A B;
array v var1-var5;
call streaminit(56645);
do custId=1 to 200;
     do _n_ = 1 to 5;
           v{_n_} = rand("NORMAL");
           end;
     output A;
     end;
do custId=1 to 2000;
     do _n_ = 1 to 5;
          v{_n_} = rand("NORMAL");
          end;
     output B;
     end;
run;

/* Concatenate the datasets. Change Ids in observations from dataset A */

data C(keep=custId var1-var5);
set A (in=inA) B;
if inA then custId = -custId;
run;

/* Find the nearest neighbors (ignore the warning) */

ods _all_ close;
proc modeclus data=C dk=51 neighbor;
var var1-var5;
id custId;
ods output Neighbor=D;
run;
ods listing;

/* Reformat the output; keep only Ids originaly from dataset A */

data E(keep=custId neighborPos neighborId);
retain custId;
set D;
if not missing(ID) then do;
     custId = -input(ID,best.);
     neighborPos = 0;
     end;
neighborPos + 1;
if custId < 0 then stop;
neighborId = input(Nbor,best.);
run;

Re: Nearest neighbour between two datasets

Reeza — Thu, 29 Aug 2013 04:03:22 GMT

I still think Case/Control Problem.

See the DIST Macro from Mayo Clinic:

http://www.mayo.edu/research/departments-divisions/department-health-sciences-research/division-biomedical-statistics-informatics/software/locally-written-sas-macros

Re: Nearest neighbour between two datasets

Bjarke — Fri, 20 Sep 2013 15:47:50 GMT

I am sorry that I have not responded, I've been swamped at work so I haven't been able to test anything untill now.

@Reeza)
You were right, it is a Case/Control problem! I've had to modify the %dist-macro slightly, but it is just great for my needs! Thank you!

@PGStats)
I'm sorry, but I couldn't quite make sense of your code.. However, I used some of yours thoughts when implementing my example.

@Arthur)
We are predicting churn in general using other methods, which are more suited for the purpose. The main point of the program is to find customers to approach for a sale attempt. Nevertheless, I find it interested if it could be possible to use this classification of nearest neighbours in other areas as well.

I've added the code for a simple example using my approach below:


filename macro '\\P-114-230-013\Arbejdsmapper\BJF\Mersalg\Macro';
%inc macro(nobs);
%inc macro(distMacro_modified);

/* Creating example dataset */
data all;
    infile datalines dsd;
    input Group: 1.         /* 0 = customers with sales attempt. 1 = customer without sales attempt.*/
          ID:    2.
          Match1:2.
          Match2:2.        
          SuccesfulSale:1.; /* Boolean to indicate whether or not a sales attempt was succesful. */
    datalines;
    0, 1, 7,  4,    1
    0, 2, 9,  6,    0
    0, 3, 22, 8,    0
    0, 4, 27, 10,     1
    1, 5, 5,  5,    .
    1, 6, 22, 7,    .
    1, 7, 17, 8,    .
    1, 8, 5,  8,    .
    1, 9, 8,  9,    .
    1, 10, 10, 11,    .
    1, 11, 14, 12,    .
    1, 12, 18, 13,    .
    1, 13, 21, 14,    .
    1, 14, 23, 15,    .
    0, 15, 2,  22,    0
    0, 16, 9,      5,    1
    0, 17, 17, 13,    1
    0, 18, 29,     2,    1
    0, 19, 14, 14,    0
    0, 20,  4, 17,    0
proc sort; by group id; run;

/* Use slightly altered version of the distance macro */
%dist(data=all,group=group,id=id,mvars=Match1 Match2,wts=1 2.5,
       out=distanceMatrix,transf=1,dist=2);

/* Outputs a matrix with the combination of each customer without sales and customer with sales including the distance between them. */
data distanceMatrixSelectNearest;
    set distanceMatrix;
    array idSales    _C_ID1-_C_ID10;
    array dist      _C1-_C10;

    do _i_ = 1 to 10;
        distanceNearest = dist{_i_};
        idNearest         = idSales{_i_};
        keep id idNearest distanceNearest;
        output;
    end;
proc sort; by id distanceNearest; run;

/* Selects the 3 nearest neighbours */
data distanceMatrixSelectNearest2;
    retain nearestNeighbourCount;
    set distanceMatrixSelectNearest;
    by id distanceNearest;

    if first.id then nearestNeighbourCount = 1; else nearestNeighbourCount = nearestNeighbourCount + 1;
    if nearestNeighbourCount <= 3;
proc sort; by idNearest; run;

/* Merges sales information onto the NN matrix */
proc sort data=all out=salesAttempt (drop=group match1 match2); where group=0; by id; run;
data distanceMatrixSelectNearest3;
    merge distanceMatrixSelectNearest2 (in=a)
          salesAttempt (rename=id=idNearest);
    by idNearest;
    if a;
proc sort; by id nearestNeighbourCount; run;

/* Create a single observation for each customer without a sale attempt containing a sales prediction as well as a id and distance to NN */
data customersToCall;
    retain nearestNeighbour1-nearestNeighbour3;
    retain nearestNeighbourSales1-nearestNeighbourSales3;
    retain nearestNeighbourDist1-nearestNeighbourDist3;
    array nn_id nearestNeighbour1-nearestNeighbour3;
    array nn_sale nearestNeighbourSales1-nearestNeighbourSales3;
    array nn_dist nearestNeighbourDist1-nearestNeighbourDist3;
    set distanceMatrixSelectNearest3;
    by id;

    nn_id{nearestNeighbourCount} = idNearest;
    nn_dist{nearestNeighbourCount} = distanceNearest;
    nn_sale{nearestNeighbourCount} = succesfulSale;

    drop idNearest distanceNearest succesfulSale nearestNeighbourCount;
    if last.id;
    if last.id then do;
        totalDist = 0;
        salesPrediction = 0;

        /* Find the total distance */
        do _i_ = 1 to 3;
            totalDist = totalDist + nn_dist{_i_};
        end;

        /* Add every sales attempt and take the average*/
        do _i_ = 1 to 3;
            salesPrediction = salesPrediction + nn_sale{_i_};
        end;
        salesPrediction = salesPrediction / 3;
    end;
proc sort; by descending salesPrediction; run;

data customersToCall2;
set customersToCall;
by descending salesPrediction;
if _n_ <= 3;
run;

Feel free to correct me on my code and my thoughts of using this approach.

Once again, thank you!

Kind regards,
Bjarke

Re: Nearest neighbour between two datasets

1zmm — Tue, 24 Sep 2013 12:23:15 GMT

You might also use PROC FASTCLUS:

1. Use the SEED= option of the PROC FASTCLUS statement to include a data set

of observations around which you want other "new" observations to cluster;

2. Use the DATA= option of the PROC FASTCLUS statement to include a data set

of the "new" observiations to be clustered;

3. Set the MAXITER option of the PROC FASTCLUS statement to zero (MAXITER=0)

to prevent the procedure from changing the original central "seed"

observations (see #1 above).

Since PROC FASTCLUS is designed for interval/ratio variables, you can only incorporate

nominal variables by creating separate clusters of observations within each category

of the nominal variables. To do this, sort by the nominal variables beforehand, and

use the BY statement in PROC FASTCLUS.