How to best match 2 columns with 2 other columns

ecclesrl · Posted 12-01-2020 06:03 PM

Old school programmer but new to SAS.

I've got two tables of portfolio data with multiple columns. Table B has many more rows than Table A. I'm trying to match ALL the rows in Table A with the closest matching row in Table B based on the closest matching Long_Cap to Short_Cap AND Long_Beta to Short_Beta. Doing a close match on ONE column is trivial, but two columns has me stumped. I have a feeling that PROC MODECLUS may be the solution, treating the Caps and Betas as pairs and then matching each up with it's nearest neighbor. So, basically, match ALL the blue dots in the chart below with the closest orange dot, using an orange dot only once. Any help would be appreciated.

TABLE A
Long_Ticker	Long_Cap	Long_Beta
ACCO	1299.922492	2.424713513
ARCB	916.517387	2.385758789
AVP	945.995035	-3.556207808
BKE	1159.9804	4.534069004
CAL	1438.941506	1.885933611
CBB	879.567966	-3.080362524
CLNE	306.703682	-1.03556073
DNR	888.640745	4.21629385
ECHO	781.413416	-0.075975634
ELY	1316.89162	-3.776411519
...	...	...

TABLE B
Short_Ticker	Short_Cap	Short_Beta
ABEO	741.387823	-1.774667292
ADMS	771.976243	6.600176051
ADTN	931.694585	1.452481453
AFI	435.270485	4.9820341
AMAG	468.710018	-1.700958032
AMPH	884.573353	1.227854433
AROW	472.594912	1.35509369
ATSG	1368.062531	2.215485056
BATRA	1097.594852	0.577381117
BLX	1058.906717	1.013467181
BSRR	404.209734	3.787348746
BSTC	311.509465	0.393381415
CALX	299.848804	4.23718815
CASH	895.597704	0.024619045
CATO	398.507035	5.609581484
CDZI	324.929355	2.381925261
...	...	...

mkeintz · Posted 12-01-2020 08:55 PM

I don't think treating the CAPs and BETAs as pairs, followed by applying some definition of distance (e.g. Euclidean) in two-dimensional space, is the real problem.

As I understand it, for data sets A (N=N(a)) and B (N=N(b)), where N(b)>=N(a), you would want N(a) clusters. Each cluster would have two members, one point from A and one (unique) point from B. I'm not familiar with proc modeclus, but I'd be surprised if it could be made to do this. I'm not sure if any of the clustering proc's would work, given the constraints you want for each "cluster".

And remember, if the objective is to minimize the total distance over all final pairings, there are N(b) * N(b-1) * ....* [1+N(b)-N(a)] possible collections of pairings to consider. For example, if N(a) is 5 and N(b) is 10, there are 10*9*8*7*6 possible combinations.

But unless the N's are relatively small, I don't think examining all possible combinations is a reasonable approach. That's why I imagine one of the optimization proc's in SAS O/R (not part of everyone's SAS license) would work. Or possibly someone has developed a similar capability using PROC IML.

You also could just do a sub-optimal matching. Make a dataset of the cartesian crossing of A and B, with N(a)*N(b) distances. Then take each obs from A, find the nearest available B (e.g. minimum distance), and remove that B from further consideration. Do that for all the A's, generating the total distance. You could do this a number of times, each time using a different random sequence of draws from A, and choose the best of your samples.

I imagine there are a number of others in this community far more versed in this sort of optimization.

Editted ps: The problem here is the need to sample B without replacement. If B's could be reused, a simple proc sql would do the trick.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 12-01-2020 11:11 PM

You would need :

1) a distance function, such as D = EUCLID(0.01*(long_cap - short_cap), long_beta - short_beta)
2) to generate a list (a dataset) of all the distances betwen short tickers and long tickers
3) to use these distances as costs in the linear assignment algorithm available in SAS/OR procedure OPTNET

PG

mkeintz · Posted 12-01-2020 11:39 PM

Can PROC OPTNET be forced to assign a dataset B observation to only 1 from dataset A, even if it is the closest to multiple A's?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 12-01-2020 11:46 PM

That's exactly (and only) what a linear assignment does: an exclusive one-to-one match.

PG

How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Re: How to best match 2 columns with 2 other columns

Click image to register for webinar

Classroom Training Available!