Old school programmer but new to SAS.
I've got two tables of portfolio data with multiple columns. Table B has many more rows than Table A. I'm trying to match ALL the rows in Table A with the closest matching row in Table B based on the closest matching Long_Cap to Short_Cap AND Long_Beta to Short_Beta. Doing a close match on ONE column is trivial, but two columns has me stumped. I have a feeling that PROC MODECLUS may be the solution, treating the Caps and Betas as pairs and then matching each up with it's nearest neighbor. So, basically, match ALL the blue dots in the chart below with the closest orange dot, using an orange dot only once. Any help would be appreciated.
TABLE A | ||
Long_Ticker | Long_Cap | Long_Beta |
ACCO | 1299.922492 | 2.424713513 |
ARCB | 916.517387 | 2.385758789 |
AVP | 945.995035 | -3.556207808 |
BKE | 1159.9804 | 4.534069004 |
CAL | 1438.941506 | 1.885933611 |
CBB | 879.567966 | -3.080362524 |
CLNE | 306.703682 | -1.03556073 |
DNR | 888.640745 | 4.21629385 |
ECHO | 781.413416 | -0.075975634 |
ELY | 1316.89162 | -3.776411519 |
... | ... | ... |
TABLE B | ||
Short_Ticker | Short_Cap | Short_Beta |
ABEO | 741.387823 | -1.774667292 |
ADMS | 771.976243 | 6.600176051 |
ADTN | 931.694585 | 1.452481453 |
AFI | 435.270485 | 4.9820341 |
AMAG | 468.710018 | -1.700958032 |
AMPH | 884.573353 | 1.227854433 |
AROW | 472.594912 | 1.35509369 |
ATSG | 1368.062531 | 2.215485056 |
BATRA | 1097.594852 | 0.577381117 |
BLX | 1058.906717 | 1.013467181 |
BSRR | 404.209734 | 3.787348746 |
BSTC | 311.509465 | 0.393381415 |
CALX | 299.848804 | 4.23718815 |
CASH | 895.597704 | 0.024619045 |
CATO | 398.507035 | 5.609581484 |
CDZI | 324.929355 | 2.381925261 |
... | ... | ... |
I don't think treating the CAPs and BETAs as pairs, followed by applying some definition of distance (e.g. Euclidean) in two-dimensional space, is the real problem.
As I understand it, for data sets A (N=N(a)) and B (N=N(b)), where N(b)>=N(a), you would want N(a) clusters. Each cluster would have two members, one point from A and one (unique) point from B. I'm not familiar with proc modeclus, but I'd be surprised if it could be made to do this. I'm not sure if any of the clustering proc's would work, given the constraints you want for each "cluster".
And remember, if the objective is to minimize the total distance over all final pairings, there are N(b) * N(b-1) * ....* [1+N(b)-N(a)] possible collections of pairings to consider. For example, if N(a) is 5 and N(b) is 10, there are 10*9*8*7*6 possible combinations.
But unless the N's are relatively small, I don't think examining all possible combinations is a reasonable approach. That's why I imagine one of the optimization proc's in SAS O/R (not part of everyone's SAS license) would work. Or possibly someone has developed a similar capability using PROC IML.
You also could just do a sub-optimal matching. Make a dataset of the cartesian crossing of A and B, with N(a)*N(b) distances. Then take each obs from A, find the nearest available B (e.g. minimum distance), and remove that B from further consideration. Do that for all the A's, generating the total distance. You could do this a number of times, each time using a different random sequence of draws from A, and choose the best of your samples.
I imagine there are a number of others in this community far more versed in this sort of optimization.
Editted ps: The problem here is the need to sample B without replacement. If B's could be reused, a simple proc sql would do the trick.
You would need :
1) a distance function, such as D = EUCLID(0.01*(long_cap - short_cap), long_beta - short_beta)
2) to generate a list (a dataset) of all the distances betwen short tickers and long tickers
3) to use these distances as costs in the linear assignment algorithm available in SAS/OR procedure OPTNET
Can PROC OPTNET be forced to assign a dataset B observation to only 1 from dataset A, even if it is the closest to multiple A's?
That's exactly (and only) what a linear assignment does: an exclusive one-to-one match.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Early bird rate extended! Save $200 when you sign up by March 31.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.