BookmarkSubscribeRSS Feed
ecclesrl
Calcite | Level 5

Old school programmer but new to SAS.

I've got two tables of portfolio data with multiple columns. Table B has many more rows than Table A. I'm trying to match ALL the rows in Table A with the closest matching row in Table B based on the closest matching Long_Cap to Short_Cap AND Long_Beta to Short_Beta. Doing a close match on ONE column is trivial, but two columns has me stumped. I have a feeling that PROC MODECLUS may be the solution, treating the Caps and Betas as pairs and then matching each up with it's nearest neighbor. So, basically, match ALL the blue dots in the chart below with the closest orange dot, using an orange dot only once. Any help would be appreciated.

TABLE A
Long_TickerLong_CapLong_Beta
ACCO1299.9224922.424713513
ARCB916.5173872.385758789
AVP945.995035-3.556207808
BKE1159.98044.534069004
CAL1438.9415061.885933611
CBB879.567966-3.080362524
CLNE306.703682-1.03556073
DNR888.6407454.21629385
ECHO781.413416-0.075975634
ELY1316.89162

-3.776411519

......

...

 

TABLE B  
Short_TickerShort_CapShort_Beta
ABEO741.387823-1.774667292
ADMS771.9762436.600176051
ADTN931.6945851.452481453
AFI435.2704854.9820341
AMAG468.710018-1.700958032
AMPH884.5733531.227854433
AROW472.5949121.35509369
ATSG1368.0625312.215485056
BATRA1097.5948520.577381117
BLX1058.9067171.013467181
BSRR404.2097343.787348746
BSTC311.5094650.393381415
CALX299.8488044.23718815
CASH895.5977040.024619045
CATO398.5070355.609581484
CDZI324.9293552.381925261
.........

 

Capture.JPG

 

4 REPLIES 4
mkeintz
PROC Star

I don't think treating the CAPs and BETAs as pairs, followed by applying some definition of distance (e.g. Euclidean) in two-dimensional space, is the real problem.

 

As I understand it, for data sets A (N=N(a)) and B (N=N(b)), where N(b)>=N(a), you would want N(a) clusters.  Each cluster would have two members, one point from A and one (unique) point from B.  I'm not familiar with proc modeclus, but I'd be surprised if it could be made to do this.  I'm not sure if any of the clustering proc's would work, given the constraints you want for each "cluster".  

 

 

And remember, if the objective is to minimize the total distance over all final pairings, there are  N(b)  * N(b-1) * ....* [1+N(b)-N(a)] possible collections of pairings to consider.  For example, if N(a) is 5 and N(b) is 10, there are 10*9*8*7*6 possible combinations.

 

But unless the N's are relatively small, I don't think examining all possible combinations is a reasonable approach.  That's why I imagine one of the optimization proc's in SAS O/R (not part of everyone's SAS license) would work.  Or possibly someone has developed a similar capability using PROC IML.

 

You also could just do a sub-optimal matching.  Make a dataset of the cartesian crossing of A and B, with N(a)*N(b) distances.  Then take each obs from A, find the nearest available B (e.g. minimum distance), and remove that B from further consideration.  Do that for all the A's, generating the total distance.  You could do this a number of times, each time using a different random sequence of draws from A, and choose the best of your samples.

 

I imagine there are a number of others in this community far more versed in this sort of optimization.

Editted ps: The problem here is the need to sample B without replacement.  If B's could be reused, a simple proc sql would do the trick.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

You would need :

 

1) a distance function, such as D = EUCLID(0.01*(long_cap - short_cap), long_beta - short_beta)
2) to generate a list (a dataset) of all the distances betwen short tickers and long tickers
3) to use these distances as costs in the linear assignment algorithm available in SAS/OR procedure OPTNET

PG
mkeintz
PROC Star

Can PROC OPTNET be forced to assign a dataset B observation to only 1 from dataset A, even if it is the closest to multiple A's?  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

That's exactly (and only) what a linear assignment does: an exclusive one-to-one match.

PG

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 635 views
  • 0 likes
  • 3 in conversation