Hello everyone This is not a "real" problem, since I already know another way to do it, but I am looking for a more efficient method. For context, what I am doing is trying to benchmark rental units with eachother. So, the user is picking a rental unit in the data, and then it will benchmark itself against the closest neighbours in the dataset. Assume that my dataset is the "Have"-sheet in the attached excel-file. So I have picked the address called "TEST-address 1" from my dataset and calculated the distance from the picked address to the other addresses (neighbours). What I want now is to reduce the dataset, so it only consists of "relevant" neighbours. The closest neighbours in the dataset are obviously the most relevant neighbours, but how are those defined? I have 4 conditons that have to be met 1) there has to be at least 10 different addresses (neighbours) 2) there has to be at least 3 different owners 3) there has to be at least 5 different properties 4) one owner cannot be too dominant in terms of values - one owner can at least have a share of 90% of the value. E.g. if the total value in the dataset is $100 and owner A has $95 then owner A has 95% of the share, which is too dominant. My current method is to do a do-loop. So when I have picked an address the program will be looking for the closest neighbours one at a time, and stop looking when all the above conditions has been met. In the "Want-sheet" it should stop at row 12. However, there is a rule that if the next addresses are in the same property, then it should stop when all the addresses in the property are in there. Hence, it should stop at row 13 instead. However, I want to test, if it is more efficient to do it the other way around. Instead of evaluating the conditions one neighbour at a time, I want the conditions to already be calculated as variables for each neighbour, like I have done in the "want"-sheet. So e.g. the 13th row states that if the 13th row is included in the final dataset then there will be 5 owners, 11 addresses and 5 properties in the dataset, and the owner with the largest share of value has a share of 50%. It is quite easy to calculate the number of owners, neighbours and properties, but the "max owner share" is difficult. Just to illustrate, how the "max owner share" should be calculated, I have created the yellow columns. The yellow columns could be an alternative method to do it, but the problem is that I would then have to create several new columns. In my original dataset there are over 400 different owners, meaning that I would have to create over 400 new columns. Does anyone have any ideas/suggestions on, how to make this more efficent? Thanks! 😀 Edit: Thanks to @ballardw, I have converted the excel-file to a data step code. The mentioned yellow columns are all the variables after "Value". I hope it makes sense 😀 data WORK.WANT;
infile datalines dsd truncover;
input Owner $5. address $15. Distance_to_picked_address commax5. Picked_owner $4. Picked_address $14. Property $6.
Value 5. no_of_owners 3. no_of_addresses 3. no_of_properties 2. max_owner_share commax5. acc_value 6. TEST1 5.
TEST2 4. TEST3 4. TEST4 4. TEST5 5. TEST6 5.;
datalines;
TEST TEST-address1 0.0 TEST TEST-address1 PROP 1000 0 0 0 0% 0 0 0 0 0 0 0
TEST1 TEST1-address1 0.1 TEST TEST-address1 PROP1 1025 1 1 1 100% 1025 1025 0 0 0 0 0
TEST1 TEST1-address2 0.1 TEST TEST-address1 PROP1 700 1 2 1 100% 1725 1725 0 0 0 0 0
TEST2 TEST2-address1 0.2 TEST TEST-address1 PROP2 625 2 3 2 73% 2350 1725 625 0 0 0 0
TEST3 TEST3-address1 0.3 TEST TEST-address1 PROP3 375 3 4 3 63% 2725 1725 625 375 0 0 0
TEST3 TEST3-address2 0.3 TEST TEST-address1 PROP3 500 3 5 3 53% 3225 1725 625 875 0 0 0
TEST4 TEST4-address1 0.4 TEST TEST-address1 PROP4 250 4 6 4 50% 3475 1725 625 875 250 0 0
TEST4 TEST4-address2 0.4 TEST TEST-address1 PROP4 250 4 7 4 46% 3725 1725 625 875 500 0 0
TEST5 TEST5-address1 0.5 TEST TEST-address1 PROP5 525 5 8 5 41% 4250 1725 625 875 500 525 0
TEST5 TEST5-address2 0.5 TEST TEST-address1 PROP5 500 5 9 5 36% 4750 1725 625 875 500 1025 0
TEST5 TEST5-address3 0.5 TEST TEST-address1 PROP5 1625 5 10 5 42% 6375 1725 625 875 500 2650 0
TEST5 TEST5-address4 0.5 TEST TEST-address1 PROP5 1075 5 11 5 50% 7450 1725 625 875 500 3725 0
TEST6 TEST6-address1 0.6 TEST TEST-address1 PROP6 875 6 12 6 45% 8325 1725 625 875 500 3725 875
TEST6 TEST6-address2 0.6 TEST TEST-address1 PROP6 1000 6 13 6 40% 9325 1725 625 875 500 3725 1875
TEST6 TEST6-address3 0.6 TEST TEST-address1 PROP7 1400 6 14 7 35% 10725 1725 625 875 500 3725 3275
TEST6 TEST6-address4 0.6 TEST TEST-address1 PROP7 1050 6 15 7 37% 11775 1725 625 875 500 3725 4325
TEST1 TEST1-address3 1.0 TEST TEST-address1 PROP8 1675 6 16 8 32% 13450 3400 625 875 500 3725 4325
TEST1 TEST1-address4 1.0 TEST TEST-address1 PROP8 800 6 17 8 30% 14250 4200 625 875 500 3725 4325
TEST1 TEST1-address5 1.0 TEST TEST-address1 PROP8 625 6 18 8 32% 14875 4825 625 875 500 3725 4325
;;;;
run;
data have;
set want;
keep Owner address Distance_to_picked_address Picked_owner Picked_address Property
Value;
run;
... View more