BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
hkgejd
Fluorite | Level 6

Hello everyone

 

This is not a "real" problem, since I already know another way to do it, but I am looking for a more efficient method.

 

For context, what I am doing is trying to benchmark rental units with eachother. So, the user is picking a rental unit in the data, and then it will benchmark itself against the closest neighbours in the dataset. Assume that my dataset is the "Have"-sheet in the attached excel-file. So I have picked the address called "TEST-address 1" from my dataset and calculated the distance from the picked address to the other addresses (neighbours).

 

What I want now is to reduce the dataset, so it only consists of "relevant" neighbours. The closest neighbours in the dataset are obviously the most relevant neighbours, but how are those defined? I have 4 conditons that have to be met

1) there has to be at least 10 different addresses (neighbours)

2) there has to be at least 3 different owners

3) there has to be at least 5 different properties

4) one owner cannot be too dominant in terms of values - one owner can at least have a share of 90% of the value. E.g. if the total value in the dataset is $100 and owner A has $95 then owner A has 95% of the share, which is too dominant.

 

My current method is to do a do-loop. So when I have picked an address the program will be looking for the closest neighbours one at a time, and stop looking when all the above conditions has been met. In the "Want-sheet" it should stop at row 12. However, there is a rule that if the next addresses are in the same property, then it should stop when all the addresses in the property are in there. Hence, it should stop at row 13 instead.

 

However, I want to test, if it is more efficient to do it the other way around. Instead of evaluating the conditions one neighbour at a time, I want the conditions to already be calculated as variables for each neighbour, like I have done in the "want"-sheet. So e.g. the 13th row states that if the 13th row is included in the final dataset then there will be 5 owners, 11 addresses and 5 properties in the dataset, and the owner with the largest share of value has a share of 50%.

 

It is quite easy to calculate the number of owners, neighbours and properties, but the "max owner share" is difficult. Just to illustrate, how the "max owner share" should be calculated, I have created the yellow columns. The yellow columns could be an alternative method to do it, but the problem is that I would then have to create several new columns. In my original dataset there are over 400 different owners, meaning that I would have to create over 400 new columns.

 

Does anyone have any ideas/suggestions on, how to make this more efficent?

 

Thanks! 😀

 

Edit: Thanks to @ballardw, I have converted the excel-file to a data step code. The mentioned yellow columns are all the variables after "Value". I hope it makes sense 😀

data WORK.WANT;
  infile datalines dsd truncover;

input Owner $5. address $15. Distance_to_picked_address commax5. Picked_owner $4. Picked_address $14. Property $6. 
Value 5. no_of_owners 3. no_of_addresses 3. no_of_properties 2. max_owner_share commax5. acc_value 6. TEST1 5. 
TEST2 4. TEST3 4. TEST4 4. TEST5 5. TEST6 5.;
datalines;
TEST TEST-address1   0.0 TEST TEST-address1 PROP  1000 0 0  0 0%   0     0    0   0   0   0    0
TEST1 TEST1-address1 0.1 TEST TEST-address1 PROP1 1025 1 1  1 100% 1025  1025 0   0   0   0    0
TEST1 TEST1-address2 0.1 TEST TEST-address1 PROP1 700  1 2  1 100% 1725  1725 0   0   0   0    0
TEST2 TEST2-address1 0.2 TEST TEST-address1 PROP2 625  2 3  2 73%  2350  1725 625 0   0   0    0
TEST3 TEST3-address1 0.3 TEST TEST-address1 PROP3 375  3 4  3 63%  2725  1725 625 375 0   0    0
TEST3 TEST3-address2 0.3 TEST TEST-address1 PROP3 500  3 5  3 53%  3225  1725 625 875 0   0    0
TEST4 TEST4-address1 0.4 TEST TEST-address1 PROP4 250  4 6  4 50%  3475  1725 625 875 250 0    0
TEST4 TEST4-address2 0.4 TEST TEST-address1 PROP4 250  4 7  4 46%  3725  1725 625 875 500 0    0
TEST5 TEST5-address1 0.5 TEST TEST-address1 PROP5 525  5 8  5 41%  4250  1725 625 875 500 525  0
TEST5 TEST5-address2 0.5 TEST TEST-address1 PROP5 500  5 9  5 36%  4750  1725 625 875 500 1025 0
TEST5 TEST5-address3 0.5 TEST TEST-address1 PROP5 1625 5 10 5 42%  6375  1725 625 875 500 2650 0
TEST5 TEST5-address4 0.5 TEST TEST-address1 PROP5 1075 5 11 5 50%  7450  1725 625 875 500 3725 0
TEST6 TEST6-address1 0.6 TEST TEST-address1 PROP6 875  6 12 6 45%  8325  1725 625 875 500 3725 875
TEST6 TEST6-address2 0.6 TEST TEST-address1 PROP6 1000 6 13 6 40%  9325  1725 625 875 500 3725 1875
TEST6 TEST6-address3 0.6 TEST TEST-address1 PROP7 1400 6 14 7 35%  10725 1725 625 875 500 3725 3275
TEST6 TEST6-address4 0.6 TEST TEST-address1 PROP7 1050 6 15 7 37%  11775 1725 625 875 500 3725 4325
TEST1 TEST1-address3 1.0 TEST TEST-address1 PROP8 1675 6 16 8 32%  13450 3400 625 875 500 3725 4325
TEST1 TEST1-address4 1.0 TEST TEST-address1 PROP8 800  6 17 8 30%  14250 4200 625 875 500 3725 4325
TEST1 TEST1-address5 1.0 TEST TEST-address1 PROP8 625  6 18 8 32%  14875 4825 625 875 500 3725 4325
;;;;
run;

data have;
set want;

keep Owner address Distance_to_picked_address Picked_owner Picked_address Property
Value;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @hkgejd,

 

Instead of the additional variables TEST1, TEST2, etc. you could use a temporary array, indexed with owner numbers, which are assigned by an informat, which in turn is created dynamically from the owner data.

 

Something like:

/* Create an informat OWNERNO that assigns unique numbers to owners */

proc sort data=have(keep=owner) out=owners nodupkey;
by owner;
run;

data owninfmt;
retain fmtname 'ownerno' type 'I';
set owners(rename=(owner=start)) nobs=n;
if _n_=1 then call symputx('n_own',n);
label=_n_;
run;

proc format cntlin=owninfmt;
run;

/* Create derived variables */

data want(drop=_i);
set have;
retain no_of_owners no_of_addresses no_of_properties max_owner_share 0;
array val[&n_own] _temporary_;
if _n_>1 then do;
  _i=input(owner,ownerno.);
  val[_i]=sum(val[_i],value);
  no_of_owners=n(of val[*]);
  no_of_addresses+1;
  no_of_properties+(property ne lag(property)); /* assumes suitable sort order */
  acc_value+value;
  max_owner_share=max(of val[*])/acc_value; /* proportions between 0 and 1 */
end;
format max_owner_share percent.;
run;

View solution in original post

4 REPLIES 4
ballardw
Super User

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

hkgejd
Fluorite | Level 6

Hi ballardw

 

Thanks for your reply!

 

That makes sense, I will convert the excel-file to a data step code immediately 😀

FreelanceReinh
Jade | Level 19

Hello @hkgejd,

 

Instead of the additional variables TEST1, TEST2, etc. you could use a temporary array, indexed with owner numbers, which are assigned by an informat, which in turn is created dynamically from the owner data.

 

Something like:

/* Create an informat OWNERNO that assigns unique numbers to owners */

proc sort data=have(keep=owner) out=owners nodupkey;
by owner;
run;

data owninfmt;
retain fmtname 'ownerno' type 'I';
set owners(rename=(owner=start)) nobs=n;
if _n_=1 then call symputx('n_own',n);
label=_n_;
run;

proc format cntlin=owninfmt;
run;

/* Create derived variables */

data want(drop=_i);
set have;
retain no_of_owners no_of_addresses no_of_properties max_owner_share 0;
array val[&n_own] _temporary_;
if _n_>1 then do;
  _i=input(owner,ownerno.);
  val[_i]=sum(val[_i],value);
  no_of_owners=n(of val[*]);
  no_of_addresses+1;
  no_of_properties+(property ne lag(property)); /* assumes suitable sort order */
  acc_value+value;
  max_owner_share=max(of val[*])/acc_value; /* proportions between 0 and 1 */
end;
format max_owner_share percent.;
run;
hkgejd
Fluorite | Level 6

Hi @FreelanceReinh 

 

That is a very nice workaround!

 

I will have to test this on my "real" data, but I think this will be way faster than my current method.

 

Thank you so much 😀

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 823 views
  • 1 like
  • 3 in conversation