BookmarkSubscribeRSS Feed
112
Calcite | Level 5 112
Calcite | Level 5

The city counsel on Boston would like to limit the number of overpriced and under represented Airbnb listings in the Boston area. The new rules require that hosts own the properties they rent out, and live in them for at least nine months of the year every year as well as pay an annual licensing fee. Many have not paid their fees and the city needs to perform an audit. But with few auditors available they need to limit their search. The city would like to narrow the listings to how many higher priced listings, with lower reviews than average, and are available more than 90 days per year.

The Director housing of Boston would like to know if, the number of listings that are the top (approximately) 15%-25% of listings in terms of price, show fewer and lower reviews than average, and are available more than 90 days per year. What the average price, number of days available, ratings and number of ratings posted. This will be the initial target of their audit.

 
data boston_airbnb_listings;
set boston_airbnb_listings;
price_numeric = input(price, comma32.);
run;
 
/* Data step to calculate percentiles and mean reviews and filter */
data work.boston_airbnb_ratings;
    set work.boston_airbnb_listings;
 
        /* Use PROC SUMMARY to calculate percentiles and mean */
        proc summary data=work.boston_airbnb_listings nway;
            var price_numeric;
            output out=summary_stats
                p15= p15
                p25= p25
                mean= mean_reviews;
        run;
end;
 
    /* Filter based on criteria: top 15%-25% of listings in terms of price, lower reviews, and available > 90 days */
    if price_numeric >= p15 and price_numeric <= p25 then output;
run;
 
how can i revise this step?
4 REPLIES 4
LinusH
Tourmaline | Level 20

Without digging into the detailed requirements, I would think first calculate the limits (e.g. in  you PROC SUMMERY), then join the result back to the detiail data to categorize the listings.

Burt to get more tangible help, please provide sample input data in data step with DATALINES, and a sample desired output.

Data never sleeps
Zinedine
Calcite | Level 5

I really like how this approach uses PROC UNIVARIATE to pre-calculate stats before merging, as preprocessing large datasets can improve efficiency. When working with merges like this, I always double check that the BY variables match and am specific about filtering with IF statements. Validation is also important - I like to check merged vs skipped row counts. For massive files, a JOIN may perform better than MERGE. Overall this gives a good starting template that I'd tweak the filters and UNIVARIATE options in based on my unique analysis needs. Nice work putting this code together - it provides a solid baseline to build from.

andreas_lds
Jade | Level 19

@112 wrote:

 

 
data boston_airbnb_listings;
set boston_airbnb_listings;
price_numeric = input(price, comma32.);
run;
 

 

It is almost always a bad idea to overwrite an existing dataset during development.

Also note, that you can't embed a proc in a data step.

Tom
Super User Tom
Super User

Your indentation is messed up. 

 

By indenting the third step (the PROC) you make the code file look like you want to run the third step in the middle of the second step.  You have to finish one step before you can start running another.

 

So put the steps in the right order.

proc summary data=work.boston_airbnb_listings nway;
    var price_numeric;
    output out=summary_stats
        p15= p15
        p25= p25
        mean= mean_reviews
    ;
run;

data work.boston_airbnb_ratings;
   set work.boston_airbnb_listings;
   if _n_=1 then set summary_stats;
run;

Now that you have the summary stats merged back onto the detailed records you can begin to use them together.

 

But I don't think you are answering the question.

 

You seem to be trying to subset to values between the 15th percentile and the 25th percentile. But they are asking for the TOP values.  Perhaps just the top 15% but perhaps as much as the top 25%.  Totally different subset.

 

Also do you have those other variables it mentions? Or do you need to calculate those also?

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 862 views
  • 0 likes
  • 5 in conversation