The city counsel on Boston would like to limit the number of overpriced and under represented Airbnb listings in the Boston area. The new rules require that hosts own the properties they rent out, and live in them for at least nine months of the year every year as well as pay an annual licensing fee. Many have not paid their fees and the city needs to perform an audit. But with few auditors available they need to limit their search. The city would like to narrow the listings to how many higher priced listings, with lower reviews than average, and are available more than 90 days per year.
The Director housing of Boston would like to know if, the number of listings that are the top (approximately) 15%-25% of listings in terms of price, show fewer and lower reviews than average, and are available more than 90 days per year. What the average price, number of days available, ratings and number of ratings posted. This will be the initial target of their audit.
Without digging into the detailed requirements, I would think first calculate the limits (e.g. in you PROC SUMMERY), then join the result back to the detiail data to categorize the listings.
Burt to get more tangible help, please provide sample input data in data step with DATALINES, and a sample desired output.
I really like how this approach uses PROC UNIVARIATE to pre-calculate stats before merging, as preprocessing large datasets can improve efficiency. When working with merges like this, I always double check that the BY variables match and am specific about filtering with IF statements. Validation is also important - I like to check merged vs skipped row counts. For massive files, a JOIN may perform better than MERGE. Overall this gives a good starting template that I'd tweak the filters and UNIVARIATE options in based on my unique analysis needs. Nice work putting this code together - it provides a solid baseline to build from.
@112 wrote:
data boston_airbnb_listings;set boston_airbnb_listings;price_numeric = input(price, comma32.);run;
It is almost always a bad idea to overwrite an existing dataset during development.
Also note, that you can't embed a proc in a data step.
Your indentation is messed up.
By indenting the third step (the PROC) you make the code file look like you want to run the third step in the middle of the second step. You have to finish one step before you can start running another.
So put the steps in the right order.
proc summary data=work.boston_airbnb_listings nway;
var price_numeric;
output out=summary_stats
p15= p15
p25= p25
mean= mean_reviews
;
run;
data work.boston_airbnb_ratings;
set work.boston_airbnb_listings;
if _n_=1 then set summary_stats;
run;
Now that you have the summary stats merged back onto the detailed records you can begin to use them together.
But I don't think you are answering the question.
You seem to be trying to subset to values between the 15th percentile and the 25th percentile. But they are asking for the TOP values. Perhaps just the top 15% but perhaps as much as the top 25%. Totally different subset.
Also do you have those other variables it mentions? Or do you need to calculate those also?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Ready to level-up your skills? Choose your own adventure.