Editor's note: SAS programming concepts in this and other Free Data Friday articles remain useful, but SAS OnDemand for Academics has replaced SAS University Edition as a free e-learning option. Hit the orange button below to start your journey with SAS OnDemand for Academics:
One of the great commercial success stories of recent years has been Airbnb. The company acts as an online marketplace allowing hosts to offer short or long-term stays at their properties, from which Airbnb gains a commission. It is often thought of as being aimed at people who have a spare room they can rent out for a little extra cash, but it isn’t restricted to that type of host. Some hosts have very many properties listed and could be considered “professional” hosts rather than the “amateurs” with just a single spare bedroom.
In this edition of Free Data Friday we will be looking at Airbnb listings data from the web site insideairbnb.com to see if we can find any differences between the type and price of listings of hosts with just a single listing and hosts with multiple listings. In order to do this we will be using the JupyterLab interface from SAS University Edition along with the SASPy interface which allows the calling of SAS from python code. I should point out at this stage that I am by no means a python expert so if you see anything in the code you think could be improved on please leave a comment below.
The data is available as a CSV from the insideairbnb web site – data is available for listings in many different cities. I decided to use listings data for New York given its reasonable quantity and allocation to the easily recognisable five boroughs as neighbourhood groups.
Importing the data into SAS proved more problematic than anticipated. The problem arose with the “name” field which appears to be a description of the property from the listing advertisement. Some of the rows contain line breaks which caused errors importing the file into SAS. This left me with the prospect of either pre-editing the file or finding another way of creating the SAS data set. I had been wanting to try SASPy for a while and this seemed the perfect opportunity to see if python and SASPy could facilitate the data load.
In order to use SASPy you start SAS in the usual way but choose to use the JupyterLab environment instead of SAS Studio. Here’s the first cell in the notebook importing SASPy and the python pandas module:
import pandas as pd
import saspy
In order to read the CSV file I then used the read_csv function with the usecols parameter. This tells python to only read specific fields from the CSV, ignoring all others. This overcomes the issue of the field with undesirable line breaks and the data is read into an in-memory data structure called a dataframe.
df=pd.read_csv("/folders/myshortcuts/Dropbox/listings.csv",
usecols=["host_id","host_name","neighbourhood_group","neighbourhood",
"latitude","longitude","room_type","price","minimum_nights",
"calculated_host_listings_count","availability_365"])
Next I use SASPy to instantiate a SAS session
sas = saspy.SASsession()
I then declare my SAS library where I am going to save the resulting data set.
sas.saslib("dbox","base","/folders/myshortcuts/Dropbox")
Now I can save the pandas dataframe to a SAS data set for further processing in SAS
airbnb=sas.df2sd(df,"airbnb","dbox")
From this point on all my processing could be done using SAS code. There are a number of ways of calling SAS from the JupyterLab python kernel. The easiest is probably the %%SAS magic command which forces all code in the cell to be run by the SAS session. Here's the cell which creates a column in the data set which helps distinguish between single and multiple listing hosts, runs Proc Means and prints the output
%%SAS
libname dbox "/folders/myshortcuts/Dropbox";
data all_data;
set dbox.airbnb;
if calculated_host_listings_count=1 then single=1;
else if calculated_host_listings_count>1 then single=0;
run;
proc means data=all_data noprint;
class single room_type neighbourhood_group;
var price;
output out=all_stats mean=avg_price;
run;
proc print data=all_stats(obs=20);
run;
Here's the output from the Proc Print
Now I can run Proc SGPie to generate some Pie Charts showing the percentage of listings by room type for all hosts, single listing hosts and multiple listing hosts. I can use the _type_ variable to distinguish between combinations of class variables.
%%SAS
title1 "Airbnb Listings by Room Type";
title2 "New York City";
title3 "All Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(_type_=2));
format _freq_ comma10.;
pie room_type / response=_freq_ maxslices=4
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
title1 "Airbnb Listings by Room Type";
title2 "New York";
title3 "Single Listing Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(single=1 and _type_=6));
format _freq_ comma10.;
pie room_type / response=_freq_ maxslices=4
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
title1 "Airbnb Listings by Room Type";
title2 "New York City";
title3 "Multiple Listing Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(single=0 and _type_=6));
format _freq_ comma10.;
pie room_type / response=_freq_ maxslices=4
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
We can see that for single listing hosts entire home/apartments comprise more than half the listings with private rooms accounting for about 39% of listings. For multiple listing hosts these numbers are almost exactly reversed with private rooms being more prevalent. Hotel rooms are almost entirely the province of multiple listing hosts although the percentages are very small.
Now we move onto location
%%SAS
title1 "Airbnb Listings by Borough";
title2 "New York City";
title3 "All Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(_type_=1));
format _freq_ comma10.;
pie neighbourhood_group / response=_freq_ maxslices=5
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
title1 "Airbnb Listings by Borough";
title2 "New York City";
title3 "Single Listing Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(single=1 and _type_=5));
format _freq_ comma10.;
pie neighbourhood_group / response=_freq_ maxslices=5
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
title1 "Airbnb Listings by Borough";
title2 "New York City";
title3 "Multiple Listing Hosts";
footnote1 j=r "Data From: http://insideairbnb.com";
proc sgpie data=all_stats(where=(single=0 and _type_=5));
format _freq_ comma10.;
pie neighbourhood_group / response=_freq_ maxslices=5
dataskin=gloss datalabeldisplay=(category percent)
datalabelloc=callout;
run;
there's not a huge difference here except for Queens where 16% of multiple listing hosts have properties but only 10% of single listing hosts do.
Finally we can chart prices by room type for the different types of host
%%SAS
title1 "Airbnb Average Prices by Room Type";
title2 "New York City";
title3 "All Hosts";
proc sgplot data=all_stats(where=(_type_=2));
vbar room_type / response=avg_price
dataskin=gloss datalabel=avg_price
categoryorder=respdesc;
xaxis label="Room Type";
yaxis label="Average Price US$";
run;
title1 "Airbnb Average Prices by Room Type";
title2 "New York City";
title3 "Single Listing Hosts";
proc sgplot data=all_stats(where=(single=1 and _type_=6));
vbar room_type / response=avg_price
dataskin=gloss datalabel=avg_price
categoryorder=respdesc;
xaxis label="Room Type";
yaxis label="Average Price US$";
run;
title1 "Airbnb Average Prices by Room Type";
title2 "New York City";
title3 "Multiple Listing Hosts";
proc sgplot data=all_stats(where=(single=0 and _type_=6));
vbar room_type / response=avg_price
dataskin=gloss datalabel=avg_price
categoryorder=respdesc;
xaxis label="Room Type";
yaxis label="Average Price US$";
run;
We can see here that there are marked differences in prices for hotel rooms and shared rooms where single listing hosts are much more expensive than multiple listing hosts. in contrast private rooms from multiple listing hosts are more expensive on average.
It's hard to know what to make of these results without further analysis but it may be that single listing hosts charge more as they are more likely to offer premium rooms than multiple listing hosts, who may concentrate on volume over quality.
Did you find something else interesting in this data? Share in the comments. I’m glad to answer any questions.
Visit [[this link]] to see all the Free Data Friday articles.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.