How much you should ask for rent: 3 steps to build a price model with airbnb data for Boston

8 min readApr 19, 2021

Heatmap of listing prices in Boston (Sep 2016)

In this post I demonstrate how by using open source data you can smartly start your airbnb business as a host. Inside airbnb provides publicly available information about a city’s Airbnb’s listings, compiled from the Airbnb web-site. As an example I take a dataset for Boston city, scraped at September, 2016, which includes 3,5k listings with 90+ columns of features, as well as availability calendar for 365 days in the future, and the reviews for each listing. Using the CRISP-DM technique, I would like to answer the following 3 questions:

How the Boston airbnb proposition looks like?
What is the most busy/expensive period and how to account for price seasonality?
How to predict the price for a new listing?

Step 1: Understanding and preparing the data

Before going to exploratory data analysis (EDA) and modelling, we need to check and clean the data. First of all, we would like to consider only active listings. So I suppose a listing to be active if it meets the following 2 conditions:

The last review was less then 6 months ago
The last calendar update was less then 2 months ago

Second, we need to transform several bad-type columns:

All the price columns are originally given in a string format as “$…”, so I convert them into numerical values.
All rates are originally strings “…%”, which I transform to numerical values.
Amenities description is given in a json format, which I convert into a list (and later into separate columns, one column for each amenity).
Drop empty/single-value columns
Convert object columns with dates (or intervals) into datetime format.

Third, I look for/generate additional features:

I use geo-data of Boston neighbourhood map, to assign each listing to a neighbourhood and calculate listing density (= number of listings per km2).
There are 68k of reviews in the dataset, for which I calculated sentiment score (with nltk.sentiment.vader.SentimentIntensityAnalyzer ). This scores together with review scores are aggregated by district as 2 new features: district_sentiment and district_rev_score .
The categorical columns are converted into numerical with pandas.get_dummies method.

Finally, we need to select features for further modelling. Since I want to get a model, which can recommend a price for a new listing, it doesn’t make sense to include the information about previous activity. Thus, based on the business sense I exclude the following features:

host rates
review rates
activity (last review, last update, etc.)
additional fees (cleaning, security deposit, etc.)
availability (since we can see from the figure above, that there is no correlation between available share (share of days during one year, when the listing is available for booking) and price.

I ended up with a dataset of 1870 samples and 73 features (preprocessed), where all NaNs were filled with -1.

Step 2: Build the model

To build regression model I use lightgbm.LGBMRegressor with default ‘gbdt’ boosting type and default objective function (L2-norm, i.e. root mean-square error). The hyper parameters (max_depth, num_leaves, n_estimators, learning_rate) were tuned with sklearn.model_selection.GridSearchCV .

import lightgbm
from lightgbm import LGBMRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_scoreX = preprocessed.drop(columns=target)
y = preprocessed[target[0]]# leave 20% of data for test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# parameters
parameters = {
    'max_depth': [2, 3, 4, 5], 
    'num_leaves': [5, 7, 8, 10],
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.05, 0.1, 0.2]
    }# look for optimal parameters combination
# use 3-fold cross-validation
lgbm = LGBMRegressor(random_state=42)
clf = GridSearchCV(lgbm, parameters, cv=3, scoring='r2')
clf.fit(X_train, y_train)best_params = clf.best_params_
print('Best parameters:', best_params)# learn model with optimal parameters
lgbm_tuned = LGBMRegressor(**best_params, random_state=42)
lgbm_tuned.fit(X_train, y_train)
preds = lgbm_tuned.predict(X_test)
r2 = r2_score(y_test, preds)
print('R2-score =', round(r2,2))

The resulting test score is R2=0.74, and the top 10 features show reasonable dependence. Top-3 features describe the listing size — the number of bedrooms, bathrooms and room type. Density and district_sentiment connected with the listing geo-position, because these 2 features are about the neighbourhood allocation and status. So, as you can see, I’ve got a quite clear and simple model to recommend a price for a listing based on most common features — size and location.

Step 3: Collect business insights

Q1: How the Boston airbnb proposition looks like?

Average listing price in Boston and by different neighbourhoods

Let’s take a step back and suppose that you face a choice of what kind of business to start. So you are not ready to put the listing details and look for a price prediction, but you need to make a rough estimation of potential income. The fastest way to guess how much money you can ask for your listing is to look at the average price in the city. A little bit more info you’ll get from the price distribution by neighbourhood. And don’t forget to pay attention at the listing density, because the lower density may be advantageous. An additional source of understanding if the listing location is good or bad is the analysis of neighbourhood rating, based on average review or sentiment score (see plots in notebook).

Another question, that you may have in your head, is what type of housing is mostly present?

So you can learn the structure of the market-pie in order to identify your competitive advantage. Next step will be to look at the average prices/price distribution in the segment you are interested in (e.g. Condominium-private room). Check, which amenities are frequently provided, it is a nice check-list when you add a listing on the web-site.

Q2: What is the most busy/expensive period and how to account for price seasonality?

We can see from the dataset that the listing price is not constant during the year, as well as the number of available listings is also changing. Before I start the result discussion, I would like to mention, that the availability in the data is a little bit confusing, because:

The Airbnb calendar for a listing does not differentiate between a booked night vs an unavailable night, therefore these bookings have been counted as “unavailable”. This serves to understate the Availability metric because popular listings will be “booked” rather than being “blacked out” by a host.

Let us keep in mind this comment while analyzing the following figures.

Let’s start with monthly seasonality: as we can see, the price is changing by 26% from month to month. The highest listing price is achieved in September-October, probably, because of the high demand from students coming to the MIT, and the lowest in February (dead time after winter holidays). The most busy time is May-August, apparently, it is vacation time.

The dynamics of price during the week is not so strong as during the year. The average price maximum difference is 4% between Tuesday and Saturday and 2.9% in general between working days (Monday-Thursday + Sunday) and weekend (Friday-Saturday). The small value of the price difference between weekends and working days is also influenced by the availability increase during weekend time. Probably, there is a segment of listings, which are usually available only during the weekends. Thus, the competition is even stronger during these couple of days, so it is risky to increase the price.

In the following section I will discuss how to include these insights into the final guess of the suitable price.

Q3: How to predict the price for a new listing?

We already discussed the model, which will predict a relevant price to a new listing, based on its common properties, such as number of bedrooms/bathrooms, number of people it accommodates, the listing density and the sentiment rate of the neighbourhood, etc. However, the prediction is based on a certain dataset, which is scraped on a certain date (the data is usually scraped once per 1–2 months). So the price recommended by the model is relevant for this certain month of scraping. What about the right price to set up for next month, or for the summer-time?

In the previous section I demonstrated the seasonality (monthly and weekly) of the listing prices. It seems reasonable to introduce the seasonal coefficient and tune the price according to the date of interest. I also propose to do it separately for different neighbourhoods, because as we can see on the figure above, the price-change magnitude drastically differs with district. I implemented the method calibrate_price_by_month , which takes the model prediction and scales it according to the date of interest and the name of the neighbourhood.

To sum up

In this post I demonstrated how we can easily catch the insights from the Inside Airbnb datasets, explore the market and define the optimal price for a new (or existing) listing:

Always start with cleaning the data. Remove old records, dummy columns, unreasonable features. You do not really need feature-selection procedure, if you have business sense of relevant factors. Just take them!
Challenge your regression model by looking at feature importances and residual plot! Top features should be explainable and the residuals should not be biased.
Use analytics to enhance your model! Be aware of the patterns which your model haven’t seen in the dataset. You don’t need to build a time-series forecast to include the seasonality. Build a single-date regression and tune the output with a scaling coefficient.

The EDA and modelling pipelines can be found in my GitHub repository. Choose the location you are interested in and quickly repeat the analysis!

Thanks for reading :)