Project Guide for Hotels to Improve Customer Satisfaction Through Data Science Using Python

Sourav Saha
Geek Culture
Published in
12 min readMar 27, 2021

--

How Data Science can boost Tourism Industry?

Image from unsplash.com

Introduction

Hotels are definitely one of the fastest-growing sectors in the tourism sector. Tourism is also a potentially large employment opportunity and Hotels are a major part of this Hospitality Sector. The hotel industry has been actively contributing to the nation’s economic growth. This trend is expected to grow gradually and in turn boost or add meaning to the tourism of any place.

In this project, we are going to do the following steps to help the Hotel to improve their Guest Satisfaction:

  1. Extract hotel reviews from the website “Booking.com” through Web Scrapping.
  2. Exploratory Data Analysis to get meaningful insights from the data.
  3. Sentiment Analysis to understand the sentiments of the customer towards the Hotel.
  4. Topic Modeling to understand the major factors resulting in Negative Sentiment of the customers.

Web scrapping

In this project, we will be scraping the reviews of “Hotel Ramada Caravela Beach Resort” located in Goa, India where Goa is one of the top tourist destination in India. The reviews were extracted from the below link:

https://www.booking.com/reviews/in/hotel/ramada-caravela-beach-resort.en-gb.html?page=1

If you want to scrape reviews of other Hotel then simply replace the hotel name “ramada-caravela-beach-resort” with the hotel name as listed in booking.com website.

To start the web scrapping process, Right click anywhere on the web page after going to the above link and select “Inspect” option to find the HTML tag associated with the information we want to scrape from that web page as shown in the below screenshot.

Under this tag “ul.review_list” we well scrape various information as shown below along with the screenshot of the corresponding HTML tags:

Now, since we got all HTML tags that we want to scrape, let us start coding!!

First we will import all necessary libraries that we need for this project.

# importing packages
import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
%matplotlib inline
import re
from bs4 import BeautifulSoup as bs
import requests
import string
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords,wordnet
from wordcloud import WordCloud
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
import warnings
warnings.filterwarnings(“ignore”)

Now, created a function below named “scrape_reviews” which has 2 arguments:

  1. hotel_linkname = Mention the name of any hotel you want to scrape as mentioned in the link of booking.com
  2. total_pages = Mention total no of review pages you want to scrape.

Let us understand the overview of the function “scrape_reviews” :

  • Mention the url of the review web page.
  • Retrieve data from the server.
  • In this project we will be using Beautiful Soup library to scrape the HTML page.
  • Clean the text by using .strip() and .replace() methods
  • Use a While loop to scrape all the pages.

This function will give us the below 3 outputs as dataframes:

  • reviewer_info: a dataframe that includes reviewers’ basic information
  • pos_reviews: a dataframe that includes all the positive reviews
  • neg_reviews: a dataframe that includes all the negative reviews
def scrape_reviews(hotel_linkname,total_pages):
#Create empty lists to put in reviewers’ information as well as all of the positive & negative reviews
info = []
positive = []
negative = []

#bookings.com reviews link
url = ‘https://www.booking.com/reviews/in/hotel/’+ hotel_linkname +’.html?page='
page_number = 1
#Use a while loop to scrape all the pages
while page_number <= total_pages:
page = requests.get(url + str(page_number)) #retrieve data from server
soup = bs(page.text, “html.parser”) # initiate a beautifulsoup object using the html source and Python’s html.parser
review_box = soup.find(‘ul’,{‘class’:’review_list’})
#ratings
ratings = [i.text.strip() for i in review_box.find_all(‘span’,{‘class’:’review-score-badge’})]

#reviewer_info
reviewer_info = [i.text.strip() for i in review_box.find_all(‘span’,{‘itemprop’:’name’})]
reviewer_name = reviewer_info[0::3]
reviewer_country = reviewer_info[1::3]
general_review = reviewer_info[2::3]
# reviewer_review_times
review_times = [i.text.strip() for i in review_box.find_all(‘div’,{‘class’:’review_item_user_review_count’})]
# review_date
review_date = [i.text.strip().strip(‘Reviewed: ‘) for i in review_box.find_all(‘p’,{‘class’:’review_item_date’})]
# reviewer_tag
reviewer_tag = [i.text.strip().replace(‘\n\n\n’,’’).replace(‘•’,’,’).lstrip(‘, ‘) for i
in review_box.find_all(‘ul’,{‘class’:’review_item_info_tags’})]
# positive_review
positive_review = [i.text.strip(‘눇’).strip() for i in review_box.find_all(‘p’,{‘class’:’review_pos’})]
# negative_review
negative_review = [i.text.strip(‘눉’).strip() for i in review_box.find_all(‘p’,{‘class’:’review_neg’})]
# append all reviewers’ info into one list
for i in range(len(reviewer_name)):
info.append([ratings[i],reviewer_name[i],reviewer_country[i],general_review[i],
review_times[i],review_date[i],reviewer_tag[i]])
# build positive review list
for i in range(len(positive_review)):
positive.append(positive_review[i])
# build negative review list
for i in range(len(negative_review)):
negative.append(negative_review[i])
# page change
page_number +=1
#Reviewer_info df
reviewer_info = pd.DataFrame(info,
columns = [‘Rating’,’Name’,’Country’,’Overall_review’,’Review_times’,’Review_date’,’Review_tags’])
reviewer_info[‘Rating’] = pd.to_numeric(reviewer_info[‘Rating’] )
reviewer_info[‘Review_times’] = pd.to_numeric(reviewer_info[‘Review_times’].apply(lambda x:re.findall(“\d+”, x)[0]))
reviewer_info[‘Review_date’] = pd.to_datetime(reviewer_info[‘Review_date’])

#positive & negative reviews dfs
pos_reviews = pd.DataFrame(positive,columns = [‘positive_reviews’])
neg_reviews = pd.DataFrame(negative,columns = [‘negative_reviews’])

return reviewer_info, pos_reviews, neg_reviews

The below function “show_data” will print the length of dataframes, total missing values, as well as the first five lines of a dataframe.

def show_data(df):
print(“The length of the dataframe is: {}”.format(len(df)))
print(“Total NAs: {}”.format(reviewer_info.isnull().sum().sum()))
return df.head()

Now after creating the function, we will mention the hotel’s name and total pages we want to scrape. However, you can change hotel’s name as per your choice and mention their corresponding total review pages.

reviewer_info, pos_reviews, neg_reviews = scrape_reviews(‘ramada-caravela-beach-resort’,total_pages = 13)

Now by using “show_data” function, we will check our scraped data output

show_data(reviewer_info) #reviewers’ basic information
show_data(pos_reviews) #Positive reviews
show_data(neg_reviews) #Negative reviews

Exploratory Data Analysis (EDA)

  1. Distribution of Positive and Negative Reviews
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
values = [len(pos_reviews), len(neg_reviews)]
ax.pie(values,
labels = [‘Number of Positive Reviews’, ‘Number of Negative Reviews’],
colors=[‘gold’, ‘lightcoral’],
shadow=True,
startangle=90,
autopct=’%1.2f%%’)
ax.axis(‘equal’)
plt.title(‘Positive Reviews Vs. Negative Reviews’);
Positive Reviews are higher than Negative Reviews by almost 10%

2. Violin Plot of the Customer Ratings for the top 10 reviewers’ country of origin

top10_list = top10_df[‘Country’].tolist()
top10 = reviewer_info[reviewer_info.Country.isin(top10_list)]
fig, ax = plt.subplots()
fig.set_size_inches(20, 5)
ax = sns.violinplot(x = ‘Country’,
y = ‘Rating’,
data = top10,
order = top10_list,
linewidth = 2)
plt.suptitle(‘Distribution of Ratings by Country’)
plt.xticks(rotation=90);

The above plot is displayed in the order of review counts of each country. It shows the relationship of ratings to the reviewers’ country of origins. From the box plot elements, we see that the median rating given by the Australia, South Africa and Qatar reviewers are a bit higher than the rest of the reviewers from other countries, while the median rating given by the reviewers from UK, US & Sweden is the lowest. Most of the shapes of the distributions (skinny on each end and wide in the middle) indicates the weights of ratings given by the reviewers are highly concentrated around the median, which is around 8 to 9. However, we probably need more data to get a better idea of the distributions.

3. Distribution of Review Tags Count for each Trip Type

#Define tag list
tag_list = [‘Business’,’Leisure’,’Group’,’Couple’,’Family’,’friends’,’Solo’]
#Count for each review tag
tag_counts = []
for tag in tag_list:
counts = reviewer_info[‘Review_tags’].str.count(tag).sum()
tag_counts.append(counts)
#Convert to a dataframe
trip_type = pd.DataFrame({‘Trip Type’:tag_list,’Counts’:tag_counts}).sort_values(‘Counts’,ascending = False)
#Visualize the trip type counts from Review_tags
fig = px.bar(trip_type, x=’Trip Type’, y=’Counts’, title=’Review Tags Counts for each Trip Type’)
fig.show()

From the above plot, we could see that most people came to Goa, India for Leisure either as Family or Couple.

Lemmatization of word tokens

Now, we have carried out Lemmatization which is the process of converting a word to its base form. For example words like “book”, “books”, “booked” will be converted to a single word “book”.

# wordnet and treebank have different tagging systems
# Create a function to define a mapping between wordnet tags and POS tags
def get_wordnet_pos(pos_tag):
if pos_tag.startswith(‘J’):
return wordnet.ADJ
elif pos_tag.startswith(‘V’):
return wordnet.VERB
elif pos_tag.startswith(‘N’):
return wordnet.NOUN
elif pos_tag.startswith(‘R’):
return wordnet.ADV

else:
return wordnet.NOUN # default, return wordnet tag “NOUN”
#Create a function to lemmatize tokens in the reviews
def lemmatized_tokens(text):
text = text.lower()
pattern = r’\b[a-zA-Z]{3,}\b’
tokens = nltk.regexp_tokenize(text, pattern) # tokenize the text
tagged_tokens = nltk.pos_tag(tokens) # a list of tuples (word, pos_tag)

stop_words = stopwords.words(‘english’)
new_stopwords = [“hotel”,”everything”,”anything”,”nothing”,”thing”,”need”,
“good”,”great”,”excellent”,”perfect”,”much”,”even”,”really”] #customize extra stop_words
stop_words.extend(new_stopwords)
stop_words = set(stop_words)

wordnet_lemmatizer = WordNetLemmatizer()
# get lemmatized tokens #call function “get_wordnet_pos”
lemmatized_words=[wordnet_lemmatizer.lemmatize(word, get_wordnet_pos(tag))
# tagged_tokens is a list of tuples (word, tag)
for (word, tag) in tagged_tokens \
# remove stop words
if word not in stop_words and \
# remove punctuations
word not in string.punctuation]
return lemmatized_words

Create Word Cloud for Positive Reviews & Negative Reviews

#Create a function to generate wordcloud
def wordcloud(review_df, review_colname, color, title):
‘’’
INPUTS:
reivew_df — dataframe, positive or negative reviews
review_colname — column name, positive or negative review
color — background color of worldcloud
title — title of the wordcloud
OUTPUT:
Wordcloud visuazliation
‘’’
text = review_df[review_colname].tolist()
text_str = ‘ ‘.join(lemmatized_tokens(‘ ‘.join(text))) #call function “lemmatized_tokens”
wordcloud = WordCloud(collocations = False,
background_color = color,
width=1600,
height=800,
margin=2,
min_font_size=20).generate(text_str)
plt.figure(figsize = (15, 10))
plt.imshow(wordcloud, interpolation = ‘bilinear’)
plt.axis(“off”)
plt.figtext(.5,.8,title,fontsize = 20, ha=’center’)
plt.show()

# Wordcoulds for Positive Reviews
wordcloud(pos_reviews,’positive_reviews’, ‘white’,’Positive Reviews: ‘)
# # WordCoulds for Negative Reviews
wordcloud(neg_reviews,’negative_reviews’, ‘black’, ‘Negative Reviews:’)

In positive reviews, most people are highly satisfied with the Location of the property close to beach, friendly Staff.

In negative reviews as well we could see the word Staff which means there are also few staffs in the property whose behavior wasn’t good, people are also not satisfied with Room and food which the business should focus to resolve.

Sentiment Analysis

Sentiment Analysis is performed on “Overall_review” column of dataframe named “reviewer_info

#Create a function to get the subjectivity
def subjectivity(text):
return TextBlob(text).sentiment.subjectivity
#Create a function to get the polarity
def polarity(text):
return TextBlob(text).sentiment.polarity
#Create two new columns
reviewer_info[‘Subjectivity’] = reviewer_info[‘Overall_review’].apply(subjectivity)
reviewer_info[‘Polarity’] = reviewer_info[‘Overall_review’].apply(polarity)
#################################################################################
#Create a function to compute the negative, neutral and positive analysis
def getAnalysis(score):
if score <0:
return ‘Negative’
elif score == 0:
return ‘Neutral’
else:
return ‘Positive’
reviewer_info[‘Analysis’] = reviewer_info[‘Polarity’].apply(getAnalysis)
#################################################################################
# plot the polarity and subjectivity
fig = px.scatter(reviewer_info,
x=’Polarity’,
y=’Subjectivity’,
color = ‘Analysis’,
size=’Subjectivity’)
#add a vertical line at x=0 for Netural Reviews
fig.update_layout(title=’Sentiment Analysis’,
shapes=[dict(type= ‘line’,
yref= ‘paper’, y0= 0, y1= 1,
xref= ‘x’, x0= 0, x1= 0)])
fig.show()

The X-axis is Polarity and Y-axis is Subjectivity. Polarity represents how positive, negative or neutral the review is and Subjectivity are opinions that describe people’s feelings towards a specific subject or topic. So higher the Subjectivity, better it is describing people’s feelings towards a subject. Bigger dots indicate more subjectivity. We could see that positive reviews are more than negative reviews but we need to know on what topic people have negative sentiment towards the Hotel so we will be doing LDA Topic Modeling.

LDA Topic Modeling

We will apply the LDA model to find the distribution of topic and the high probability of word in each topic. Here, the object of doing LDA Topic Modeling is to look at the negative reviews to find out what topics should the hotel be focusing on improving the Customer Satisfaction.

To understand LDA Topic modeling in more detail click on this link which explains the concept properly in full detail.

The following steps were done to perform LDA Topic Modeling:

  • Reviews were converted to document-term matrix
  • Find the optimal LDA model using GridSearch and parameter tuning
  • Compare LDA Model Performance Scores
#Create a function to build the optimal LDA model
def optimal_lda_model(df_review, review_colname):
‘’’
INPUTS:
df_review — dataframe that contains the reviews
review_colname: name of column that contains reviews

OUTPUTS:
lda_tfidf — Latent Dirichlet Allocation (LDA) model
dtm_tfidf — document-term matrix in the tfidf format
tfidf_vectorizer — word frequency in the reviews
A graph comparing LDA Model Performance Scores with different params
‘’’
docs_raw = df_review[review_colname].tolist()
#************ Step 1: Convert to document-term matrix ************##Transform text to vector form using the vectorizer object
tf_vectorizer = CountVectorizer(strip_accents = ‘unicode’,
stop_words = ‘english’,
lowercase = True,
token_pattern = r’\b[a-zA-Z]{3,}\b’, # num chars > 3 to avoid some meaningless words
max_df = 0.9, # discard words that appear in > 90% of the reviews
min_df = 10) # discard words that appear in < 10 reviews
#apply transformation
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
#convert to document-term matrix
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(“The shape of the tfidf is {}, meaning that there are {} {} and {} tokens made through the filtering process.”.\
format(dtm_tfidf.shape,dtm_tfidf.shape[0], review_colname, dtm_tfidf.shape[1]))
#******* Step 2: GridSearch & parameter tuning to find the optimal LDA model *******## Define Search Param
search_params = {‘n_components’: [5, 10, 15, 20, 25, 30],
‘learning_decay’: [.5, .7, .9]}
# Init the Model
lda = LatentDirichletAllocation()
# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)
# Do the Grid Search
model.fit(dtm_tfidf)
#***** Step 3: Output the optimal lda model and its parameters *****## Best Model
best_lda_model = model.best_estimator_
# Model Parameters
print(“Best Model’s Params: “, model.best_params_)
# Log Likelihood Score: Higher the better
print(“Model Log Likelihood Score: “, model.best_score_)
# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print(“Model Perplexity: “, best_lda_model.perplexity(dtm_tfidf))
#*********** Step 4: Compare LDA Model Performance Scores ***********##Get Log Likelyhoods from Grid Search Output
gscore=model.fit(dtm_tfidf).cv_results_
n_topics = [5, 10, 15, 20, 25, 30]
log_likelyhoods_5 = [gscore[‘mean_test_score’][gscore[‘params’].index(v)] for v in gscore[‘params’] if v[‘learning_decay’]==0.5]
log_likelyhoods_7 = [gscore[‘mean_test_score’][gscore[‘params’].index(v)] for v in gscore[‘params’] if v[‘learning_decay’]==0.7]
log_likelyhoods_9 = [gscore[‘mean_test_score’][gscore[‘params’].index(v)] for v in gscore[‘params’] if v[‘learning_decay’]==0.9]
# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label=’0.5')
plt.plot(n_topics, log_likelyhoods_7, label=’0.7')
plt.plot(n_topics, log_likelyhoods_9, label=’0.9')
plt.title(“Choosing Optimal LDA Model”)
plt.xlabel(“Num Topics”)
plt.ylabel(“Log Likelyhood Scores”)
plt.legend(title=’Learning decay’, loc=’best’)
plt.show()

return best_lda_model, dtm_tfidf, tfidf_vectorizer

best_lda_model, dtm_tfidf, tfidf_vectorizer = optimal_lda_model(neg_reviews, ‘negative_reviews’)

From the graph, we see that there is little impact to choose different learning decay before 15 topics, however, 5 topics would produce the best model.
Now, let’s output the words in the topics we just created.

#Create a function to inspect the topics we created 
def display_topics(model, feature_names, n_top_words):
‘’’
INPUTS:
model — the model we created
feature_names — tells us what word each column in the matric represents
n_top_words — number of top words to display
OUTPUTS:
a dataframe that contains the topics we created and the weights of each token
‘’’
topic_dict = {}
for topic_idx, topic in enumerate(model.components_):
topic_dict[“Topic %d words” % (topic_idx+1)]= [‘{}’.format(feature_names[i])
for i in topic.argsort()[:-n_top_words — 1:-1]]
topic_dict[“Topic %d weights” % (topic_idx+1)]= [‘{:.1f}’.format(topic[i])
for i in topic.argsort()[:-n_top_words — 1:-1]]
return pd.DataFrame(topic_dict)
display_topics(best_lda_model, tfidf_vectorizer.get_feature_names(), n_top_words = 20)

Now, let’s visualize the topics with pyLDAVis Visualization!

# Topic Modelling Visualization for the Negative Reviews
pyLDAvis.sklearn.prepare(best_lda_model, dtm_tfidf, tfidf_vectorizer)

On the left-hand side of the visualization, each topic is represented by a bubble. The larger the bubble, the more prevalent is that topic where the number 1 being the most popular topic, and number 5 being the least popular topic. The distance between two bubbles represents the topic similarity.

The right-hand side shows the top-30 most relevant terms for the topic you select on the left. The blue bar represents the overall term frequency, and the red bar indicates the estimated term frequency within the selected topic. So, if you see a bar with both red and blue, it means the term also appears at other topics. You can hover over the term to see in which topic(s) is the term also included.

So for example in the above output we could see that Topic 3 consist of a word named ‘wifi’ so the Business should focus on improving wifi services in their rooms.

End Note:

The result of this project can help hotel management to understand what actions needs to be taken for the tourists in order to increase a greater number of tourists visiting to Hotel and improve tourist satisfaction as well so this project can be carried out by any hotel to improve Guest satisfaction. This project can also be used to gain insights for competitors Hotel to understand customer perception towards their competitors.

--

--