Analysis of hotel data in Dalian

Time:2020-1-13

This project is from the 6th issue of “building + data analysis and mining practice” of the experimental building. “Building + data analysis and mining practice” is a course content customized for the junior engineer of data analysis or data mining in the experimental building. It includes 35 experiments, 20 challenges, 5 comprehensive projects and 1 large project. Six weeks to get you started with data analysis and mining.

data fetch

The data is obtained on August 27, 2019. The hotel price of No. 08-28-08-29 will fluctuate with the weak season and peak season of tourism. At present, Dalian belongs to the junction of season change, and the price level tends to be reasonable but still higher than the normal level.

import pandas as pd
import jieba
from tqdm import tqdm_notebook
from wordcloud import WordCloud
import numpy as np
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
df = pd.read_csv('https://s3.huhuhang.com/temporary/b1vzDs.csv')
df.shape

Output:

(2475, 7)

Data cleaning

#There will be duplicate data obtained. First, delete an item with the same name from the data table according to the name of the hotel
df = df.drop_duplicates(['HotelName'])
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2219 entries, 0 to 2474
Data columns (total 7 columns):
Unnamed: 0            2219 non-null int64
index                 2219 non-null int64
HotelName             2219 non-null object
HotelLocation         2219 non-null object
HotelCommentValue     2219 non-null float64
HotelCommentAmount    2219 non-null int64
HotelPrice            2219 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 138.7+ KB

After the deletion of duplicates, the hotel information obtained contains 2219 valid information, of which 5 columns are:

  • “Hotelname” hotel name
  • “Hotel location” the district life where the hotel is located
  • “Hotelcommentvalue” hotel rating
  • “Hotelcommentamount” the number of reviews the hotel has obtained
  • “Hotel price” hotel’s lowest price

Due to the fact that some hotels are newly opened (or for other reasons), there is no score evaluation temporarily (hotels without score evaluation are assigned “0” in the process of data acquisition), so we take this part of data out separately as a new hotel data set for subsequent analysis and prediction.

df_new_hotel = df[df["HotelCommentValue"]==0].drop(['Unnamed: 0'], axis=1).set_index(['index'])
df_new_hotel.head()

output

Analysis of hotel data in Dalian

For Hotels with existing scores, they are also separated from the original data set for analysis and modeling.

df_in_ana = df[df["HotelCommentValue"]!=0].drop(["Unnamed: 0", "index"], axis=1)
df_in_ana.shape

Output:

(1669, 5)

Data analysis

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

PLT. Rcparams ['font. Sans serif '] = ['simhei'] (used to display Chinese labels normally)
PLT. Rcparams ['axes. Unicode_minus'] = false ා used to display negative signs normally
sns.distplot(df_in_ana['HotelPrice'].values)

Output:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7353b9c240>

Analysis of hotel data in Dalian

By visualizing the distribution of hotel prices, it can be seen that most of the hotel prices are concentrated below 500 yuan / night, of which 200-300 yuan / night is the most concentrated; there are not many hotels above 500 yuan / night. Therefore, according to the price distribution and the actual price level, the hotel is divided into the following grades according to the price situation:

  • ”Cheap hotel, less than 100 yuan per night
  • “Economy” Hotel, 100-300 yuan per night
  • ”Comfortable hotel, 300-500 yuan per night
  • ”High end hotel, the price is 500-1000 yuan per night
  • “Luxury” Hotel, price above 1000 yuan per night
DF in ANA ['hotellabel '] = DF in ANA ["hotelprice"]. Apply (lambda x:' luxury 'if x > 1000 else\
                                                        ('high end' if x > 500 else\
                                                        ('comfortable' if x > 300 else\
                                                        ('economic' if x > 100 else 'cheap'))

After the division, let’s get a general idea of the proportion of different types of hotels:

hotel_label = df_in_ana.groupby('HotelLabel')['HotelName'].count()
plt.pie(hotel_label.values, labels=hotel_label.index, autopct='%.1f%%', explode=[0, 0.1, 0.1, 0.1, 0.1], shadow=True)

Output:

([<matplotlib.patches.Wedge at 0x7f735196bf28>,
  <matplotlib.patches.Wedge at 0x7f7351974978>,
  <matplotlib.patches.Wedge at 0x7f735197d358>,
  <matplotlib.patches.Wedge at 0x7f735197dcf8>,
  <matplotlib.patches.Wedge at 0x7f73519096d8>],
 [text (1.0995615668223722, 0.0310541586125, 'luxury'),
  Text (0.8817809341165916, 0.813917922292212, 'cheap'),
  Text (- 1.1653378183544278, - 0.28633506441395257, 'economy'),
  Text (0.9862461234793722, - 0.6836070391108557, 'comfortable'),
  Text (1.1898928807304072, - 0.15541857156431768, 'high end')],
 [Text(0.5997608546303848, 0.016938631970454542, '0.9%'),
  Text(0.5143722115680117, 0.47478545467045696, '21.9%'),
  Text(-0.679780394040083, -0.16702878757480563, '62.0%'),
  Text(0.5753102386963004, -0.3987707728146658, '11.0%'),
  Text(0.6941041804260709, -0.09066083341251863, '4.1%')])

Analysis of hotel data in Dalian

From the results of pie chart, we can see that more than 50% of the hotels are economical, 21.9% of the hotels are cheap, and the proportion of high-end and luxury hotels is relatively small, which is more in line with the general positioning of tourism cities.

Let’s take a look at the geographical distribution of the hotel:

from pyecharts import Map
Map_hotel = map ("Dalian hotel regional distribution map", width = 1000, height = 600)

hotel_distribution = df_in_ana.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]

h_values = list(hotel_distribution.values)
district = list(hotel_distribution.index)

Map_hotel. Add ("", District, h_values, maptype = 'Dalian', is_visualmap = true,
                         visual_range=([min(h_values), max(h_values)]), 
                         visual_text_color="#fff", symbol_size=20, is_label_show=True)
map_hotel.render('dalian_hotel.html')

Here, due to the nonstandard filling in of location information of some hotels when obtaining location information from the website, the obtained information presents a certain degree of differentiation. Because these differentiated information is not convenient for unified planning, and its proportion is not large, so it is in a relatively backward position after sort. We only intercept the information of the top 8 main areas Information, we can see that for the hotels that have been collected, most of them are located in Shahekou District and Jinzhou District, which are directly related to the distribution of major scenic spots in Dalian, such as the famous Xinghai Square, the sea crossing bridge in Shahekou District, Jinshitan and Discovery Kingdom in Jinzhou district. (in fact, there is no corresponding content in the map of high-tech park, because it does not belong to the administrative region, but as a technological development zone at the junction of Ganjingzi District and Shahekou District, its proportion has no impact on both Shahekou District and Ganjingzi District, which does not prevent us from analyzing the data).

Dalian hotel regional distribution map

Analysis of hotel data in Dalian

Dalian is a tourist city. The location and level of hotels in different administrative regions (geographical location) should be different. Therefore, it is an interesting problem to understand the distribution of hotels of different grades in different regions:

hotel_distribution = df_in_ana.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]
hotel_label_distr = df_in_ana.groupby([ 'HotelLocation','HotelLabel'])['HotelName'].count().sort_values(ascending=False).reset_index()
in_use_district = list(hotel_distribution.index)
hotel_label_distr = hotel_label_distr[hotel_label_distr['HotelLocation'].isin(in_use_district)]

fig, axes = plt.subplots(1, 5, figsize=(17,8))
Hotel label list = ['high end', 'comfort', 'economy', 'luxury', 'cheap']
for i in range(len(hotel_label_list)):
    current_df = hotel_label_distr[hotel_label_distr['HotelLabel']==hotel_label_list[i]]
    Axes [i]. Set "('{} type hotel's regional distribution'. Format (Hotel"'label "[i]))
    axes[i].pie(current_df.HotelName, labels=current_df.HotelLocation, autopct='%.1f%%', shadow=True)

It can be seen from the distribution of hotels of various grades in different regions that all types of hotels are advantageously distributed in Shahekou District, Jinzhou district and Zhongshan District. Interestingly, luxury hotels are not distributed in Lvshunkou district. This type of hotels not only concentrate in Shahekou District, but also occupy a large proportion in Zhongshan District. This is due to historical and geographical reasons Great relationship. Dalian people often say that Zhongshan District is the legendary “rich area”. Many business travelers will choose their residences in Zhongshan District, which also promotes the investment in high-end and luxury hotels in this area.

In addition to the requirements for hotel price (grade), we will also consider the evaluation of hotels when we make a hotel reservation. The higher the score is, the more the evaluation is, we will be more inclined to book. Therefore, for the scored dataset, let’s take a look at these hotels in Dalian.

First of all, according to the scoring situation and the general cognition of consumers, the hotel is marked as follows:

  • Those with more than 4.6 points are “super good”“
  • 4.0-4.6 points “not bad“
  • 3.0-4.0 points are “general”
  • Below 3.0 is “poor evaluation”
DF in ANA ['hotelcommentlevel '] = DF in ANA ["hotelcommentvalue"]. Apply (lambda x:' super good 'if x > 4.6\
                                                                      Else ('not bad 'if x > 4.0\
                                                                      Else ('general 'if x > 3.0 else' bad comment '))

According to the rating and hotel level clustering, we visualized the data.

hotel_label_level = df_in_ana.groupby(['HotelCommentLevel','HotelLabel'])['HotelName'].count().sort_values(ascending=False).reset_index()
fig, axes = plt.subplots(1, 5, figsize=(17,8))
for i in range(len(hotel_label_list)):
    current_df = hotel_label_level[hotel_label_level['HotelLabel'] == hotel_label_list[i]]
    Axes [i]. Set "('{} type hotel rating'. Format (Hotel"'label "[i]))
    axes[i].pie(current_df.HotelName, labels=current_df.HotelCommentLevel, autopct='%.1f%%', shadow=True)

It can be seen from the evaluation distribution of various types of hotels that the poor evaluation mainly occurs in the low-cost hotels and economy hotels, among which the low-cost hotels are the worst hit areas. For comfortable, high-end and Luxury Hotels with a minimum price of more than 300 per night, there is basically no bad evaluation, which also confirms the general cognition of “where is the money spent”, among which the luxury hotels are good The highest proportion of people rated (“excellent”). The proportion of “super excellent” evaluation has not increased with the rise of hotel grade. For high-end hotels, the proportion of “super excellent” evaluation has decreased compared with the comfortable hotels with lower price. The reason may be that people’s expectation of service corresponding to the hotel price is greater than the actual service level that the hotel can provide. On the one hand, remind consumers not to blindly think that expensive is Well, on the one hand, remind the hotel to do the corresponding thing as much as possible. It is not advisable to inflate the price.

Hotel list

According to the current content, we can make a “grass planting list” and “lightning protection list”:

”The “grass planting list” mainly collects the list of Hotels with good evaluation and many evaluation items (inspected by many people, meeting the requirements) and reasonable price for friends with different travel needs; the “anti thunder list” mainly collects the Hotels with poor evaluation, reminding people not to dare to “try wrong” or “take chance”.

Grass list

#Budget Hotel
DF? POS? Heap = DF? In? Ana [(DF? In? Ana ['hotellabel '] = =' cheap ')\
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'], ascending=False)
df_pos_cheap

Output:

Analysis of hotel data in Dalian

#Economy hotel
DF? POS? Economy = DF? In? Ana [(DF? In? Ana ['hotellabel '] =' economy ')\
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 2000)].sort_values(by=['HotelPrice'])
df_pos_economy

Output:

Analysis of hotel data in Dalian

#Comfortable hotel
DF? POS? Comfortable = DF? In? Ana [(DF? In? Ana ['hotellabel '] = =' comfortable ')\
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_comfortable

Output:

Analysis of hotel data in Dalian

#High end Hotel
DF? POS? HS = DF? In? Ana [(DF? In? Ana ['hotellabel '] = =' high end ')\
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_hs

Output:

Analysis of hotel data in Dalian

#Luxury Hotel
DF? POS? Luxury = DF? In? Ana [(DF? In? Ana ['hotellabel '] = =' luxury ')\
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'])
df_pos_luxury

Output:

Analysis of hotel data in Dalian

Minefield list

df_neg = df_in_ana[(df_in_ana['HotelCommentValue'] < 3.0) \
                         & (df_in_ana['HotelCommentAmount'] > 50)].sort_values(by=['HotelPrice'], ascending=False)
df_neg

Output:

Analysis of hotel data in Dalian

The science of hotel name

For more extreme hotel types, such as high-end hotels that are very expensive and expensive, they usually take the business elegant style, and their names will sound “expensive”; for those that are cheaper and rely on the price and flow, for students or people with poor economic foundation, their names either take a small and fresh way, or they are simple and rough, which is “cost-effective”. I We use word cloud to test whether this theory is consistent with hotels in Dalian.

wget -nc "http://labfile.oss.aliyuncs.com/courses/1176/fonts.zip"
unzip -o fonts.zip
from wordcloud import WordCloud

def get_word_map(hotel_name_list):
    word_dict ={}
    for hotel_name in tqdm_notebook(hotel_name_list):
        hotel_name = hotel_name.replace('(', '')
        hotel_name = hotel_name.replace(')', '')
        word_list = list(jieba.cut(hotel_name, cut_all=False))
        for word in word_list:
            If word = = 'Dalian' or len (word) < 2:
                continue
            if word not in word_dict:
                word_dict[word] = 0
            word_dict[word] += 1

    font_path = 'fonts/SourceHanSerifK-Light.otf'
    wc = WordCloud(font_path=font_path, background_color='white', max_words=1000, 
                            max_font_size=120, random_state=42, width=800, height=600, margin=2)
    wc.generate_from_frequencies(word_dict)

    return wc

In order to ensure that the data volume of rendering word cloud is sufficient, we do not select the data according to the original classification standard of hotel grade, but select the Hotels with prices lower than 150 and higher than 500, as two relatively extreme types, to see if they have any typical differences in naming.

part1 = df_in_ana[df_in_ana['HotelPrice'] <= 150]['HotelName'].values
part2 = df_in_ana[df_in_ana['HotelPrice'] > 500]['HotelName'].values
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
Axes [0]. Set title ('the name of the lower priced hotel cloud ')
axes[0].imshow(get_word_map(part1), interpolation='bilinear')
Axes [1]. Set title ('name cloud of higher priced hotels')
axes[1].imshow(get_word_map(part2), interpolation='bilinear')

Output:

<matplotlib.image.AxesImage at 0x7f73515c1908>

Analysis of hotel data in Dalian

From the results, there are obvious differences between the two types of hotels. Low price hotels have high frequency of “guest house”, “theme”, “Youth”, “Express Hotel”, “Hotel”, “Hotel” and so on, which is in line with our understanding of the positioning of such hotels; high-end hotels have high frequency of “star sea”, “sea view”, “hot spring”, “square”, because Dalian’s well-known landmarks are The Xinghai Square in Shahekou District and the nearby hotels (especially the high-end hotels) like to embody the word “Xinghai” in their names. In addition to highlighting the geographical location, it seems that they can also add some styles to the hotels through this word. In addition, high-end hotels don’t seem to like to call themselves “XX hotels” and prefer to call them “hotels” or “Hotel Apartments”. The crazier thing is, whether it’s a cheaper hotel or a more expensive hotel, they all like the word “apartment”. This seems to be a trend in the development of hotel industry.

See Mingzhi Hotel

As a symbol of people or things, the first impression caused by name is very important. We have just analyzed the characteristics of extreme hotel types in name. To a certain extent, we can judge whether the hotel is in a certain level according to the name, “three years old to see the whole life”. For the small white hotel that has just started to run and has no score, we can judge whether it is in a certain level according to the right The price prediction results determine whether the hotel’s pricing plan is in line with its positioning. Before that, we analyzed the evaluation characteristics of different grades of hotels, and combined with these known results, we can roughly understand whether these small white hotels have a false high price, or whether they are worth being a mouse and taking a “discovery road”. However, there is another problem involved in this. Because of the environment and the times, the new hotel will have a difference in the strategy of naming from the previous hotel. This difference will have a significant impact on the modeling and prediction process. Therefore, here we just use the learned method to do an interesting experiment, the results will not be accurate, but the process is very interesting )

df_in_ana['HotelPrice'].median()

Output:

156.0

Through the above word cloud analysis and the median price of the evaluated hotels, we set the price 150 as the dividing threshold. Hotels with a price lower than 150 yuan / night are marked as 1, while Hotels with a price higher than 150 yuan / night are marked as 0. This way of dividing makes the data volume of the two parts basically balanced, and also reflects the difference of the hotel name to a certain extent.

df_in_ana['PriceLabel'] = df_in_ana['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
df_new_hotel['PriceLabel'] = df_new_hotel['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
#Set word segmentation method
def word_cut(x):
    X = x.replace ('(', ') ා (remove the ()
    x = x.replace(')', '')
    return jieba.lcut(x)
#Set up training and test sets
x_train = df_in_ana['HotelName'].apply(word_cut).values
y_train = df_in_ana['PriceLabel'].values
x_test = df_new_hotel['HotelName'].apply(word_cut).values
y_test = df_new_hotel['PriceLabel'].values

The training set contains 1669 pieces of information, 790 pieces of data marked as 1, 550 pieces of information in the test set, and 195 pieces of data marked as 1.

#Based on word2vec method, a shallow neural network model of word vector is established, and the sum of word vector is calculated for the hotel name after word segmentation
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
w2v_model = Word2Vec(size=200, min_count=10)
w2v_model.build_vocab(x_train)
w2v_model.train(x_train, total_examples=w2v_model.corpus_count, epochs=5)

def sum_vec(text):
    vec = np.zeros(200).reshape((1, 200))
    for word in text:
        try:
            vec += w2v_model[word].reshape((1, 200)) 
        except KeyError:
            continue
    return vec 

train_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_train)])
#Build a neural network classifier model, and use training data to train the model
from sklearn.externals import joblib
from sklearn.neural_network import MLPClassifier
from IPython import display 

model = MLPClassifier(hidden_layer_sizes=(100, 50, 20), learning_rate='adaptive')
model.fit(train_vec, y_train)

#Draw loss curve and monitor the change process of loss function
display.clear_output(wait=True)
plt.plot(model.loss_curve_)

Output:

[<matplotlib.lines.Line2D at 0x7f73400b8198>]

Analysis of hotel data in Dalian

PS: because of the small amount of data and the relatively insufficient information contained in the data itself, the training results are not very good here.

#Then sum the word vectors of the test set
test_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_test)])
#Use the trained model to predict, and put the results into the test table
y_pred = model.predict(test_vec)
df_new_hotel['PredLabel'] = pd.Series(y_pred)
#Results of modeling and prediction
from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_test)

Output:

0.6163636363636363

In fact, the accuracy of prediction is only about 60%, which is a rather unsatisfactory result. We will expand the data, and what is the main reason.

new_hotel_questionable = df_new_hotel[(df_new_hotel['PriceLabel'] ==0) & (df_new_hotel['PredLabel']==1)]
new_hotel_questionable = new_hotel_questionable.sort_values(by='HotelPrice', ascending=False)
new_hotel_questionable

Output:

Analysis of hotel data in Dalian

The results show that many new hotels, especially those with high prices, are resort hotels of villa type, which is not obvious in the data set that has been evaluated. The classifier of modeling is not sensitive to it, and the possibility of misclassification will increase greatly.

plt.figure(figsize=(15, 7))
plt.imshow(get_word_map(new_hotel_questionable['HotelName'].values), interpolation='bilinear')

Output:

<matplotlib.image.AxesImage at 0x7f7333b06d68>

Compared with the data set used in modeling, the newly opened hotels have added some words, such as “store No.”, “branch store”, “villa” and so on, which lead to the decline of prediction accuracy.

Meet the new hotel

In addition to our knowledge of names, we can also learn about the new hotels’ geographical distribution and changes in average prices.

new_hotel_distri = df_new_hotel.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)[:7]
plt.pie(new_hotel_distri.values, labels=new_hotel_distri.index, autopct='%.1f%%', shadow=True)

Output;

([<matplotlib.patches.Wedge at 0x7f7333ae1240>,
  <matplotlib.patches.Wedge at 0x7f7333ae1c50>,
  <matplotlib.patches.Wedge at 0x7f7333ae9630>,
  <matplotlib.patches.Wedge at 0x7f7333ae9fd0>,
  <matplotlib.patches.Wedge at 0x7f7333af29b0>,
  <matplotlib.patches.Wedge at 0x7f7333afd390>,
  <matplotlib.patches.Wedge at 0x7f7333afdd30>],
 [text (0.49522412178448982, 0.9822184427113841, 'Jinzhou District'),
  Text (- 1.0502523061308453, 0.327062828011443, 'Ganjingzi District'),
  Text (- 0.7189197652449374, - 0.8325589295300148, 'Shahekou District'),
  Text (0.10878704263418966, - 1.0946074087794706, 'Lvshunkou district'),
  Text (0.645723922346646, - 0.8905282793117135, 'Zhongshan District'),
  Text (0.9702662169179598, - 0.5182503915171803, 'Xigang District'),
  Text (1.0890040760087287, - 0.1551454879665377, 'Pulandian district')],
 [Text(0.2701222482463081, 0.5357555142062095, '35.1%'),
  Text(-0.5728648942531882, 0.17839790618805892, '20.1%'),
  Text(-0.39213805376996586, -0.4541230524709171, '16.8%'),
  Text(0.059338386891376174, -0.597058586606984, '9.0%'),
  Text(0.35221304849163515, -0.4857426978063891, '7.8%'),
  Text(0.5292361183188871, -0.2826820317366438, '6.6%'),
  Text(0.5940022232774883, -0.08462481161811146, '4.5%')])

From the pie chart, we can see that more than 30% of the new hotels choose Jinzhou District, Shahekou District as the old hotel cluster, and only 16% of the practitioners choose the new hotel here.

Df_new_hotel ['hotellabel '] = df_new_hotel ["hotelprice"]. Apply (lambda x:' luxury 'if x > 1000\
                                                              Else ('high end 'if x > 500\
                                                              Else ('comfortable 'if x > 300\
                                                              Else ('economy 'if x > 100\
                                                              Else 'cheap')) 
new_hotel_label = df_new_hotel.groupby('HotelLabel')['HotelName'].count()
plt.pie(new_hotel_label.values, labels=new_hotel_label.index, autopct='%.1f%%', explode=[0, 0.1, 0.1, 0.1, 0.1], shadow=True)

Output:

([<matplotlib.patches.Wedge at 0x7f7333abbdd8>,
  <matplotlib.patches.Wedge at 0x7f7333a44828>,
  <matplotlib.patches.Wedge at 0x7f7333a4d208>,
  <matplotlib.patches.Wedge at 0x7f7333a4dba8>,
  <matplotlib.patches.Wedge at 0x7f7333a59588>],
 [text (1.0859612910752763, 0.17518011955161772, 'luxury'),
  Text (0.6137971106588083, 1.0311416522218948, 'cheap'),
  Text (- 1.1999216970224413, 0.01370842860376746, 'economy'),
  Text (0.46080283077562195, - 1.1079985339111122, 'comfortable'),
  Text (1.149441699723409, - 0.3446502271207151, 'high end')],
 [Text(0.5923425224046961, 0.09555279248270056, '5.1%'),
  Text(0.3580483145509714, 0.6014992971294385, '22.7%'),
  Text(-0.6999543232630907, 0.007996583352197684, '44.0%'),
  Text(0.26880165128577943, -0.6463324781148153, '18.9%'),
  Text(0.6705076581421987, -0.20104596582041712, '9.3%')])

In addition to the affordable low-cost hotels that most travelers will choose, the investment proportion of high-end luxury hotels has also been significantly increased in the newly opened hotels. Combined with the word cloud analysis of the newly opened hotels, more and more hotel practitioners have invested in the construction of high-end hotels, mainly reflected in villa type resort hotels, reflecting people’s more comfortable quality The pursuit of line experience.

In terms of price, there are also some interesting results:

df2 = df_new_hotel.groupby('HotelLabel')['HotelPrice'].mean().reset_index()
df1=df_in_ana.groupby('HotelLabel')['HotelPrice'].mean().reset_index()
price_change_percent = (df2['HotelPrice'] -  df1['HotelPrice'])/df1['HotelPrice'] * 100
PLT. Title ('change of average price of newly opened hotels of all levels')
plt.bar(df1['HotelLabel'] ,price_change_percent, width = 0.35)
plt.ylim(-18, 18)
for x, y in enumerate(price_change_percent):
    if y < 0:
        plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='top')
    else:
        plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='bottom')

Analysis of hotel data in Dalian

Compared with the old hotels that have been evaluated, the change of average price of new hotels is shown in the following aspects:

  1. The average price of luxury and cheap hotels has declined
  2. The average price of medium hotels, including economy hotels, comfort hotels and high-end hotels, has increased

The two extreme level hotels gain attention by “reducing the price” to gain occupancy rate, so as to achieve considerable development. While the intermediate hotels gain capital by changing business philosophy, conforming to the trend of the times and other ways, but the final development effect still depends on the recognition of passengers.

summary

This experiment takes hotels in Dalian as the analysis data, mining information including price and regional distribution, and providing the “grass planting list” and “lightning protection list” of hotels that have been evaluated (mom doesn’t have to worry about having friends to worry about hotel decision in Dalian anymore The paper analyzes the word cloud of hotel names, finds out the correlation between Hotel Names and hotel grades, establishes a classification model, predicts whether the names of newly opened and non evaluated hotels are suitable for their pricing standards, at the same time, finds out the regional distribution and grade distribution of newly opened hotels, compares their average price changes with the existing evaluated hotels, and has a side understanding of some Dalian tourism undertakings The development of ideas. Due to the small amount of data, and the strong correlation between the naming method of hotel store name and regional, era environment, the effect of modeling and prediction is not good, but it is also an interesting thing to learn these contents and apply them in different aspects to deepen the understanding of them.

Synchronization of Zhihu column: https://xuanlan.zhihu.com/p/85909205