The Airbnb Dataset

About the Data

    The dataset used in our project is obtained from Inside Airbnb, which is an organization that has collected Airbnb data for various cities of different countries and continents. The data on this site is originally the publicly accessible data from the Airbnb site, and Inside Airbnb worked on analyzing, cleaning, and rearranging the data to facilitate any future use of this Airbnb data by the public. At this point, most of the cities included in the dataset are metropolises, but the dataset is still expanding to include data from more regions around the globe. Since our research questions mainly focus on the Airbnb dataset of Los Angeles, we requested the listing, reviews, calendar, and neighborhood data of LA from the website directly.

  For Los Angeles, the dataset contains information about more than 40k Airbnbs in the listing file; over 1 million Airbnb listing records in the reviews file; the availability of all recorded Airbnbs in the future 12 months in the calendar file; and list of all neighborhoods and corresponding neighborhood groups in Los Angeles. 

Processing the Data

    Since the dataset has already been cleaned and integrated by Inside Airbnb, together with the data dictionary which provides detailed descriptions for each variable, the dataset is quite straightforward to read and understand. However, as the dataset is directly obtained from Airbnb listing data, there are some listings whose names contain special characters, emojis, or other languages. Therefore, what we did was to extract all listings with non-English characters in their names and create a subset of the original dataset without these listings in case any of our future research would focus on the information hidden behind the names of the listings. We also obtained a subset with only these listings with special characters and the original dataset if we would like to have a comprehensive understanding using all information provided by the dataset.

    We combined the listing file with the reviews file with the Airbnb ids provided in both datasets with R. Since the reviews dataset is simply a record of Airbnb id and the date when the review was made, the date recorded in each observation would correspond to a review made on that particular day. Therefore, by counting the number of observations for each Airbnb id, we were able to find the number of reviews for each Airbnb recorded. Then we used the merge function to add reviews count to the original listing dataset.

    We also combined an updated neighborhood dataset to the listing dataset. For the updated neighborhood dataset, we found the median income in each neighborhood from LA times and combined it with the original neighborhood dataset by the name of each neighborhood. Once again, we use the neighborhood name to combine the updated neighborhood dataset to our listing dataset in order to have all our data in a single table.

    We mainly used Tableau and R for data visualization, so we converted all subsets of the original dataset into .csv files for convenience.

Data Critique

  • What does this data set entail? What information does it give us? Why is it significant?
    This data set entails Airbnb data on dozens of cities and countries around the world. Each city has listings data, reviews data, and calendar data. Calendar data displays whether a listing was available or not for every date between June 6, 2022, which is when the data was scraped, and June 5, 2023. Within the listings data for a specific city, which is much of what we will be focusing on, there is location data, host data, and detailed listing data. Location data includes the approximate latitude and longitude and the neighborhood within the city. Host data includes the host’s name, how many listings they have, and whether they are verified. The detailed listing data includes the amenities, the number of bedrooms, the price, the reviews, and the availability. This data is significant because according to the mission of Inside Airbnb, the project that provides this data, it gives insight into the effect of Airbnb on residential communities.
  • How was this data generated?
    The Airbnb data was generated by scraping public information from the Airbnb website. The data in the dataset is simply a snapshot of listings at a certain moment in time. For example, if an Airbnb listing is deleted after the data has been generated, the listing and all of its information will still appear in the dataset.
  • What are the original sources?
    The original sources of this dataset are from the Airbnb website, and all information utilized in this dataset is public information from the website (such as room availability for the future year, and reviews for each listing). All the location information for the listing dataset is anonymized by Airbnb, and the availability information is pretty much dependent on how the host defines availability. The dataset doesn’t use the names of surrounding neighborhoods due to the inaccuracies of the Airbnb site. Instead, the names used for each listing are compiled by comparing the listing’s geographic locations with a city’s definition of neighborhoods. 
  • Who or what organization funded the creation of this dataset? (find out if possible)
    This dataset was initially created for a project (Inside Airbnb) that helps investigate how Airbnb has impacted its surrounding residential communities. The person that contributed to the creation of this dataset, or the project to create this dataset, was Murray Cox, a community artist and activist, and he got the idea to initiate this project while he was working with Clarisa James, the Executive Director from DIVAS for Social Justice.
  • What information, events, or phenomena can your dataset illuminate?
    From website: “empower communities to understand, decide and control the role of renting residential homes to tourists”. Although our project only focuses on one city (Los Angeles), the dataset itself is actually quite large and contains data for cities all over the world, which means we can have an insight of global airbnb records and their role in local tourism. One of the significant attributes of the Airbnb dataset is the aspect of consumer-side data and host-side data. By looking at the reviews, costs, frequency of nights, and more, we can interpret consumer’s interests and trends in tourism. On the host-side, the nightly rates, neighborhood locations, property size, and more can connect to the changes in the real estate market and explain the negative effects of a huge influx in demand—how tourism impacts gentrification.
  • What information is left out? What can the data set not reveal?
    Our dataset only shows the number and frequency of reviews, but there isn’t any variable showing the actual review or quantifying the review (score or anything). This makes it especially hard to understand how tenants feel about their experiences during their stay. One of the variables is about the availability of Airbnb throughout the year, but this data was measured as the total number of days that this Airbnb is available. Thus, we are not sure if this Airbnb is busier during certain months of the year. In fact, since most of the data in our dataset is measured by the unit of a year, we are unable to obtain detailed or any trends within that year. Although our data set contains thousands of information on the host ID, neighborhood data, price, address, and more, it does not reveal the core information about the guests (customers) who use and purchase the Airbnbs. Thus, it is difficult to obtain the demographic context of the guests and interpret the outside factors that take into account in booking an Airbnb. As the global tourism industry has skyrocketed along with housing prices in neighborhoods with an influx of tourists in local areas, knowing the demographic segmentation would help us understand if one age group is more susceptible to choosing an Airbnb booking over an affordable hotel. Also, if we know one area is more likely to be booked by students or families compared to other areas by senior executives, we could interpret the growth and history in those neighborhoods.
  • What are the ideological effects? (You should also give your account of the ideological effects of the way in which your sources have been divided into data (your dataset’s ontology). If your dataset were your only source, what information would be left out?
    Ideological effects include the way data is divided by location, such as different cities or regions across the globe, rather than by other descriptors such as price or house type. This limits us in our ability to compare only listings within the same region. If the dataset was the only source, other regions would be left out as we are only focusing on the Los Angeles area when analyzing the different pieces of information. As our dataset is recorded mostly quantitatively as consumers make reviews online, we cannot measure the humanistic value in terms of their location interest. For example, we cannot know whether someone values aesthetics over commutability. Furthermore, it is crucial to note that Inside Airbnb is a completely different and separate organization from the Airbnb company itself and scraps data from places and reviews posted by users of Airbnb.com. The interests and presentations of these two entities clash as Inside Airbnb compared their scraped data with the public data release for New York City in 2015. Murray Cox, the founder of Inside Airbnb, had been speculating about the transparency of Airbnb and thousands of listings that violate the policy. Collaborating with Tom Slee, they published a report titled “How Airbnb’s Data Hid the Facts in New York City”. Cox and Slee have strong biases against how Airbnb presents public data itself and purposely established Inside Airbnb to highlight illegal renting on the site and showcase how it is negatively affecting the housing market in cities. According to the press interview with Cox and Slee, they wanted to uncover the “lies” in Airbnb and when they obtained the long list of data from multiple cities, they declared that it was “apparent proof that Airbnb was burying evidence of scofflaws on the platform” (Katz 2017). Thus, there are critical limitations in the data set and analysis they provide as their intent is to showcase the negative outlook of Airbnb and the organization doesn’t accurately monitor the new changes in Airbnb’s listings in each city. The Airbnb company has repeatedly criticized the Inside Airbnb organization as an inaccurate representation of their sources, however, Inside Airbnb has been partnering with city officials, governments, researchers, hotel industry leaders, and much more who are interested in using their data set. Their SEO is incredibly well equipped and when a user searches any public data related to Airbnb, Inside Airbnb is the first source to appear. This overshadows the actual public data release from Airbnb and could cause confusion on which data is more accurate or not.