

We found this dataset on the data science platform Kaggle, and it appears that an independent user of the Kaggle website created the dataset. The dataset consists of 10,000 songs from 1960 to the present day, selected from the top music popularity charts each year. The sources of this data are the ARIA (Australian Recording Industry Association) and Billboard charts, whose rankings helped decide which 10,000 songs should be included. However, it is unclear whether subjective decisions were made by the person who compiled the dataset. We know that the factors they looked at were the songs’ commercial success and cultural significance and that they tried to incorporate diverse tracks to ensure the vast music scene was represented.


The creator generated the dataset by utilizing and linking the information Spotify could provide on each song and by using audio and lyric processors to generate metrics on the more intangible aspects of each song. They also obtained links to album cover images and track preview files from scdn.co, the domain associated with Spotify.

What does the data contain?
This dataset contains descriptive and numerical data relating to each song in the collection. Each record contains a lot of basic information for each track obtained from Spotify, such as:
- Track Name
- Artist Name(s)
- Album Name
- Album Release Date
- Disc Number
- Track Number
- Track Duration (ms)
- Artist Genres
Furthermore, the dataset provides quantitative information on various audible and sensory aspects of the data:
- Danceability
- Energy
- Key
- Loudness
- Mode
- Speechiness
- Acousticness
- Instrumentalness
- Liveness
- Valence
- Tempo
- Time Signature
Other categories of this dataset include:
- Track URI
- Artist URI(s)
- Album URI
- Album Artist URI(s)
- Album Artist Name(s)
- Album Image URL
- Track Preview URL
- Explicit
- Popularity
- ISRC
- Added By
- Added At
- Label
- Copyrights
There were a few initial inquiries the data was useful for. For instance, it could be used to perform a time-series analysis of music popularity, genres, and audible elements from the 1960s until now. This can help illuminate how certain factors in songs change over time. Another use of this dataset is that it can help correlate the songs’ audible factors to their genres and other time-independent fields, which can help figure out why certain songs or styles of songs became popular over time in a quantifiable manner.
What’s missing?
Notably, the dataset is missing geographic data and other information like artist demographics, which we were interested in researching. We utilized various web scraping processes and separate online databases to add this missing interest data. By augmenting our dataset with artists’ country of origin and other demographic data about the artists, including racial/ethnic background and gender, we could track more complex trends over time. For example, we could observe whether a shift in the music features generated in the data correlates with specific eras, historical movements, or geographic locations. We could also trace the representation of marginalized groups in the music scene, enabling us to gain more socially meaningful insights from the data. Using demographic data about the artists, we could also see how marginalized artists became more prevalent, parallel to notable historical events. Events could include a rise in technological access via music platforms and social media, social movements, or other changes that could have helped certain artists gain more listeners.
The data does have several limitations to be aware of based on the parameters of its creation. It is biased towards music that is mainstream and commercially successful in the English-speaking parts of the world. As the data relies on top rankings from Billboard and ARIA (two Western, English-speaking music organizations), the data tends to leave out songs by prominent artists in regions like Asia who don’t produce songs in English. In addition, by its very nature as a collection of popular songs, the data leaves out independent or underground artists, with the artists not receiving mainstream coverage and attention. The time cutoff at 1960 creates some issues with tracking trends caused by or associated with the featured songs from the 1960s, such as how the Civil Rights Movement affected music, because of the inability to see how the figures in that period compare to data from the 40s and 50s or beyond. Moreover, it is hard to be sure about the artists’ intention on all of these tracks, so we cannot tell whether the songs directly reflect the socio-political climate of the time or if they were just a more self-contained form of artistic expression.