Masters candidate in Biostatistics, Tushar Patni, will present:
“Comparing Levenshtein and Euclidean Distances for Clustering Time Use Data”
Plan B Adviser: Erika Helgeson
Abstract:
With the variety of wearable technology devices available on the market, such as activity trackers, Fitbit, and others time use data is becoming more prevalent. Therefore, it is valuable to understand the limitations and advantages of methods used to analyze such data. In the case of categorical time series data, the use of conventional Euclidean distance metric for clustering is not feasible. In this study we proposed using the Levenshtein distance, which is commonly used in the field of genetics, to account for the dynamic nature of the time series data. To compare the Levenshtein and Euclidean distances for data clustering, data was extracted from an online repository called American Time Use Survey data (ATUS). ATUS collects data on when and how individuals spend their time doing various activities, such as paid work, socializing, household work, etc. The ATUS data was manipulated into two formats: a format representing total time spent in each activity and a time series format. Different clustering algorithms were used to cluster both formats of the data. Euclidean distance was used for clustering the total time data and Levenshtein distance was used for clustering the time series data. The results showed that the Levenshtein distance metric was able to identify clusters that were not picked up by the Euclidean distance, but only when the clustering algorithms identified more than three clusters.
Refreshments will be served prior to the presentation.