DATA CLUSTERING IN DIFFERENT ENVIRONMENTS: AN ANALYSIS OF COST, TIME AND QUALITY
DOI:
https://doi.org/10.56238/levv16n49-112Keywords:
Clustering, Cloud computing, Local environment, ENADE, Silhouette Score, Data scienceAbstract
This study aims to analyze the efficiency of applying clustering algorithms in two distinct computational environments: local and cloud-based. The research adopts a quantitative and experimental approach, seeking to measure and compare the performance of four algorithms — KMeans, MiniBatchKMeans, DBSCAN, and HDBSCAN — based on metrics such as execution time, operational cost, and clustering quality. The dataset was extracted from the 2022 National Student Performance Exam (ENADE), specifically from questions related to students’ perceptions of the pandemic's impact on their academic experience. Data processing included cleaning, normalization, and structuring for analysis in both environments. Implementation was carried out using tools such as Python, PostgreSQL, Visual Studio Code, and Amazon SageMaker, maintaining consistent parameters across all experiments. The quality of the clusters was primarily assessed using the Silhouette index, along with computational complexity and processing time analysis. Results showed that the cloud environment outperformed in terms of execution time, with MiniBatchKMeans standing out, while the local environment was more economical in terms of total cost. No significant differences were observed in the quality of clustering between the two environments. It is concluded that the choice between local and cloud computing environments should consider the project profile, data volume, processing urgency, and available resources. This research contributes to the practical understanding of the advantages and limitations of each infrastructure, providing insights for technical and strategic decision-making in the data science field, especially in educational contexts. It also emphasizes the importance of replicability, test automation, and careful metric selection to ensure reliable results in real-world data experiments.