CLUSTERIZAÇÃO DE DADOS EM DIFERENTES AMBIENTES: UMA ANÁLISE DE CUSTO, TEMPO E QUALIDADE

Guilherme Correia  Dutra; Felipe Nunes  Gaia; Rodrigo Elias  Francisco

doi:10.56238/levv16n49-112

Authors

Guilherme Correia Dutra Author
Felipe Nunes Gaia Author
Rodrigo Elias Francisco Author

DOI:

https://doi.org/10.56238/levv16n49-112

Keywords:

Clustering, Cloud computing, Local environment, ENADE, Silhouette Score, Data science

Abstract

This study aims to analyze the efficiency of applying clustering algorithms in two distinct computational environments: local and cloud-based. The research adopts a quantitative and experimental approach, seeking to measure and compare the performance of four algorithms — KMeans, MiniBatchKMeans, DBSCAN, and HDBSCAN — based on metrics such as execution time, operational cost, and clustering quality. The dataset was extracted from the 2022 National Student Performance Exam (ENADE), specifically from questions related to students’ perceptions of the pandemic's impact on their academic experience. Data processing included cleaning, normalization, and structuring for analysis in both environments. Implementation was carried out using tools such as Python, PostgreSQL, Visual Studio Code, and Amazon SageMaker, maintaining consistent parameters across all experiments. The quality of the clusters was primarily assessed using the Silhouette index, along with computational complexity and processing time analysis. Results showed that the cloud environment outperformed in terms of execution time, with MiniBatchKMeans standing out, while the local environment was more economical in terms of total cost. No significant differences were observed in the quality of clustering between the two environments. It is concluded that the choice between local and cloud computing environments should consider the project profile, data volume, processing urgency, and available resources. This research contributes to the practical understanding of the advantages and limitations of each infrastructure, providing insights for technical and strategic decision-making in the data science field, especially in educational contexts. It also emphasizes the importance of replicability, test automation, and careful metric selection to ensure reliable results in real-world data experiments.

Downloads

Download data is not yet available.

DATA CLUSTERING IN DIFFERENT ENVIRONMENTS: AN ANALYSIS OF COST, TIME AND QUALITY

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

How to Cite

Google Scholar

Latest publications

Language

Make a Submission

Information

Keywords