Detección de Ciberacoso en Twitter para la Población Colombiana Usando Técnicas de Inteligencia Artificial

Felipe Mauricio Guerra Sáenz; Oscar Fernando Bedoya Leiva; Marcela Holguín Mera

doi:10.24050/reia.v22i44.1757

Cómo citar

Guerra Sáenz, F. M., Bedoya Leiva, O. F., & Holguín Mera, M. (2025). Cyberbullying Detection on Twitter for the Colombian Population Using Artificial Intelligence Techniques. Revista EIA, 22(44), 4402 pp. 1–37. https://doi.org/10.24050/reia.v22i44.1757

PDF

FLIP

Publicado: Jul 1, 2025

https://doi.org/10.24050/reia.v22i44.1757

Palabras clave:

Artificial intelligence

Cyberbullying

Colombian Spanish

Language models

Logistic regression

Machine learning

Random forests

Transformers

Twitter

XGBoost

Aprendizaje de máquina; Bosques aleatorios, Ciberacoso; Español Colombiano; Inteligencia artificial; Transformers; Modelos de lenguaje; Regresión logística; Twitter; XGBoost.

Aprendizaje de máquina

Bosques aleatorios

Ciberacoso

Español Colombiano

Inteligencia artificial

Transformers

Modelos de lenguaje

Regresión logística

Twitter

XGBoost

Número

Vol. 22 Núm. 44 (2025): Tabla de contenido Revista EIA No. 44

Sección

Artículos

Términos de la licencia (VER)

Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-SinDerivadas 4.0.

Declaración del copyright

Los autores ceden en exclusiva a la Universidad EIA, con facultad de cesión a terceros, todos los derechos de explotación que deriven de los trabajos que sean aceptados para su publicación en la Revista EIA, así como en cualquier producto derivados de la misma y, en particular, los de reproducción, distribución, comunicación pública (incluida la puesta a disposición interactiva) y transformación (incluidas la adaptación, la modificación y, en su caso, la traducción), para todas las modalidades de explotación (a título enunciativo y no limitativo: en formato papel, electrónico, on-line, soporte informático o audiovisual, así como en cualquier otro formato, incluso con finalidad promocional o publicitaria y/o para la realización de productos derivados), para un ámbito territorial mundial y para toda la duración legal de los derechos prevista en el vigente texto difundido de la Ley de Propiedad Intelectual. Esta cesión la realizarán los autores sin derecho a ningún tipo de remuneración o indemnización.

La autorización conferida a la Revista EIA estará vigente a partir de la fecha en que se incluye en el volumen y número respectivo en el Sistema Open Journal Systems de la Revista EIA, así como en las diferentes bases e índices de datos en que se encuentra indexada la publicación.

Todos los contenidos de la Revista EIA, están publicados bajo la Licencia Creative Commons Atribución-NoComercial-NoDerivativa 4.0 Internacional

Licencia

Esta obra está bajo una Licencia Creative Commons Atribución-NoComercial-NoDerivativa 4.0 Internacional

Felipe Mauricio Guerra Sáenz

Universidad del Valle, Colombia

Oscar Fernando Bedoya Leiva

Universidad del Valle, Colombia

Marcela Holguín Mera

Universidad de San Buenaventura, Colombia

Resumen

Cyberbullying is an increasingly relevant issue in contemporary society that can have serious consequences on the emotional and psychological well-being of victims. Currently, due to the vast exchange of digital interactions, it is challenging and labor-intensive for online platform moderators to manually detect and remove all cyberbullying comments. Therefore, there is a need for automatic models that employ artificial intelligence techniques to detect cyberbullying. This article proposes machine learning-based models and language-based models for cyberbullying detection on the social network Twitter. The machine learning techniques used are XGBoost, logistic regression, and random forests. On the language model side, a fine-tuning process was applied to a masked language model based on a transformer named roberta-base-bne. Although there are currently various models for this purpose, most of them are developed using English. In the case of Spanish, there are very few studies, and in the particular case of Colombian Spanish, there is no precedent for contributions in this area.
Additionally, this article introduces a corpus comprising tweets written in Colombian Spanish, meticulously annotated by a qualified occupational therapist. Two distinct datasets stem from this corpus. Dataset 1 is characterized by the annotation of a tweet as cyberbullying and another as non-cyberbullying, both containing the same word, carefully considering the context in which each word is employed. In contrast, Dataset 2 features different words in cyberbullying and non-cyberbullying tweets. The rationale behind utilizing these two datasets lies in capturing diverse language expressions and their contextual nuances, assessing the effectiveness of the applied techniques in discerning such context. Results from dataset 1 reveal that models achieve an area under the ROC curve of 0.797, 0.796, 0.785, and 0.910 with logistic regression, random forests, XGBoost, and roberta-base-bne, respectively. Meanwhile, employing Dataset 2 yields area under the ROC curve values of 0.983, 0.978, 0.971, and 0.996, respectively. Finally, we introduce a web application named AI Cyberbullying Detector tailored for therapists, empowering them to leverage artificial intelligence in cyberbullying-related studies.

Descargas

Los datos de descargas todavía no están disponibles.

Citaciones

Referencias (VER)

l-garadi, M.A.; Varathan, K.D.; Ravana, S.D. (2016). Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network. Computers in Human Behavior, 63, pp. 433–443. https://doi.org/10.1016/j.chb.2016.05.051

Aránguez Sánchez, T. (2022). Límites legales a la libertad de expresión en Twitter. Un análisis feminista. Algoritmos, teletrabajo y otros grandes temas del feminismo digital, pp. 249–267. ISBN 978-84-1122-494-9

Balakrishnan, V.; Khan, S.; Arabnia, H. (2020). Improving cyberbullying detection using Twitter users’ psychological features and machine learning. Computers & Security, 90. https://doi.org/10.1016/j.cose.2019.101710

Bozyiğit, A.; Semih, U.; Nasiboğlu, E. (2021). Cyberbullying detection: Utilizing social media features. Expert Systems with Applications, 179. https://doi.org/10.1016/j.eswa.2021.115001

Chia, Z.; Ptaszynski, M.; Masui, F.; Leliwa, G.; Wroczynski, M. (2021). Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection. Information Processing & Management, 58(4). https://doi.org/10.1016/j.ipm.2021.102600

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. North American Chapter of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1810.04805

Feijóo, S.; Foody, M.; O’Higgins Norman, J.; Pichel, R.; Rial, A. (2021). Cyberbullies, the Cyberbullied, and Problematic Internet Use: Some Reasonable Similarities. Psicothema, 33(2), pp. 198–205. https://doi.org/10.7334/psicothema2020.209

Guerra, F. (2023a). Colombian-spanish-cyberbullying-classifier. Disponible en: https://huggingface.co/FelipeGuerra/colombian-spanish-cyberbullying-classifier

Guerra, F. (2023b). Colombian-spanish-cyberbullying-detector. Disponible en: https://huggingface.co/FelipeGuerra/colombian-spanish-cyberbullying-detector

Guerra, F. (2023c). Colombian_Spanish_Cyberbullying_Dataset_1. Disponible en: https://huggingface.co/datasets/FelipeGuerra/Colombian_Spanish_Cyberbullying_Dataset_1

Guerra, F. (2023d). Colombian_Spanish_Cyberbullying_Dataset_2. Disponible en: https://huggingface.co/datasets/FelipeGuerra/Colombian_Spanish_Cyberbullying_Dataset_2

Gutiérrez, A.; Armengol, J.; Pàmies, M.; Llop, J.; Silveira, J.; Pio, C.; Armentano, C.; Rodriguez, C.; Gonzalez, A.; Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68. https://doi.org/10.26342/2022-68-3

Hassan, S.A.; Khairalla, M.A.; Fakhrou, A. (2023). The crime of cyberbullying and its relationship to addiction to social networking sites: A study at the law college Prince Mohammad Bin Fahd University. Computers in Human Behavior Reports, 12. https://doi.org/10.1016/j.chbr.2023.100346

Herrera-López, M.; Romera, E.; Ortega-Ruiz, R. (2017). Bullying y cyberbullying en Colombia; coocurrencia en adoles-centes escolarizados. Revista Latinoamericana de Psicología, 49(3), pp. 163–172. https://doi.org/10.1016/j.rlp.2016.08.001

Johari, N.; Jaafar, J. (2022). A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning. International Visualization, Informatics and Technology Conference (IVIT), Kuala Lumpur, Malaysia, IEEE, pp. 325–332. https://doi.org/10.1109/IVIT55443.2022.10033395

Kee, D.; Anwar, A.; Vranjes, I. (2024). Cyberbullying victimization and suicide ideation: The mediating role of psycholo-gical distress among Malaysian youth. Computers in Human Behavior, 150. https://doi.org/10.1016/j.chb.2023.108000

Khan, S.; Qureshi, A. (2022). Cyberbullying Detection in Urdu Language Using Machine Learning. International Confe-rence on Emerging Trends in Electrical, Control, and Telecommunication Engineering (ETECTE), Lahore, Pakistan, IEEE, pp. 1–6. https://doi.org/10.1109/ETECTE55893.2022.10007379

León-Paredes, G.; Palomeque, W.; Gallegos, P.; Vintimilla, P.; Bravo, J.; Barbosa, L.; Paredes, M. (2019). Presumptive Detection of Cyberbullying on Twitter through Natural Language Processing and Machine Learning in the Spanish Language. IEEE CHILEAN Conference on Electrical, Electronics Engineering, Information and Communication Techno-logies (CHILECON), Valparaíso, Chile, IEEE, pp. 1–7. https://doi.org/10.1109/CHILECON47746.2019.8987684

Marín-Cortés, A.; Franco-Bustamante, S.; Betancur-Hoyos, E.; Vélez-Zapata, V. (2020). Miedo y tristeza en adolescentes espectadores de cyberbullying. Vulneración de la salud mental en la era digital. Revista Virtual Universidad Católica del Norte, 61, pp. 66–82. https://doi.org/10.35575/rvucn.n61a5

Peláez, G.; Lena, F. (2021). Árboles de decisión y bosques aleatorios en sistemas expertos: un enfoque fundamental. Ad-vances in education, ICT and innovation: issues for business and social enhancing, Madrid. https://doi.org/10.14679/1243

Salehi, F.; Abbasi, E.; Hassibi, B. (2019). The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32. https://doi.org/10.48550/arXiv.1906.03761

Takuya, A.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning. https://doi.org/10.48550/arXiv.1907.10902

Van Hee, C.; Jacobs, G.; Emmery, C.; Desmet, B.; Lefever, E.; Verhoeven, B.; De Pauw, G.; Daelemans, W.; Hoste, V. (2018). Automatic detection of cyberbullying in social media text. PLoS ONE, 13(10): e0203794. https://doi.org/10.1371/journal.pone.0203794

Van Hee, C.; Verhoeven, B.; Lefever, E.; De Pauw, G.; Daelemans, W.; Hoste, V. (2015). Guidelines for the Fine-Grained Analysis of Cyberbullying, version 1.0. Technical Report LT3 15-01, LT3, Language and Translation Technology Team – Ghent University.

Wang, L.; Jiang, S.; Zhou, Z.; Fei, W.; Wang, W. (2024). Online disinhibition and adolescent cyberbullying: A systematic review. Children and Youth Services Review, 156. https://doi.org/10.1016/j.childyouth.2023.107352

Wang, W.; Chen, L.; Thirunarayan, K.; Sheth, A. (2014). Cursing in English on twitter. Proceedings of the 17th ACM con-ference on Computer supported cooperative work & social computing (CSCW ’14), Association for Computing Machi-nery, New York, NY, USA, pp. 415–425. https://doi.org/10.1145/2531602.2531734

Zhang, X.; Tong, J.; Vishwamitra, N.; Whittaker, E.; Mazer, J.; Kowalski, R.; Hu, H.; Luo, F.; Macbeth, J.; Dillon, E. (2016). Cyberbullying detection with a pronunciation based convolutional neural network. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 740–745. https://doi.org/10.1109/ICMLA.2016.0132

Barra lateral del artículo