Cyberbullying Detection on Twitter for the Colombian Population Using Artificial Intelligence Techniques

Detección de Ciberacoso en Twitter para la Población Colombiana Usando Técnicas de Inteligencia Artificial

Contenido principal del artículo

Felipe Mauricio Guerra Sáenz
Oscar Fernando Bedoya Leiva
Marcela Holguín Mera

Resumen

Cyberbullying is an increasingly relevant issue in contemporary society that can have serious consequences on the emotional and psychological well-being of victims. Currently, due to the vast exchange of digital interactions, it is challenging and labor-intensive for online platform moderators to manually detect and remove all cyberbullying comments. Therefore, there is a need for automatic models that employ artificial intelligence techniques to detect cyberbullying. This article proposes machine learning-based models and language-based models for cyberbullying detection on the social network Twitter. The machine learning techniques used are XGBoost, logistic regression, and random forests. On the language model side, a fine-tuning process was applied to a masked language model based on a transformer named roberta-base-bne. Although there are currently various models for this purpose, most of them are developed using English. In the case of Spanish, there are very few studies, and in the particular case of Colombian Spanish, there is no precedent for contributions in this area.
Additionally, this article introduces a corpus comprising tweets written in Colombian Spanish, meticulously annotated by a qualified occupational therapist. Two distinct datasets stem from this corpus. Dataset 1 is characterized by the annotation of a tweet as cyberbullying and another as non-cyberbullying, both containing the same word, carefully considering the context in which each word is employed. In contrast, Dataset 2 features different words in cyberbullying and non-cyberbullying tweets. The rationale behind utilizing these two datasets lies in capturing diverse language expressions and their contextual nuances, assessing the effectiveness of the applied techniques in discerning such context. Results from dataset 1 reveal that models achieve an area under the ROC curve of 0.797, 0.796, 0.785, and 0.910 with logistic regression, random forests, XGBoost, and roberta-base-bne, respectively. Meanwhile, employing Dataset 2 yields area under the ROC curve values of 0.983, 0.978, 0.971, and 0.996, respectively. Finally, we introduce a web application named AI Cyberbullying Detector tailored for therapists, empowering them to leverage artificial intelligence in cyberbullying-related studies.

Descargas

Los datos de descargas todavía no están disponibles.

Detalles del artículo

Citaciones

Crossref
Scopus
Europe PMC

Referencias (VER)

l-garadi, M.A.; Varathan, K.D.; Ravana, S.D. (2016). Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network. Computers in Human Behavior, 63, pp. 433–443. https://doi.org/10.1016/j.chb.2016.05.051

Aránguez Sánchez, T. (2022). Límites legales a la libertad de expresión en Twitter. Un análisis feminista. Algoritmos, teletrabajo y otros grandes temas del feminismo digital, pp. 249–267. ISBN 978-84-1122-494-9

Balakrishnan, V.; Khan, S.; Arabnia, H. (2020). Improving cyberbullying detection using Twitter users’ psychological features and machine learning. Computers & Security, 90. https://doi.org/10.1016/j.cose.2019.101710

Bozyiğit, A.; Semih, U.; Nasiboğlu, E. (2021). Cyberbullying detection: Utilizing social media features. Expert Systems with Applications, 179. https://doi.org/10.1016/j.eswa.2021.115001

Chia, Z.; Ptaszynski, M.; Masui, F.; Leliwa, G.; Wroczynski, M. (2021). Machine Learning and feature engineering-based study into sarcasm and irony classification with application to cyberbullying detection. Information Processing & Management, 58(4). https://doi.org/10.1016/j.ipm.2021.102600

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Lan-guage Understanding. North American Chapter of the Association for Computational Linguistics. https://doi.org/10.48550/arXiv.1810.04805

Feijóo, S.; Foody, M.; O’Higgins Norman, J.; Pichel, R.; Rial, A. (2021). Cyberbullies, the Cyberbullied, and Problematic Internet Use: Some Reasonable Similarities. Psicothema, 33(2), pp. 198–205. https://doi.org/10.7334/psicothema2020.209

Guerra, F. (2023a). Colombian-spanish-cyberbullying-classifier. Disponible en: https://huggingface.co/FelipeGuerra/colombian-spanish-cyberbullying-classifier

Guerra, F. (2023b). Colombian-spanish-cyberbullying-detector. Disponible en: https://huggingface.co/FelipeGuerra/colombian-spanish-cyberbullying-detector

Guerra, F. (2023c). Colombian_Spanish_Cyberbullying_Dataset_1. Disponible en: https://huggingface.co/datasets/FelipeGuerra/Colombian_Spanish_Cyberbullying_Dataset_1

Guerra, F. (2023d). Colombian_Spanish_Cyberbullying_Dataset_2. Disponible en: https://huggingface.co/datasets/FelipeGuerra/Colombian_Spanish_Cyberbullying_Dataset_2

Gutiérrez, A.; Armengol, J.; Pàmies, M.; Llop, J.; Silveira, J.; Pio, C.; Armentano, C.; Rodriguez, C.; Gonzalez, A.; Villegas, M. (2022). MarIA: Spanish Language Models. Procesamiento del Lenguaje Natural, 68. https://doi.org/10.26342/2022-68-3

Hassan, S.A.; Khairalla, M.A.; Fakhrou, A. (2023). The crime of cyberbullying and its relationship to addiction to social networking sites: A study at the law college Prince Mohammad Bin Fahd University. Computers in Human Behavior Reports, 12. https://doi.org/10.1016/j.chbr.2023.100346

Herrera-López, M.; Romera, E.; Ortega-Ruiz, R. (2017). Bullying y cyberbullying en Colombia; coocurrencia en adoles-centes escolarizados. Revista Latinoamericana de Psicología, 49(3), pp. 163–172. https://doi.org/10.1016/j.rlp.2016.08.001

Johari, N.; Jaafar, J. (2022). A Malay Language Cyberbullying Detection Model on Twitter using Supervised Machine Learning. International Visualization, Informatics and Technology Conference (IVIT), Kuala Lumpur, Malaysia, IEEE, pp. 325–332. https://doi.org/10.1109/IVIT55443.2022.10033395

Kee, D.; Anwar, A.; Vranjes, I. (2024). Cyberbullying victimization and suicide ideation: The mediating role of psycholo-gical distress among Malaysian youth. Computers in Human Behavior, 150. https://doi.org/10.1016/j.chb.2023.108000

Khan, S.; Qureshi, A. (2022). Cyberbullying Detection in Urdu Language Using Machine Learning. International Confe-rence on Emerging Trends in Electrical, Control, and Telecommunication Engineering (ETECTE), Lahore, Pakistan, IEEE, pp. 1–6. https://doi.org/10.1109/ETECTE55893.2022.10007379

León-Paredes, G.; Palomeque, W.; Gallegos, P.; Vintimilla, P.; Bravo, J.; Barbosa, L.; Paredes, M. (2019). Presumptive Detection of Cyberbullying on Twitter through Natural Language Processing and Machine Learning in the Spanish Language. IEEE CHILEAN Conference on Electrical, Electronics Engineering, Information and Communication Techno-logies (CHILECON), Valparaíso, Chile, IEEE, pp. 1–7. https://doi.org/10.1109/CHILECON47746.2019.8987684

Marín-Cortés, A.; Franco-Bustamante, S.; Betancur-Hoyos, E.; Vélez-Zapata, V. (2020). Miedo y tristeza en adolescentes espectadores de cyberbullying. Vulneración de la salud mental en la era digital. Revista Virtual Universidad Católica del Norte, 61, pp. 66–82. https://doi.org/10.35575/rvucn.n61a5

Peláez, G.; Lena, F. (2021). Árboles de decisión y bosques aleatorios en sistemas expertos: un enfoque fundamental. Ad-vances in education, ICT and innovation: issues for business and social enhancing, Madrid. https://doi.org/10.14679/1243

Salehi, F.; Abbasi, E.; Hassibi, B. (2019). The impact of regularization on high-dimensional logistic regression. Advances in Neural Information Processing Systems, 32. https://doi.org/10.48550/arXiv.1906.03761

Takuya, A.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning. https://doi.org/10.48550/arXiv.1907.10902

Van Hee, C.; Jacobs, G.; Emmery, C.; Desmet, B.; Lefever, E.; Verhoeven, B.; De Pauw, G.; Daelemans, W.; Hoste, V. (2018). Automatic detection of cyberbullying in social media text. PLoS ONE, 13(10): e0203794. https://doi.org/10.1371/journal.pone.0203794

Van Hee, C.; Verhoeven, B.; Lefever, E.; De Pauw, G.; Daelemans, W.; Hoste, V. (2015). Guidelines for the Fine-Grained Analysis of Cyberbullying, version 1.0. Technical Report LT3 15-01, LT3, Language and Translation Technology Team – Ghent University.

Wang, L.; Jiang, S.; Zhou, Z.; Fei, W.; Wang, W. (2024). Online disinhibition and adolescent cyberbullying: A systematic review. Children and Youth Services Review, 156. https://doi.org/10.1016/j.childyouth.2023.107352

Wang, W.; Chen, L.; Thirunarayan, K.; Sheth, A. (2014). Cursing in English on twitter. Proceedings of the 17th ACM con-ference on Computer supported cooperative work & social computing (CSCW ’14), Association for Computing Machi-nery, New York, NY, USA, pp. 415–425. https://doi.org/10.1145/2531602.2531734

Zhang, X.; Tong, J.; Vishwamitra, N.; Whittaker, E.; Mazer, J.; Kowalski, R.; Hu, H.; Luo, F.; Macbeth, J.; Dillon, E. (2016). Cyberbullying detection with a pronunciation based convolutional neural network. 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 740–745. https://doi.org/10.1109/ICMLA.2016.0132