Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system

Ankush Khandelwal, Sahil Swami, Syed S. Akthar, Manish Shrivastava

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field of linguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence. Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language, but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor in code-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containing English-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts.

Original languageEnglish (US)
Title of host publicationLREC 2018 - 11th International Conference on Language Resources and Evaluation
EditorsHitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
PublisherEuropean Language Resources Association (ELRA)
Pages1203-1207
Number of pages5
ISBN (Electronic)9791095546009
StatePublished - Jan 1 2019
Event11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
Duration: May 7 2018May 12 2018

Publication series

NameLREC 2018 - 11th International Conference on Language Resources and Evaluation

Conference

Conference11th International Conference on Language Resources and Evaluation, LREC 2018
CountryJapan
CityMiyazaki
Period5/7/185/12/18

Fingerprint

social media
humor
computational linguistics
language
popularity
networking
English language
Social Media
semantics
linguistics
experiment
Language

Keywords

  • Code-mixing
  • Extra tree classifier
  • Humor detection
  • Naive bayes
  • Random forest classifier
  • SVM

Cite this

Khandelwal, A., Swami, S., Akthar, S. S., & Shrivastava, M. (2019). Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. In H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, ... T. Tokunaga (Eds.), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 1203-1207). (LREC 2018 - 11th International Conference on Language Resources and Evaluation). European Language Resources Association (ELRA).

Humor detection in English-Hindi code-mixed social media content : Corpus and baseline system. / Khandelwal, Ankush; Swami, Sahil; Akthar, Syed S.; Shrivastava, Manish.

LREC 2018 - 11th International Conference on Language Resources and Evaluation. ed. / Hitoshi Isahara; Bente Maegaard; Stelios Piperidis; Christopher Cieri; Thierry Declerck; Koiti Hasida; Helene Mazo; Khalid Choukri; Sara Goggi; Joseph Mariani; Asuncion Moreno; Nicoletta Calzolari; Jan Odijk; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. p. 1203-1207 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Khandelwal, A, Swami, S, Akthar, SS & Shrivastava, M 2019, Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. in H Isahara, B Maegaard, S Piperidis, C Cieri, T Declerck, K Hasida, H Mazo, K Choukri, S Goggi, J Mariani, A Moreno, N Calzolari, J Odijk & T Tokunaga (eds), LREC 2018 - 11th International Conference on Language Resources and Evaluation. LREC 2018 - 11th International Conference on Language Resources and Evaluation, European Language Resources Association (ELRA), pp. 1203-1207, 11th International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, 5/7/18.
Khandelwal A, Swami S, Akthar SS, Shrivastava M. Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system. In Isahara H, Maegaard B, Piperidis S, Cieri C, Declerck T, Hasida K, Mazo H, Choukri K, Goggi S, Mariani J, Moreno A, Calzolari N, Odijk J, Tokunaga T, editors, LREC 2018 - 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA). 2019. p. 1203-1207. (LREC 2018 - 11th International Conference on Language Resources and Evaluation).
Khandelwal, Ankush ; Swami, Sahil ; Akthar, Syed S. ; Shrivastava, Manish. / Humor detection in English-Hindi code-mixed social media content : Corpus and baseline system. LREC 2018 - 11th International Conference on Language Resources and Evaluation. editor / Hitoshi Isahara ; Bente Maegaard ; Stelios Piperidis ; Christopher Cieri ; Thierry Declerck ; Koiti Hasida ; Helene Mazo ; Khalid Choukri ; Sara Goggi ; Joseph Mariani ; Asuncion Moreno ; Nicoletta Calzolari ; Jan Odijk ; Takenobu Tokunaga. European Language Resources Association (ELRA), 2019. pp. 1203-1207 (LREC 2018 - 11th International Conference on Language Resources and Evaluation).
@inproceedings{23dba05d55d44b2bbb6d0f084c233402,
title = "Humor detection in English-Hindi code-mixed social media content: Corpus and baseline system",
abstract = "The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field of linguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence. Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language, but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor in code-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containing English-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts.",
keywords = "Code-mixing, Extra tree classifier, Humor detection, Naive bayes, Random forest classifier, SVM",
author = "Ankush Khandelwal and Sahil Swami and Akthar, {Syed S.} and Manish Shrivastava",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
series = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",
publisher = "European Language Resources Association (ELRA)",
pages = "1203--1207",
editor = "Hitoshi Isahara and Bente Maegaard and Stelios Piperidis and Christopher Cieri and Thierry Declerck and Koiti Hasida and Helene Mazo and Khalid Choukri and Sara Goggi and Joseph Mariani and Asuncion Moreno and Nicoletta Calzolari and Jan Odijk and Takenobu Tokunaga",
booktitle = "LREC 2018 - 11th International Conference on Language Resources and Evaluation",

}

TY - GEN

T1 - Humor detection in English-Hindi code-mixed social media content

T2 - Corpus and baseline system

AU - Khandelwal, Ankush

AU - Swami, Sahil

AU - Akthar, Syed S.

AU - Shrivastava, Manish

PY - 2019/1/1

Y1 - 2019/1/1

N2 - The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field of linguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence. Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language, but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor in code-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containing English-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts.

AB - The tremendous amount of user generated data through social networking sites led to the gaining popularity of automatic text classification in the field of computational linguistics over the past decade. Within this domain, one problem that has drawn the attention of many researchers is automatic humor detection in texts. In depth semantic understanding of the text is required to detect humor which makes the problem difficult to automate. With increase in the number of social media users, many multilingual speakers often interchange between languages while posting on social media which is called code-mixing. It introduces some challenges in the field of linguistic analysis of social media content (Barman et al., 2014), like spelling variations and non-grammatical structures in a sentence. Past researches include detecting puns in texts (Kao et al., 2016) and humor in one-lines (Mihalcea et al., 2010) in a single language, but with the tremendous amount of code-mixed data available online, there is a need to develop techniques which detects humor in code-mixed tweets. In this paper, we analyze the task of humor detection in texts and describe a freely available corpus containing English-Hindi code-mixed tweets annotated with humorous(H) or non-humorous(N) tags. We also tagged the words in the tweets with Language tags (English/Hindi/Others). Moreover, we describe the experiments carried out on the corpus and provide a baseline classification system which distinguishes between humorous and non-humorous texts.

KW - Code-mixing

KW - Extra tree classifier

KW - Humor detection

KW - Naive bayes

KW - Random forest classifier

KW - SVM

UR - http://www.scopus.com/inward/record.url?scp=85059891613&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85059891613&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85059891613

T3 - LREC 2018 - 11th International Conference on Language Resources and Evaluation

SP - 1203

EP - 1207

BT - LREC 2018 - 11th International Conference on Language Resources and Evaluation

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Piperidis, Stelios

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Hasida, Koiti

A2 - Mazo, Helene

A2 - Choukri, Khalid

A2 - Goggi, Sara

A2 - Mariani, Joseph

A2 - Moreno, Asuncion

A2 - Calzolari, Nicoletta

A2 - Odijk, Jan

A2 - Tokunaga, Takenobu

PB - European Language Resources Association (ELRA)

ER -