Generic Feature Selection Methodology to Named Entity Detection from Indian and European Languages

doi:10.4316/AECE.2019.01011

1/2019 - 11

View TOC | « Previous Article | Next Article »

Generic Feature Selection Methodology to Named Entity Detection from Indian and European Languages

MALARKODI, C. S. , DEVI, S. L.

View the paper record and citations in

Click to see author's profile in

SCOPUS,

IEEE Xplore,

Web of Science

Download PDF (1,279 KB) | Citation | Downloads: 1,162 | Views: 2,614

Author keywords
classification, optimization, feature extraction, fuzzy logic, signal processing

References keywords
named(30), entity(28), recognition(23), language(13), languages(10), indian(9), india(8), sobha(6), natural(6), learning(6)
Blue keywords are present in both the references section and the paper title.

About this article
Date of Publication: 2019-02-28
Volume 19, Issue 1, Year 2019, On page(s): 79 - 88
ISSN: 1582-7445, e-ISSN: 1844-7600
Digital Object Identifier: 10.4316/AECE.2019.01011
Web of Science Accession Number: 000459986900011
SCOPUS ID: 85064208532

Abstract

Full text preview

This paper describes the development of language and domain independent Named Entity Recognition (NER) system which can identify named entities from any given dataset irrespective of the language and domain. The main novelty of the present work is the generic feature selection methodology which has been applied to 7 Indian languages and 5 European languages. The generic feature selection methodology was done in two ways; first using frequency based approach; secondly k-means++ clustering algorithm was used to validate the patterns obtained in the frequency based approach. The dataset used for the experiments belongs to different genre. To the best of our knowledge we are the first to work on the development of cross-lingual Named Entity (NE) system with 12 languages belongs to different language families. We have done the 10-fold cross validation and the system output has been analyzed for all the languages and causes of error cases was discussed in the error analysis section. The performance of our system is also compared with the existing systems.

References

Cited By «-- Click to see who has cited this paper

[1] A. Borthwick, J. Sterling, E. Agichtein, R. Grishman, "NYU: Description of the MENE named Entity System," in Proc. Seventh Machine Understanding Conference (MUC-7), Virginia, 1998.

[2] D. Nadeau, S. Sekine, "A survey of named entity recognition and classification," Linguisticae Investigationes, vol. 30, no. 7, pp. 3-26, 2007.
[CrossRef] [SCOPUS Times Cited 1850]

[3] D. M. Bikel, S. Miller, R. Schwartz, R. Weischedel, "Nymble: A high-performance learning name-finder," in Proc. Fifth Conference on Applied Natural Language Processing, Washington, 1997, pp. 194-201.
[CrossRef]

[4] E. F. Tjong Kim Sang, "Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition," in Proc. CONLL-2002, Taipei, Taiwan, 2002,
[CrossRef]

[5] E .F. Tjong Kim Sang, F. De Meulder, "Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition," in Proc. of the seventh conference on Natural language learning at HLT-NAACL 2003, Canada, vol. 4, 2003, pp. 142-147. Arxiv:cs/0306050

[6] R. Florian, A. Ittycheriah, H. Jing, T. Zhang, "Named entity recognition through classifier combination," in Proc. Seventh conference on Natural language learning at HLT-NAACL 2003, ACM, vol. 4, pp. 168-171, 2003.
[CrossRef]

[7] F. De Meulder, W. Daelemans, "Memory-based named entity recognition using unannotated data," in Proc. Seventh conference on Natural language learning at HLT-NAACL 2003, ACL, vol. 4, 2003, pp. 208-211.
[CrossRef]

[8] B. Desmet, V. Hoste, "Dutch named entity recognition using classifier ensembles," LOT Occasional Series, vol. 16, pp. 29-41, 2010.

[9] D. Varga, E. Simon "Hungarian named entity recognition with a maximum entropy approach," Acta Cybern, vol. 18, no. 2, pp. 293-301, 2007.

[10] G. Szarvas, R. Farkas, A. Kocsor, "A multilingual named entity recognition system using boosting and c4.5 decision tree learning algorithms," in Proc. International Conference on Discovery Science, pp. 267-278, 2006.
[CrossRef] [SCOPUS Times Cited 66]

[11] R. Florian, "Named entity recognition as a house of cards: Classifier stacking," in Proc. of the 6th conference on Natural language learning, Association for Computational Linguistics, vol. 20, pp. 1-4, 2002.
[CrossRef]

[12] D. Benikova, C. Biemann, M. Kisselew, S. Padó, "Germeval 2014 named entity recognition shared task: companion paper," in Proc. KONVENS GermEval Shared Task on Named Entity Recognition, Hildesheim, Germany, 2014, pp. 104-112.

[13] A. K. Singh, "Named Entity Recognition for South and South East Asian Languages: Taking Stock", in Proc. IJCNLP, India, 2008, pp. 5-16.

[14] S. K. Saha, P. Sarathi Ghosh, S. Sarkar, P. Mitra, "Named entity recognition in Hindi using maximum entropy and transliteration," Polibits, vol. 38, pp. 33-41, 2008.
[CrossRef]

[15] S. Gupta, P. Bhattacharyya, "Think globally, apply locally: using distributional characteristics for Hindi named entity identification," in Proc. Named Entities Workshop, 2010, pp. 116-125. ISBN: 978-1-932432-78-7

[16] N.V. Patil, A. S. Patil, B. V. Pawar, "Issues and Challenges in Marathi Named Entity Recognition," International Journal on Natural Language Computing (IJNLC), vol. 5, no. 1, pp. 15-30, 2016.
[CrossRef]

[17] A. Kaur, G. S. Josan, "Evaluation of Named Entity Features for Punjabi Language," Procedia Computer Science," vol. 1, no. 46, pp. 159-166, 2015.
[CrossRef] [Web of Science Times Cited 7] [SCOPUS Times Cited 9]

[18] A. Ekbal, S. Bandyopadhyay, "A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi," Linguistic Issues in Language Technology, vol. 2, no. 1, pp. 1-44, 2009.

[19] K. N. Kumar, G. S. K. Santosh, V. Varma, "A Language-Independent Approach to Identify the Named Entities in under-resourced languages and Clustering Multilingual Documents," in Proc. International Conference on Multilingual and Multimodal Information Access Evaluation, Amsterdam, 2011, pp. 74-82.

[20] M. S. Bindu, I. Sumam Mary, "Design And Development Of A Named Entity Based Question Answering System For Malayalam Language," PhD diss., Cochin University Of Science And Technology, 2012.

[21] G. V. S. Raju, B. Srinivasu, S. V. Raju, K. S. M. V. Kumar, Named Entity Recognition for Telugu using Maximum Entropy Model. Journal of Theoretical & Applied Information Technology, vol. 1, no. 13, 2010.

[22] S. L. Pandian, T. V. Geetha, Krishna, "Named Entity Recognition in Tamil using Context-cues and the E-M algorithm," in Proc. 3rd Indian International Conference on Artificial Intelligence, Pune, India, pp. 1951-1958, 2007.
[CrossRef] [Web of Science Times Cited 1] [SCOPUS Times Cited 5]

[23] R. Vijayakrishna, L. D. Sobha, "Domain focused Named Entity for Tamil using Conditional Random Fields," in Proc. workshop on NER for South and South East Asian Languages, Hyderabad, India, 2008, pp. 59-66.

[24] C. S. Malarkodi, L. D. Sobha, "A Deeper Look into Features for NE Resolution in Indian Languages," in Proc. of the Workshop on Indian Language Data: Resources and Evaluation, LREC, Istanbul, 2012, pp. 36-41.

[25] C. S. Malarkodi, R. K. Pattabhi, L. D. Sobha, "Tamil NER-Coping with Real Time Challenges," in Proc. workshop on Machine Translation and Parsing in Indian Languages (MTPIL-2012), COLING, Bombay, India, 2012, pp. 23-38.

[26] L. D. Sobha, C.S. Malarkodi, K. Marimuthu, "Named Entity Recognizer for Indian Languages," in Proc. ICON NLP Tool Contest, India, 2013.

[27] V. Gayen, K. Sarkar, "An HMM based named entity recognition system for Indian languages: the JU system at ICON 2013," in Proc. of the ICON NLP Tool Contest, 2014. arXiv:1405.7397v1

[28] R. K. Pattabhi, L.D. Sobha, "NERIL: Named Entity Recognition for Indian Languages @ FIRE 2013-An Overview," in Proc. FIRE-2013, India, 2013.

[29] R. K. Pattabhi, L.D. Sobha, "NERIL: Named Entity Recognition for Indian Languages @ FIRE 2014-An Overview," in Proc. of the FIRE-2014, India, 2014.

[30] N. Abinaya, J. Neethu, H.B.G. Barathi, M. K. Anand, K. P. Soman, "AMRITA_CEN@ FIRE-2014: Named Entity Recognition for Indian Languages using Rich Features," in Proc. Forum for Information Retrieval Evaluation, India, ACM, 2014, pp. 103-111.
[CrossRef]

[31] S. K. Saha, S. Sudeshna M. Pabitra, "Feature selection techniques for maximum entropy based biomedical named entity recognition," Journal of biomedical informatics, vol. 42, no. 5, pp. 905-911, 2009.
[CrossRef] [Web of Science Times Cited 70]

[32] S. Zahra, M.A. Ghazanfar, A. Khalid, M.A. Azam, U. Naeem, & A. Prugel-Bennett, "Novel centroid selection approaches for KMeans-clustering based recommender systems," Information sciences, vol. 320, pp. 156-189, 2015.
[CrossRef] [Web of Science Times Cited 143] [SCOPUS Times Cited 185]

[33] T. Zhang, F. Ma, "Improved rough k-means clustering algorithm based on weighted distance measure with Gaussian function," International Journal of Computer Mathematics, vol. 94, no. 4, pp. 663-675, 2017.
[CrossRef] [Web of Science Times Cited 40] [SCOPUS Times Cited 55]

[34] I. D. Borlea, R. E. Precup, F. Dragan, A. B. Borlea, A. B. "Centroid update approach to K-means clustering," Advances in Electrical and Computer Engineering, vol. 17, no. 4, pp. 3-11, 2017.
[CrossRef] [Full Text] [Web of Science Times Cited 17] [SCOPUS Times Cited 24]

[35] Chakraborty, Saptarshi, D. Swagatam, "k- Means clustering with a new divergence-based distance metric: Convergence and performance analysis," Pattern Recognition Letters, vol. 100, pp. 67-73, 2017.
[CrossRef] [Web of Science Times Cited 43]

[36] J. Lafferty, A. McCallum, F. Pereira, "Conditional Random Fields for segmenting and labelling sequence data," in Proc. ICML-01, Massachusetts, 2001, pp. 282-289.

[37] H. M. Wallach, "Conditional random fields: An introduction," Technical Reports (CIS), MSCIS-04-21, 2004.

References Weight

Web of Science® Citations for all references: 321 TCR
SCOPUS® Citations for all references: 2,194 TCR

Web of Science® Average Citations per reference: 8 ACR
SCOPUS® Average Citations per reference: 58 ACR

TCR = Total Citations for References / ACR = Average Citations per Reference

We introduced in 2010 - for the first time in scientific publishing, the term "References Weight", as a quantitative indication of the quality ... Read more

Citations for references updated on 2024-04-17 18:44 in 171 seconds.

Note¹: Web of Science® is a registered trademark of Clarivate Analytics.
Note²: SCOPUS® is a registered trademark of Elsevier B.V.
Disclaimer: All queries to the respective databases were made by using the DOI record of every reference (where available). Due to technical problems beyond our control, the information is not always accurate. Please use the CrossRef link to visit the respective publisher site.

Copyright ©2001-2024
Faculty of Electrical Engineering and Computer Science
Stefan cel Mare University of Suceava, Romania

All rights reserved: Advances in Electrical and Computer Engineering is a registered trademark of the Stefan cel Mare University of Suceava. No part of this publication may be reproduced, stored in a retrieval system, photocopied, recorded or archived, without the written permission from the Editor. When authors submit their papers for publication, they agree that the copyright for their article be transferred to the Faculty of Electrical Engineering and Computer Science, Stefan cel Mare University of Suceava, Romania, if and only if the articles are accepted for publication. The copyright covers the exclusive rights to reproduce and distribute the article, including reprints and translations.

Permission for other use: The copyright owner's consent does not extend to copying for general distribution, for promotion, for creating new works, or for resale. Specific written permission must be obtained from the Editor for such copying. Direct linking to files hosted on this website is strictly prohibited.

Disclaimer: Whilst every effort is made by the publishers and editorial board to see that no inaccurate or misleading data, opinions or statements appear in this journal, they wish to make it clear that all information and opinions formulated in the articles, as well as linguistic accuracy, are the sole responsibility of the author.

Menu:

Generic Feature Selection Methodology to Named Entity Detection from Indian and European Languages