Tutorial: Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories

Half day tutorial, Monday, September 14, 2015, 9:00 AM – 12:30 PM.

Aims, scope and learning objectives of the tutorial

This tutorial aims to present some of the proposed solutions to attempt to solve the author name disambiguation problem, one of the hardest problems faced by the digital library research community. Thus, at the end of this tutorial, we expect the participants will have understood the author name disambiguation problem and its challenges, as well as learned about some of the proposed solutions.

Full description

This tutorial is divided in two parts. The first part is based on our survey, entitled “A Brief Survey of Automatic Methods for Author Name Disambiguation”, published in SIGMOD Record in June 2012 [Ferreira et al. 2012b]. In this introductory part, we contextualize the problem, present a formal definition for it and propose a general taxonomy for charactering the automatic author name disambiguation methods proposed in the literature. Our proposed taxonomy, classify the methods according to the main type of exploited approach, author grouping, which tries to group the references to the same author using some type of similarity among reference attributes, or author assignment, which aims at directly assigning the references to their respective authors. Alternatively, the methods may be grouped according to the evidence explored in the disambiguation task: the citation attributes, web information, or implicit data that can be extracted from the available information. At the end of the first part, we briefly describe some of the author name disambiguation methods according to our taxonomy and discuss open research challenges.

In the second part, we present some of our solutions for the problem. First, we present HHC – Heuristic-based Hierarchical Clustering [Cota et al., 2010]. HHC disambiguates a set of citation records by successively fusing clusters of citation records with similar author names based on a real-world heuristic applied to their citation attributes. Despite its simplicity, HHC reaches competitive results when compared with other methods described in the literature. Next, we present INDi – Incremental unsupervised Name Disambiguation [Carvalho et al., 2011]. INDi is an unsupervised incremental method that aims to disambiguate only the new ambiguous citation records inserted into a disambiguated digital library (DL), a strategy that avoids running the disambiguation process on the entire DL and losing any manual corrections that have been previously made. Finally, we present SAND – Self-training Associative Name Disambiguator [Ferreira et al., 2010; Ferreira et al., 2014]. SAND is a three-step self-training method for author name disambiguation that requires no manual labeling and no parameterization (in real world scenarios), being particularly suitable for the common situation in which only the most basic information in a citation record is available (i.e., author names, and work and venue titles). We finish this second part by presenting SyGAR [Ferreira et al. 2012a], a tool for generating synthetic collections that allows the simulation of several realistic scenarios to support the evaluation of disambiguation methods.

Target audience

Intended audience: Digital library developers and practitioners interested in having an overview about the author name disambiguation problem and its solutions. The level of this tutorial is introductory.

Presenters

Anderson A. Ferreira holds a BS degree in Computer Science from the Federal University of Viçosa, Brazil, and an MSc and a PhD degree in Computer Science from the Federal University of Minas Gerais, Brazil. As a PhD student, Anderson participated in several research projects developed at the UFMG Database Laboratory (LBD) and was a member of the Brazilian National Institute of Science and Technology for the Web – InWeb. He has published several articles in major conferences and journals from the digital libraries and databases areas, such as JCDL, SBBD/JIDM, JASIST, IP&M, Information Sciences, World Digital and SIGMOD Record. He is currently an Assistant Professor in the Computer Science Department of the Federal University of Ouro Preto, Brazil.

Marcos André Gonçalves received a BS degree in Computer Science from the Federal University of Ceara, Brazil (1995), an MSc in Computer Science from the State University of Campinas, Brazil (1997), and a PhD degree in Computer Science from the Virginia Polytechnic Institute and State University (2004). In 2005, he joined the Computer Science Department of the Federal University of Minas Gerais, where he is currently an Associate Professor. Dr. Gonçalves has been an active member in the Digital Library community, having regularly served as program committee member for JCDL and TPDL, as well as for other related conferences such as CIKM, WSDM, SIGIR and SPIRE. He is currently a Junior Member of the Brazilian Academy of Science. His main research areas are Digital Libraries, Information Retrieval, and Social Networks. Dr. Gonçalves is the coauthor of the books “Key Issues Regarding Digital Libraries: Evaluation and Integration” and “Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach” published by Morgan & Claypool Publishers.

Alberto H. F. Laender holds a BS degree in Electrical Engineering and an MSc degree in Computer Science, both from the Federal University of Minas Gerais, Brazil, and a PhD degree in Computing from the University of East Anglia, UK. He joined the Computer Science Department of the Federal University of Minas Gerais in 1975, where he is currently a Full Professor and the head of the Database Research Group. In 1997, he was a Visiting Scientist at HP Labs in Palo Alto, California. Prof. Laender has served as program committee member for several national and international conferences and workshops on databases and Web-related topics, such as VLDB, ER, CIKM, WWW, WIDM, WebDB, JCDL, TPDL, DocEng, EC-Web, SPIRE, ICDE, SIGIR, AMW, SBBD and WebMedia. He was also program committee chair of SPIRE 2002 and SBBD 2003, and program committee co-chair of WIDM 2003, WIDM 2004, CIKM 2007, AMW 2010 and WWW 2013 Workshops. Currently, he is a member of the Steering Committee of ER, CIKM and AMW. Prof. Laender is a member of the Brazilian Academy of Science and in 2010 was awarded the National Order of the Scientific Merit by the Brazilian President. He is the author of more than 150 refereed journal and conference papers, and his research interests include conceptual modeling and database design, web data management, web information systems and digital libraries.

References

  • Carvalho, A. P.; Ferreira, A. A.; Laender, A. H. F.; and Gonçalves, M. A. Incremental unsupervised name disambiguation in cleaned digital libraries. Journal of Information and Data Management, 2(3):289–304, 2011.
  • Cota, R. G.; Ferreira, A. A.; Gonçalves, M. A.; Laender, A. H. F.; and Nascimento, C. An unsupervised heuristic-based hierarchical method for name disambiguation in bibliographic citations. Journal of the American Society for Information Science and Technology, 61(9):1853–1870, 2010.
  • Fan, X.; Wang, J.; Pu, X.; Zhou, L.; and Lv, B. On graph-based name disambiguation. ACM Journal of Data and Information Quality, 2:10:1-10:23, 2011.
  • Esperidião, L. V. B.; Ferreira, A. A.; Laender, A. H. F.Gonçalves, M. A.; Menotti, D.; Tavares, A. I.; and Assis, G. T. Reducing Fragmentation in Incremental Author Name Disambiguation. Journal of Information and Data Management, 5(3): 293-307, 2014.
  • Ferreira, A. A.; Gonçalves, M. A.; Almeida, J. M.; Laender, A. H. F.; and Veloso, A. A tool for generating synthetic authorship records for evaluating author name disambiguation methods. Information Sciences, 206:42–62, 2012a.
  • Ferreira, A. A.; Gonçalves, M. A.; and Laender, A. H. F. A brief survey of automatic methods for author name disambiguation. SIGMOD Record, 41(2):15–26, 2012b.
  • Ferreira, A. A.; Gonçalves, M. A.; Almeida, J. M.; Laender, A. H. F.; and Veloso, A. SyGAR – A Synthetic Data Generator for Evaluating Name Disambiguation Methods. In Proceedings of the 13th European Conference on Digital Libraries, pages 437–441, Corfu, Greece, 2009.
  • Ferreira, A. A.; Gonçalves, M. A.; and Laender, A. H. F. Disambiguating Author Names in Large Bibliographic Repositories. In Proceedings of the International Conference on Digital Libraries, pages 170-180, New Delhi, India, 2013.
  • Ferreira, A. A. ; Gonçalves, M. A. ; Laender, A. H. F. Disambiguating Author Names using Minimum Bibliographic Information. World Digital Library. 7(1): 71-84, 2014.
  • Ferreira, A. A.; Machado, T. M.; and Gonçalves, M. A. Improving author name disambiguation with user relevance feedback. Journal of Information and Data Management, 3(3):332–347, 2012.
  • Ferreira, A. A.; Silva, R.; Gonçalves, M. A.; Veloso, A.; and Laender, A. H. F. Active associative sampling for author name disambiguation. In Proceedings of the 2012 ACM/IEEE Joint Conference on Digital Libraries, pages 175–184, Washington, DC, 2012.
  • Ferreira, A. A.; Veloso, A.; Gonçalves, M. A.; and Laender, A. H. F. Effective self-training author name disambiguation in scholarly digital libraries. In Proceedings of the 2010 ACM/IEEE Joint Conference on Digital Libraries, pages 39–48, Gold Coast, Queensland, Australia, 2010.
  • Ferreira, A. A.; Veloso, A.; Gonçalves, M. A.; and Laender, A. H. F. Self-training author name disambiguation for information scarce scenarios. Journal of the American Society for Information Science and Technology. 65(6):1257-1278, 2014.
  • Godoi, T. A.; Torres, R. S.; Carvalho, A. M. B. R.; Gonçalves, M. A.; Ferreira, A. A., Fan, W.; and Fox, E. A. A Relevance Feedback Approach for the Author Name Disambiguation Problem. In Proceedings of the 2013 ACM/IEEE Joint Conference on Digital Libraries, pages 209-218, Indianapolis, Indiana, 2013.
  • Han, H.; Giles, C. L.; Zha, H.; Li, C.; and Tsioutsiouliklis, K. Two supervised learning approaches for name disambiguation in author citations. In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 296-305, Tucson, USA, 2004.
  • Han, H.; Xu, W.; Zha, H.; and Giles, C. L. A hierarchical naive Bayes mixture model for name disambiguation in author citations. In Proceedings of the 2005 ACM Symposium on Applied Computing, pages 1065-1069, Santa Fe, New Mexico, USA, 2005.
  • Han, H.; Zha, H.; and Giles, C. L. (2005b). Name disambiguation in author citations using a k-way spectral clustering method. In Proceedings of the 5th ACM/IEEE Joint Conference on Digital Libraries, pages 334-343, Denver, CO, USA.
  • Huang, J.; Ertekin, S.; and Giles, C. L. Efficient name disambiguation for large scale databases. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 536-544, Berlin, Germany, 2006.
  • Laender, A. H. F.; Gonçalves, M. A.; Cota, R. G.; Ferreira, A. A.; Santos, R. L. T.; and Silva, A. J. C. Keeping a digital library clean: new solutions to old problems. In Proceedings of the ACM Symposium on Document Engineering, pages 257–262, 2008.
  • Levin, F. H.; and Heuser, C. A. Evaluating the use of social networks in author name disambiguation in digital libraries. Journal of Information and Data Management, 1(2):183-197, 2010.
  • Levin, M.; Krawzyk, S.; Bethard, S.; and Jurafsky, D. Citation-based bootstrapping for large-scale author disambiguation. Journal of the American Society for Information Science and Technology, 63(5):1030-1047, 2012.
  • Pereira, D. A.; Ribeiro-Neto, B. A.; Ziviani, N.; Laender, A. H. F.; Gonçalves, M. A.; and Ferreira, A. A. Using web information for author name disambiguation. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 49–58, Austin, TX, USA, 2009.
  • Santana, A. F.; Gonçalves, M. A. ; Laender, Alberto H. F; and Ferreira, A. A. Combining Domain-Specific Heuristics for Author Name Disambiguation. In Proceedings of the 2014 ACM/IEEE Joint Conference on Digital Library, pages 173-182, London, 2014.
  • Song, Y.; Huang, J.; Councill, I. G.; Li, J.; and Giles, C. L. Efficient topic-based unsupervised name disambiguation. In Proceedings of the 2007 ACM/IEEE Joint Conference on Digital Libraries, pages 342–351, 2007.
  • Tang, J.; Fong, A. C. M.; Wang, B.; and Zhang, J. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transactions on Knowledge and Data Engineering, 24(6):975-987, 2012.
  • Torvik, V. I.; and Smalheiser, N. R. Author name disambiguation in medline. ACM Transactions on Knowledge Discovery from Data, 3(3):1-29, 2009.
  • Treeratpituk, P.; and Giles, C. L. Disambiguating authors in academic publications using random forests. In Proceedings of the 2009 ACM/IEEE Joint Conference on Digital Libraries, pages 39-48, Austin, TX, USA, 2009.
  • Veloso, A.; Ferreira, A. A.; Gonçalves, M. A.; Laender, A. H.; and Meira Jr., W. Cost-effective on-demand associative author name disambiguation. Information Processing & Management, 48(4):680–697, 2012.