Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

electronics-logo

Article Menu

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Past, present, and future of face recognition: a review.

research paper for face recognition

1. Introduction

  • Natural character: The face is a very realistic biometric feature used by humans in the individual’s recognition, making it possibly the most related biometric feature for authentication and identification purposes [ 4 ]. For example, in access control, it is simple for administrators to monitor and evaluate approved persons after authentication, using their facial characteristics. The support of ordinary employers (e.g., administrators) may boost the efficiency and applicability of recognition systems. On the other hand, identifying fingerprints or iris requires an expert with professional competencies to provide accurate confirmation.
  • Nonintrusive: In contrast to fingerprint or iris images, facial images can quickly be obtained without physical contact; people feel more relaxed when using the face as a biometric identifier. Besides, a face recognition device can collect data in a friendly manner that people commonly accept [ 5 ].
  • Less cooperation: Face recognition requires less assistance from the user compared with iris or fingerprint. For some limited applications such as surveillance, a face recognition device may recognize an individual without active subject involvement [ 5 ].
  • We provide an updated review of automated face recognition systems: the history, present, and future challenges.
  • We present 23 well-known face recognition datasets in addition to their assessment protocols.
  • We have reviewed and summarized nearly 180 scientific publications on facial recognition and its material problems of data acquisition and pre-processing from 1990 to 2020. These publications have been classified according to various approaches: holistic, geometric, local texture, and deep learning for 2D and 3D facial recognition. We pay particular attention to the methods based deep-learning, which are currently considered state-of-the-art in 2D face recognition.
  • We analyze and compare several in-depth learning methods according to the architecture implemented and their performance assessment metrics.
  • We study the performance of deep learning methods under the most commonly used data set: (i) Labeled Face in the Wild (LFW) data set [ 10 ] for 2D face recognition, (ii) Bosphorus and BU-3DFE for 3D face recognition.
  • We discuss some new directions and future challenges for facial recognition technology by paying particular attention to the aspect of 3D recognition.

2. Face Recognition History

  • 1964: The American researchers Bledsoe et al. [ 11 ] studied facial recognition computer programming. They imagine a semi-automatic method, where operators are asked to enter twenty computer measures, such as the size of the mouth or the eyes.
  • 1977: The system was improved by adding 21 additional markers (e.g., lip width, hair color).
  • 1988: Artificial intelligence was introduced to develop previously used theoretical tools, which showed many weaknesses. Mathematics (“linear algebra”) was used to interpret images differently and find a way to simplify and manipulate them independent of human markers.
  • 1991: Alex Pentland and Matthew Turk of the Massachusetts Institute of Technology (MIT) presented the first successful example of facial recognition technology, Eigenfaces [ 12 ], which uses the statistical Principal component analysis (PCA) method.
  • 1998: To encourage industry and the academy to move forward on this topic, the Defense Advanced Research Projects Agency (DARPA) developed the Face recognition technology (FERET) [ 13 ] program, which provided to the world a sizable, challenging database composed of 2400 images for 850 persons.
  • 2005: The Face Recognition Grand Challenge (FRGC) [ 14 ] competition was launched to encourage and develop face recognition technology designed to support existent facial recognition initiatives.
  • 2011: Everything accelerates due to deep learning, a machine learning method based on artificial neural networks [ 9 ]. The computer selects the points to be compared: it learns better when it supplies more images.
  • 2014: Facebook knows how to recognize faces due to its internal algorithm, Deepface [ 15 ]. The social network claims that its method approaches the performance of the human eye near to 97%.
  • In its new updates, Apple introduced a facial recognition application where its implementation has extended to retail and banking.
  • Mastercard developed the Selfie Pay, a facial recognition framework for online transactions.
  • From 2019, people in China who want to buy a new phone will now consent to have their faces checked by the operator.
  • Chinese police used a smart monitoring system based on live facial recognition; using this system, they arrested, in 2018, a suspect of “economic crime” at a concert where his face, listed in a national database, was identified in a crowd of 50,000 persons.

3. Face Recognition Systems

3.1. main steps in face recognition systems, 3.2. assessment protocols in face recognition, 4. available datasets and protocols for 2d face recognition, 4.1. orl dataset, 4.2. feret dataset, 4.3. ar dataset, 4.4. xm2vts database, 4.5. banca dataset, 4.6. frgc dataset.

  • In experimental protocol 1, two controlled still images of an individual are used as one for a gallery, and the other for a probe.
  • In Exp 2, the four controlled images of a person are distributed among the gallery and probe.
  • In Exp 4, a single controlled still image presents the gallery, and a single uncontrolled still image presents the probe.
  • Exps 3, 5, and 6 are designed for 3D images.

4.7. LFW Database

4.8. cmu multi pie dataset, 4.9. casia-webface dataset, 4.10. iarpa janus benchmark-a, 4.11. megaface database, 4.12. cfp dataset, 4.13. ms-celeb-m1 benchmark, 4.14. dmfd database, 4.15. vggface database, 4.16. vggface2 database, 4.17. iarpa janus benchmark-b, 4.18. mf2 dataset, 4.19. dfw dataset.

  • Impersonation protocol used only to evaluate the performance of impersonation techniques.
  • Obfuscation protocol used in the cases of disguises.
  • Overall performance protocol that is used to evaluate any algorithm on the complete dataset.

4.20. IARPA Janus Benchmark-C

4.21. lfr dataset, 4.22. rmfrd and smfrd: masqued face recognition dataset.

  • Masked face detection dataset (MFDD): it can be utilized to train a masked face detection model with precision.
  • Real-world masked face recognition dataset (RMFRD): it contains 5000 images of 525 persons wearing masks, and 90,000 pictures of the same 525 individuals without masks collected from the Internet ( Figure 17 ).
  • Simulated masked face recognition dataset (SMFRD): in the meantime, the proposers utilized alternative means to place masks on the standard large-scale facial datasets, such as LFW [ 10 ] and CASIA WebFace [ 30 ] datasets, expanding thus the volume and variety of the masked facial recognition dataset. The SMFRD dataset covers 500,000 facial images of 10,000 persons, and it can be employed in practice alongside their original unmasked counterparts ( Figure 18 ).

5. Two-Dimensional Face Recognition Approaches

5.1. holistic methods, 5.2. geometric approach, 5.3. local-texture approach, 5.4. deep learning approach, 5.4.1. introduction to deep learning.

  • Unsupervised or generative (auto encoder (AE) [ 99 ], Boltzman machine (BM) [ 100 ], recurrent neural network (RNN) [ 101 ], and sum-product network (SPN) [ 102 ]);
  • Supervised or discriminative (convolutional neural network (CNN));
  • Hybrid (deep neural network (DNN) [ 97 , 103 ]).

5.4.2. Convolutional Neural Networks (CNNs)

  • Convolutional layer: This is the CNN’s core building block that aims at extracting features from the input data. Each layer uses a convolution operation to obtain a feature map. After that, the activation or feature maps are fed to the next layer as input data [ 9 ].
  • Pooling layer: This is a non-linear down-sampling [ 104 , 105 ] form that reduces the dimensionality of the feature map but still has the crucial information. There are various non-linear pooling functions in which max-pooling is the most efficient and superior to sub-sampling [ 106 ].
  • Rectified linear unit (ReLU) Layer: This is a non-linear operation, involving units that use the rectifier.
  • Fully connected layer (FC): The high-level reasoning in the neural network is done via fully connected layers after applying various convolutional layers and max-pooling layers [ 107 ].

5.4.3. Popular CNN Architectures

5.4.4. deep cnn-based methods for face recognition., investigations based on alexnet architecture, investigations based on vggnet architecture, investigations based on googlenet architecture, investigations based on lenet architecture, investigations based on resnet architecture, 6. three-dimensional face recognition, 6.1. factual background and acquisition systems, 6.1.1. introduction to 3d face recognition, 6.1.2. microsoft kinect technology, 6.2. methods and datasets, 6.2.1. challenges of 3d facial recognition, 6.2.2. traditional methods of machine learning.

  • Traditional methods of machine learning
  • Deep learning-based methods.

6.2.3. Deep Learning-Based Methods

6.2.4. three-dimensional face recognition databases, 7. open challenges, 7.1. face recognition and occlusion, 7.2. hetegerenous face recognition, 7.3. face recognition and ageing, 7.4. single sample face recognition.

  • In real-world applications (e.g., passports, immigration systems), only one model of each individual is registered in the database and accessible for the recognition task [ 174 ].
  • Pattern recognition systems require vast training data to ensure the generalization of the learning systems.
  • Deep learning-based approach is considered a powerful technique in face recognition. Nonetheless, they need a significant amount of training data to perform well [ 9 ].

7.5. Face Recognition in Video Surveillance

7.6. face recognition and soft biometrics, 7.7. face recognition and smartphones, 7.8. face recognition and internet of things (iot), 8. conclusions, author contributions, conflicts of interest.

  • Kortli, Y.; Jridi, M.; Al Falou, A.; Atri, M. A Review of Face Recognition Methods. Sensors 2020 , 20 , 342. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • O’Toole, A.J.; Roark, D.A.; Abdi, H. Recognizing moving faces: A psychological and neural synthesis. Trends Cogn. Sci. 2002 , 6 , 261–266. [ Google Scholar ] [ CrossRef ]
  • Dantcheva, A.; Chen, C.; Ross, A. Can facial cosmetics affect the matching accuracy of face recognition systems? In Proceedings of the 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington, VA, USA, 23–27 September 2012; pp. 391–398. [ Google Scholar ]
  • Sinha, P.; Balas, B.; Ostrovsky, Y.; Russell, R. Face recognition by humans: Nineteen results all computer vision researchers should know about. Proc. IEEE 2006 , 94 , 1948–1962. [ Google Scholar ] [ CrossRef ]
  • Ouamane, A.; Benakcha, A.; Belahcene, M.; Taleb-Ahmed, A. Multimodal depth and intensity face verification approach using LBP, SLF, BSIF, and LPQ local features fusion. Pattern Recognit. Image Anal. 2015 , 25 , 603–620. [ Google Scholar ] [ CrossRef ]
  • Porter, G.; Doran, G. An anatomical and photographic technique for forensic facial identification. Forensic Sci. Int. 2000 , 114 , 97–105. [ Google Scholar ] [ CrossRef ]
  • Li, S.Z.; Jain, A.K. Handbook of Face Recognition , 2nd ed.; Springer Publishing Company: New York, NY, USA, 2011. [ Google Scholar ]
  • Morder-Intelligence. Available online: https://www.mordorintelligence.com/industry-reports/facial-recognition-market (accessed on 21 July 2020).
  • Guo, G.; Zhang, N. A survey on deep learning based face recognition. Comput. Vis. Image Underst. 2019 , 189 , 10285. [ Google Scholar ] [ CrossRef ]
  • Huang, G.B.; Mattar, M.; Berg, T.; Learned-Miller, E. Labeled Faces in the Wild: A Database for Studying Face Recognition in Unconstrained Environments ; Technical Report; University of Massachusetts: Amherst, MA, USA, 2007; pp. 7–49. [ Google Scholar ]
  • Bledsoe, W.W. The Model Method in Facial Recognition ; Technical Report; Panoramic Research, Inc.: Palo Alto, CA, USA, 1964. [ Google Scholar ]
  • Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991 , 3 , 71–86. [ Google Scholar ] [ CrossRef ]
  • Phillips, P.J.; Wechsler, H.; Huang, J.; Rauss, P. The FERET database and evaluation procedure for face recognition algorithms. Image Vis. Comput. 1998 , 16 , 295–306. [ Google Scholar ] [ CrossRef ]
  • Phillips, P.J.; Flynn, P.J.; Scruggs, T.; Bowyer, K.W.; Chang, J.; Hoffman, K.; Marques, J.; Min, J.; Worek, W. Overview of the face recognition grand challenge. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; pp. 947–954. [ Google Scholar ]
  • Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Deepface: Closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [ Google Scholar ]
  • Chihaoui, M.; Elkefi, A.; Bellil, W.; Ben Amar, C. A Survey of 2D Face Recognition Techniques. Computers 2016 , 5 , 21. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Benzaoui, A.; Bourouba, H.; Boukrouche, A. System for automatic faces detection. In Proceedings of the 2012 3rd International Conference on Image Processing, Theory, Tools and Applications (IPTA), Istanbul, Turkey, 15–18 October 2012; pp. 354–358. [ Google Scholar ]
  • Martinez, A.M. Recognizing imprecisely localized, partially occluded and expression variant faces from a single sample per class. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2002 , 24 , 748–763. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Sidahmed, S.; Messali, Z.; Ouahabi, A.; Trépout, S.; Messaoudi, C.; Marco, S. Nonparametric denoising methods based on contourlet transform with sharp frequency localization: Application to electron microscopy images with low exposure time. Entropy 2015 , 17 , 2781–2799. [ Google Scholar ]
  • Ouahabi, A. Image Denoising using Wavelets: Application in Medical Imaging. In Advances in Heuristic Signal Processing and Applications ; Chatterjee, A., Nobahari, H., Siarry, P., Eds.; Springer: Basel, Switzerland, 2013; pp. 287–313. [ Google Scholar ]
  • Ouahabi, A. A review of wavelet denoising in medical imaging. In Proceedings of the International Workshop on Systems, Signal Processing and Their Applications (IEEE/WOSSPA’13), Algiers, Algeria, 12–15 May 2013; pp. 19–26. [ Google Scholar ]
  • Nakanishi, A.Y.J.; Western, B.J. Advancing the State-of-the-Art in Transportation Security Identification and Verification Technologies: Biometric and Multibiometric Systems. In Proceedings of the 2007 IEEE Intelligent Transportation Systems Conference, Seattle, WA, USA, 30 September–3 October 2007; pp. 1004–1009. [ Google Scholar ]
  • Samaria, F.S.; Harter, A.C. Parameterization of a Stochastic Model for Human Face Identification. In Proceedings of the 1994 IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [ Google Scholar ]
  • Martinez, A.M.; Benavente, R. The AR face database. CVC Tech. Rep. 1998 , 24 , 1–10. [ Google Scholar ]
  • Messer, K.; Matas, J.; Kittler, J.; Jonsson, K. Xm2vt sdb: The extended m2vts database. In Proceedings of the 1999 2nd International Conference on Audio and Video-based Biometric Person Authentication (AVBPA), Washington, DC, USA, 22–24 March 1999; pp. 72–77. [ Google Scholar ]
  • Bailliére, E.A.; Bengio, S.; Bimbot, F.; Hamouz, M.; Kittler, J.; Mariéthoz, J.; Matas, J.; Messer, K.; Popovici, V.; Porée, F.; et al. The BANCA Database and Evaluation Protocol. In Proceedings of the 2003 International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA), Guildford, UK, 9–11 June 2003; pp. 625–638. [ Google Scholar ]
  • Huang, G.B.; Jain, V.; Miller, E.L. Unsupervised joint alignment of complex images. In Proceedings of the 2007 IEEE International Conference on Computer Vision (ICCV), Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8. [ Google Scholar ]
  • Huang, G.; Mattar, M.; Lee, H.; Miller, E.G.L. Learning to align from scratch. Adv. Neural Inf. Process. Syst. 2012 , 25 , 764–772. [ Google Scholar ]
  • Gross, R.; Matthews, L.; Cohn, J.; Kanade, T.; Baker, S. Multi-PIE. Image Vis. Comput. 2010 , 28 , 807–813. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • CASIA Web Face. Available online: http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html (accessed on 21 July 2019).
  • Klare, B.F.; Klein, B.; Taborsky, E.; Blanton, A.; Cheney, J.; Allen, K.; Grother, P.; Mah, A.; Burge, M.; Jain, A.K. Pushing the frontiers of unconstrained face detection and recognition: IARPA Janus Benchmark A. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1931–1939. [ Google Scholar ]
  • Shlizerman, I.K.; Seitz, S.M.; Miller, D.; Brossard, E. The MegaFace benchmark: 1 million faces for recognition at scale. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4873–4882. [ Google Scholar ]
  • Shlizerman, I.K.; Suwajanakorn, S.; Seitz, S.M. Illumination-aware age progression. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 3334–3341. [ Google Scholar ]
  • Ng, H.W.; Winkler, S. A data-driven approach to cleaning large face datasets. In Proceedings of the 2014 IEEE International Conference on Image Processing (ICIP), Paris, France, 27–30 October 2014; pp. 343–347. [ Google Scholar ]
  • Sengupta, S.; Cheng, J.; Castillo, C.; Patel, V.M.; Chellappa, R.; Jacobs, D.W. Frontal to Profile Face Verification in the Wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–9. [ Google Scholar ]
  • Guo, Y.; Zhang, L.; Hu, Y.; He, X.; Gao, J. Ms-Celeb-1m: A dataset and benchmark for large-scale face recognition. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016. [ Google Scholar ]
  • Wang, T.Y.; Kumar, A. Recognizing Human Faces under Disguise and Makeup. In Proceedings of the 2016 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), Sendai, Japan, 29 February–2 March 2016; pp. 1–7. [ Google Scholar ]
  • Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep Face Recognition. In Proceedings of the 2015 British Machine Vision Conference, Swansea, UK, 7–10 September 2015; pp. 41.1–41.12. [ Google Scholar ]
  • Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A dataset for recognizing faces across pose and age. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), Xi’an, China, 15–19 May 2018; pp. 67–74. [ Google Scholar ]
  • Whitelam, C.; Taborsky, E.; Blanton, A.; Maze, B.; Adams, J.; Miller, T.; Kalka, N.; Jain, A.K.; Duncan, J.A.; Allen, K. IARPA Janus Benchmark-B face dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 592–600. [ Google Scholar ]
  • Nech, A.; Shlizerman, I.K. Level playing field for million scale face recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3406–3415. [ Google Scholar ]
  • Kushwaha, V.; Singh, M.; Singh, R.; Vatsa, M. Disguised Faces in the Wild. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1–18. [ Google Scholar ]
  • Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. IARPA Janus benchmark-C: Face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, QLD, Australia, 20–23 February 2018; pp. 158–165. [ Google Scholar ]
  • Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. LFR face dataset: Left-Front-Right dataset for pose-invariant face recognition in the wild. In Proceedings of the 2020 IEEE International Conference on Informatics, IoT, and Enabling Technologies (ICIoT), Doha, Qatar, 2–5 February 2020; pp. 124–130. [ Google Scholar ]
  • Wang, Z.; Wang, G.; Huang, B.; Xiong, Z.; Hong, Q.; Wu, H.; Yi, P.; Jiang, K.; Wang, N.; Pei, Y.; et al. Masked Face Recognition Dataset and Application. arXiv 2020 , arXiv:2003.09093v2. [ Google Scholar ]
  • Belhumeur, P.N.; Hespanha, J.P.; Kriegman, D.J. Eigenfaces vs Fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 1997 , 19 , 711–720. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Stone, J.V. Independent component analysis: An introduction. Trends Cogn. Sci. 2002 , 6 , 59–64. [ Google Scholar ] [ CrossRef ]
  • Sirovich, L.; Kirby, M. Low-Dimensional procedure for the characterization of human faces. J. Opt. Soc. Am. 1987 , 4 , 519–524. [ Google Scholar ] [ CrossRef ]
  • Kirby, M.; Sirovich, L. Application of the Karhunen-Loève procedure for the characterization of human faces. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 1990 , 12 , 831–835. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Femmam, S.; M’Sirdi, N.K.; Ouahabi, A. Perception and characterization of materials using signal processing techniques. IEEE Trans. Instrum. Meas. 2001 , 50 , 1203–1211. [ Google Scholar ] [ CrossRef ]
  • Zhao, L.; Yang, Y.H. Theoretical analysis of illumination in PCA-based vision systems. Pattern Recognit. 1999 , 32 , 547–564. [ Google Scholar ] [ CrossRef ]
  • Pentland, A.; Moghaddam, B.; Starner, T. View-Based and modular eigenspaces for face recognition. In Proceedings of the 1994 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 21–23 June 1994; pp. 84–91. [ Google Scholar ]
  • Bartlett, M.; Movellan, J.; Sejnowski, T. Face Recognition by Independent Component Analysis. IEEE Trans. Neural Netw. 2002 , 13 , 1450–1464. [ Google Scholar ] [ CrossRef ]
  • Abhishree, T.M.; Latha, J.; Manikantan, K.; Ramachandran, S. Face recognition using Gabor Filter based feature extraction with anisotropic diffusion as a pre-processing technique. Procedia Comput. Sci. 2015 , 45 , 312–321. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Zehani, S.; Ouahabi, A.; Oussalah, M.; Mimi, M.; Taleb-Ahmed, A. Trabecular bone microarchitecture characterization based on fractal model in spatial frequency domain imaging. Int. J. Imaging Syst. Technol. accepted.
  • Ouahabi, A. Signal and Image Multiresolution Analysis , 1st ed.; ISTE-Wiley: London, UK, 2012. [ Google Scholar ]
  • Guetbi, C.; Kouame, D.; Ouahabi, A.; Chemla, J.P. Methods based on wavelets for time delay estimation of ultrasound signals. In Proceedings of the 1998 IEEE International Conference on Electronics, Circuits and Systems, Lisbon, Portugal, 7–10 September 1998; pp. 113–116. [ Google Scholar ]
  • Ferroukhi, M.; Ouahabi, A.; Attari, M.; Habchi, Y.; Taleb-Ahmed, A. Medical video coding based on 2nd-generation wavelets: Performance evaluation. Electronics 2019 , 8 , 88. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Wang, M.; Jiang, H.; Li, Y. Face recognition based on DWT/DCT and SVM. In Proceedings of the 2010 International Conference on Computer Application and System Modeling (ICCASM), Taiyuan, China, 22–24 October 2010; pp. 507–510. [ Google Scholar ]
  • Bookstein, F.L. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 1989 , 11 , 567–585. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Shih, F.Y.; Chuang, C. Automatic extraction of head and face boundaries and facial features. Inf. Sci. 2004 , 158 , 117–130. [ Google Scholar ] [ CrossRef ]
  • Zobel, M.; Gebhard, A.; Paulus, D.; Denzler, J.; Niemann, H. Robust facial feature localization by coupled features. In Proceedings of the 2000 4th IEEE International Conference on Automatic Face and Gesture Recognition (FG), Grenoble, France, 26–30 March 2000; pp. 2–7. [ Google Scholar ]
  • Wiskott, L.; Fellous, J.M.; Malsburg, C.V.D. Face recognition by elastic bunch graph matching. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 1997 , 19 , 775–779. [ Google Scholar ] [ CrossRef ]
  • Xue, Z.; Li, S.Z.; Teoh, E.K. Bayesian shape model for facial feature extraction and recognition. Pattern Recognit. 2003 , 36 , 2819–2833. [ Google Scholar ] [ CrossRef ]
  • Tistarelli, M. Active/space-variant object recognition. Image Vis. Comput. 1995 , 13 , 215–226. [ Google Scholar ] [ CrossRef ]
  • Lades, M.; Vorbuggen, J.C.; Buhmann, J.; Lange, J.; Malsburg, C.V.D.; Wurtz, R.P.; Konen, W. Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Comput. 1993 , 42 , 300–311. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Wiskott, L. Phantom faces for face analysis. Pattern Recognit. 1997 , 30 , 837–846. [ Google Scholar ] [ CrossRef ]
  • Duc, B.; Fischer, S.; Bigun, J. Face authentication with Gabor information on deformable graphs. IEEE Trans. Image Process. 1999 , 8 , 504–516. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Kotropoulos, C.; Tefas, A.; Pitas, I. Frontal face authentication using morphological elastic graph matching. IEEE Trans. Image Process. 2000 , 9 , 555–560. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Jackway, P.T.; Deriche, M. Scale-space properties of the multiscale morphological dilation-erosion. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 1996 , 18 , 38–51. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Tefas, A.; Kotropoulos, C.; Pitas, I. Face verification using elastic graph matching based on morphological signal decomposition. Signal Process. 2002 , 82 , 833–851. [ Google Scholar ] [ CrossRef ]
  • Kumar, D.; Garaina, J.; Kisku, D.R.; Sing, J.K.; Gupta, P. Unconstrained and Constrained Face Recognition Using Dense Local Descriptor with Ensemble Framework. Neurocomputing 2020 . [ Google Scholar ] [ CrossRef ]
  • Zehani, S.; Ouahabi, A.; Mimi, M.; Taleb-Ahmed, A. Staistical features extraction in wavelet domain for texture classification. In Proceedings of the 2019 6th International Conference on Image and Signal Processing and their Applications (IEEE/ISPA), Mostaganem, Algeria, 24–25 November 2019; pp. 1–5. [ Google Scholar ]
  • Ait Aouit, D.; Ouahabi, A. Nonlinear Fracture Signal Analysis Using Multifractal Approach Combined with Wavelet. Fractals Complex Geom. Patterns Scaling Nat. Soc. 2011 , 19 , 175–183. [ Google Scholar ] [ CrossRef ]
  • Girault, J.M.; Kouame, D.; Ouahabi, A. Analytical formulation of the fractal dimension of filtered stochastic signal. Signal Process. 2010 , 90 , 2690–2697. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Djeddi, M.; Ouahabi, A.; Batatia, H.; Basarab, A.; Kouamé, D. Discrete wavelet transform for multifractal texture classification: Application to ultrasound imaging. In Proceedings of the IEEE International Conference on Image Processing (IEEE ICIP2010), Hong Kong, China, 26–29 September 2010; pp. 637–640. [ Google Scholar ]
  • Ouahabi, A. Multifractal analysis for texture characterization: A new approach based on DWT. In Proceedings of the 10th International Conference on Information Science, Signal Processing and Their Applications (IEEE/ISSPA), Kuala Lumpur, Malaysia, 10–13 May 2010; pp. 698–703. [ Google Scholar ]
  • Davies, E.R. Introduction to texture analysis. In Handbook of Texture Analysis ; Mirmehdi, M., Xie, X., Suri, J., Eds.; Imperial College Press: London, UK, 2008; pp. 1–31. [ Google Scholar ]
  • Benzaoui, A.; Hadid, A.; Boukrouche, A. Ear biometric recognition using local texture descriptors. J. Electron. Imaging 2014 , 23 , 053008. [ Google Scholar ] [ CrossRef ]
  • Ahonen, T.; Hadid, A.; Pietikäinen, M. Face recognition with local binary patterns. In Proceedings of the 8th European Conference on Computer Vision (ECCV), Prague, Czech Republic, 11–14 May 2004; pp. 469–481. [ Google Scholar ]
  • Ahonen, T.; Hadid, A.; Pietikäinen, M. Face description with local binary patterns: Application to face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006 , 28 , 2037–2041. [ Google Scholar ] [ CrossRef ]
  • Beveridge, J.R.; Bolme, D.; Draper, B.A.; Teixeira, M. The CSU face identification evaluation system: Its purpose, features, and structure. Mach. Vis. Appl. 2005 , 16 , 128–138. [ Google Scholar ] [ CrossRef ]
  • Moghaddam, B.; Nastar, C.; Pentland, A. A bayesian similarity measure for direct image matching. In Proceedings of the 13th International Conference on Pattern Recognition (ICPR), Vienna, Austria, 25–29 August 1996; pp. 350–358. [ Google Scholar ]
  • Rodriguez, Y.; Marcel, S. Face authentication using adapted local binary pattern histograms. In Proceedings of the 9th European Conference on Computer Vision (ECCV), Graz, Austria, 7–13 May 2006; pp. 321–332. [ Google Scholar ]
  • Sadeghi, M.; Kittler, J.; Kostin, A.; Messer, K. A comparative study of automatic face verification algorithms on the banca database. In Proceedings of the 4th International Conference on Audio- and Video-Based Biometric Person Authentication (AVBPA), Guilford, UK, 9–11 June 2003; pp. 35–43. [ Google Scholar ]
  • Huang, X.; Li, S.Z.; Wang, Y. Jensen-shannon boosting learning for object recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 144–149. [ Google Scholar ]
  • Boutella, E.; Harizi, F.; Bengherabi, M.; Ait-Aoudia, S.; Hadid, A. Face verification using local binary patterns and generic model adaptation. Int. J. Biomed. 2015 , 7 , 31–44. [ Google Scholar ] [ CrossRef ]
  • Benzaoui, A.; Boukrouche, A. 1DLBP and PCA for face recognition. In Proceedings of the 2013 11th International Symposium on Programming and Systems (ISPS), Algiers, Algeria, 22–24 April 2013; pp. 7–11. [ Google Scholar ]
  • Benzaoui, A.; Boukrouche, A. Face Recognition using 1DLBP Texture Analysis. In Proceedings of the 5th International Conference of Future Computational Technologies and Applications, Valencia, Spain, 27 May–1 June 2013; pp. 14–19. [ Google Scholar ]
  • Benzaoui, A.; Boukrouche, A. Face Analysis, Description, and Recognition using Improved Local Binary Patterns in One Dimensional Space. J. Control Eng. Appl. Inform. (CEAI) 2014 , 16 , 52–60. [ Google Scholar ]
  • Ahonen, T.; Rathu, E.; Ojansivu, V.; Heikkilä, J. Recognition of Blurred Faces Using Local Phase Quantization. In Proceedings of the 19th International Conference on Pattern Recognition (ICPR), Tampa, FL, USA, 8–11 December 2008; pp. 1–4. [ Google Scholar ]
  • Ojansivu, V.; Heikkil, J. Blur insensitive texture classification using local phase quantization. In Proceedings of the 3rd International Conference on Image and Signal Processing (ICSIP), Cherbourg-Octeville, France, 1–3 July 2008; pp. 236–243. [ Google Scholar ]
  • Tan, X.; Triggs, B. Enhanced local texture feature sets for face recognition under difficult lighting conditions. In Proceedings of the 3rd International Workshop on Analysis and Modeling of Faces and Gestures (AMFG), Rio de Janeiro, Brazil, 20 October 2007; pp. 168–182. [ Google Scholar ]
  • Lei, Z.; Ahonen, T.; Pietikainen, M.; Li, S.Z. Local Frequency Descriptor for Low-Resolution Face Recognition. In Proceedings of the 9th Conference on Automatic Face and Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011; pp. 161–166. [ Google Scholar ]
  • Kannala, J.; Rahtu, E. BSIF: Binarized statistical image features. In Proceedings of the 21th International Conference on Pattern Recognition (ICPR), Tsukuba, Japan, 11–15 November 2012; pp. 1363–1366. [ Google Scholar ]
  • Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015 , 61 , 85–117. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Deng, L. A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Trans. Signal Inf. Process. 2014 , 3 , 1–29. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Deng, L.; Yu, D. Deep Learning: Methods and Applications. Found. Trends Signal Process. 2014 , 7 , 197–387. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010 , 11 , 3371–3408. [ Google Scholar ]
  • Salakhutdinov, R.; Hinton, G. Deep Boltzmann machines. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater, FL, USA, 16–19 April 2009; pp. 448–455. [ Google Scholar ]
  • Sutskever, I.; Martens, J.; Hinton, G. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), Bellevue, WA, USA, 28 June–2 July 2011; pp. 1017–1024. [ Google Scholar ]
  • Poon, H.; Domingos, P. Sum-product networks: A new deep architecture. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 689–690. [ Google Scholar ]
  • Kimb, K.; Aminantoa, M.E. Deep Learning in Intrusion Detection Perspective: Overview and further Challenges. In Proceedings of the International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia, 23–24 September 2017; pp. 5–10. [ Google Scholar ]
  • Ouahabi, A. Analyse spectrale paramétrique de signaux lacunaires. Traitement Signal 1992 , 9 , 181–191. [ Google Scholar ]
  • Ouahabi, A.; Lacoume, J.-L. New results in spectral estimation of decimated processes. IEEE Electron. Lett. 1991 , 27 , 1430–1432. [ Google Scholar ] [ CrossRef ]
  • Scherer, D.; Müller, A.; Behnke, S. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the 2010 International Conference on Artificial Neural Networks, Thessaloniki, Greece, 15–18 September 2010; pp. 92–101. [ Google Scholar ]
  • Coşkun, M.; Uçar, A.; Yildirim, Ö.; Demir, Y. Face recognition based on convolutional neural network. In Proceedings of the 2017 International Conference on Modern Electrical and Energy Systems (MEES), Kremenchuk, Ukraine, 15–17 November 2017; pp. 376–379. [ Google Scholar ]
  • Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998 , 86 , 2278–2324. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015 , 115 , 211–252. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [ Google Scholar ]
  • Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014. [ Google Scholar ]
  • Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [ Google Scholar ]
  • He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [ Google Scholar ]
  • Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2019 , 42 , 7132–7141. [ Google Scholar ]
  • Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–26 June 2005; pp. 539–546. [ Google Scholar ]
  • Sun, Y.; Wang, X.; Tang, X. Deep learning face representation from predicting 10,000 classes. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1891–1898. [ Google Scholar ]
  • Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1988–1996. [ Google Scholar ]
  • Sun, Y.; Wang, X.; Tang, X. Deeply learned face representations are sparse, selective, and robust. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2892–2900. [ Google Scholar ]
  • Sun, Y.; Liang, D.; Wang, X.; Tang, X. DeepID3: Face Recognition with Very Deep Neural Networks. arXiv 2015 , arXiv:1502.00873v1. [ Google Scholar ]
  • Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. Web-Scale training for face identification. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 2746–2754. [ Google Scholar ]
  • Ouahabi, A.; Depollier, C.; Simon, L.; Kouame, D. Spectrum estimation from randomly sampled velocity data [LDV]. IEEE Trans. Instrum. Meas. 1998 , 47 , 1005–1012. [ Google Scholar ] [ CrossRef ]
  • Liu, J.; Deng, Y.; Bai, T.; Huang, C. Targeting ultimate accuracy: Face recognition via deep embedding. arXiv 2015 , arXiv:1506.07310v4. [ Google Scholar ]
  • Masi, I.; Tran, A.T.; Hassner, T.; Leksut, J.T.; Medioni, G. Do we really need to collect millions of faces for effective face recognition? In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherland, 8–16 October 2016; pp. 579–596. [ Google Scholar ]
  • Zhang, X.; Fang, Z.; Wen, Y.; Li, Z.; Qiao, Y. Range loss for deep face recognition with Long-Tailed Training Data. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5419–5428. [ Google Scholar ]
  • Liu, W.; Wen, Y.; Yu, Z.; Yang, M. Large-margin softmax loss for convolutional neural networks. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 507–516. [ Google Scholar ]
  • Chen, B.; Deng, W.; Du, J. Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4021–4030. [ Google Scholar ]
  • Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [ Google Scholar ]
  • Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. arXiv 2013 , arXiv:1311.2901v3. [ Google Scholar ]
  • Ben Fredj, H.; Bouguezzi, S.; Souani, C. Face recognition in unconstrained environment with CNN. Vis. Comput. 2020 , 1–10. [ Google Scholar ] [ CrossRef ]
  • Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 499–515. [ Google Scholar ]
  • Wu, Y.; Liu, H.; Li, J.; Fu, Y. Deep Face Recognition with Center Invariant Loss. In Proceedings of the Thematic Workshop of ACM Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 408–414. [ Google Scholar ]
  • Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Feature Transfer Learning for Face Recognition with Under-Represented Data. In Proceedings of the 2019 International Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [ Google Scholar ]
  • Ranjan, R.; Castillo, C.D.; Chellappa, R. L2-constrained softmax loss for discriminative face verification. arXiv 2017 , arXiv:1703.09507v3. [ Google Scholar ]
  • Deng, J.; Zhou, Y.; Zafeiriou, S. Marginal Loss for Deep Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 2006–2014. [ Google Scholar ]
  • Wang, F.; Xiang, X.; Cheng, J.; Yuille, A.L. NormFace: L2 Hypersphere Embedding for Face Verification. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1041–1049. [ Google Scholar ]
  • Liu, Y.; Li, H.; Wang, X. Rethinking Feature Discrimination and Polymerization for Large-Scale Recognition. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS), (Deep Learning Workshop), Long Beach, CA, USA, 4–9 December 2017. [ Google Scholar ]
  • Hasnat, M.; Bohné, J.; Milgram, J.; Gentric, S.; Chen, L. Von Mises-Fisher Mixture Model-based Deep Learning: Application to Face Verification. arXiv 2017 , arXiv:1706.04264v2. [ Google Scholar ]
  • Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6738–6746. [ Google Scholar ]
  • Zheng, Y.; Pal, D.K.; Savvides, M. Ring Loss: Convex Feature Normalization for Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5089–5097. [ Google Scholar ]
  • Guo, Y.; Zhang, L. One-Shot Face Recognition by Promoting Underrepresented Classes. arXiv 2018 , arXiv:1707.05574v2. [ Google Scholar ]
  • Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Liu, W. CosFace: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5265–5274. [ Google Scholar ]
  • Wang, F.; Cheng, J.; Liu, W.; Liu, H. Additive Margin Softmax for Face Verification. IEEE Signal Process. Lett. 2018 , 25 , 926–930. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Wu, X.; He, R.; Sun, Z.; Tan, T. A Light CNN for Deep Face Representation with Noisy Labels. IEEE Trans. Inf. Forensics Secur. 2018 , 13 , 2884–2896. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Hayat, M.; Khan, S.H.; Zamir, W.; Shen, J.; Shao, L. Gaussian Affinity for Max-margin Class Imbalanced Learning. In Proceedings of the 2019 International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [ Google Scholar ]
  • Deng, J.; Guo, J.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 International Conference on Computer Vision and Pattern Recognition (CVPR), Lone Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [ Google Scholar ]
  • Huang, C.; Li, Y.; Loy, C.C.; Tang, X. Deep Imbalanced Learning for Face Recognition and Attribute Prediction. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 2019 . Available online: https://ieeexplore.ieee.org/document/8708977 (accessed on 21 July 2020).
  • Song, L.; Gong, D.; Li, Z.; Liu, C.; Liu, W. Occlusion Robust Face Recognition Based on Mask Learning with Pairwise Differential Siamese Network. In Proceedings of the 2019 International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [ Google Scholar ]
  • Wei, X.; Wang, H.; Scotney, B.; Wan, H. Minimum margin loss for deep face recognition. Pattern Recognit. 2020 , 97 , 107012. [ Google Scholar ] [ CrossRef ]
  • Sun, J.; Yang, W.; Gao, R.; Xue, J.H.; Liao, Q. Inter-class angular margin loss for face recognition. Signal Process. Image Commun. 2020 , 80 , 115636. [ Google Scholar ] [ CrossRef ]
  • Wu, Y.; Wu, Y.; Wu, R.; Gong, Y.; Lv, K.; Chen, K.; Liang, D.; Hu, X.; Liu, X.; Yan, J. Rotation consistent margin loss for efficient low-bit face recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–18 June 2020; pp. 6866–6876. [ Google Scholar ]
  • Ling, H.; Wu, J.; Huang, J.; Li, P. Attention-based convolutional neural network for deep face recognition. Multimed. Tools Appl. 2020 , 79 , 5595–5616. [ Google Scholar ] [ CrossRef ]
  • Wu, B.; Wu, H. Angular Discriminative Deep Feature Learning for Face Verification. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2133–2137. [ Google Scholar ]
  • Chen, D.; Cao, X.; Wang, L.; Wen, F.; Sun, J. Bayesian face revisited: A joint formulation. In Proceedings of the European Conference on Computer Vision (ECCV), Firenze, Italy, 7–13 October 2012; pp. 566–579. [ Google Scholar ]
  • Chen, B.C.; Chen, C.S.; Hsu, W.H. Face recognition and retrieval using cross-age reference coding with cross-age celebrity dataset. IEEE Trans. Multimed. 2015 , 17 , 804–815. [ Google Scholar ] [ CrossRef ]
  • Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 3730–3738. [ Google Scholar ]
  • Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [ Google Scholar ]
  • Oumane, A.; Belahcene, M.; Benakcha, A.; Bourennane, S.; Taleb-Ahmed, A. Robust Multimodal 2D and 3D Face Authentication using Local Feature Fusion. Signal Image Video Process. 2016 , 10 , 12–137. [ Google Scholar ] [ CrossRef ]
  • Oumane, A.; Boutella, E.; Benghherabi, M.; Taleb-Ahmed, A.; Hadid, A. A Novel Statistical and Multiscale Local Binary Feature for 2D and 3D Face Verification. Comput. Electr. Eng. 2017 , 62 , 68–80. [ Google Scholar ] [ CrossRef ]
  • Soltanpour, S.; Boufama, B.; Wu, Q.M.J. A survey of local feature methods for 3D face recognition. Pattern Recognit. 2017 , 72 , 391–406. [ Google Scholar ] [ CrossRef ]
  • Zhou, S.; Xiao, S. 3D Face Recognition: A Survey. Hum. Cent. Comput. Inf. Sci. 2018 , 8 , 8–35. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Min, R.; Kose, N.; Dugelay, J. KinectFaceDB: A Kinect Database for Face Recognition. IEEE Trans. Syst. Man Cybern. Syst. 2014 , 44 , 1534–1548. [ Google Scholar ] [ CrossRef ]
  • Drira, H.; Ben Amor, B.; Srivastava, A.; Daoudi, M.; Slama, R. 3D Face Recognition under Expressions, Occlusions, and Pose Variations. IEEE Trans. Pattern Anal. Mach. Intell. 2013 , 35 , 2270–2283. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Ribeiro Alexandre, G.; Marques Soares, J.; Pereira Thé, G.A. Systematic review of 3D facial expression recognition methods. Pattern Recognit. 2020 , 100 , 107108. [ Google Scholar ] [ CrossRef ]
  • Ríos-Sánchez, B.; Costa-da-Silva, D.; Martín-Yuste, N.; Sánchez-Ávila, C. Deep Learning for Facial Recognition on Single Sample per Person Scenarios with Varied Capturing Conditions. Appl. Sci. 2019 , 9 , 5474. [ Google Scholar ]
  • Kim, D.; Hernandez, M.; Choi, J.; Medioni, G. Deep 3D face identification. In Proceedings of the IEEE International Joint Conference on Biometrics (IJCB), Denver, CO, USA, 1–4 October 2017; pp. 133–142. [ Google Scholar ]
  • Gilani, S.Z.; Mian, A.; Eastwood, P. Deep, dense and accurate 3D face correspondence for generating population specific deformable models. Pattern Recognit. 2017 , 69 , 238–250. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Gilani, S.Z.; Mian, A.; Shafait, F.; Reid, I. Dense 3D face correspondence. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2018 , 40 , 1584–1598. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Gilani, S.Z.; Mian, A. Learning from Millions of 3D Scans for Large-scale 3D Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1896–1905. [ Google Scholar ]
  • Mimouna, A.; Alouani, I.; Ben Khalifa, A.; El Hillali, Y.; Taleb-Ahmed, A.; Menhaj, A.; Ouahabi, A.; Ben Amara, N.E. OLIMP: A Heterogeneous Multimodal Dataset for Advanced Environment Perception. Electronics 2020 , 9 , 560. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Benzaoui, A.; Boukrouche, A.; Doghmane, H.; Bourouba, H. Face recognition using 1DLBP, DWT, and SVM. In Proceedings of the 2015 3rd International Conference on Control, Engineering & Information Technology (CEIT), Tlemcen, Algeria, 25–27 May 2015; pp. 1–6. [ Google Scholar ]
  • Ait Aouit, D.; Ouahabi, A. Monitoring crack growth using thermography.-Suivi de fissuration de matériaux par thermographie. C. R. Mécanique 2008 , 336 , 677–683. [ Google Scholar ] [ CrossRef ]
  • Arya, S.; Pratap, N.; Bhatia, K. Future of Face Recognition: A Review. Procedia Comput. Sci. 2015 , 58 , 578–585. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Zafeiriou, S.; Zhang, C.; Zhang, Z. A survey on face detection in the wild: Past, present and future. Comput. Vis. Image Underst. 2015 , 138 , 1–24. [ Google Scholar ] [ CrossRef ] [ Green Version ]
  • Min, R.; Xu, S.; Cui, Z. Single-Sample Face Recognition Based on Feature Expansion. IEEE Access 2019 , 7 , 45219–45229. [ Google Scholar ] [ CrossRef ]
  • Zhang, D.; An, P.; Zhang, H. Application of robust face recognition in video surveillance systems. Optoelectron. Lett. 2018 , 14 , 152–155. [ Google Scholar ] [ CrossRef ]
  • Tome, P.; Vera-Rodriguez, R.; Fierrez, J.; Ortega-Garcia, J. Facial soft biometric features for forensic face recognition. Forensic Sci. Int. 2015 , 257 , 271–284. [ Google Scholar ] [ CrossRef ] [ PubMed ] [ Green Version ]
  • Fathy, M.E.; Patel, V.M.; Chellappa, R. Face-based Active Authentication on mobile devices. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 April 2015; pp. 1687–1691. [ Google Scholar ]
  • Medapati, P.K.; Murthy, P.H.S.T.; Sridhar, K.P. LAMSTAR: For IoT-based face recognition system to manage the safety factor in smart cities. Trans. Emerg. Telecommun. Technol. 2019 , 1–15. Available online: https://onlinelibrary.wiley.com/doi/abs/10.1002/ett.3843?af=R (accessed on 10 July 2020).

Click here to enlarge figure

DatabaseApparition’s DateImagesSubjectsImages/Subject
ORL [ ]19944004010
FERET [ ]199614,1261199-
AR [ ]1998301611626
XM2VTS [ ]1999-295-
BANCA [ ]2003-208-
FRGC [ ]200650,000-7
LFW [ ]200713,2335749≈2.3
CMU Multi PIE [ ]2009>750,000337N/A
IJB-A [ ]20155712500≈11.4
CFP [ ]20167000500>14
DMFD [ ]201624604106
IJB-B [ ]201721,7981845≈36.2
MF2 [ ]20174.7 M672,057≈7
DFW [ ]201811,1571000≈5.26
IJB-C [ ]201831,3343531≈6
LFR [ ]202030,00054210–260
RMFRD [ ]202095,000525-
SMFRD [ ]2020500,00010,000-
DatabaseApparition’s DateImagesSubjectsImages/Subject
CASIA WebFace [ ]2014494,41410,575≈46.8
MegaFace [ ]20161,027,060690,572≈1.4
MS-Celeb-1M [ ]201610 M100,000100
VGGFACE [ ]20162.6 M26221000
VGGFACE2 [ ]20173.31 M9131≈362.6
MethodAuthorsYearArchitectureNetworksVerif. MetricTraining Set Accuracy (%) ± SE
1DeepFaceTaigman et al. [ ]2014CNN-93SoftmaxFacebook (4.4 M, 4 K) *97.35 ± 0.25
2DeepIDSun et al. [ ]2014CNN-960Softmax + JBCelebFaces + [ ] (202 k, 10 k) *97.45 ± 0.26
3DeepID2Sun et al. [ ]2014CNN-925Contrastive Softmax + JBCelebFaces+ (202 k, 10 k) *99.15 ± 0.13
4DeepID2+Sun et al. [ ]2014CNN-925Contrastive Softmax + JBWDRef [ ] + CelebFaces + (290 k, 12 k) *99.47 ± 0.12
5DeepID3Sun et al. [ ]2015VGGNet25Contrastive Softmax + JBWDRef + CelebFaces + (290 k,12 k)99.53 ± 0.10
6FaceNetSchroff et al. [ ]2015GoogleNet1Triplet LossGoogle (200 M, 8 M) *99.63 ± 0.09
7Web-ScaleTaigman et al. [ ]2015CNN-94Contrastive SoftmaxPrivate Database (4.5 M, 55 K) *98.37
8BAIDULiu et al. [ ]2015CNN-910Triplet LossPrivate Databse (1.2 M, 18 K) *99.77
9VGGFaceParkhi et al. [ ]2015VGGNet1Triplet LossVGGFace (2.6 M, 2.6 K)98.95
10AugmentationMasi et al. [ ]2016VGGNet-191SoftmaxCASIA WebFace (494 k, 10 k)98.06
11Range LossZhang et al. [ ]2016VGGNet-161Range LossCASIA WebFace + MS-Celeb-1M (5 M, 100 k)99.52
12Center LossWen et al. [ ]2016LeNet1Center LossCASIA WebFace + CACD2000 [ ] + Celebrity + [ ] (0.7 M, 17 k)99.28
13L-SoftmaxLiu et al. [ ]2016VGGNet-181L-SoftmaxCASIA-WebFace (490 k, 10 K)98.71
14L2-SoftmaxRanjan et al. [ ]2017ResNet-1011L2-SoftmaxMS-Celeb 1M (3.7 M, 58 k)99.78
15Marginal LossDeng et al. [ ]2017ResNet-271Marginal LossMS-Celeb 1M (4 M, 82 k)99.48
16NormFaceWang et al. [ ]2017ResNet-281Contrastive LossCASIA WebFace (494 k, 10 k)99.19 ± 0.008
17Noisy SoftmaxChen et al. [ ]2017VGGNet1Noisy SoftmaxCASIA WebFace (400 K, 14 k)99.18
18COCO LossLiu et al. [ ]2017ResNet-1281COCO LossMS-Celeb 1M (3 M, 80 k)
19Center Invariant Loss Wu et al. [ ]2017LeNet1Center Invariant LossCASIA WebFace (0.45 M, 10 k)99.12
20Von Mises-FisherHasnat et al. [ ]2017ResNet-271vMF LossMS-Celeb-1M (4.61 M, 61.24 K)99.63
21SphereFaceLiu et al. [ ]2018ResNet-641A-Softmax CASIA WebFace (494 k, 10 k)99.42
22Ring LossZheng et al. [ ]2018ResNet-641Ring LossMS-Celeb-1M (3.5 M, 31 K)99.50
23MLRGuo and Zhang [ ]2018ResNet-341CCS LossMS-Celeb-1M (10 M, 100 K)99.71
24CosfaceWang et al. [ ]2018ResNet-641Large Margin Cosine Loss CASIA WebFace (494 k, 10 k)99.73
25AM-SoftmaxWang et al. [ ]2018ResNet-201AM-Softmax LossCASIA WebFace (494 k, 10 k)99.12
26Light-CNNWu et al. [ ]2018ResNet-291SoftmaxMS-Celeb-1M (5 M, 79 K)99.33
27Affinity LossHayat et al. [ ]2019ResNet-501Affinity LossVGGFace2 (3.31 M, 8 K)99.65
28ArcFaceDeng et al. [ ]2019ResNet-1001ArcFaceMS-Celeb-1M (5.8 M, 85 k)99.83
29CLMLEHuang et al. [ ]2019ResNet-64 1CLMLE LossCASIA WebFace (494 k, 10 k)99.62
30PDSNSong et al. [ ]2019ResNet-501Pairwise Contrastive LossCASIA WebFace (494 k, 10 k)99.20
31Feature TransferYin et al. [ ] 2019LeNet1SoftmaxMS-Celeb-1M (4.8 M, 76.5 K)99.55
32Ben Fredj workBen Fredj et al. [ ]2020GoogleNet1Softmax with center lossCASIA WebFace (494 k, 10 k)99.2 ± 0.04
33MMLWei et al. [ ]2020Inception ResNet-V1 [ ]1MML LossVGGFace2 (3.05 M, 8 K)99.63
34IAMSun et al. [ ]2020Inception ResNet-V11IAM lossCASIA WebFace (494 k, 10 k)99.12
35RCM lossWu et al. [ ]2020ResNet-181Rotation Consistent Margin loss CASIA WebFace (494 k, 10 k)98.91
36ACNNLing et al. [ ]2020ResNet-1001ArcFace LossDeepGlint-MS1M (3.9 M, 86 K)99.83
37LMC
SDLMC
DLMC
Wu and Wu [ ]2020ResNet321LMC loss
SDLMC loss
DLMC loss
CASIA WebFace (494 k, 10 k)98.1399.0399.07
DatabaseApparition’s DateImagesSubjectsData Type
BU-3DFE20062500100Mesh
FRGC v1.0 [ ]2006943273Depth image
FRGC v2.0 [ ]20064007466Depth image
CASIA20064623123Depth image
ND2006200788813,450Depth image
Bosphorus20084666105Point Cloud
BJUT-3D20091200500Mesh
Texas 3DFRD20101140118Depth image
UMB-DB20111473143Depth image
BU-4DFE2008606 sequences = 60,600 (frames)1013D video

Share and Cite

Adjabi, I.; Ouahabi, A.; Benzaoui, A.; Taleb-Ahmed, A. Past, Present, and Future of Face Recognition: A Review. Electronics 2020 , 9 , 1188. https://doi.org/10.3390/electronics9081188

Adjabi I, Ouahabi A, Benzaoui A, Taleb-Ahmed A. Past, Present, and Future of Face Recognition: A Review. Electronics . 2020; 9(8):1188. https://doi.org/10.3390/electronics9081188

Adjabi, Insaf, Abdeldjalil Ouahabi, Amir Benzaoui, and Abdelmalik Taleb-Ahmed. 2020. "Past, Present, and Future of Face Recognition: A Review" Electronics 9, no. 8: 1188. https://doi.org/10.3390/electronics9081188

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Face Recognition by Humans and Machines: Three Fundamental Advances from Deep Learning

Alice j. o’toole.

1 School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas 75080, USA;

Carlos D. Castillo

2 Department of Electrical and Computer Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;

Deep learning models currently achieve human levels of performance on real-world face recognition tasks. We review scientific progress in understanding human face processing using computational approaches based on deep learning. This review is organized around three fundamental advances. First, deep networks trained for face identification generate a representation that retains structured information about the face (e.g., identity, demographics, appearance, social traits, expression) and the input image (e.g., viewpoint, illumination). This forces us to rethink the universe of possible solutions to the problem of inverse optics in vision. Second, deep learning models indicate that high-level visual representations of faces cannot be understood in terms of interpretable features. This has implications for understanding neural tuning and population coding in the high-level visual cortex. Third, learning in deep networks is a multistep process that forces theoretical consideration of diverse categories of learning that can overlap, accumulate over time, and interact. Diverse learning types are needed to model the development of human face processing skills, cross-race effects, and familiarity with individual faces.

1. INTRODUCTION

The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks ( Jacquet & Champod 2020 , Phillips et al. 2018 , Taigman et al. 2014 ). This behavioral parity allows for meaningful comparisons of representations in two successful systems. DCNNs also emulate computational aspects of the ventral visual system ( Fukushima 1988 , Krizhevsky et al. 2012 , LeCun et al. 2015 ) and support surprisingly direct, layer-to-layer comparisons with primate visual areas ( Yamins et al. 2014 ). Nonlinear, local convolutions, executed in cascaded layers of neuron-like units, form the computational engine of both biological and artificial neural networks for human and machine-based face recognition. Enormous numbers of parameters, diverse learning mechanisms, and high-capacity storage in deep networks enable a wide variety of experiments at multiple levels of analysis, from reductionist to abstract. This makes it possible to investigate how systems and subsystems of computations support face processing tasks.

Our goal is to review scientific progress in understanding human face processing with computational approaches based on deep learning. As we proceed, we bear in mind wise words written decades ago in a paper on science and statistics: “All models are wrong, but some are useful” ( Box 1979 , p. 202) (see the sidebar titled Perspective: Theories and Models of Face Processing and the sidebar titled Caveat: Iteration Between Theory and Practice ). Since all models are wrong, in this review, we focus on what is useful. For present purposes, computational models are useful when they give us insight into the human visual and perceptual system. This review is organized around three fundamental advances in understanding human face perception, using knowledge generated from deep learning models. The main elements of these advances are as follows.

PERSPECTIVE: THEORIES AND MODELS OF FACE PROCESSING

Box (1976) reminds us that scientific progress comes from motivated iteration between theory and practice. In understanding human face processing, theories should be used to generate the questions, and machines (as models) should be used to answer the questions. Three elemental concepts are required for scientific progress. The first is flexibility. Effective iteration between theory and practice requires feedback between what the theory predicts and what the model reveals. The second is parsimony. Because all models are wrong, excessive elaboration will not find the correct model. Instead, economical descriptions of a phenomenon should be preferred over complex descriptions that capture less fundamental elements of human perception. Third, Box (1976 , p. 792) cautions us to avoid “worrying selectivity” in model evaluation. As he puts it, “since all models are wrong, the scientist must be alert to what is importantly wrong.”

These principles represent a scientific ideal, rather than a reality in the field of face perception by humans and machines. Applying scientific principles to computational modeling of human face perception is challenging for diverse reasons (see the sidebar titled Caveat: Iteration Between Theory and Practice below). We argue, as Cichy & Kaiser (2019) have, that although the utility of scientific models is usually seen in terms of prediction and explanation, their function for exploration should not be underrated. As scientific models, DCNNs carry out high-level visual tasks in neurally inspired ways. They are at a level of development that is ripe for exploring computational and representational principles that actually work but are not understood. This is a classic problem in reverse engineering—yet the use of deep learning as a model introduces a dilemma. The goal of reverse engineering is to understand how a functional but highly complex system (e.g., the brain and human visual system) solves a problem (e.g., recognizes a face). To accomplish this, a well-understood model is used to test hypotheses about the underlying mechanisms of the complex system. A prerequisite of reverse engineering is that we understand how the model works. Failing that, we risk using one poorly understood system to test hypotheses about another poorly understood system. Although deep networks are not black boxes (every parameter is knowable) ( Hasson et al. 2020 ), we do not fully understand how they recognize faces ( Poggio et al. 2020 ). Therefore, the primary goal should be to understand deep networks for face recognition at a conceptual and representational level.

CAVEAT: ITERATION BETWEEN THEORY AND PRACTICE

Box (1976) noted that scientific progress depends on motivated iteration between theory and practice. Unfortunately, a motivation to iterate between theory and practice is not a reasonable expectation for the field of computer-based face recognition. Automated face recognition is big business, and the best models were not developed to study human face processing. DCNNs provide a neurally inspired, but not copied, solution to face processing tasks. Computer scientists formulated DCNNs at an abstract level, based on neural networks from the 1980s ( Fukushima 1988 ). Current DCNN-based models of human face processing are computationally refined, scaled-up versions of these older networks. Algorithm developers make design and training decisions for performance and computational efficiency. In using DCNNs to model human face perception, researchers must choose between smaller, controlled models and larger-scale, uncontrolled networks (see also Richards et al. 2019 ). Controlled models are easier to analyze but can be limited in computational power and training data diversity. Uncontrolled models better emulate real neural systems but may be intractable. The easy availability of cutting-edge pretrained face recognition models, with a variety of architectures, has been the deciding factor for many research labs with limited resources and expertise to develop networks. Given the widespread use of these models in vision science, brain-similarity metrics for artificial neural networks have been developed ( Schrimpf et al. 2018 ). These produce a Brain-Score made up of a composite of neural and behavioral benchmarks. Some large-scale (uncontrolled) network architectures used in modeling human face processing (See Section 2.1 ) score well on these metrics.

A promising long-term strategy is to increase the neural accuracy of deep networks ( Grill-Spector et al. 2018 ). The ventral visual stream and DCNNs both enable hierarchical and feedforward processing. This offers two computational benefits consistent with DCNNs as models of human face processing. First, the universal approximation theorem ( Hornik et al. 1989 ) ensures that both types of networks can approximate any complex continuous function relating the input (visual image) to the output (face identity). Second, linear and nonlinear feedforward connections enable fast computation consistent with the speed of human facial recognition ( Grill-Spector et al. 2018 , Thorpe et al. 1996 ). Although current DCNNs lack other properties of the ventral visual system, these can be implemented as the field progresses.

  • Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. The face representations that emerge from deep networks trained for identification operate invariantly across changes in image and appearance, but they are not themselves invariant.
  • Computational theory and simulation studies of deep learning indicate a reconsideration of a long-standing axiom in vision science that face or object representations can be understood in terms of interpretable features. Instead, in deep learning models, the concept of a nameable deep feature, localized in an output unit of the network or in the latent variables of the space, should be reevaluated.
  • Natural environments provide highly variable training data that can structure the development of face processing systems using a variety of learning mechanisms that overlap, accumulate over time, and interact. It is no longer possible to invoke learning as a generic theoretical account of a behavioral or neural phenomenon.

We focus on deep learning findings that are relevant for understanding human face processing—broadly construed. The human face provides us with diverse information, including identity, gender, race or ethnicity, age, and emotional state. We use the face to make inferences about a person’s social traits ( Oosterhof & Todorov 2008 ). As we discuss below, deep networks trained for identification retain much of this diverse facial information (e.g., Colón et al. 2021 , Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 , Terhörst et al. 2020 ). The use of face recognition algorithms in applied settings (e.g., law enforcement) has spurred detailed performance comparisons between DCNNs and humans (e.g., Phillips et al. 2018 ). For analogous reasons, the problem of human-like race bias in DCNNs has also been studied (e.g., Cavazos et al. 2020 ; El Khiyari & Wechsler 2016 ; Grother et al. 2019 ; Krishnapriya et al. 2019 , 2020 ). Developmental data on infants’ exposure to faces in the first year(s) of life offer insight into how to structure the training of deep networks ( Smith & Slone 2017 ). These topics are within the scope of this review. Although we consider general points of comparison between DCNNs and neural responses in face-selective areas of the primate inferotemporal (IT) cortex, a detailed discussion of this topic is beyond the scope of this review. (For a review of primate face-selective areas that considers computational perspectives, see Hesse & Tsao 2020 ). In this review, we focus on the computational and representational principles of neural coding from a deep learning perspective.

The review is organized as follows. We begin with a brief review of where machine performance on face identification stands relative to humans in quantitative terms. Qualitative performance comparisons on identification and other face processing tasks (e.g., expression classification, social perception, development) are integrated into Sections 2 – 4 . These sections consider advances in understanding human face processing from deep learning approaches. We close with a discussion of where the next steps might lead.

1.1. Where We Are Now: Human Versus Machine Face Recognition

Deep learning models of face identification map widely variable images of a face onto a representation that supports identification accuracy comparable to that of humans. The steady progress of machines over the past 15 years can be summarized in terms of the increasingly challenging face images that they can recognize ( Figure 1 ). By 2007, the best algorithms surpassed humans on a task of identity matching for unfamiliar faces in frontal images taken indoors ( O’Toole et al. 2007 ). By 2012, well-established algorithms exceeded human performance on frontal images with moderate changes in illumination and appearance ( Kumar et al. 2009 , Phillips & O’Toole 2014 ). Machine ability to match identity for in-the-wild images appeared with the advent of DCNNs in 2013–2014. Human face recognition was marginally more accurate than DeepFace ( Taigman et al. 2014 ), an early DCNN, on the Labeled Faces in the Wild (LFW) data set ( Huang et al. 2008 ). LFW contains in-the-wild images taken mostly from the front. DCNNs now fare well on in-the-wild images with significant pose variation (e.g., Maze et al. 2018 , data set). Sengupta et al. (2016) found parity between humans and machines on frontal-to-frontal identity matching but human superiority on frontal-to-profile matching.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0001.jpg

The progress of computer-based face recognition systems can be tracked by their ability to recognize faces with increasing levels of image and appearance variability. In 2006, highly controlled, cropped face images with moderate variability, such as the images of the same person shown, were challenging (images adapted with permission from Sim et al. 2002 ). In 2012, algorithms could tackle moderate image and appearance variability (the top 4 images are extreme examples adapted with permission from Huang et al. 2012 ; the bottom two images adapted with permission from Phillips et al. 2011 ). By 2018, deep convolutional neural networks (DCNNs) began to tackle wide variation in image and appearance, (images adapted with permission from the database in Maze et al. 2018 ). In the 2012 and 2018 images, all side-by side images show the same person except the bottom pair of 2018 panels.

Identity matching:

process of determining if two or more images show the same identity or different identities; this is the most common task performed by machines

Human face recognition:

the ability to determine whether a face is known

1.2. Expert Humans and State-of-the-Art Machines Work Together

DCNNs can sometimes even surpass normal human performance. Phillips et al. (2018) compared humans and machines matching the identity of faces in high-quality frontal images. Although this is generally considered an easy task, the images tested were chosen to be highly challenging based on previous human and machine studies. Four DCNNs developed between 2015 and 2017 were compared to human participants from five groups: professional forensic face examiners, professional forensic face reviewers, superrecognizers ( Noyes et al. 2017 , Russell et al. 2009 ), professional fingerprint examiners, and students. Face examiners, reviewers, and superrecognizers performed more accurately than fingerprint examiners, and fingerprint examiners performed more accurately than students. Machine performance, from 2015 to 2017, tracked human skill levels. The 2015 algorithm ( Parkhi et al. 2015 ) performed at the level of the students; the 2016 algorithm ( Chen et al. 2016 ) performed at the level of the fingerprint examiners ( Ranjan et al. 2017c ); and the two 2017 algorithms ( Ranjan et al. 2017 , c ) performed at the level of professional face reviewers and examiners, respectively. Notably, combining the judgments of individual professional face examiners with those of the best algorithm ( Ranjan et al. 2017 ) yielded perfect performance. This suggests a degree of strategic diversity for the face examiners and the DCNN and demonstrates the potential for effective human–machine collaboration ( Phillips et al. 2018 ).

Combined, the data indicate that machine performance has improved from a level comparable to that of a person recognizing unfamiliar faces to one comparable to that of a person recognizing more familiar faces ( Burton et al. 1999 , Hancock et al. 2000 , Jenkins et al. 2011 ) (see Section 4.1 ).

2. RETHINKING INVERSE OPTICS AND FACE REPRESENTATIONS

Deep networks force us to rethink the universe of possible solutions to the problem of inverse optics in vision. These networks operate with a degree of invariance to image and appearance that was unimaginable by researchers less than a decade ago. Invariance refers to the model’s ability to consistently identify a face when image conditions (e.g., viewpoint, illumination) and appearance (e.g., glasses, facial hair) vary. The nature of the representation that accomplishes this is not well understood. The inscrutability of DCNN codes is due to the enormous number of computations involved in generating a face representation from an image and the uncontrolled training data. To create a face representation, millions of nonlinear, local convolutions are executed over tens (to hundreds) of layers of units. Researchers exert little or no control over the training data, but instead source face images from the web with the goal of finding as much labeled training data as possible. The number of images per identity and the types of images (e.g., viewpoint, expression, illumination, appearance, quality) are left (mostly) to what is found through web scraping. Nevertheless, DCNNs produce a surprisingly structured and rich face representation that we are beginning to understand.

2.1. Mining the Face Identity Code in Deep Networks

The face representation generated by DCNNs for the purpose of identifying a face also retains detailed information about the characteristics of the input image (e.g., viewpoint, illumination) and the person pictured (e.g., gender, age). As shown below, this unified representation can solve multiple face processing tasks in addition to identification.

2.1.1. Image characteristics.

Face representations generated by deep networks both are and are not invariant to image variation. These codes can identify faces invariantly over image change, but they are not themselves invariant. Instead, face representations of a single identity vary systematically as a function of the characteristics of the input image. The representations generated by DCNNs are, in fact, representations of face images.

Work to dissect face identity codes draws on the metaphor of a face space ( Valentine 1991 ) adapted to representations generated by a DCNN. Visualization and simulation analyses demonstrate that identity codes for face images retain ordered information about the input image ( Dhar et al. 2020 , Hill et al. 2019 , Parde et al. 2017 ). Viewpoint (yaw and pitch) can be predicted accurately from the identity code, as can media source (still image or video frame) ( Parde et al. 2017 ). Image quality (blur, usability, occlusion) is also available as the identity code norm (vector length). 1 Poor-quality images produce face representations centered in the face space, creating a DCNN garbage dump. This organizational structure was replicated in two DCNNs with different architectures, one developed by Chen et al. (2016) with seven convolutional layers and three fully connected layers and another developed by Sankaranarayanan et al. (2016) with 11 convolutional layers and one fully connected layer. Image quality estimates can also be optimized directly in a DCNN using human ratings ( Best-Rowden & Jain 2018 ).

Face space:

representation of the similarity of faces in a multidimensional space

For a closer look at the structure of DCNN face representations, Hill et al. (2019) examined the representations of highly controlled face images in a face space generated by a deep network trained with in-the-wild images. The network processed images of three-dimensional laser scans of human heads rendered from five viewpoints under two illumination conditions (ambient, harsh spotlight). Visualization of these representations in the resulting face space showed a highly ordered pattern (see Figure 2 ). Consistent with the network’s high accuracy at face identification, images clustered by identity. Identity clusters separated into regions of male and female faces (see Section 2.1.2 ). Within each identity cluster, the images separated by illumination condition—visible in the face space as chains of images. Within each illumination chain, the image representations were arranged in the space by viewpoint, which varied systematically along the image chain. To further probe the coding of identity, Hill et al. (2019) processed images of caricatures of the 3D heads (see also Blanz & Vetter 1999 ). Caricature representations were centered in each identity cluster, indicating that the network perceived a caricature as a good likeness of the identity.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0002.jpg

Visualization of the top-level deep convolutional neural network (DCNN) similarity space for all images from Hill et al. (2019) . ( a – f ) Points are colored according to different variables. Grey polygonal borders are for illustration purposes only and show the convex hull of all images of each identity. These convex hulls are expanded by a margin for visibility. The network separates identities accurately. In panels a and d , the space is divided into male and female sections. In panels b and e , illumination conditions subdivide within identity groupings. In panels c and f , the viewpoint varies sequentially within illumination clusters. Dotted-line boxes in panels a – c show areas enlarged in panels d – g . Figure adapted with permission from Hill et al. (2019) .

DCNN face representation:

output vector produced for a face image processed through a deep network trained for faces

All results from Hill et al. (2019) were replicated using two networks with starkly different architectures. The first, developed by Ranjan et al. (2019) , was based on a ResNet-101 with 101 layers and skip connections; the second, developed by Chen et al. (2016) , had 15 convolution and pooling layers, a dropout layer, and one fully connected top layer. As measured using the brain-similarity metrics developed in Brain-Score ( Schrimpf et al. 2018 ), one of these architectures (ResNet-101) was the third most brain-like of the 25 networks tested. The ResNet-101 network scored well on both neural (V4 and IT cortex) and behavioral predictability for object recognition. Hill et al.’s (2019) replication of this face space using a shallower network ( Chen et al. 2016 ), however, suggests that network architecture may be less important than computational capacity in understanding high-level visual codes for faces (see Section 3.2 ).

Brain-Score:

neural and behavioral benchmarks that score an artificial neural network on its similarity to brain mechanisms for object recognition

Returning to the issue of human-like view invariance in a DCNN, Abudarham & Yovel (2020) compared the similarity of face representations computed within and across identities and viewpoints. Consistent with view-invariant performance, same-identity, different-view face pairs were more similar than different-identity, same-view face pairs. Consistent with a noninvariant face representation, correlations between similarity scores across head view decreased monotonically with increasing view disparity. These results support the characterization of DCNN codes as being functionally view invariant but with a view-specific code. Notably, earlier layers in the network showed view specificity, whereas higher layers showed view invariance.

It is worth digressing briefly to consider invariance in the context of neural approaches to face processing. An underlying assumption of neural approaches is that “a major purpose of the face patches is thus to construct a representation of individual identity invariant to view direction” ( Hesse & Tsao 2020 , pp. 703). Ideas about how this is accomplished have evolved. Freiwald & Tsao (2010) posited the progressive computation of invariance via the pooling of neurons across face patches, as follows. In early patches, a neuron responds to a specific identity from specific views; in middle face patches, greater invariance is achieved by pooling the responses of mirror-symmetric views of an identity; in later face patches, each neuron pools inputs representing all views of the same individual to create a fully view-invariant representation. More recently, Chang & Tsao (2017) proposed that the brain computes a view-invariant face code using shape and appearance parameters analogous to those used in a computer graphics model of face synthesis ( Cootes et al. 1995 ) (see the sidebar titled Neurons, Neural Tuning, Population Codes, Features, and Perceptual Constancy ). This code retains information about the face, but not about the particular image viewed.

NEURONS, NEURAL TUNING, POPULATION CODES, FEATURES, AND PERCEPTUAL CONSTANCY

Barlow (1972 , p. 371) wrote, “Results obtained by recording from single neurons in sensory pathways…obviously tell us something important about how we sense the world around us; but what exactly have we been told?” In answer, Barlow (1972 , p. 371) proposed that “our perceptions are caused by the activity of a rather small number of neurons selected from a very large population of predominantly silent cells. The activity of each single cell is thus an important perceptual event and it is thought to be related quite simply to our subjective experience.” Although this proposal is sometimes caricatured as the grandmother cell doctrine (see also Gross 2002 ), Barlow simply asserts that single-unit activity can be interpreted in perceptual terms, and that the responses of small numbers of units, in combination, underlie subjective perceptual experience. This proposal reflects ideas gleaned from studies of early visual areas that have been translated, at least in part, to studies of high-level vision.

Over the past decade, single neurons in face patches have been characterized as selective for facial features (e.g., aspect ratio, hair length, eyebrow height) ( Freiwald et al. 2009 ), face viewpoint and identity ( Freiwald & Tsao 2010 ), eyes ( Issa & DiCarlo 2012 ), and shape or appearance parameters from an active appearance model of facial synthesis ( Chang & Tsao 2017 ). Neurophysiological studies of face and object processing also employ techniques aimed at understanding neural population codes. Using the pattern of neural responses in a population of neurons (e.g., IT), linear classifiers are used often to predict subjective percepts (commonly defined as the image viewed). For example, Chang & Tsao (2017) showed that face images viewed by a macaque could be reconstructed using a linear combination of the activity of just 205 face cells in face patches ML–MF and AM. This classifier provides a real neural network model of the face-selective cortex that can be interpreted in simple terms.

Population code models generated from real neural data (a few hundred units), however, differ substantially in scale from the face- and object-selective cortical regions that they model (1 mm 3 of the cerebral cortex contains approximately 50,000 neurons and 300 million adjustable parameters; Azevedo et al. 2009 , Kandel et al. 2000 , Hasson et al. 2020 ). This difference in scale is at the core of a tension between model interpretability and real-world task generalizability ( Hasson et al. 2020 ). It also creates tension between the neural coding hypotheses suggested by deep learning and the limitations of current neuroscience techniques for testing these hypotheses. To model neural function, an electrode gives access to single neurons and (with multi-unit recordings) to relatively small numbers of neurons (a few hundred). Neurocomputational theory based on direct fit models posits that overparameterization (i.e., the extremely high number of parameters available for neural computation) is critical to the brain’s solution to real-world problems (see Section 3.2 ). Bridging the gap between the computational and neural scale of these perspectives remains an ongoing challenge for the field.

Deep networks suggest an alternative that is largely consistent with neurophysiological data but interprets the data in a different light. Neurocomputational theory posits that the ventral visual system untangles face identity information from image parameters ( DiCarlo & Cox 2007 ). The idea is that visual processing starts in the image domain, where identity and viewpoint information are entangled. With successive levels of neural processing, manifolds corresponding to individual identities are untangled from image variation. This creates a representational space where identities can be separated with hyperplanes. Image information is not lost, but rather, is rearranged (for object recognition results, see Hong et al. 2016 ). The retention of image and identity information in DCNN face representations is consistent with this theory. It is also consistent with basic neuroscience findings indicating the emergence of a representation dominated by identity that retains sensitivity to image features (See Section 2.2 ).

2.1.2. Appearance and demographics.

Faces can be described using what computer vision researchers have called attributes or soft biometrics (hairstyle, hair color, facial hair, and accessories such as makeup and glasses). The definition of attributes in the computational literature is vague and can include demographics (e.g., gender, age, race) and even facial expression. Identity codes from deep networks retain a wide variety of face attributes. For example, Terhörst et al. (2020) built a massive attribute classifier (MAC) to test whether 113 attributes could be predicted from the face representations produced by deep networks [ArcFace ( Deng et al. 2019 ) or FaceNet ( Schroff et al. 2015 )] for images from in-the-wild data sets ( Huang et al. 2008 , Liu et al. 2015 ). The MAC learned to map from DCNN-generated face representations to attribute labels. Cross-validated results showed that 39 of the attributes were easily predictable, and 74 of the 113 were predictable at reliable levels. Hairstyle, hair color, beard, and accessories were predicted easily. Attributes such as face geometry (e.g., round), periocular characteristics (e.g., arched eyebrows), and nose were moderately predictable. Skin and mouth attributes were not well predicted.

The continuous shuffling of identity, attribute, and image information across layers of the network was demonstrated by Dhar et al. (2020) . They tracked the expressivity of attributes (identity, sex, age, pose) across layers of a deep network. Expressivity was defined as the degree to which a feature vector, from any given layer of a network, specified an attribute. Dhar et al. (2020) computed expressivity using a second neural network that estimated the mutual information between attributes and DCNN features. Expressivity order in the final fully connected layer of both networks (Resnet-101 and Inception Resnet v2; Ranjan et al. 2019 ) indicated that identity was most expressed, followed by age, sex, and yaw. Identity expressivity increased dramatically from the final pooling layer to the last fully connected layer. This echos the progressive increase in the detectability of view-invariant face identity representations seen across face patches in the macaque ( Freiwald & Tsao 2010 ). It also raises the computational possibility of undetected viewpoint sensitivity in these neurons (see Section 3.1 ).

Mutual information:

a statistical term from information theory that quantifies the codependence of information between two random variables

2.1.3. Social traits.

People make consistent (albeit invalid) inferences about a person’s social traits based on their face ( Todorov 2017 ). These judgments have profound consequences. For example, competence judgments about faces predict election success at levels far above chance ( Todorov et al. 2005 ). The physical structure of the face supports these trait inferences ( Oosterhof & Todorov 2008 , Walker & Vetter 2009 ), and thus it is not surprising that deep networks retain this information. Using face representations produced by a network trained for face identification ( Sankaranarayanan et al. 2016 ), 11 traits (e.g., shy, warm, impulsive, artistic, lazy), rated by human participants, were predicted at levels well above chance ( Parde et al. 2019 ). Song et al. (2017) found that more than half of 40 attributes were predicted accurately by a network trained for object recognition (VGG-16; Simonyan & Zisserman 2014 ). Human and machine trait ratings were highly correlated.

Other studies show that deep networks can be optimized to predict traits from images. Lewenberg et al. (2016) crowd-sourced large numbers of objective (e.g., hair color) and subjective (e.g., attractiveness) attribute ratings from faces. DCNNs were trained to classify images for the presence or absence of each attribute. They found highly accurate classification for the objective attributes and somewhat less accurate classification for the subjective attributes. McCurrie et al. (2017) trained a DCNN to classify faces according to trustworthiness, dominance, and IQ. They found significant accord with human ratings, with higher agreement for trustworthiness and dominance than for IQ.

2.1.4. Facial expressions.

Facial expressions are also detectable in face representations produced by identity-trained deep networks. Colón et al. (2021) found that expression classification was well above chance for face representations of images from the Karolinska data set ( Lundqvist et al. 1998 ), which includes seven facial expressions (happy, sad, angry, surprised, fearful, disgusted, neutral) seen from five viewpoints (frontal and 90- and 45-degree left and right profiles). Consistent with human data, happiness was classified most accurately, followed by surprise, disgust, anger, neutral, sadness, and fear. Notably, accuracy did not vary across viewpoint. Visualization of the identities in the emergent face space showed a structured ordering of similarity in which viewpoint dominated over expression.

2.2. Functional Invariance, Useful Variability

The emergent code from identity-trained DCNNs can be used to recognize faces robustly, but it also retains extraneous information that is of limited, or no, value for identification. Although demographic and trait information offers weak hints to identity, image characteristics and facial expression are not useful for identification. Attributes such as glasses, hairstyle, and facial hair are, at best, weak identity cues and, at worst, misleading cues that will not remain constant over extended time periods. In purely computational terms, the variability of face representations for different images of an identity can lead to errors. Although this is problematic in security applications, coincidental features and attributes can be diagnostic enough to support acceptably accurate identification performance in day-to-day face recognition ( Yovel & O’Toole 2016 ). (For related arguments based on adversarial images for object recognition, see Ilyas et al. 2019 , Xie et al. 2020 , Yuan et al. 2020 .) A less-than-perfect identification system in computational terms, however, can be a surprisingly efficient, multipurpose face processing system that supports identification and the detection of visually derived semantic information [called attributes by Bruce & Young (1986) ].

What do we learn from these studies that can be useful in understanding human visual processing of faces? First, we learn that it is computationally feasible to accommodate diverse information about faces (identity, demographics, visually derived semantic information), images (viewpoint, illumination, quality), and emotions (expression) in a unified representation. Furthermore, this diverse information can be accessed selectively from the representation. Thus, identity, image parameters, and attributes are all untangled when learning prioritizes the difficult within-category discrimination problem of face identification.

Second, we learn that to understand high-level visual representations for faces, we need to think in terms of categorical codes unbound from a spatial frame of reference. Although remnants of retinotopy and image characteristics remain in high-level visual areas (e.g., Grill-Spector et al. 1999 , Kay et al. 2015 , Kietzmann et al. 2012 , Natu et al. 2010 , Yue et al. 2010 ), the expressivity of spatial layout weakens dramatically from early visual areas to categorically structured areas in the IT cortex. Categorical face representations should capture what cognitive and perceptual psychologists call facial features (e.g., face shape, eye color). Indeed, altering these types of features in a face affects identity perception similarly for humans and deep networks ( Abudarham et al. 2019 ). However, neurocomputational theory suggests that finding these features in the neural code will likely require rethinking the interpretation of neural tuning and population coding (see Section 3.2 ).

Third, if the ventral stream untangles information across layers of computations, then we should expect traces of identity, image data, and attributes at many, if not all, neural network layers. These may variously dominate the strength of the neural signal at different layers (see Section 3.1 ). Thus, various layers in the network will likely succeed in predicting several types of information about the face and/or image, though with differing accuracy. For now, we should not ascribe too much importance to findings about which specific layer(s) of a particular network predict specific attributes. Instead, we should pay attention to the pattern of prediction accuracy across layers. We would expect the following pattern. Clearly, for the optimized attribute (identity), the output offers the clearest access. For subject-related attributes (e.g., demographics), this may also be the case. For image-related attributes, we would expect every layer in the network to retain some degree of prediction ability. Exactly how, where, and whether the neural system makes use of these attributes for specific tasks remain open questions.

3. RETHINKING VISUAL FEATURES: IMPLICATIONS FOR NEURAL CODES

Deep learning models force us to rethink the definition and interpretation of facial features in high-level representations. Theoretical ideas about the brain’s solution to complex real-world tasks such as face recognition must be reconciled at the level of neural units and representational spaces. Deep learning models can be used to test hypotheses about how faces are stored in the high-dimensional representational space defined by the pattern of responses of large numbers of neurons.

3.1. Units Confound Information that Separates in the Representation Space

Insight into interpreting facial features comes from deep network simulations aimed at understanding the relationship between unit responses and the information retained in the face representation. Parde et al. (2021) compared identification, gender classification, and viewpoint estimation in subspaces of a DCNN face space. Using an identity-trained network capable of all three tasks, they tested performance on the tasks using randomly sampled subsets of output units. Beginning at full dimensionality (512-units) and progressively decreasing sample size, they found no notable decline in identification accuracy for more than 3,000 in-the-wild-faces until the sample size reached 16 randomly chosen units (3% of full dimensionality). Correlations between unit responses across representations were near zero, indicating that individual units captured nonredundant identity cues. Statistical power for identification (i.e., separating identities) was uniformly high for all output units, demonstrating that units used their entire response range to separate identities. A unit firing at its maximum provided no more, and no less, information than any other response value. This distinction may seem trivial, but it is not. The data suggest that every output unit acts to separate identities to the maximum degree possible. As such, all units participate in coding all identities. In information theory terms, this is an ideal use of neural resources.

For gender classification and viewpoint estimation, performance declined at a much faster rate than for identification as units were deleted ( Parde et al. 2021 ). Statistical power for predicting gender and viewpoint was strong in the distributed code but weak at the level of the unit. Prediction power for these attributes was again roughly equivalent for all units. Thus, individual units contributed to coding all three attributes, but identity modulated individual unit responses far more strongly than did gender or viewpoint. Notably, a principal component (PC) analysis of representations in the full-dimensional space revealed subspaces aligned with identity, gender, and viewpoint ( Figure 3 ). Consistent with the strength of the categorical identity code in the representation, identity information dominated PCs explaining large amounts of variance, gender dominated the middle range of PCs, and viewpoint dominated PCs explaining small amounts of variation.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0003.jpg

Illustration of the separation of the task-relevant information into subspaces for an identity-trained deep convolutional neural network (DCNN). Each plot shows the similarity (cosine) between principal components (PCs) of the face space and directional vectors in the space that are diagnostic of identity ( top ), gender ( middle ), and viewpoint ( bottom ). Figure adapted with permission from Parde et al. (2021) .

The emergence and effectiveness of these codes in DCNNs suggest that caution is needed in ascribing significance only to stimuli that drive a neuron to high rates of response. Small-scale modulations of neural responses can also be meaningful. Let us consider a concrete example. A neurophysiologist probing the network used by Parde et al. (2021) would find some neurons that respond strongly to a few identities. Interpreting this as identity tuning, however, would be an incorrect characterization of a code in which all units participate in coding all identities. Concomitantly, few units in the network would appear responsive to viewpoint or gender variations because unit firing rates would modulate only slightly with changes in viewpoint or gender. Thus, the distributed coding of view and gender across units would likely be missed. The finding that neurons in macaque face patch AM respond selectively (i.e., with high response rates) to identity over variable views ( Freiwald & Tsao 2010 ) is consistent with DCNN face representations. It is possible, however, that these units also encode other face and image attributes, but with differential degrees of expressivity. This would be computationally consistent with the untangling theory and with DCNN codes.

Macaque face patches:

regions of the macaque cortex that respond selectively to faces, including the posterior lateral (PL), middle lateral (ML), middle fundus (MF), anterior lateral (AL), anterior fundus (AF), and anterior medial (AM)

Another example comes from the use of generative adversarial networks and related techniques to characterize the response properties of single (or multiple) neuron(s) in the primate visual cortex ( Bashivan et al. 2019 , Ponce et al. 2019 , Yuan et al. 2020 ). These techniques have examined neurons in areas V4 ( Bashivan et al. 2019 ) and IT ( Ponce et al. 2019 , Yuan et al. 2020 ). The goal is to progressively evolve images that drive neurons to their maximum response or that selectively (in)activate subsets of neurons. Evolved images show complex mosaics of textures, shapes, and colors. They sometimes show animals or people and sometimes reveal spatial patterns that are not semantically interpretable. However, these techniques rely on two strong assumptions. First, they assume that a neuron’s response can be characterized completely in terms of the stimuli that activate it maximally, thereby discounting other response rates as noninformative. The computational utility of a unit’s full response range in DCNNs suggests that reconsideration of this assumption is necessary. Second, these techniques assume that a neuron’s response properties can be visualized accurately as a two-dimensional image. Given the categorical, nonretinotopic nature of representations in high-level visual areas, this seems problematic. If the representation under consideration is not in the image or pixel domain, then image-based visualization may offer limited, and possibly misleading, insight into the underlying nature of the code.

3.2. Direct-Fit Models and Deep Learning

In rethinking visual features at a theoretical level, direct-fit models of neural coding appear to best explain deep learning findings in multiple domains (e.g., face recognition, language) ( Hasson et al. 2020 ). These models posit that neural computation fits densely sampled data from the environment. Implementation is accomplished using “overparameterized optimization algorithms that increase predictive (generalization) power, without explicitly modeling the underlying generative structure of the world” ( Hasson et al. 2020 , p. 418). Hasson et al. (2020) begins with an ideal model in a small-parameter space ( Figure 4 ). When the underlying structure of the world is simple, a small-parameter model will find the underlying generative function, thereby supporting generalization via interpolation and extrapolation. Despite decades of effort, small-parameter functions have not solved real-world face recognition with performance anywhere near that of humans.

An external file that holds a picture, illustration, etc.
Object name is nihms-1766682-f0004.jpg

( a ) A model with too few parameters fails to fit the data. ( b ) The ideal-fit model fits with a small number of parameters and has generative power that supports interpolation and extrapolation. ( c ) An overfit function can model noise in the training data. ( d ) An overparameterized model generalizes well to new stimuli within the scope of the training samples. Figure adapted with permission from Hasson et al. (2020) .

When the underlying structure of the world is complex and multivariate, direct-fit models offer an alternative to models based on small-parameter functions. With densely sampled real-world training data, each new observation can be placed in the context of past experience. More formally, direct-fit models solve the problem of generalization to new exemplars by experience-scaffolded interpolation ( Hasson et al. 2020 ). This produces face recognition performance in the range of that of humans. A fundamental element of the success of deep networks is that they model the environment with big data, which can be structured in overparameterized spaces. The scale of the parameterization and the requirement to operate on real-world data are pivotal. Once the network is sufficiently parameterized to fit the data, the exact details of its architecture are not important. This may explain why starkly different network architectures arrive at similarly structured representations ( Hill et al. 2019 , Parde et al. 2017 , Storrs et al. 2020 ).

Returning to the issue of features, in neurocomputational terms, the strength of connectivity between neurons at synapses is the primary locus of information, just as weights between units in a deep network comprise information. We expect features, whatever they are, to be housed in the combination of connection strengths among units, not in the units themselves. In a high-dimensional multivariate encoding space, they are hyperplane directions through the space. Thus, features are represented across many computing elements, and each computing element participates in encoding many features ( Hasson et al. 2020 , Parde et al. 2021 ). If features are directions in a high-dimensional coding space ( Goodfellow et al. 2014 ), then units act as an arbitrary projection surface from which this information can be accessed—albeit in a nontransparent form.

A downside of direct-fit models is that they cannot generalize via extrapolation. The other-race effect is an example of how face recognition may fail due to limited experience ( Malpass & Kravitz 1969 ) (see Section 4.3.2 ). The extrapolation limit may be countered, however, by the capacity of direct-fit models to acquire expertise within the confines of experience. For example, in human perception, category experience selectively structures representations as new exemplars are learned. Collins & Behrmann (2020) show that this occurs in a way that reflects the greater experience that humans have with faces and computer-generated objects from novel made-up categories of objects, which the authors call YUFOs. They tracked the perceived similarity of pairs of other-race faces and YUFOs as people learned novel exemplars of each. Experience changed perceived similarities more selectively for faces than for YUFOs, enabling more nuanced discrimination of exemplars from the experienced category of faces.

In summary, direct-fit models offer a framework for thinking about high-level visual codes for faces in a way that unifies disparate data on single units and high-dimensional coding spaces. These models are fueled by the rich experience that we (models) gain from learning (training on) real-world data. They solve complex visual tasks with interpolated solutions that elude transparent semantic interpretation.

4. RETHINKING LEARNING IN HUMANS AND DEEP NETWORKS

Deep network models of human face processing force us to consider learning as a complex and diverse set of mechanisms that can overlap, accumulate over time, and interact. Learning in both humans and artificial neural networks can refer to qualitatively different phenomena. In both cases, learning involves multiple steps. For DCNNs, these steps are fundamental to a network’s ability to recognize faces across image and appearance variation. Human visual learning is likewise diverse and unfolds across the developmental lifespan in a process governed by genetics and environmental input ( Goodman & Shatz 1993 ). The stepwise implementation of learning is one way that DCNNs differ from previous face recognition networks. Considered as manipulable modeling tools, the learning steps in DCNNs force us to think in concrete and nuanced ways about how humans learn faces.

In this section, we outline the learning layers in human face processing ( Section 4.1 ), introduce the layers of learning used in training machines ( Section 4.2 ), and consider the relationship between the two in the context of human behavior ( Section 4.3.1 ). The human learning layers support a complex, biologically realized face processing system. The machine learning layers can be thought of as building blocks that can be combined in a variety of ways to model human behavioral phenomena. At the outset, we note that machine learning is designed to maximize performance—not to model the development of the human face processing system ( Smith & Slone 2017 ). Concomitantly, the sequential presentation of training data in DCNNs differs from the pattern of exposure that infants and young children have with faces and objects ( Jayaraman et al. 2015 ). The machine learning steps, however, can be modified to model human learning more closely. In practical terms, fully trained DCNNs, available on the web, are used (almost exclusively) to model human neural systems (see the sidebar titled Caveat: Iteration Between Theory and Practice ). It is important, therefore, to understand how (and why) these models are configured as they are and to understand the types of learning tools available for modeling human face processing. These steps may provide computational grounding for basic learning mechanisms hypothesized in humans.

4.1. Human Learning for Face Processing

To model human face processing, researchers need to consider the following types of learning. The most specific form of learning is familiar face recognition. People learn the faces of specific familiar individuals (e.g., friends, family, celebrities). Familiar faces are recognized robustly over challenging changes in appearance and image characteristics. The second-most specific is local population tuning. People recognize own-race faces more accurately than other-race faces, a phenomenon referred to as the other-race effect (e.g., Malpass & Kravitz 1969 ). This likely results from tuning to the statistical properties of the faces that we see most frequently—typically faces of our own race. The third-most specific is nfamiliar face recognition. People can differentiate unfamiliar faces perceptually. Unfamiliar refers to faces that a person has not encountered previously or has encountered infrequently. Unfamiliar face recognition is less robust to image and appearance change than is familiar face recognition. The least specific form of learning is object recognition. At a fundamental level of analysis, faces are objects, and both share early visual processing wetware.

4.2. How Deep Convolutional Neural Networks Learn Face Identification

Training DCNNs for face recognition involves a sequence of learning stages, each with a concrete objective. Unlike human learning, machine learning stages are executed in strict sequence. The goal across all stages of training is to build an effective method for converting images of faces into points in a high-dimensional space. The resulting high-dimensional space allows for easy comparison among faces, search, and clustering. In this section, we sketch out the engineering approach to learning, working forward from the most general to the most specific form of learning. This follows the implementation order used by engineers.

4.2.1. Object classification (between-category learning): Stage 1.

Deep networks for face identification are commonly built on top of DCNNs that have been pretrained for object classification. Pretraining is carried out using large data sets of objects, such as those available in ImageNet ( Russakovsky et al. 2015 ), which contains more than 14 million images of over 1,000 classes of objects (e.g., volcanoes, cups, chihuahuas). The object categorization training procedure involves adjusting the weights on all layers of the network. For training to converge, a large training set is required. The loss function optimized in this procedure typically uses the well-understood cross-entropy loss + Softmax combination. Most practitioners do not execute this step because it has been performed already in a pretrained model downloaded from a public repository in a format compatible with DCNN software libraries [e.g., PyTorch ( Paszke et al. 2019 ), TensorFlow ( Abadi et al. 2016 )]. Networks trained for object recognition have proven better for face identification than networks that start with a random configuration ( Liu et al. 2015 , Yi et al. 2014 ).

4.2.2. Face recognition (within-category learning): Stage 2.

Face recognition training is implemented in a second stage of training. In this stage, the last fully connected layer that connects to object-category nodes (e.g., volcanoes, cups) is removed from the results of the Stage 1 training. Next, a fully connected layer that maps to the number of face identities available for face training is connected. Depending on the size of the face training set, the weights of either all layers or all but a few layers at the beginning of the network are updated. The former is common when very large numbers of face identities are available for training. In academic laboratories, data sets include 5–10 million face images of 40,000–100,000 identities. In industry, far larger data sets are often used ( Schroff et al. 2015 ). A technical difficulty encountered in retraining an object classification network to a face recognition network is the large increase in the number of categories involved (approximately 1,000 objects versus 50,000+ faces). Special loss functions can address this issue [e.g., L2-Softmax/crystal loss ( Ranjan et al. 2017 ), NormFace ( Wang et al. 2017 ), angular Softmax ( Li et al. 2018 ), additive Softmax ( Wang et al. 2018 ), additive angular margins ( Deng et al. 2019 )].

When the Stage 2 face training is complete, the last fully connected layer that connects to the 50,000+ face identity nodes is removed, leaving below it a relatively low-dimensional (128- to 5,000-unit) layer of output units. This can be thought of as the face representation. This output represents a face image, not a face identity. At this point in training, any arbitrary face image from any identity (known or unknown to the network) can be processed by the DCNN to produce a compact face image descriptor across the units of this layer. If the network functions perfectly, then it will produce identical codes for all images of the same person. This would amount to perfect image and appearance generalization. This is not usually achieved, even when the network is highly accurate (see Section 2 ).

In this state, the network is commonly employed to recognize faces not seen in training (unfamiliar faces). Stage 2 training supports a surprising degree of generalization (e.g., pose, expression, illumination, and appearance) for images of unfamiliar faces. This general face learning gives the system special knowledge of faces and enables it to perform within-category face discrimination for unfamiliar faces ( O’Toole et al. 2018 ). With or without Stage 3 training, the network is now capable of converting images of faces into points in a high-dimensional space, which, as noted above, is the primary goal of training. In practice, however, Stages 3 and 4 can provide a critical bridge to modeling behavioral characteristics of the human face processing system.

4.2.3. Adapting to local statistics of people and visual environments: Stage 3.

The objective of Stage 3 training is to finalize the modification of the DCNN weights to better adapt to the application domain. The term application domain can refer to faces from a particular race or ethnicity or, as it is commonly used in industry, to the type of images to be processed (e.g., in-the-wild faces, passport photographs). This training is a crucial step in many applications because there will be no further transformation of the weights. Special care is needed in this training to avoid collapsing the representation into a form that is too specific. Training at this stage can improve performance for some faces and decrease it for others.

Whereas Stages 1 and 2 are used in the vast majority of published computational work, in Stage 3, researchers diverge. Although there is no standard implementation for this training, fine-tuning and learning a triplet loss embedding ( van der Maaten & Weinberger 2012 ) are common methods. These methods are conceptually similar but differ in implementation. In both methods, ( a ) new layers are added to the network, ( b ) specific subsets of layers are frozen or unfrozen, and ( c ) optimization continues with an appropriate loss function using a new data set with the desired domain characteristics. Fine-tuning starts from an already-viable network state and updates a nonempty subset of weights, or possibly all weights. It is typically implemented with smaller learning rates and can use smaller training sets than those needed for full training. Triplet loss is implemented by freezing all layers and adding a new, fully connected layer. Minimization is done with the triplet loss, again on a new (smaller) data set with the desired domain characteristics.

A natural question is why Stage 2 (general face training) is not considered fine-tuning. The answer, in practice, comes down to viability and volume. When the training for Stage 2 starts, the network is not in a viable state to perform face recognition. Therefore, it requires a voluminous, diverse data set to function. Stage 3 begins with a functional network and can be tuned effectively with a small targeted data set.

This face knowledge history provides a tool for adapting to local face statistics (e.g., race) ( O’Toole et al. 2018 ).

4.2.4. Learning individual people: Stage 4.

In psychological terms, learning individual familiar faces involves seeing multiple, diverse images of the individuals to whom the faces belong. As we see more images of a person, we become more familiar with their face and can recognize it from increasingly variable images ( Dowsett et al. 2016 , Murphy et al. 2015 , Ritchie & Burton 2017 ). In computational terms, this translates into the question of how a network can learn to recognize a random set of special (familiar) faces with greater accuracy and robustness than other nonspecial (unfamiliar) faces—assuming, of course, the availability of multiple, variable images of the special faces. This stage of learning is defined, in nearly all cases, outside of the DCNN, with no change to weights within the DCNN.

The problem is as follows. The network starts with multiple images of each familiar identity and can produce a representation for each of the images–but what then? There is no standard familiarization protocol, but several approaches exist. We categorize these approaches first and link them to theoretical accounts of face familiarity in Section 4.3.3 .

The first approach is averaging identity codes, or 1-class learning. It is common in machine learning to use an average (or weighted average) of the DCNN-generated face image representations as an identity code (see also Crosswhite et al. 2018 , Su et al. 2015 ). Averaging creates a person-identity prototype ( Noyes et al. 2021 ) for each familiar face.

The second is individual face contrast, or 2-class learning. This technique employs direct learning of individual identities by contrasting them with all other identities. There are two classes because the model learns what makes each identity (positive class) different than all other identities (negative class). The distinctiveness of each familiar face is enhanced relative to all other known faces (e.g., Noyes et al. 2021 ).

The third is multiple face contrast, or K-class learning. This refers to the use of identification training for a random set of (familiar) faces with a simple network (often a one-layer network). The network learns to map DCNN-generated face representations of the available images onto identity nodes.

The fourth approach is fine-tuning individual face representations. Fine-tuning has also been used for learning familiar identities ( Blauch et al. 2020a ). It is an unusual method because it alters weights within the DCNN itself. This can improve performance for the familiarized faces but can limit the network’s ability to represent other faces.

These methods create a personal face learning history that supports more accurate and robust face processing for familiar people ( O’Toole et al. 2018 ).

4.3. Mapping Learning Between Humans and Machines

Deep networks rely on multiple types of learning that can be useful in formulating and testing complex, nuanced hypotheses about human face learning. Manipulable variables include order of learning, training data, and network plasticity at different learning stages. We consider a sample of topics in human face processing that can be investigated by manipulating learning in deep networks. Because these investigations are just beginning, we provide an overview of the work in progress and discuss possible next steps in modeling.

4.3.1. Development of face processing.

Early infants’ experience with faces is critical for the development of face processing skills ( Maurer et al. 2002 ). The timing of this experience has become increasingly clear with the availability of data sets gathered using head-mounted cameras in infants (1–15 months of age) (e.g., Jayaraman et al. 2015 , Yoshida & Smith 2008 ). In seeing the world from the perspective of the infant, it becomes clear that the development of sensorimotor abilities drives visual experience. Infants’ experience transitions from seeing only what is made available to them (often faces in the near range), to seeing the world from the perspective of a crawler (objects and environments), to seeing hands and the objects that they manipulate ( Fausey et al. 2016 , Jayaraman et al. 2015 , Smith & Slone 2017 , Sugden & Moulson 2017 ). Between 1 and 3 months of age, faces are frequent, temporally persistent, and viewed frontally at close range. This early experience with faces is limited to a few individuals. Faces become less frequent as the child’s first year progresses and attention shifts to the environment, to objects, and later to hands ( Jayaraman & Smith 2019 ).

The prevalence of a few important faces in the infants’ visual world suggests that early face learning may have an out-sized influence on structuring visual recognition systems. Infants’ visual experience of objects, faces, and environments can provide a curriculum for teaching machines ( Smith et al. 2018 ). DCNNs can be used to test hypotheses about the emergence of competence on different face processing tasks. Some basic computational challenges, however, need to be addressed. Training with very large numbers of objects (or faces) is required for deep network learning to converge (see Section 4.2.1 ). Starting small and building competence on multiple domains (faces, objects, environments) might require basic changes to deep network training. Alternatively, the small number of special faces in an infant’s life might be considered familiar faces. Perception and memory of these faces may be better modeled using tools that operate outside the deep network on representations that develop within the network (Stage 4 learning; Section 4.2.4 ). In this case, the quality of the representation produced at different points in a network’s development of more general visual knowledge varies (Stages 1 and 2 of training; Sections 4.2.1 and 4.2.2 ). The learning of these special faces early in development might interact with the learning of objects and scenes at the categorical level ( Rosch et al. 1976 , Yovel et al. 2012 ). A promising approach would involve pausing training in Stages 1 and 2 to test face representation quality at various points along the way to convergence.

4.3.2. Race bias in the performance of humans and deep networks.

People recognize own-race faces more accurately than other-race faces. For humans, this other-race effect begins in infancy ( Kelly et al. 2005 , 2007 ) and is manifest in children ( Pezdek et al. 2003 ). Although it is possible to reverse these effects in childhood ( Sangrigoli et al. 2005 ), training adults to recognize other-race faces yields only modest gains (e.g., Cavazos et al. 2019 , Hayward et al. 2017 , Laurence et al. 2016 , Matthews & Mondloch 2018 , Tanaka & Pierce 2009 ). Concomitantly, evidence for the experience-based contact hypothesis is weak when it is evaluated in adulthood ( Levin 2000 ). Clearly, the timing of experience is critical in the other-race effect. Developmental learning, which results in perceptual narrowing during a critical childhood period, may provide a partial account of the other-race effect ( Kelly et al. 2007 , Sangrigoli et al. 2005 , Scott & Monesson 2010 ).

Perceptual narrowing:

sculpting of neural and perceptual processing via experience during a critical period in child development

Face recognition algorithms from the 1990s and present-day DCNNs differ in accuracy for faces of different races (for a review, see Cavazos et al. 2020 ; for a comprehensive test of race bias in DCNNs, see Grother et al. 2019 ). Although training with faces of different races is often cited as a cause of race effects, it is unclear which training stage(s) contribute to the bias. It is likely that biased learning affects all learning stages. From the human perspective, for many people, experience favors own-race faces across the lifespan, potentially impacting performance through multiple learning mechanisms (developmental, unfamiliar, and familiar face learning). DCNN training may also use race-biased data at all stages. For humans, understanding the role of different types of learning in the other-race effect is challenging because experience with faces cannot be controlled. DCNNs can serve as a tool for studying critical periods and perceptual narrowing. It is possible to compare the face representations that emerge from training regimes that vary in the time course of exposure to faces of different races. The ability to manipulate training stage order, network plasticity, and training set diversity in deep networks offers an opportunity to test hypotheses about how bias emerges. The major challenge for DCNNs is the limited availability of face databases that represent the diversity of humans.

4.3.3. Familiar versus unfamiliar face recognition.

Face familiarity in a deep network can be modeled in more ways than we can count. The approaches presented in Section 4.2.4 are just a beginning. Researchers should focus first on the big questions. How do familiar and unfamiliar face representations differ—beyond simple accuracy and robustness? This has been much debated recently, and many questions remain ( Blauch et al. 2020a , b ; Young & Burton 2020 ; Yovel & Abudarham 2020 ). One approach is to ask where in the learning process representations for familiar and unfamiliar faces diverge. The methods outlined in Section 4.2.4 make some predictions.

In the individual and multiple face contrast methods, familiar and unfamiliar face representations are not differentiated within the deep network. Instead, familiar face representations generated by the DCNN are enhanced in another, simpler network populated with known faces. A familiar face’s representation is affected, therefore, by the other faces that we know well. Contrast techniques have preliminary empirical support. In the work of Noyes et al. (2021) , familiarization using individual-face contrast improved identification for both evasion and impersonation disguise. It also produced a pattern of accuracy similar to that seen for people familiar with the disguised individuals ( Noyes & Jenkins 2019 ). For humans who were unfamiliar with the disguised faces, the pattern of accuracy resembled that seen after general face training inside of the DCNN. There is also support for multiple-face contrast familiarization. Perceptual expertise findings that emphasize the selective effects of the exemplars experienced during highly skilled learning are consistent with this approach ( Collins & Behrmann 2020 ) (see Section 3.2 ).

Familiarization by averaging and fine-tuning both improve performance, but at a cost. For example, averaging the DCNN representations increased performance for evasion disguise by increasing tolerance for appearance variation ( Noyes et al. 2021 ). It decreased performance, however, for imposter disguise by allowing too much tolerance for appearance variation. Averaging methods highlight the need to balance the perception of identity across variable images with an ability to tell similar faces apart.

Familiarization via fine-tuning was explored by Blauch et al. (2020a) , who varied the number of layers tuned (all layers, fully connected layers, only the fully connected layer mapping the perceptual layer to identity nodes). Fine-tuning applied at lower layers alters the weights within the deep network to produce a perceptual representation potentially affected by familiar faces. Fine-tuning in the mapping layer is equivalent to multiclass face contrast learning ( Blauch et al. 2020b ). Blauch et al. (2020b) show that fine-tuning the perceptual representation, which they consider analogous to perceptual learning, is not necessary for producing a familiarity effect ( Blauch et al. 2020a ).

These approaches are not (necessarily) mutually exclusive and therefore can be combined to exploit useful features of each.

4.3.4. Objects, faces, both.

The organization of face-, body-, and object-selective areas in the ventral temporal cortex has been studied intensively (cf. Grill-Spector & Weiner 2014 ). Neuroimaging studies in childhood reveal the developmental time course of face selectivity and other high-level visual tasks (e.g., Natu et al. 2016 ; Nordt et al. 2019 , 2020 ). How these systems interact during development in the context of constantly changing input from the environment is an open question. DCNNs can be used to test functional hypotheses about the development of object and face learning (see also Grill-Spector et al. 2018 ).

In the case of machine learning, face recognition networks are more accurate when pretrained to categorize objects ( Liu et al. 2015 , Yi et al. 2014 ), and networks trained with only faces are more accurate for face recognition than networks trained with only objects ( Abudarham & Yovel 2020 , Blauch et al. 2020a ). Human-like viewpoint invariance was found in a DCNN trained for face recognition but not in one trained for object recognition ( Abudarham & Yovel 2020 ). In machine learning, networks are trained first with objects, and then with faces. Moreover, networks can simultaneously learn object and face recognition ( Dobs et al. 2020 ), which incurs minimal duplication of neural resources.

4.4. New Tools, New Questions, New Data, and a New Look at Old Data

Psychologists have long posited diverse and complex learning mechanisms for faces. Deep networks provide new tools that can be used to model human face learning with greater precision than was possible previously. This is useful because it encourages theoreticians to articulate hypotheses in ways specific enough to model. It may no longer be sufficient to explain a phenomenon in terms of generic learning or contact. Concepts such as perceptual narrowing should include ideas about where and how in the learning process this narrowing occurs. A major challenge ahead is the sheer number of knobs to be set in deep networks. Plasticity, for example, can be dialed up or down, and it can be applied to selected network layers or specific face diets administered across multiple learning stages (in sequence or simultaneously). The list goes on. In all of the topics discussed, and others not discussed, theoretical ideas should specify the manipulations thought to be most critical. We should follow the counsel of Box (1976) to avoid worrying selectivity and instead focus on what is most important. New tools succeed when they facilitate the discovery of things that we did not know or had not hypothesized. Testing these hypotheses will require new data and may suggest a reevaluation of existing data.

5. THE PATH FORWARD

In this review, we highlight fundamental advances in thinking brought about by deep learning approaches. These networks solve the inverse optics problem for face identification by untangling image, appearance, and identity over layers of neural-like processing. This demonstrates that robust face identification can be achieved with a representation that includes specific information about the face image(s) actually experienced. These representations retain information about appearance, perceived traits, expressions, and identity.

Direct-fit models posit that deep networks operate by placing new observations into the context of past experience. These models depend on overparameterized networks that create a high-dimensional space from real-world training data. Face representations housed within this space project onto units, thereby confounding stimulus features that (may) separate in the high-dimensional space. This raises questions about the transparency and interpretability of information gained by examining the response properties of network units. Deep networks can be studied at the both micro- and macroscale simultaneously and can be used to formulate hypotheses about the underlying neural code for faces. A key to understanding face representations is to reconcile the responses of neurons to the structure of the code in the high-dimensional space. This is a challenging problem best approached by combining psychological, neural, and computational methods.

The process of training a deep network is complex and layered. It draws on learning mechanisms aimed at objects and faces, visual categories of faces (e.g., race), and special familiar faces. Psychological and neural theory considers the many ways in which people and brains learn faces from real-world visual experience. DCNNs offer the potential to implement and test sophisticated hypotheses about how humans learn faces across the lifespan.

We should not lose sight of the fact that a compelling reason to study deep networks is that they actually work, i.e., they perform nearly as well as humans, on face recognition tasks that have stymied computational modelers for decades. This might qualify as a property of deep networks that is importantly right ( Box 1976 ). There is a difference, of course, between working and working like humans. Determining whether a deep network can work like humans, or could be made to do so by manipulating other properties of the network (e.g., architectures, training data, learning rules), is work that is just beginning.

SUMMARY POINTS

  • Face representations generated by DCNN networks trained for identification retain information about the face (e.g., identity, demographics, attributes, traits, expression) and the image (e.g., viewpoint).
  • Deep learning face networks generate a surprisingly structured face representation from unstructured training with in-the-wild face images.
  • Individual output units from deep networks are unlikely to signal the presence of interpretable features.
  • Fundamental structural aspects of high-level visual codes for faces in deep networks replicate over a wide variety of network architectures.
  • Diverse learning mechanisms in DCNNs, applied simultaneously or in sequence, can be used to model human face perception across the lifespan.

FUTURE ISSUES

  • Large-scale systematic manipulations of training data (race, ethnicity, image variability) are needed to give insight into the role of experience in structuring face representations.
  • Fundamental challenges remain in understanding how to combine deep networks for face, object, and scene recognition in ways analogous to the human visual system.
  • Deep networks model the ventral visual stream at a generic level, arguably up to the level of the IT cortex. Future work should examine how downstream systems, such as face patches, could be connected into this system.
  • In rethinking the goals of face processing, we argue in this review that some longstanding assumptions about visual representations should be reconsidered. Future work should consider novel experimental questions and employ methods that do not rely on these assumptions.

ACKNOWLEDGMENTS

The authors are supported by funding provided by National Eye Institute grant R01EY029692-03 to A.J.O. and C.D.C.

DISCLOSURE STATEMENT

C.D.C. is an equity holder in Mukh Technologies, which may potentially benefit from research results.

1 This is the case in networks trained with the Softmax objective function.

LITERATURE CITED

  • Abadi M, Barham P, Chen J, Chen Z, Davis A, et al. 2016. Tensorflow: a system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) , pp. 265–83. Berkeley, CA: USENIX [ Google Scholar ]
  • Abudarham N, Shkiller L, Yovel G. 2019. Critical features for face recognition . Cognition 182 :73–83 [ PubMed ] [ Google Scholar ]
  • Abudarham N, Yovel G. 2020. Face recognition depends on specialized mechanisms tuned to view-invariant facial features: insights from deep neural networks optimized for face or object recognition . bioRxiv 2020.01.01.890277 . 10.1101/2020.01.01.890277 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Azevedo FA, Carvalho LR, Grinberg LT, Farfel JM, Ferretti RE, et al. 2009. Equal numbers of neuronal and nonneuronal cells make the human brain an isometrically scaled-up primate brain . J. Comp. Neurol 513 ( 5 ):532–41 [ PubMed ] [ Google Scholar ]
  • Barlow HB. 1972. Single units and sensation: a neuron doctrine for perceptual psychology? Perception 1 ( 4 ):371–94 [ PubMed ] [ Google Scholar ]
  • Bashivan P, Kar K, DiCarlo JJ. 2019. Neural population control via deep image synthesis . Science 364 ( 6439 ):eaav9436 [ PubMed ] [ Google Scholar ]
  • Best-Rowden L, Jain AK. 2018. Learning face image quality from human assessments . IEEE Trans. Inform. Forensics Secur 13 ( 12 ):3064–77 [ Google Scholar ]
  • Blanz V, Vetter T. 1999. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques , pp. 187–94. New York: ACM [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020a. Computational insights into human perceptual expertise for familiar and unfamiliar face recognition . Cognition 208 :104341. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Blauch NM, Behrmann M, Plaut DC. 2020b. Deep learning of shared perceptual representations for familiar and unfamiliar faces: reply to commentaries . Cognition 208 :104484. [ PubMed ] [ Google Scholar ]
  • Box GE. 1976. Science and statistics . J. Am. Stat. Assoc 71 ( 356 ):791–99 [ Google Scholar ]
  • Box GEP. 1979. Robustness in the strategy of scientific model building. In Robustness in Statistics , ed. Launer RL, Wilkinson GN, pp. 201–36. Cambridge, MA: Academic Press [ Google Scholar ]
  • Bruce V, Young A. 1986. Understanding face recognition . Br. J. Psychol 77 ( 3 ):305–27 [ PubMed ] [ Google Scholar ]
  • Burton AM, Bruce V, Hancock PJ. 1999. From pixels to people: a model of familiar face recognition . Cogn. Sci 23 ( 1 ):1–31 [ Google Scholar ]
  • Cavazos JG, Noyes E, O’Toole AJ. 2019. Learning context and the other-race effect: strategies for improving face recognition . Vis. Res 157 :169–83 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cavazos JG, Phillips PJ, Castillo CD, O’Toole AJ. 2020. Accuracy comparison across face recognition algorithms: Where are we on measuring race bias? IEEE Trans. Biom. Behav. Identity Sci 3 ( 1 ):101–11 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chang L, Tsao DY. 2017. The code for facial identity in the primate brain . Cell 169 ( 6 ):1013–28 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Chen JC, Patel VM, Chellappa R. 2016. Unconstrained face verification using deep CNN features. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Cichy RM, Kaiser D. 2019. Deep neural networks as scientific models . Trends Cogn. Sci 23 ( 4 ):305–17 [ PubMed ] [ Google Scholar ]
  • Collins E, Behrmann M. 2020. Exemplar learning reveals the representational origins of expert category perception . PNAS 117 ( 20 ):11167–77 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Colón YI, Castillo CD, O’Toole AJ. 2021. Facial expression is retained in deep networks trained for face identification . J. Vis 21 ( 4 ):4 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Cootes TF, Taylor CJ, Cooper DH, Graham J. 1995. Active shape models-their training and application . Comput. Vis. Image Underst 61 ( 1 ):38–59 [ Google Scholar ]
  • Crosswhite N, Byrne J, Stauffer C, Parkhi O, Cao Q, Zisserman A. 2018. Template adaptation for face verification and identification . Image Vis. Comput 79 :35–48 [ Google Scholar ]
  • Deng J, Guo J, Xue N, Zafeiriou S. 2019. Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pp. 4690–99. Piscataway, NJ: IEEE [ PubMed ] [ Google Scholar ]
  • Dhar P, Bansal A, Castillo CD, Gleason J, Phillips P, Chellappa R. 2020. How are attributes expressed in face DCNNs? In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) , pp. 61–68. Piscataway, NJ: IEEE [ Google Scholar ]
  • DiCarlo JJ, Cox DD. 2007. Untangling invariant object recognition . Trends Cogn. Sci 11 ( 8 ):333–41 [ PubMed ] [ Google Scholar ]
  • Dobs K, Kell AJ, Martinez J, Cohen M, Kanwisher N. 2020. Using task-optimized neural networks to understand why brains have specialized processing for faces . J. Vis 20 ( 11 ):660 [ Google Scholar ]
  • Dowsett A, Sandford A, Burton AM. 2016. Face learning with multiple images leads to fast acquisition of familiarity for specific individuals . Q. J. Exp. Psychol 69 ( 1 ):1–10 [ PubMed ] [ Google Scholar ]
  • El Khiyari H, Wechsler H. 2016. Face verification subject to varying (age, ethnicity, and gender) demographics using deep learning . J. Biom. Biostat 7 :323 [ Google Scholar ]
  • Fausey CM, Jayaraman S, Smith LB. 2016. From faces to hands: changing visual input in the first two years . Cognition 152 :101–7 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY. 2010. Functional compartmentalization and viewpoint generalization within the macaque face-processing system . Science 330 ( 6005 ):845–51 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Freiwald WA, Tsao DY, Livingstone MS. 2009. A face feature space in the macaque temporal lobe . Nat. Neurosci 12 ( 9 ):1187–96 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Fukushima K 1988. Neocognitron: a hierarchical neural network capable of visual pattern recognition . Neural Netw 1 ( 2 ):119–30 [ Google Scholar ]
  • Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, et al. 2014. Generative adversarial nets. In NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems , pp. 2672–80. New York: ACM [ Google Scholar ]
  • Goodman CS, Shatz CJ. 1993. Developmental mechanisms that generate precise patterns of neuronal connectivity . Cell 72 :77–98 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Kushnir T, Edelman S, Avidan G, Itzchak Y, Malach R. 1999. Differential processing of objects under various viewing conditions in the human lateral occipital complex . Neuron 24 ( 1 ):187–203 [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS. 2014. The functional architecture of the ventral temporal cortex and its role in categorization . Nat. Rev. Neurosci 15 ( 8 ):536–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Grill-Spector K, Weiner KS, Gomez J, Stigliani A, Natu VS. 2018. The functional neuroanatomy of face perception: from brain measurements to deep neural networks . Interface Focus 8 ( 4 ):20180013. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Gross CG. 2002. Genealogy of the “grandmother cell” . Neuroscientist 8 ( 5 ):512–18 [ PubMed ] [ Google Scholar ]
  • Grother P, Ngan M, Hanaoka K. 2019. Face recognition vendor test (FRVT) part 3: demographic effects . Rep., Natl. Inst. Stand. Technol., US Dept. Commerce, Gaithersburg, MD [ Google Scholar ]
  • Hancock PJ, Bruce V, Burton AM. 2000. Recognition of unfamiliar faces . Trends Cogn. Sci 4 ( 9 ):330–37 [ PubMed ] [ Google Scholar ]
  • Hasson U, Nastase SA, Goldstein A. 2020. Direct fit to nature: an evolutionary perspective on biological and artificial neural networks . Neuron 105 ( 3 ):416–34 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Hayward WG, Favelle SK, Oxner M, Chu MH, Lam SM. 2017. The other-race effect in face learning: using naturalistic images to investigate face ethnicity effects in a learning paradigm . Q. J. Exp. Psychol 70 ( 5 ):890–96 [ PubMed ] [ Google Scholar ]
  • Hesse JK, Tsao DY. 2020. The macaque face patch system: a turtle’s underbelly for the brain . Nat. Rev. Neurosci 21 ( 12 ):695–716 [ PubMed ] [ Google Scholar ]
  • Hill MQ, Parde CJ, Castillo CD, Colon YI, Ranjan R, et al. 2019. Deep convolutional neural networks in the face of caricature . Nat. Mach. Intel 1 ( 11 ):522–29 [ Google Scholar ]
  • Hong H, Yamins DL, Majaj NJ, DiCarlo JJ. 2016. Explicit information for category-orthogonal object properties increases along the ventral stream . Nat. Neurosci 19 ( 4 ):613–22 [ PubMed ] [ Google Scholar ]
  • Hornik K, Stinchcombe M, White H. 1989. Multilayer feedforward networks are universal approximators . Neural Netw 2 ( 5 ):359–66 [ Google Scholar ]
  • Huang GB, Lee H, Learned-Miller E. 2012. Learning hierarchical representations for face verification with convolutional deep belief networks. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 2518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Huang GB, Mattar M, Berg T, Learned-Miller E. 2008. Labeled faces in the wild: a database for studying face recognition in unconstrained environments . Paper presented at the Workshop on Faces in “Real-Life” Images: Detection, Alignment, and Recognition, Marseille, France [ Google Scholar ]
  • Ilyas A, Santurkar S, Tsipras D, Engstrom L, Tran B, Madry A. 2019. Adversarial examples are not bugs, they are features . arXiv:1905.02175 [stat.ML] [ Google Scholar ]
  • Issa EB, DiCarlo JJ. 2012. Precedence of the eye region in neural processing of faces . J. Neurosci 32 ( 47 ):16666–82 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jacquet M, Champod C. 2020. Automated face recognition in forensic science: review and perspectives . Forensic Sci. Int 307 :110124. [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Fausey CM, Smith LB. 2015. The faces in infant-perspective scenes change over the first year of life . PLOS ONE 10 ( 5 ):e0123780. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Jayaraman S, Smith LB. 2019. Faces in early visual environments are persistent not just frequent . Vis. Res 157 :213–21 [ PubMed ] [ Google Scholar ]
  • Jenkins R, White D, Van Montfort X, Burton AM. 2011. Variability in photos of the same face . Cognition 121 ( 3 ):313–23 [ PubMed ] [ Google Scholar ]
  • Kandel ER, Schwartz JH, Jessell TM, Siegelbaum S, Hudspeth AJ, Mack S, eds. 2000. Principles of Neural Science , Vol. 4 . New York: McGraw-Hill [ Google Scholar ]
  • Kay KN, Weiner KS, Grill-Spector K. 2015. Attention reduces spatial uncertainty in human ventral temporal cortex . Curr. Biol 25 ( 5 ):595–600 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Ge L, Pascalis O. 2007. The other-race effect develops during infancy: evidence of perceptual narrowing . Psychol. Sci 18 ( 12 ):1084–89 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kelly DJ, Quinn PC, Slater AM, Lee K, Gibson A, et al. 2005. Three-month-olds, but not newborns, prefer own-race faces . Dev. Sci 8 ( 6 ):F31–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Kietzmann TC, Swisher JD, König P, Tong F. 2012. Prevalence of selectivity for mirror-symmetric views of faces in the ventral and dorsal visual pathways . J. Neurosci 32 ( 34 ):11763–72 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Krishnapriya KS, Albiero V, Vangara K, King MC, Bowyer KW. 2020. Issues related to face recognition accuracy varying based on race and skin tone . IEEE Trans. Technol. Soc 1 ( 1 ):8–20 [ Google Scholar ]
  • Krishnapriya K, Vangara K, King MC, Albiero V, Bowyer K. 2019. Characterizing the variability in face recognition accuracy relative to race. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops , Vol. 1 , pp. 2278–85. Piscataway, NJ: IEEE [ Google Scholar ]
  • Krizhevsky A, Sutskever I, Hinton GE. 2012. Imagenet classification with deep convolutional neural networks. In NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems , pp. 1097–105. New York: ACM [ Google Scholar ]
  • Kumar N, Berg AC, Belhumeur PN, Nayar SK. 2009. Attribute and simile classifiers for face verification. In Proceedings of the 2009 IEEE International Conference on Computer Vision , pp. 365–72. Piscataway, NJ: IEEE [ Google Scholar ]
  • Laurence S, Zhou X, Mondloch CJ. 2016. The flip side of the other-race coin: They all look different to me . Br. J. Psychol 107 ( 2 ):374–88 [ PubMed ] [ Google Scholar ]
  • LeCun Y, Bengio Y, Hinton G. 2015. Deep learning . Nature 521 ( 7553 ):436–44 [ PubMed ] [ Google Scholar ]
  • Levin DT. 2000. Race as a visual feature: using visual search and perceptual discrimination tasks to understand face categories and the cross-race recognition deficit . J. Exp. Psychol. Gen 129 ( 4 ):559–74 [ PubMed ] [ Google Scholar ]
  • Lewenberg Y, Bachrach Y, Shankar S, Criminisi A. 2016. Predicting personal traits from facial images using convolutional neural networks augmented with facial landmark information . arXiv:1605.09062 [cs.CV] [ Google Scholar ]
  • Li Y, Gao F, Ou Z, Sun J. 2018. Angular softmax loss for end-to-end speaker verification. In Proceedings of the 11th International Symposium on Chinese Spoken Language Processing (ISCSLP) , pp. 190–94. Baixas, France: ISCA [ Google Scholar ]
  • Liu Z, Luo P, Wang X, Tang X. 2015. Deep learning face attributes in the wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 3730–38. Piscataway, NJ: IEEE [ Google Scholar ]
  • Lundqvist D, Flykt A, Ohman A. 1998. Karolinska directed emotional faces . Database of standardized facial images, Psychol. Sect., Dept. Clin. Neurosci. Karolinska Hosp., Solna, Swed. https://www.kdef.se/#:~:text=The%20Karolinska%20Directed%20Emotional%20Faces,from%20the%20original%20KDEF%20images [ Google Scholar ]
  • Malpass RS, Kravitz J. 1969. Recognition for faces of own and other race . J. Personal. Soc. Psychol 13 ( 4 ):330–34 [ PubMed ] [ Google Scholar ]
  • Matthews CM, Mondloch CJ. 2018. Improving identity matching of newly encountered faces: effects of multi-image training . J. Appl. Res. Mem. Cogn 7 ( 2 ):280–90 [ Google Scholar ]
  • Maurer D, Le Grand R, Mondloch CJ. 2002. The many faces of configural processing . Trends Cogn. Sci 6 ( 6 ):255–60 [ PubMed ] [ Google Scholar ]
  • Maze B, Adams J, Duncan JA, Kalka N, Miller T, et al. 2018. IARPA Janus Benchmark—C: face dataset and protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB) , pp. 158–65. Piscataway, NJ: IEEE [ Google Scholar ]
  • McCurrie M, Beletti F, Parzianello L, Westendorp A, Anthony S, Scheirer WJ. 2017. Predicting first impressions with deep learning. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 518–25. Piscataway, NJ: IEEE [ Google Scholar ]
  • Murphy J, Ipser A, Gaigg SB, Cook R. 2015. Exemplar variance supports robust learning of facial identity . J. Exp. Psychol. Hum. Percept. Perform 41 ( 3 ):577–81 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Barnett MA, Hartley J, Gomez J, Stigliani A, Grill-Spector K. 2016. Development of neural sensitivity to face identity correlates with perceptual discriminability . J. Neurosci 36 ( 42 ):10893–907 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Natu VS, Jiang F, Narvekar A, Keshvari S, Blanz V, O’Toole AJ. 2010. Dissociable neural patterns of facial identity across changes in viewpoint . J. Cogn. Neurosci 22 ( 7 ):1570–82 [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu V, Jeska B, Barnett M, Grill-Spector K. 2019. Learning to read increases the informativeness of distributed ventral temporal responses . Cereb. Cortex 29 ( 7 ):3124–39 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Nordt M, Gomez J, Natu VS, Rezai AA, Finzi D, Grill-Spector K. 2020. Selectivity to limbs in ventral temporal cortex decreases during childhood as selectivity to faces and words increases . J. Vis 20 ( 11 ):152 [ Google Scholar ]
  • Noyes E, Jenkins R. 2019. Deliberate disguise in face identification . J. Exp. Psychol. Appl 25 ( 2 ):280–90 [ PubMed ] [ Google Scholar ]
  • Noyes E, Parde C, Colon Y, Hill M, Castillo C, et al. 2021. Seeing through disguise: getting to know you with a deep convolutional neural network . Cognition . In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Noyes E, Phillips P, O’Toole A. 2017. What is a super-recogniser. In Face Processing: Systems, Disorders and Cultural Differences , ed. Bindemann M, pp. 173–201. Hauppage, NY: Nova Sci. Publ. [ Google Scholar ]
  • Oosterhof NN, Todorov A. 2008. The functional basis of face evaluation . PNAS 105 ( 32 ):11087–92 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Castillo CD, Parde CJ, Hill MQ, Chellappa R. 2018. Face space representations in deep convolutional neural networks . Trends Cogn. Sci 22 ( 9 ):794–809 [ PubMed ] [ Google Scholar ]
  • O’Toole AJ, Phillips PJ, Jiang F, Ayyad J, Pénard N, Abdi H. 2007. Face recognition algorithms surpass humans matching faces over changes in illumination . IEEE Trans. Pattern Anal. Mach. Intel ( 9 ):1642–46 [ PubMed ] [ Google Scholar ]
  • Parde CJ, Castillo C, Hill MQ, Colon YI, Sankaranarayanan S, et al. 2017. Face and image representation in deep CNN features. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 673–80. Piscataway, NJ: IEEE [ Google Scholar ]
  • Parde CJ, Colón YI, Hill MQ, Castillo CD, Dhar P, O’Toole AJ. 2021. Face recognition by humans and machines: closing the gap between single-unit and neural population codes—insights from deep learning in face recognition . J. Vis In press [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parde CJ, Hu Y, Castillo C, Sankaranarayanan S, O’Toole AJ. 2019. Social trait information in deep convolutional neural networks trained for face identification . Cogn. Sci 43 ( 6 ):e12729. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Parkhi OM, Vedaldi A, Zisserman A. 2015. Deep face recognition . Rep., Vis. Geom. Group, Dept. Eng. Sci., Univ. Oxford, UK [ Google Scholar ]
  • Paszke A, Gross S, Massa F, Lerer A, Bradbury J, et al. 2019. Pytorch: an imperative style, high-performance deep learning library. In NeurIPS 2019: Proceedings of the 32nd International Conference on Neural Information Processing Systems , pp. 8024–35. New York: ACM [ Google Scholar ]
  • Pezdek K, Blandon-Gitlin I, Moore C. 2003. Children’s face recognition memory: more evidence for the cross-race effect . J. Appl. Psychol 88 ( 4 ):760–63 [ PubMed ] [ Google Scholar ]
  • Phillips PJ, Beveridge JR, Draper BA, Givens G, O’Toole AJ, et al. 2011. An introduction to the good, the bad, & the ugly face recognition challenge problem. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG) , pp. 346–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Phillips PJ, O’Toole AJ. 2014. Comparison of human and computer performance across face recognition experiments . Image Vis. Comput 32 ( 1 ):74–85 [ Google Scholar ]
  • Phillips PJ, Yates AN, Hu Y, Hahn CA, Noyes E, et al. 2018. Face recognition accuracy of forensic examiners, superrecognizers, and face recognition algorithms . PNAS 115 ( 24 ):6171–76 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Poggio T, Banburski A, Liao Q. 2020. Theoretical issues in deep networks . PNAS 117 ( 48 ):30039–45 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ponce CR, Xiao W, Schade PF, Hartmann TS, Kreiman G, Livingstone MS. 2019. Evolving images for visual neurons using a deep generative network reveals coding principles and neuronal preferences . Cell 177 ( 4 ):999–1009 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ranjan R, Bansal A, Zheng J, Xu H, Gleason J, et al. 2019. A fast and accurate system for face detection, identification, and verification . IEEE Trans. Biom. Behav. Identity Sci 1 ( 2 ):82–96 [ Google Scholar ]
  • Ranjan R, Castillo CD, Chellappa R. 2017. L2-constrained softmax loss for discriminative face verification . arXiv:1703.09507 [cs.CV] [ Google Scholar ]
  • Ranjan R, Sankaranarayanan S, Castillo CD, Chellappa R. 2017c. An all-in-one convolutional neural network for face analysis. In Proceedings of the 2017 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) , pp. 17–24. Piscataway, NJ: IEEE [ Google Scholar ]
  • Richards BA, Lillicrap TP, Beaudoin P, Bengio Y, Bogacz R, et al. 2019. A deep learning framework for neuroscience . Nat. Neurosci 22 ( 11 ):1761–70 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Ritchie KL, Burton AM. 2017. Learning faces from variability . Q. J. Exp. Psychol 70 ( 5 ):897–905 [ PubMed ] [ Google Scholar ]
  • Rosch E, Mervis CB, Gray WD, Johnson DM, Boyes-Braem P. 1976. Basic objects in natural categories . Cogn. Psychol 8 ( 3 ):382–439 [ Google Scholar ]
  • Russakovsky O, Deng J, Su H, Krause J, Satheesh S, et al. 2015. ImageNet Large Scale Visual Recognition Challenge . Int. J. Comput. Vis 115 ( 3 ):211–52 [ Google Scholar ]
  • Russell R, Duchaine B, Nakayama K. 2009. Super-recognizers: people with extraordinary face recognition ability . Psychon. Bull. Rev 16 ( 2 ):252–57 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Sangrigoli S, Pallier C, Argenti AM, Ventureyra V, de Schonen S. 2005. Reversibility of the other-race effect in face recognition during childhood . Psychol. Sci 16 ( 6 ):440–44 [ PubMed ] [ Google Scholar ]
  • Sankaranarayanan S, Alavi A, Castillo C, Chellappa R. 2016. Triplet probabilistic embedding for face verification and clustering . arXiv:1604.05417 [cs.CV] [ Google Scholar ]
  • Schrimpf M, Kubilius J, Hong H, Majaj NJ, Rajalingham R, et al. 2018. Brain-score: Which artificial neural network for object recognition is most brain-like? bioRxiv 407007 . 10.1101/407007 [ CrossRef ] [ Google Scholar ]
  • Schroff F, Kalenichenko D, Philbin J. 2015. Facenet: a unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition , pp. 815–23. Piscataway, NJ: IEEE [ Google Scholar ]
  • Scott LS, Monesson A. 2010. Experience-dependent neural specialization during infancy . Neuropsychologia 48 ( 6 ):1857–61 [ PubMed ] [ Google Scholar ]
  • Sengupta S, Chen JC, Castillo C, Patel VM, Chellappa R, Jacobs DW. 2016. Frontal to profile face verification in the wild. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) , pp. 1–9. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sim T, Baker S, Bsat M. 2002. The CMU pose, illumination, and expression (PIE) database. In Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition , pp. 53–58. Piscataway, NJ: IEEE [ Google Scholar ]
  • Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition . arXiv:1409.1556 [cs.CV] [ Google Scholar ]
  • Smith LB, Jayaraman S, Clerkin E, Yu C. 2018. The developing infant creates a curriculum for statistical learning . Trends Cogn. Sci 22 ( 4 ):325–36 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Smith LB, Slone LK. 2017. A developmental approach to machine learning? Front. Psychol 8 :2124. [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Song A, Linjie L, Atalla C, Gottrell G. 2017. Learning to see people like people: predicting social impressions of faces . Cogn. Sci 2017 :1096–101 [ Google Scholar ]
  • Storrs KR, Kietzmann TC, Walther A, Mehrer J, Kriegeskorte N. 2020. Diverse deep neural networks all predict human it well, after training and fitting . bioRxiv 2020.05.07.082743 . 10.1101/2020.05.07.082743 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Su H, Maji S, Kalogerakis E, Learned-Miller E. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision , pp. 945–53. Piscataway, NJ: IEEE [ Google Scholar ]
  • Sugden NA, Moulson MC. 2017. Hey baby, what’s “up”? One-and 3-month-olds experience faces primarily upright but non-upright faces offer the best views . Q. J. Exp. Psychol 70 ( 5 ):959–69 [ PubMed ] [ Google Scholar ]
  • Taigman Y, Yang M, Ranzato M, Wolf L. 2014. Deepface: closing the gap to human-level performance in face verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition , pp. 1701–8. Piscataway, NJ: IEEE [ Google Scholar ]
  • Tanaka JW, Pierce LJ. 2009. The neural plasticity of other-race face recognition . Cogn. Affect. Behav. Neurosci 9 ( 1 ):122–31 [ PubMed ] [ Google Scholar ]
  • Terhörst P, Fährmann D, Damer N, Kirchbuchner F, Kuijper A. 2020. Beyond identity: What information is stored in biometric face templates? arXiv:2009.09918 [cs.CV] [ Google Scholar ]
  • Thorpe S, Fize D, Marlot C. 1996. Speed of processing in the human visual system . Nature 381 ( 6582 ):520–22 [ PubMed ] [ Google Scholar ]
  • Todorov A 2017. Face Value: The Irresistible Influence of First Impressions . Princeton, NJ: Princeton Univ. Press [ Google Scholar ]
  • Todorov A, Mandisodza AN, Goren A, Hall CC. 2005. Inferences of competence from faces predict election outcomes . Science 308 ( 5728 ):1623–26 [ PubMed ] [ Google Scholar ]
  • Valentine T 1991. A unified account of the effects of distinctiveness, inversion, and race in face recognition . Q. J. Exp. Psychol. A 43 ( 2 ):161–204 [ PubMed ] [ Google Scholar ]
  • van der Maaten L, Weinberger K. 2012. Stochastic triplet embedding. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing , pp. 1–6. Piscataway, NJ: IEEE [ Google Scholar ]
  • Walker M, Vetter T. 2009. Portraits made to measure: manipulating social judgments about individuals with a statistical face model . J. Vis 9 ( 11 ):12 [ PubMed ] [ Google Scholar ]
  • Wang F, Liu W, Liu H, Cheng J. 2018. Additive margin softmax for face verification . IEEE Signal Process. Lett 25 :926–30 [ Google Scholar ]
  • Wang F, Xiang X, Cheng J, Yuille AL. 2017. Normface: L 2 hypersphere embedding for face verification. In MM ‘17: Proceedings of the 25th ACM International Conference on Multimedia , pp. 1041–49. New York: ACM [ Google Scholar ]
  • Xie C, Tan M, Gong B, Wang J, Yuille AL, Le QV. 2020. Adversarial examples improve image recognition. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 819–28. Piscataway, NJ: IEEE [ Google Scholar ]
  • Yamins DL, Hong H, Cadieu CF, Solomon EA, Seibert D, DiCarlo JJ. 2014. Performance-optimized hierarchical models predict neural responses in higher visual cortex . PNAS 111 ( 23 ):8619–24 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Yi D, Lei Z, Liao S, Li SZ. 2014. Learning face representation from scratch . arXiv:1411.7923 [cs.CV] [ Google Scholar ]
  • Yoshida H, Smith LB. 2008. What’s in view for toddlers? Using a head camera to study visual experience . Infancy 13 ( 3 ):229–48 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Young AW, Burton AM. 2020. Insights from computational models of face recognition: a reply to Blauch, Behrmann and Plaut . Cognition 208 :104422. [ PubMed ] [ Google Scholar ]
  • Yovel G, Abudarham N. 2020. From concepts to percepts in human and machine face recognition: a reply to Blauch, Behrmann & Plaut . Cognition 208 :104424. [ PubMed ] [ Google Scholar ]
  • Yovel G, Halsband K, Pelleg M, Farkash N, Gal B, Goshen-Gottstein Y. 2012. Can massive but passive exposure to faces contribute to face recognition abilities? J. Exp. Psychol. Hum. Percept. Perform 38 ( 2 ):285–89 [ PubMed ] [ Google Scholar ]
  • Yovel G, O’Toole AJ. 2016. Recognizing people in motion . Trends Cogn. Sci 20 ( 5 ):383–95 [ PubMed ] [ Google Scholar ]
  • Yuan L, Xiao W, Kreiman G, Tay FE, Feng J, Livingstone MS. 2020. Adversarial images for the primate brain . arXiv:2011.05623 [q-bio.NC] [ Google Scholar ]
  • Yue X, Cassidy BS, Devaney KJ, Holt DJ, Tootell RB. 2010. Lower-level stimulus features strongly influence responses in the fusiform face area . Cereb. Cortex 21 ( 1 ):35–47 [ PMC free article ] [ PubMed ] [ Google Scholar ]

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, face recognition.

600 papers with code • 23 benchmarks • 64 datasets

Facial Recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on identification of those detected faces.

The state of the art tables for this task are contained mainly in the consistent parts of the task : the face verification and face identification tasks.

( Image credit: Face Verification )

research paper for face recognition

Benchmarks Add a Result

--> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> --> -->
Trend Dataset Best ModelPaper Code Compare
GhostFaceNetV2-1 (MS1MV3)
GhostFaceNetV2-1
MS1MV2, R100, SFace
Fine-tuned ArcFace
Fine-tuned ArcFace
ArcFace+CSFM
PIC - QMagFace
Prodpoly
Prodpoly
PIC - MagFace
PIC - ArcFace
FaceNet+Adaptive Threshold
FaceNet+Adaptive Threshold
FaceNet+Adaptive Threshold
Model with Up Convolution + DoG Filter (Aligned)
Model with Up Convolution + DoG Filter
GhostFaceNetV2-1
Model with Up Convolution + DoG Filter
GhostFaceNetV2-1
Multi-task
FaceTransformer+OctupletLoss
Partial FC
MCN

research paper for face recognition

Most implemented papers

Facenet: a unified embedding for face recognition and clustering.

On the widely used Labeled Faces in the Wild (LFW) dataset, our system achieves a new record accuracy of 99. 63%.

ArcFace: Additive Angular Margin Loss for Deep Face Recognition

research paper for face recognition

Recently, a popular line of research in face recognition is adopting margins in the well-established softmax loss function to maximize class separability.

VGGFace2: A dataset for recognising faces across pose and age

The dataset was collected with three goals in mind: (i) to have both a large number of identities and also a large number of images for each identity; (ii) to cover a large range of pose, age and ethnicity; and (iii) to minimize the label noise.

SphereFace: Deep Hypersphere Embedding for Face Recognition

This paper addresses deep face recognition (FR) problem under open-set protocol, where ideal face features are expected to have smaller maximal intra-class distance than minimal inter-class distance under a suitably chosen metric space.

A Light CNN for Deep Face Representation with Noisy Labels

This paper presents a Light CNN framework to learn a compact embedding on the large-scale face data with massive noisy labels.

Learning Face Representation from Scratch

The current situation in the field of face recognition is that data is more important than algorithm.

Circle Loss: A Unified Perspective of Pair Similarity Optimization

This paper provides a pair similarity optimization viewpoint on deep feature learning, aiming to maximize the within-class similarity $s_p$ and minimize the between-class similarity $s_n$.

MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition

In this paper, we design a benchmark task and provide the associated datasets for recognizing face images and link them to corresponding entity keys in a knowledge base.

CosFace: Large Margin Cosine Loss for Deep Face Recognition

research paper for face recognition

The central task of face recognition, including face verification and identification, involves face feature discrimination.

RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition

We propose RepMLP, a multi-layer-perceptron-style neural network building block for image recognition, which is composed of a series of fully-connected (FC) layers.

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

A deep facial recognition system using computational intelligent algorithms

Roles Conceptualization, Data curation, Formal analysis, Methodology, Supervision, Writing – original draft, Writing – review & editing

* E-mail: [email protected]

Affiliations Department of Information Systems, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt, Department of Computer Science, Faculty of Computers and Informatics, Misr International University, Cairo, Egypt

ORCID logo

Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Writing – original draft

Affiliation Department of Information Systems, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt

Roles Formal analysis, Investigation, Methodology, Software, Validation, Writing – review & editing

Affiliation Department of Computer Science, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt

Roles Conceptualization, Investigation, Project administration, Writing – original draft, Writing – review & editing

Affiliations Department of Scientific Computing, Faculty of Computers and Artificial Intelligence, Benha University, Benha City, Egypt, Department of Computer Science, Higher Technological Institute, 10th of Ramadan City, Egypt

  • Diaa Salama AbdELminaam, 
  • Abdulrhman M. Almansori, 
  • Mohamed Taha, 
  • Elsayed Badr

PLOS

  • Published: December 3, 2020
  • https://doi.org/10.1371/journal.pone.0242269
  • Peer Review
  • Reader Comments

Fig 1

The development of biometric applications, such as facial recognition (FR), has recently become important in smart cities. Many scientists and engineers around the world have focused on establishing increasingly robust and accurate algorithms and methods for these types of systems and their applications in everyday life. FR is developing technology with multiple real-time applications. The goal of this paper is to develop a complete FR system using transfer learning in fog computing and cloud computing. The developed system uses deep convolutional neural networks (DCNN) because of the dominant representation; there are some conditions including occlusions, expressions, illuminations, and pose, which can affect the deep FR performance. DCNN is used to extract relevant facial features. These features allow us to compare faces between them in an efficient way. The system can be trained to recognize a set of people and to learn via an online method, by integrating the new people it processes and improving its predictions on the ones it already has. The proposed recognition method was tested with different three standard machine learning algorithms (Decision Tree (DT), K Nearest Neighbor(KNN), Support Vector Machine (SVM)). The proposed system has been evaluated using three datasets of face images (SDUMLA-HMT, 113, and CASIA) via performance metrics of accuracy, precision, sensitivity, specificity, and time. The experimental results show that the proposed method achieves superiority over other algorithms according to all parameters. The suggested algorithm results in higher accuracy (99.06%), higher precision (99.12%), higher recall (99.07%), and higher specificity (99.10%) than the comparison algorithms.

Citation: Salama AbdELminaam D, Almansori AM, Taha M, Badr E (2020) A deep facial recognition system using computational intelligent algorithms. PLoS ONE 15(12): e0242269. https://doi.org/10.1371/journal.pone.0242269

Editor: Seyedali Mirjalili, Torrens University Australia, AUSTRALIA

Received: May 28, 2020; Accepted: October 25, 2020; Published: December 3, 2020

Copyright: © 2020 Salama AbdELminaam et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript.

Funding: This study was funded by a grant from DSA Lab, Faculty of Computers and Artificial Intelligence, Benha University to author DSA (28211231302952).

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

The face is considered the most critical part of the human body. Research shows that even a face can speak, and it has different words for different emotions. It plays a crucial role in interacting with people in society. It conveys people's identity and thus can be used as a key for security solutions in many organizations. The facial recognition (FR) system is increasingly trending across the world as an extraordinarily safe and reliable security technology. It is gaining significant importance and attention from thousands of corporate and government organizations because of its high level of security and reliability [ 1 – 3 ].

Moreover, the FR system is providing vast benefits compared to other biometric security solutions such as palmprints and fingerprints. The system captures biometric measurements of a person from a specific distance without interacting with the person. In crime deterrent applications, this system can help many organizations identify a person who has any kind of criminal record or other legal issues. Thus, this technology is becoming essential for numerous residential buildings and corporate organizations. This technique is based on the ability to recognize a human face and then compare the different features of the face with previously recorded faces. This feature also increases the importance of the system and enables it to be widely used across the world. It is developed with user-friendly features and operations that include different nodal points of the face. There are approximately 80 to 90 unique nodal points of a face. From these nodal points, the FR system measures significant aspects including the distance between the eyes, length of the jawline, shape of the cheekbones, and depth of the eyes. These points are measured by creating a code called the faceprint, which represents the identity of the face in the computer database. With the introduction of the latest technology, systems based on 2D graphics are now available on 3D graphics, which makes the system more accurate and increases its reliability.

Biometrics is defined as the science and technology to measure and statistically analyze biological data. They are measurable behavioral and/or physiological characteristics that could be used to verify individual identification. For each individual, a unique biometric could be used for verification. Biometric systems are used in increasingly many fields such as prison security, secured access, and forensics. Biometric systems recognize individuals using authentication by utilizing different biological features such as the face, hand geometry, iris, retina, and fingerprints. The FR system is a more natural biometric information process with better variation than any other method. Thus, FR has become a recent topic in computer science related to biometrics and machine learning [ 4 , 5 ]. Machine learning is a computer science field that gives computers the capability to learn without further explicit programming. The main focus of machine learning is providing algorithms for training to perform a task—machine learning related to the field of computational statistics and mathematical optimization. Machine learning includes multiple methods such as reinforcement learning, supervised learning, almost supervised learning, and unsupervised learning [ 6 ]. Machine learning can be used on many tasks that people think only they can do, such as playing games, learning subjects, and recognition [ 6 ]. Most machine learning algorithms consume a massive amount of resources, so it would be better to perform their tasks on a distributed environment such as cloud computing, fog computing, or edge computing.

Cloud computing is based on the shareability of many resources including services, applications, storage, servers, and networks to accomplish economies and consistency and thus provide the best concentration to maximize the efficiency of using the shared resources. Fog computing contains many services that are provided on the network edge, such as data storage, computing, data provision, and application services for end users who can be added to the network edge [ 7 ]. These environments would reduce the total amount of resource usage, speed up the completion time of tasks, and reduce costs via pay-per-use.

The main goals of this paper are to build a deep FR system using transfer learning in fog computing. This system is based on modern techniques of deep convolutional neural networks (DCNN) and machine learning. The proposed methods will be able to capture the biometric measurements of a person from a specific distance for crime deterrent purposes without interacting with the person. Thus, the proposed methods can help many organizations identify a person with any kind of criminal record or other legal issues.

The remainder of the paper is organized as follows. Section 2 presents related work in FR techniques and applications. Section 3 presents the components of traditional FR: face processing, deep feature extraction and face matching by in-depth features, machine learning, K-nearest neighbors (KNN), support vector machines (SVM), DCNN, the computing framework, fog computing, and cloud computing. Section 4 explains the proposed FR system using transfer learning in fog computing. Section 5 presents the experimental results. Section 6 provides the conclusion with the outcomes of the proposed system.

2. Literature review

Due to the significant development of machine learning, the computing environment, and recognition systems, many researchers have worked on pattern recognition and identification via different biometrics using various building mining model strategies. Some common recent works on FR systems are surveyed here in brief.

Singh, D et al. [ 8 ] proposed a COVID-19 disease classification model to classify infected patients from chest CT images. a convolutional neural network (CNN) is used to classify COVID-19-infected patients as infected (+ve) or not (−ve). Additionally, the initial parameters of CNN are tuned using multi-objective differential evolution (MODE). The results show that the proposed CNN model outperforms competitive models, i.e., ANN, ANFIS, and CNN models in terms of accuracy, F-measure, sensitivity, specificity, and Kappa statistics by 1.9789%, 2.0928%, 1.8262%, 1.6827%, and 1.9276%, respectively.

Schiller, D et al. [ 9 ] proposed a novel approach to transfer learning to automatic emotion recognition (AER) across various modalities. The proposed model used for facial expression recognition that utilizes saliency maps to transfer knowledge from an arbitrary source to a target network by mostly “hiding” non-relevant information. The proposed method is independent of the employed model since the experience is solely transferred via augmentation of the input data. The evaluation of the proposed model showed that the new model was able to adapt to the new domain faster when forced to focus on the parts of the input that were considered relevant sources Prakash, R et al. [ 10 ] proposed an automated face recognition method using Convolutional Neural Network (CNN) with a transfer learning approach. The CNN with weights learned from pre-trained model VGG-16. The extracted features are fed as input to the Fully connected layer and softmax activation for classification. Two publicly available databases of face images–Yale and AT&T are used to test the performance of the proposed method. Face recognition accuracy of 100% is achieved for AT&T database face images and 96.5% for Yale database face images. The results show that face recognition using CNN with transfer learning gives better classification accuracy in comparison with PCA method.

Deng et al. [ 11 ] proposed additive angular margin loss (ArcFace) to accomplish face acknowledgment. The proposed ArcFace has an unmistakable geometric understanding as a result of the specific correspondence to geodesic separation on a hypersphere. They also introduced the broadest exploratory assessment against the FR method utilizing ten FR datasets. They indicated that ArcFace reliably beats the best in class and can be effectively actualized with irrelevant computational overhead. The verification performance of open-sourced FR models on LFW, CALFW, and CPLFW datasets reached 99.82%, 95.45%, and 92.08%, respectively [ 11 ].

Wang et al. [ 12 ] proposed a large margin cosine loss (LMCL) by reformulating the SoftMax loss as a cosine loss by L2 normalizing the two highlights and weight vectors to evacuate outspread varieties and using the cosine edge term to expand the choice edge in precise space. They achieved the highest between-class difference and lowest intraclass fluctuation via cosine choice edge augmentation and normalization. They referred to their model, trained with LMCL, as CosFace. They based their experiment on the Labeled Face in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge datasets. They confirmed the efficiency of their proposed approach, achieving 99.33%, 96.1%, 77.11%, and 89.88% accuracy on the LFW, YTF, MF1 Rank1, and MF1 Veri datasets, respectively [ 12 ].

Tran et al. [ 13 ] proposed a disentangled representation learning-generative adversarial network (DR-GAN) with three different developments. First, the encoder-decoder structure of the generator permits DR-GAN to gain proficiency with a discriminative and generative portrayal, including picture blending. Second, the portrayal is unraveled from other face varieties—for example, through the posture code given to the decoder and posture estimation in the discriminator. Third, DR-GAN can accept one or various pictures as information and produce one integrated portrayal alongside an arbitrary number of manufactured pictures. They tested their network using the Multi-PIE database. They contrasted their strategy and face acknowledgment techniques with Multi-PIE, CFP, and IJB-A and achieved average face confirmation exactness with greater than tenfold standard deviation. They accomplished equivalent execution on frontal-frontal confirmation with ~1.4% enhancement for frontal-profile verification [ 13 ].

Masi et al. [ 14 ] proposed to build prepared information sizes for face acknowledgment frameworks: domain explicit information development. They presented techniques to enhance realistic datasets with critical facial varieties by controlling the faces in the datasets while coordinating inquiry pictures presented by standard convolutional neural systems. They tested their framework against the LFW and IJB-A benchmarks and Janus CS2 on a large number of downloaded pictures. They reported the standard convention for unhindered, marked outside information and announced a mean grouping precision of 100% equal error rate [ 14 ].

Ding and Tao [ 15 ] proposed a far-reaching system based on convolutional neural networks (CNN) to overcome the difficulties faced in video-based face recognition (VFR). CNN learns obscure highlights by utilizing prepared information comprising misleadingly obscured information and still pictures. They proposed a trunk-branch ensemble CNN model (TBE-CNN) to improve CNN highlights to present varieties and impediments. TBE-CNN separates data from face pictures and zones picked around facial segments. TBE-CNN removes information by sharing the center and low-level convolutional layers between the branch and trunk systems. They proposed an improved triplet misfortune capacity to invigorate the influence of discriminative portrayals learned by TBE-CNN. TBE-CNN was tested on three video face databases: YouTube, COX Face, and PaSC Faces [ 15 ].

Al-Waisy, et al. [ 16 ] proposed a multimodal profound learning system that depends on nearby element presentation for k-based face acknowledgment. They consolidated the focal points of neighborhood handmade component descriptors with the DBN to report face acknowledgment in unconstrained circumstances. They proposed a multimodal nearby component extraction approach dependent on consolidating the upsides of fractal measurement with the curvelet change, and they called it the curvelet–fractal approach. The principal inspiration of this methodology is that the curvelet change can expertly present the fundamental facial structure, while the fractal measurement presents the surface descriptors of face pictures. They proposed a multimodal profound face acknowledgment (MDFR) approach, to include highlight presentation by preparing a DBN on nearby element portrayals. They compared the outcomes of the proposed MDFR approach with the curvelet–fractal approach on four face datasets: the LFW, CAS-PEAL-R1, FERET, and SDUMLA-HMT databases. The outcomes acquired from their proposed approaches outperformed different methodologies including WPCA, DBN, and LBP by accomplishing new outcomes on the four datasets [ 16 ].

Sivalingam et al. [ 17 ] proposed a proficient fractional face location strategy utilizing AlexNet CNN to detect emotions based on images of half-faces. They distinguished the key focal points and concentrated on textural highlights. They proposed an AlexNet CNN strategy to discriminatively coordinate the two removed nearby capabilities, and both the textural and geometrical data of neighborhood highlights were utilized for coordination. The comparability of two appearances was changed according to the separation between the adjusted capabilities. They tested their approach on four generally utilized face datasets and demonstrated the viability and constraints of their proposed method [ 17 ].

Jonnathann et al. [ 18 ] presented a comparison between profound learning and conventional AI strategies (for example, artificial neural networks, extreme learning machine, SVM, optimum-path forest, KNN) and deep learning. For facial biometric acknowledgment, they concentrated on CNNs. They used three datasets: AR Face, YALE, and SDUMLA-HMT [ 19 ]. Further research on FR can be found in [ 20 – 23 ].

3. Material and methods

  • Ethics Statement

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 1 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

https://doi.org/10.1371/journal.pone.0242269.g001

3.1 Traditional facial recognition components

The whole system comprises three modules, as shown in Fig 1 .

  • In the beginning, the face detector is utilized on videos or images to detect faces.
  • The prominent feature detector aligns each face to be normalized and recognized with the best match.
  • Finally, the face images are fed into the FR module with the aligned results.

Before inputting an image into the FR module, the image is scanned using face anti-spoofing, followed by recognition performance .

research paper for face recognition

  • where M indicates the face matching algorithm, which is used to calculate the degree of similarity,
  • F refers to extracting the feature encoded for identity information,
  • P is the face-processing stage of occlusal facial treatment, expressions, highlights, and phenomena; and
  • I i and I j are two faces in the images.

3.1.1 Face processing.

Deep learning approaches are commonly used because of their dominant representation; Ghazi and Ekenel [ 24 ] showed some conditions including occlusions, expressions, illuminations, and pose, which can affect the deep FR performance. One of the main challenges in FR applications is representing variation; in this paper, we will summarize the face-processing deep methods for poses. Similar techniques can solve other changes. The face-processing techniques are categorized as "one-to-many augmentation" and "many-to-one normalization" [ 24 ].

  • "One-to-many augmentation" : Create many images from a single image with the ability to change the situation, which helps increase the ability of deep networks to work and learn.
  • "Many-to-one normalization" : The canonical view of face images is recovered from nonfrontal-view images, after which FR is performed under controlled conditions.

3.1.2 Deep feature extraction: Network architecture.

The architectures can be categorized as a backbone and assembled networks , as shown in Table 1 , inspired by the success of ImageNet [ 25 ] and typical CNN architectures such as SENet, ResNet, GoogleNet and VGGNet. It is also used as a baseline model in FR as a full or partial implementation [ 26 – 30 ].

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t001

In addition to the mainstream methods, FR is still used as an architecture design to improve efficiency. Additionally, with backbone networks as basic blocks, FR methods can be implemented in assembled networks, possibly with multiple tasks or multiple inputs. Each network is related to one type of input or one type of task. During adoption, higher performance is attained after the results of assembled networks are collected [ 30 ].

Loss Function. SoftMax loss is used as an organizing object by a supervising signal, and it improves the variation in the features. For FR, when intravariations may be larger than intervariations, SoftMax loss loses its effectiveness.

  • Euclidean-distance-based loss:

Intravariance compression and intervariance enlargement are based on the Euclidean distance.

  • Angular/cosine-margin-based loss:

Discriminative learning of facial features is performed according to angular similarity, with prominent and potentially large angular/cosine separability between the features learned.

  • SoftMax loss and its variations:

Performance is enhanced by using SoftMax loss or a modification of it.

3.1.3 Face matching by deep features.

After training the deep networks to work with massive data and an appropriate loss function, deep feature representation must be obtained by testing each of the passed images through the networks. L2 distance or cosine distance methods are most commonly used to compute feature similarity; however, for identification and verification tasks, the nearest neighbor (NN) and threshold comparison are used. Many other methods are used to process the deep features and compute facial matching with high accuracy, such as sparse representation-based classifier (SRC) and metric learning.

FR is a developed object classification; face-processing methods can also handle variations in poses, expressions, and occlusions. There are many new complicated kinds of FR related to features present in the real world, such as cross pose FR, cross-age FR, and video FR. Sometimes, more realistic datasets are constructed to simulate scenes from reality.

3.2 Machine learning

Machine learning is developed from computational learning theory and pattern recognition. A learning algorithm uses a set of samples called a training set as an input.

In general, there exist two main categories of learning: supervised and unsupervised. The objective of supervised learning is to learn the prediction of the proper output vector for any input vector. Classification tasks are applications in which the target label is a finite number in a discrete category. Defining the unsupervised learning objective is challenging. A primary objective is to find similar samples of sensible clusters identified within input data, called clustering.

3.2.1 K-nearest neighbors.

research paper for face recognition

KNN must store a large amount of training space, and this is one of the limitations that make KNN challenging to work with in a large dataset.

3.2.2 Support vector machine.

research paper for face recognition

Although we use the L1 norm for the penalty term Pn i = 1 ξi, there exist other penalty terms such as the L2 norm, which should be chosen with respect to the needs of the application. Moreover, parameter C is a hyper-parameter that can be chosen via cross-validation or Bayesian optimization. An important property of SVM is that the resulting classifier uses only a few points of training to classify a new data point, known as a support vector.

SVMs can perform nonlinear classification that detects a nonlinear hyper-plane function of the input variable in addition to performing linear classification as the input variable is mapped to a high-dimensional feature space. SVMs can perform multiclass classification in addition to binary classification [ 34 ].

SVMs are among the best off-the-shelf supervised learning models that are capable of effectively working with high-dimensional datasets and are efficient regarding memory usage due to the employment of support vectors for prediction. SVMs are useful in several real-world systems including protein classification, image classification, and handwritten character recognition.

3.3 Computing framework

The recognition system has different parts, and the computing framework is one of the essential parts for processing data. The computing framework is famous for cloud and fog computing. The application of FR can utilize a framework based on process location and application. Data in some applications must be processed after the acquisition; however, in some applications, data processing is not instantly required. Fog computing is a network architecture that supports the processing of data instantly [ 35 ].

3.3.1 Fog computing.

Cloud computing is engineered to work by relaying and transmitting information to the edge of the servers from the datacenter task. The fog computing architecture on edge servers uses this architecture, and it provides network, storage space, limited computing, and data filtering of logical intelligence and datacenters. This structure is used in fields such as military and e-health applications [ 36 , 37 ].

3.3.2 Cloud computing.

To obtain accessible data, data are sent to the datacenter for analysis and processing. A significant amount of time and effort is expended to transfer and process data in this type of architecture, indicating that it is not sufficient to work with big data. Big data processing increases the cloud server's CPU usage [ 38 ]. There are various types of cloud computing such as Infrastructure as a Service (IaaS) , Platform as a Service (PaaS) , Software as a Service (SaaS ), and Mobile Backend as a Service (MBaaS ) [ 39 ].

Big data applications such as FR require a method and design that distribute computing to process big data in a fast and repetitive way [ 40 , 41 ]. Data are divided into packages, and each package is assigned to different computers for processing. A move from the cloud to fog or distributed computing requires 1) a reduction in network loading, 2) an increase in data processing speed, 3) a decrease in CPU usage, 4) a decrease in energy consumption, and 5) higher data volume processing.

4. Proposed facial recognition system

4.1 traditional deep convolutional neural networks.

Images are expressed in terms of width (W) 227, height (H) 227, and depth (D) 3 of the colors red, green, and blue; therefore, they have a size of 227×227×3. The input color image is filtered at the first convolutional layer. This layer has 96 kernels (K) with an 11x 11x11 filter (F) and a 4-pixel stride (s). In the kernel map, the stride is the distance between the responsive field centers of neighboring neurons. The mathematical formula ((W-F+2P)/S) +1 is employed to compute the output size of the convolutional layer, where P refers to the padded pixel number, which can be as low as zero. The output volume size of the convolutional layer is ((227–11+0)/4)+1 = 55. The second input of the convolutional layer has a size of 55×55×no of filters, and therefore, the number of filters is 256 in this layer. As the work of the layers is distributed over 2 GPUs, the load is divided by 2 over all layers in each GPU. The next layer is the convolutional layer, followed by the pooling layer. Each feature map is decreased in dimensionality, and important features are retained. The type of pooling can be sum, max, average, etc. In AlexNet, a max-pooling layer is employed. Two hundred fifty-six filters (256) are input to this layer.

Krizhevsky et al. [ 11 ] developed AlexNet for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [ 34 ]. The first layer of AlexNet is used to filter the input image. The input image has a height (H), width (W), and depth (D) of 227×227×3; D = 3 to account for the colors red, green, and blue. The first convolutional layer is utilized to filter the input color image; it has 96 kernels (K) with an 11x11x11 filter (F) and a four-pixel stride (s). The stride is the distance between the responsive field centers of neighboring neurons in the kernel map. The formula ((W-F+2P)/S) +1 is employed to compute the output size of the convolutional layer, where P refers to the padded pixel number, which can be as low as zero. The convolutional layer output volume size is ((227–11+0)/4)+1 = 55. The second input of the convolutional layer is of size 55×55×no of filters, and the number of filters in this layer is also 256. Since the work of these layers is distributed over 2 GPUs, the load of each layer is divided by 2. The next layer is the convolutional layer, followed by the pooling layer. Each feature map dimensionality decreases, and important features are retained. The pooling method can be max, sum, average, etc. A max-pooling layer is employed in AlexNet. A total of 256 filters are the input of this layer. Each filter has a size of 5×5×256 with a stride of two pixels. When two GPUs are used, the work is divided into 55/2×55/2×256/2≈ 27×27×128 inputs for each GPU. The normalized output of the second convolutional layer is connected to the third layer, which has 384 kernels with a size of 3×3. For the fourth convolutional layer, there are 384 kernels of size 3×3, and they are divided over 2 GPUs, so the load of each GPU is 3×3×192. There are 256 kernels each of size 3×3 in the fifth convolutional layer, and they are divided over 2 GPUs, so each GPU has a load of 3×3×128. The last three convolutional layers are created without pooling layers or normalization. The outputs of these three layers are delivered as the input to two fully connected layers, where each layer has 4096 neurons. Fig 2 illustrates the architecture used in AlexNet to classify different classes with ImageNet as a training dataset [ 34 ]. DCNNs can learn from features hierarchically. A DCNN increases the image classification accuracy, especially with large datasets [ 42 ]. Since the implementation of a DCNN requires a large number of images to attain high classification rates, an insufficient number of color images among the subjects’ identification images creates an extra challenge for recognition systems [ 35 , 36 ]. A DCNN consists of neural networks with convolutional layers that perform feature extraction and classification on images [ 37 ]. The difference between the information used for testing and the original data used to train the DCNN is minimized by using a training set with different sizes or scales but the same features. The features will be extracted and classified well using a deep network [ 43 ]. Therefore, the DCNN will be useful in the task of recognition and classification. So DCNN will be utilized in the recognition and classification tasks. The AlexNet Architecture is shown in Fig 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g002

4.2 Fundamentals of transfer learning

The center information on transfer learning (TL) appears in Fig 3 . The center utilizes a moderately intricate and fruitful preprepared model, prepared from an enormous information source, e.g., ImageNet, which is a large visual database developed for visual object recognition research [ 41 ]. It contains over 14,000,000 manually annotated pictures, and one million pictures are furnished with bounding boxes. ImageNet contains in excess of 20,000 classifications [ 11 ]. Ordinarily, pretrained models are prepared on a subset of ImageNet with 1,000 classes. At that point, we "moved" the scholarly information to the moderately rearranged assignments (e.g., characterizing liquor abuse and nonliquor addiction) to remove a limited quantity of private information. Two attributes are imperative to support the exchange [ 44 ]: -i. The achievement of the pretrained model can advance the prohibition of client mediation with the exhausting hyperparameter tuning of new undertakings; ii. The early layers in pretrained models can be resolved as highlight extractors that help separate low-level highlights—for example, edges, tints, shades, and surfaces. Customary TL retrains the new layers [ 13 ]. First, the pretrained model is utilized, and then the entire structure of the neural system is reprepared. Critically, the worldwide learning rate is fixed, and the moving layers will have a low factor, while recently included layers will have a high factor. The core knowledge of TL is shown in Fig 3 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g003

4.3 Adaptive deep convolutional neural networks (the proposed face recognition system)

The proposed system consists of three essential stages, including

  • preprocessing,
  • feature extraction
  • recognition, and identification.

In preprocessing , the frame begins to capture images that must have a human face as the subject of insertion.

This image is passed to face detector module. The face detector work non detecting the human face and segment bit as region of interest. the obtained ROI continues the preprocessing steps. It is resized into the preretinal size to alignment purpose.

In the feature’s extraction , the preprocessed ROI in handled to extract feature vector using the modified version of AlexNet. The extract vector represents the significant details of the associated image.

Finally, the recognition and identification include the determination of feature vector belongs to whom subject of enrolled subject in the system’s database. Each new feature vector represents either anew subject or already registered subject. for the feature vector of ready a register subject, the system recognition the associated ID. for the feature vector of a new registered subject, the system adds new record into the connected database.

Fig 4 illustrates the general overall view of the proposed face recognition system.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g004

The system performs the steps on the face images to obtain the distinctive features of each face as follow:

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 5 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g005

In the preprocessing step, as shown in Fig 5 , the system begins to ensure the input image is the RGP image. Align in the same size of the image. Then, the face detection step is performed. This step uses a well-known face detection mechanism, the Viola-Jones detection approach. The popularity of Viola-Jones detection stems from its ability to work well in real-time and its ability to achieve high accuracy. To detect faces in a specific image, this face detector uses detection windows with different sizes to scan the input image.

In this phase, the decision of whether there is a face window is made. Haar-like filters are used to derive simple local features that are applied to face window candidates. In Haar-like filters, the feature values are obtained easily by finding the difference between the total light intensities of the pixels. Then segmentation the region of the issue by cropping and resizing the face image to 227×227, as shown in Fig 6 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g006

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 6 have given written informed consent (as outlined in PLOS consent form) to publish their image.

  • 2. Features Extraction using Pre-trained Alex Network

The accessible dataset size is inadequate to prepare another deep model from the earliest starting point, and in any case, this is not possible due to a large number of prepared pictures. To maintain objectivity in this test, we applied the exchange learning hypothesis to the preprepared engineering of AlexNet in three distinct ways. First, we expected to alter the structure. The last fully-connected layer (FCL) was updated since the first FCLs were created to perform 1,000 classifications. Twenty arbitrarily chosen classes were recorded: the scale, hairdresser chair, lorikeet, small poodle, Maltese dog, dark-striped cat, beer bottle, work station, necktie, trombone, protective crash helmet, cucumber, letterbox, pomegranate, Appenzeller, gag, snow panther, mountain bike, lock, and Diamondback. We observed that none of them were identified with the face recognition method. Thus, we could not legitimately apply AlexNet as the element extractor. Consequently, the calibration was fundamental. Since the length of yield neurons (1000) in conventional AlexNet is not equivalent to the number of classes in our task (2), we expected to have to alter the relating softmax layer and arrangement layer, as indicated by Fig 7 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g007

In our exchange learning plan, we utilized another arbitrarily introduced completely associated layer with a number of accessible subjects in the utilized dataset(s), a softmax layer, and another characterization layer with a similar number of competitors. Fig 8 shows various kinds of available activation functions; we used softmax, since we had different information and choices depending on the most extreme scores of different outputs. Next, we set the training choices. Three properties were checked before training. First, the overall number of training iterations ought to be small for exchange learning. We initially set the number of training iterations to 6. Second, the global learning rate was set to a small estimated value of 10−4 to back learning off, since the early layers of this neural system were preprepared. Third, the learning pace of new layers was several times that of the transfer layer, since the transfer layers with preprepared loads and weights and the new layers had irregular instated loads and weights. Third, we shifted the quantities of transfer layers and tried various settings. AlexNet comprises five Conv layers (CL1, CL2, CL3, CL4, and CL5) and three completely associated layers (FCL6, FL7, and FL8).

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g008

The pseudocode of the proposed algorithm is shown in algorithm 1. It starts using the original AlexNet architecture and image dataset for the subjects that were enrolled in the recognition systems. For each image in the dataset, the subject’s face is detected using Viola-Jones detection. The new face dataset is used for transfer learning. To transfer learning, we adapt to the architecture of AlexNet. Next, we train the altered architecture using the face dataset. The trained model is used in feature extraction.

we expect to overhaul the relating SoftMax layer and arrangement layer as indicated in the pseudocode of the proposed calculation (Algorithm 1).

Algorithm 1: Transfer Learning using AlexNet model

Input ← original AlexNet Net , ImageFaceSet imds

Output ← modified trained AlexNet FNet , features FSet

1.     Begin

2.         // Preprocessing Face image(s) in imds

3.         For i = 1: length(imds)

4.            img ← read(imds,i)

5.             face ← detectFace(img)

6.             img ← resize(face,[227, 227])

7.          save(imds,I,img)

8.         End for

9.         // Adapt AlexNet Structure

10.        FLayers ← Net.Layers(1:END-3)

11.         FLayers .append(new Convolutional layer)

12.         FLayers . append(new SoftMax layer)

13.        FLayers. append(new Classification layer)

14.         // Train FNet using options

15.         Options.set(SolverOptimizer ← stochastic gradient descent with momentum)

16.         Options.set(InitialLearnRate ←1e-3)

17.         Options.set(LearnRateSchedule ← Piecewise)

18.         Options.set(MiniBatchSize ←32)

19.         Options.set(MaxEpochs ←6)

20.         FNet ← trainNetwork(FLayers, imds, Options)

21.         //Use FNet to extract features

22.        FSet ← empty

23.         For j = 1: length(imds)

24.            img ← read(imds,j)

25.             F ← extract(FNet, img, ‘FC7’)

26.             FSet ← FSet U F

27.     End for

  • 3. Face recognition Phase using Fog and Cloud Computing:

Fig 9 shows the fog computing face recognition framework. Fog systems comprise client devices, cloud nodes/servers, and distributed computing environments. The general differences from the conventional distributed computing process are as follows:

  • A distributed computing community oversees and controls numerous cloud nodes/servers.
  • Fog nodes/servers situated at the edge of the system between the system community and the client have a specific procurement device that can perform preprocessing and highlight extraction tasks and can communicate biometric data securely with the client devices and cloud node.
  • User devices are heterogeneous and include advanced mobile phones, personal computers (PCs), hubs, and other networkable terminals.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g009

There are multiple purposes behind the communication plan.

  • From the viewpoint of recognition efficiency, if FR information is sent to a node, the system communication cost will increase, since all information must be sent to and prepared by the cloud server. Additionally, the calculation load on the cloud server will increase.
  • From the point of view of recognition security, the cloud community, as the focal hub of the whole system, will become a target for attacks. If the focal hub is breached, information acquired from the fog nodes/servers becomes vulnerable.
  • Face recognition datasets are required for training if a neural system is utilized for recognition. Preparing datasets is normally time consuming and will greatly increase the training time if the training is carried out only by the nodes, risking the training quality.

Since the connection between a fog node and client devices is very inconsistent, we propose a general engineering system for cloud-based face recognition frameworks. This plan exploits the processing ability and capacity limit of fog nodes/servers and cloud servers.

The design incorporates preprocessing, including extraction, face recognition, and recognition-based security. The plan is partitioned into 6 layers as indicated by the information stream of fog architecture shown in Fig 10 :

  • User equipment layer : The FC/MEC client devices are heterogeneous, including PCs and smart terminals. These devices may use various fog nodes/servers through various conventions.
  • Network layer : This connects administration through various fog architecture protocols. It is able to obtain information transmitted from the system and client device layer and to compress and transmit the information.
  • Data processing layer : The essential task of this layer is to preprocess image(s) sent from client hardware, including information cleaning, filtering, and preprocessing. The task of this layer is performed on cloud nodes.
  • Extraction layer : After the image(s) are preprocessed, the extraction layer utilizes the related AlexNet to remove the highlights.
  • Analysis layer : This layer communicates through the cloud. Its primary task is to cluster the removed element vectors that were found by fog nodes/servers. It can coordinate data among registered clients and produces responses to requests.
  • Management layer : The management in the cloud server is, for the most part, responsible for(1) the choices and responses of the face recognition framework and (2) the information and logs of the fog nodes/servers that can be stored to facilitate recognition and authentication.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g010

All participants provided written informed consent and appropriate, photographic release. The individuals shown in Fig 11 , Fig 12 have given written informed consent (as outlined in PLOS consent form) to publish their image.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g011

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g012

As shown in Fig 11 , the recognition classifier of the Analysis layer is the most significant piece of the framework for data preparation. It is identified with the resulting cloud server response to guarantee the legitimacy of the framework. Relatedly, our work centres around recognition and authentication. Classifiers on fog nodes/servers can utilize their calculation ability and capacity limit for recognition. In any case, much of the scope information cannot be handled or stored because of the restricted calculation and capacity of fog nodes/servers. Moreover, as mentioned, sending classifiers on fog nodes/servers cannot meet the needs of an individual system. The cloud server has a greater storage capacity than fog nodes/servers; therefore, the cloud server can store many training sets and process these sets. It can send training sets to fog nodes/servers progressively for training with the goal that different fog nodes/servers receive appropriate sets.

Fig 12 shows Face images of SDUMLA-HMT subjects under different conditions as a dataset example.

5. Experimental results

In this section, we provide the results we obtained in the experiments. Some of these results will be presented as graphs, which present the relation between the performance and some of the parameters previously mentioned.

5.1 Runtime environment

The proposed recognition system was implemented and developed using MatlabR2018a on a PC with an Intel Core i7 CPU running at 2.2 GHz and Windows 10 Professional 64-bit edition. The proposed system is based on the dataset SDUMLA-HMT, which is available online for free.

5.2 Dataset(s)

SDUMLA-HMT is a publicly available database that has been used to evaluate the proposed system. The SDUMLA-HMT database was collected in 2010 by Shandong University, Jinan, China. It consists of five subdatabases—face, iris, finger vein, fingerprint, and gait—and contains 106 subjects (61 males and 45 females) with ages ranging between 17 and 31 years. In this work, we have used the face and iris databases only [ 19 ].

The face database was built using seven digital cameras. Each camera was used to capture the face of every subject with different poses (three images), different expressions (four images), and different accessories (one image with a hat and one image with glasses), and under different illumination conditions (three images). The face database consists of 106×7×(3+4+2+3) = 8,904 images. All face images are of 640×480 pixels and are stored in the BMP format. Some face images of subject number 69 under different conditions are shown in Fig [ 19 ].

5.3 Performance measure

It is obviously, researchers recently focus on enhancing the face recognition systems from accuracy metrics regardless of the latest technologies and computing environment. Today, cloud computing and fog computing are available to enhance the performance of face recognition and decrease time complexity. In the proposed framework, we will handle these issues and well considered. The classifier performance evaluator carries out various performance measures and classifies the FR accuracy as true positive (TP), false negative (FN), false positive (FP) and true negative (TN). Precision is the most interesting and sensitive measure that can be used in wide-range comparison of the essential individual classifiers and the proposed system.

research paper for face recognition

  • True Negative (TN): These are the negative tuples that were correctly labeled by the classifier.
  • True Positive (TP): These are the positive tuples that were correctly labeled by the classifier.
  • False Positive (FP): These are the negative tuples that were incorrectly labeled as positive.
  • False Negative (FN): These are the positive tuples that were mislabeled as negative.

5.4 Results & discussion

A set of experiments were performed to evaluate the proposed system in terms of the evaluation criteria. All experiments start by loading the color images from the data source, then passing them to the segmentation step. According to the pretrained AlexNet, the input image size cannot exceed 227×227, and the image depth limit is 3. Therefore, after segmentation, we performed a check step to guarantee the appropriateness of the image size. A resizing process to 227×227×3 for width, height, and depth is imperative if the size of the image exceeds the size limit. And the main parameters and ratios are represented in Table 2 .

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t002

  • The experimental outcomes of the developed FR system and its comparison with various other techniques are presented in the scenario. It has been noted that the outcomes of the proposed algorithm outperformed most of its peers, especially in terms of precision.

5.4.1 Recognition time results

Fig 13 shows the comparison of the four algorithms: decision tree (DT), KNN classifier, SVM, and the proposed DCNN powered by the pre-trained AlexNet classifier. The relationship between two Parameters, observation/sec and recognition time in seconds per observation, which are used respectively for comparisons.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g013

  • The results show that the proposed DCNN has superiority over other machine learning algorithms according to observation/sec and recognition time

5.4.2 Precision results.

Fig 14 shows the precision of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g014

  • The results show that the proposed DCNN has superiority over other machine learning algorithms according to Perception for the 2 nd and 3 rd datasets and obtain with SVM the best results for the 1 st dataset.

5.4.3 Recall results.

Fig 15 shows the recall of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g015

  • The results show that the proposed DCNN has superiority over other machine learning algorithms, according to Recall parameters.

5.4.4 Accuracy results

Fig 16 displays the accuracy of our proposed system of the four algorithms using three datasets SDUMLA-HMT, 113, and

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g016

  • The results show that the proposed DCNN has superiority over other machine learning algorithms, according to Accuracy parameters.

5.4.5 Specificity results.

Fig 17 displays the data of the specificity of our proposed system comparing with other four algorithms using three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g017

Table 3 shows the average results for precision, recall, accuracy, and specificity of the four algorithms using the three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t003

Fig 18 displays the data documented in Table representing the average results for precision, recall, accuracy, and specificity of our proposed system of the four algorithms using three datasets SDUMLA-HMT, 113, and CASIA.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g018

Table 4 shows the comparison of the three algorithms and the algorithm developed by Jonnathann et al. [ 15 ] using the same dataset. The Table 4 compares the accuracy rates of the developed classifiers verse the same classifiers developed by Jonnathann et al. [ 15 ] in terms of accuracy rates without considering feature extraction methods.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.t004

Fig 19 shows the data documented in Table. It is noticeable that the proposed classifier achieves the highest accuracy using KNN, SVM, and DCNN.

thumbnail

https://doi.org/10.1371/journal.pone.0242269.g019

6. Conclusion

FR a more natural biometric information process than other proposed systems, and it must address more variation than any other method. It is one of the most famous combinatorial optimization problems. Solving this problem in a reasonable time requires an efficient optimization method. FR may face many difficulties and challenges in terms of the input image such as different facial expressions, subjects wearing hats or glasses and varying brightness levels. This study is based on the adaptive version of the most recent DCNN algorithm, called AlexNet. This paper proposed a deep FR learning method using TL in fog computing. The proposed DCNN algorithm is based on a set of steps to process the face images to obtain the distinctive features of the face. These steps are divided by preprocessing, face detection, and feature extraction. The proposed method improves the solution by adjusting the parameters to search for the final optimal solution. In this study, the proposed algorithm and other popular machine learning algorithms, including the DT, KNN, and SVM algorithms, were tested on three standard benchmark datasets to demonstrate the efficiency and effectiveness of the proposed DCNN in solving the FR problem. These datasets were characterized by various numbers of images, including males and females. The proposed algorithm and other algorithms were tested on different images in the first dataset, and the results demonstrated the effectiveness of the DCNN algorithm in terms of achieving the optimal solution (i.e., the best accuracy) with reasonable accuracy, recall, precision, and specificity compared to the other algorithms. At the same time, the proposed DCNN achieved the best accuracy compared with Jonnathann et al. [ 18 ]. The accuracy of the proposed method reached 99.4%, compared with 97.26% by Jonnathann et al. [ 18 ]. The suggested algorithm results in higher accuracy (99.06%), higher precision (99.12%), higher recall (99.07%), and higher specificity (99.10%) than the comparison algorithms.

Based on the experimental results and performance analysis of various test images (i.e., 30 images), the results showed that the proposed algorithm could be used to effectively locate an optimal solution within a reasonable time compared with other popular algorithms. In the future, we plan to improve this algorithm in two ways. The first is by comparing the proposed algorithm with different recent metaheuristic algorithms and testing the methods with the remaining instances from each dataset. The second is by applying the proposed algorithm to real-life FR problems in a specific domain.

  • View Article
  • PubMed/NCBI
  • Google Scholar
  • 7. Gamaleldin AM. An introduction to cloud computing concepts. Egypt: Software Engineering Competence Center; 2013. https://doi.org/10.1016/j.aju.2012.12.001 pmid:26579251
  • 10. Prakash, R. Meena, N. Thenmoezhi, and M. Gayathri. "Face Recognition with Convolutional Neural Network and Transfer Learning." In 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), pp. 861–864. IEEE, 2019.
  • 11. Deng J, Guo J, Xue N, Zafeiriou S, ArcFace: Additive angular margin loss for deep face recognition. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Long Beach, CA: IEEE; 2019. pp. 4685–4694.
  • 12. Wang H, Wang Y, Zhou Z, Ji X, Gong D, Zhou J, et al., CosFace: Large margin cosine loss for deep face recognition. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE; 2018. pp. 5265–5274.
  • 13. Tran L, Yin X, Liu X, Disentangled representation learning GAN for pose-invariant face recognition. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). Honolulu, HI: IEEE; 2017. pp. 1415–1424.
  • 14. Masi I, Tran AT, Hassner T, Leksut JT, Medioni G. Do we really need to collect millions of faces for effective face recognition? In: Leibe B, Matas J, Sebe N, Welling M, editors. European conference on computer vision (ECCV). Cham, Switzerland: Springer; 2016. pp. 579–596.
  • 19. Yin Y, Liu L, Sun X, SDUMLA-HMT: A multimodal biometric database. In: Chinese conference on biometric recognition. Beijing, China: Springer; 2011. pp. 260–268.
  • 24. Ghazi MM, Ekenel HK, A comprehensive analysis of deep learning based representation for face recognition. In: 2016 IEEE conference on computer vision and pattern recognition workshops (CVPRW). Las Vegas, NV: IEEE; 2016. pp. 102–109.
  • 26. He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). Las Vegas, NV: IEEE; 2016. pp. 770–778.
  • 27. Hu J, Shen L, Sun G, Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition. Salt Lake City, UT: IEEE; 2018. pp. 7132–7141.
  • 28. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Advances in neural information processing systems. Nevada, USA: Curran Associates Inc.; 2012. pp. 1097–1105.
  • 29. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556. 2014.
  • 30. Szegedy C, Wei L, Yangqing J, Sermanet P, Reed S, Anguelov D, et al., Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). Boston, MA: IEEE; 2015. pp. 1–9.
  • 32. Guyon I, Boser BE, Vapnik V. Automatic capacity tuning of very large VC-dimension classifiers. In: Hanson SJ, Cowan JD, Giles CL, editors. Advances in neural information processing systems. San Mateo, CA: Morgan Kaufmann Publishers Inc.; 1993. pp. 147–155.
  • 33. Schölkopf B, Smola AJ. Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press; 2002. https://doi.org/10.1074/mcp.m200054-mcp200 pmid:12488466
  • 34. Cristianini N, Shawe-Taylor J. An introduction to support vector machines and other kernel-based learning methods. Cambridge, UK: Cambridge University Press; 2000.
  • 40. Nasr-Esfahani E, Samavi S, Karimi N, Soroushmehr SMR, Jafari MH, Ward K, et al., Melanoma detection by analysis of clinical images using convolutional neural network. In: 2016 38th annual international conference of the IEEE engineering in medicine and biology society (EMBC). Orlando, FL: IEEE; 2016. pp. 1373–1376.
  • 41. Pham TC, Luong CM, Visani M, Hoang VD. Deep CNN and data augmentation for skin lesion classification. In: Nguyen NT, Hoang DH, Hong TP, Pham H, Trawiński B, editors. Asian conference on intelligent information and database systems. Dong Hoi City, Vietnam: Springer; 2018. pp. 573–582.
  • 42. Deng J, Dong W, Socher R, Li L, Li K, Li FF, ImageNet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. Miami, FL: IEEE; 2009. pp. 248–255.
  • 44. D. S. Abdul. Elminaam, Shaimaa ABDALLAH IBRAHIM, “Building a robust Heart Diseases Diagnose Intelligent Model Based on RST using LEM2 and MODLEM2”, in the Proceedings of the 32nd International Business Information Management Association Conference, IBIMA 2018—Vision 2020: Sustainable Economic Development and Application of Innovation Management from Regional expansion to Global Growth, PP 5733–5744, 15–16 November 2018, Seville, Spain

A real-time face detection based on skin detection and geometry features

  • Research Article
  • Published: 20 June 2024

Cite this article

research paper for face recognition

  • Weijing Xu 1 &
  • Di Wang 1  

Explore all metrics

In recent decades, researchers have been interested in face detection due to its application in computer vision and pattern recognition technologies, which themselves, commercially and academically, are so valuable. There are different approaches for face detection such as feature-based approach, appearance-based approach, template-based approach, and knowledge-based approach. The main challenges in face detection are complex backgrounds and various poses. The most proposed methods have focused on these two challenges. Another challenge is considered as various lighting conditions. The proposed method is a color-based method that uses skin color features and its performance has been proven as a fast classification method in face detection. There are different color spaces; the proposed method used YCbCr color space because it has the best result in the proper lighting conditions, RGB color space is used to increase accuracy and remove confusing objects, and also HSV color space is employed for images with unsuitable lighting conditions. Morphology operation is used to increase speed and accuracy. Such geometric features as hole, width, and height are used to determine a face. Results showed that the precision of the proposed method is 92.4%, 91%, and 93% on databases Bao, IMM, and Aberdeen, respectively. Also, the recall value of the proposed method on the Champion dataset is 94.27%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

research paper for face recognition

Similar content being viewed by others

A novel approach for face detection using hybrid skin color model.

research paper for face recognition

Contribution of Skin Color Cue in Face Detection Applications

research paper for face recognition

Face Detection in Color Images Based on Explicitly-Defined Skin Color Model

Data availability.

Data generated and analyzed during the current study are available from the corresponding author upon request.

P.P. Paul, M. Gavrilova, PCA based geometric modeling for automatic face detection, in 2011 International Conference on Computational Science and Its Applications. IEEE, pp. 33–38 (2011)

R. Gupta, A.K. Saxena, Survey of advanced face detection techniques in image processing. Int. J. Comput. Sci. Manag. Res. 1 (2), 156–164 (2012)

Google Scholar  

M.-H. Yang, D.J. Kriegman, N. Ahuja, Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24 (1), 34–58 (2002)

Article   Google Scholar  

K.-S. Park, R.-H. Park, Y.-G. Kim, Face detection using the 3× 3 block rank patterns of gradient magnitude images and a geometrical face model, in 2011 IEEE International Conference on Consumer Electronics (ICCE). IEEE, pp. 793–794 (2011)

S. A. Inalou, S. Kasaei, AdaBoost-based face detection in color images with low false alarm, in 2010 Second International Conference on Computer Modeling and Simulation, IEEE vol. 2, pp. 107–111 (2010)

R.-L. Hsu, M. Abdel-Mottaleb, A.K. Jain, Face detection in color images. IEEE Trans. Pattern Anal. Mach. Intell. 24 (5), 696–706 (2002)

C. Lin, Face detection in complicated backgrounds and different illumination conditions by using YCbCr color space and neural network. Pattern Recognit. Lett. 28 (16), 2190–2200 (2007)

Article   ADS   Google Scholar  

J. Kovac, P. Peer, F. Solina, Human Skin Color Clustering for Face Detection . IEEE, vol. 2 (2003)

I. Aldasouqi, M. Hassan, Smart human face detection system. Int. J. Comput. 5 (2), 210–216 (2011)

G. Kukharev, A. Nowosielski, Visitor identification-elaborating real time face recognition system (2004)

P. Kakumanu, S. Makrogiannis, N. Bourbakis, A survey of skin-color modeling and detection methods. Pattern Recognit. 40 (3), 1106–1122 (2007)

P. Viola, M.J. Jones, Robust real-time face detection. Int. J. Comput. Vis. 57 , 137–154 (2004)

G. Yang, T.S. Huang, Human face detection in a complex background. Pattern Recognit. 27 (1), 53–63 (1994)

S.J. McKenna, S. Gong, Y. Raja, Modelling facial colour and identity with gaussian mixtures. Pattern Recognit. 31 (12), 1883–1892 (1998)

R. Kjeldsen, J. Kender, Finding skin in color images, in Proceedings of the Second International Conference on Automatic Face and Gesture Recognition . IEEE, pp. 312–317 (1996)

A. Lanitis, C.J. Taylor, T.F. Cootes, Automatic face identification system using flexible appearance models. Image Vis. Comput. 13 (5), 393–401 (1995)

I. Craw, D. Tock, A. Bennett, Finding face features, in Computer Vision—ECCV’92: Second European Conference on Computer Vision Santa Margherita Ligure, Italy, May 19–22, 1992 Proceedings 2 . Springer, pp. 92–96 (1992)

G.A. Rowley, Neural network-based afce detection, in Proceedings of the CVOR , pp. 203–208 (1996)

E. Osuna, R. Freund, F. Girosit, Training support vector machines: an application to face detection, in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition . IEEE, pp. 130–136 (1997)

A.N. Rajagopalan, K.S. Kumar, J. Karlekar, R. Manivasakan, M.M. Patil, U.B. Desai, P.G. Poonacha, S. Chaudhuri, Finding faces in photographs, in Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271) . IEEE, pp. 640–645 (1998)

M.K. Hasan, M.S. Ahsan, S.H.S. Newaz, G.M. Lee, Human face detection techniques: a comprehensive review and future research directions. Electronics (Basel) 10 (19), 2354 (2021)

C. Li, R. Wang, J. Li, L. Fei, Face detection based on YOLOv3, in Recent Trends in Intelligent Computing, Communication and Devices: Proceedings of ICCD 2018 . Springer, pp. 277–284 (2020)

G. Singh, A.K. Goel, Face detection and recognition system using digital image processing, in 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA) . IEEE, pp. 348–352 (2020)

V. Arulkumar, S.J. Prakash, E.K. Subramanian, N. Thangadurai, An intelligent face detection by corner detection using special morphological masking system and fast algorithm, in 2021 2nd International Conference on Smart Electronics and Communication (ICOSEC) . IEEE, pp. 1556–1561 (2021)

K. Zhang, Y. Wang, W. Li, C. Li, Z. Lei, Real-time adaptive skin detection using skin color model updating unit in videos. J. Real Time Image Process. 2022 , 1–13 (2022)

D.T.T. Vijaya Kumar, R. Mahammad Shafi, A fast feature selection technique for real-time face detection using hybrid optimized region based convolutional neural network. Multimed. Tools Appl. 82 (9), 13719–13732 (2023)

S. Bilal, R. Akmeliawati, M.J.E. Salami, A.A. Shafie, Dynamic approach for real-time skin detection. J. Real Time Image Process. 10 , 371–385 (2015)

X. Yang, N. Liang, W. Zhou, H. Lu, A face detection method based on skin color model and improved AdaBoost algorithm. Traitement du Signal 37 (6), 929–937 (2020)

S. Zhu, N. Zhang, Face detection based on skin color model and geometry features, in 2012 International Conference on Industrial Control and Electronics Engineering (n.d.)

C.P. Papageorgiou, M. Oren, T. Poggio, A general framework for object detection, in Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271) . IEEE, pp. 555–562 (1998)

S. Tripathi, V. Sharma, S. Sharma, Face detection using combined skin color detector and template matching method. Int. J. Comput. Appl. 26 (7), 5–8 (2011)

B. Wang, X. Chang, C. Liu, A robust method for skin detection and segmentation of human face, in 2009 Second International Conference on Intelligent Networks and Intelligent Systems . IEEE, pp. 290–293 (2009)

M.-H. Yang and N. Ahuja, Detecting human faces in color images, in Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No. 98CB36269) . IEEE, vol. 1, pp. 127–130 (1998)

V. Vezhnevets, V. Sazonov, A. Andreeva, A survey on pixel-based skin color detection techniques, in Proceedings of the Graphicon, Moscow, Russia, vol. 3, pp. 85–92 (2003)

F. Solina, P. Peer, B. Batagelj, S. Juvan, 15 seconds of fame-an interactive, computer-vision based art installation, in 7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002. IEEE, vol. 1, pp. 198–204 (2002)

Q. Liu, G. Peng, A robust skin color-based face detection algorithm, in 2010 2nd International Asia Conference on Informatics in Control, Automation and Robotics (CAR 2010) IEEE, vol. 2, pp. 525–528 (2010)

J.M. Chaves-González, M.A. Vega-Rodríguez, J.A. Gómez-Pulido, J.M. Sánchez-Pérez, Detecting skin in face recognition systems: a colour spaces study. Digit. Signal Process. 20 (3), 806–823 (2010)

S. Chitra, G. Balakrishnan, Comparative study for two color spaces HSCbCr and YCbCr in skin color detection. Appl. Math. Sci. 6 (85), 4229–4238 (2012)

A.R. Smith, Color gamut transform pairs. ACM Siggraph Comput. Graph. 12 (3), 12–19 (1978)

R.C. Gonzalez, Digital Image Processing (Pearson Education India, New Delhi, 2009)

P. Hancock, P. Hancock, Psychological Image Collection at Stirling (University of Stirling Psychology Department 2000, Stirling, 2004)

M.M. Nordstrøm, M. Larsen, J. Sierakowski, M.B. Stegmann, The IMM face database-an annotated dataset of 240 face images, Technical University of Denmark, DTU Informatics, Building 321, (2004)

R. Frischholz, Bao face database at the face detection homepage. Available: http://www.facedetection.com (2012)

Download references

Acknowledgements

This work was supported by the project of Shenyang Urban Construction University Basic Scientific Research of the education department of Liaoning Provincial in 2022. “Research on the Application of Blockchain Technology Empowering the Innovative Development of Urban Intelligent Transportation Systems under the 5G Background” (No. LJKMZ20221924).

Author information

Authors and affiliations.

School of Information and Control Engineering, Shenyang Urban Construction University, Shenyang, 110167, China

Weijing Xu & Di Wang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Weijing Xu .

Ethics declarations

Conflict of interest.

There is no conflict of interest between the authors of this article.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Xu, W., Wang, D. A real-time face detection based on skin detection and geometry features. J Opt (2024). https://doi.org/10.1007/s12596-024-01949-0

Download citation

Received : 30 January 2024

Accepted : 28 May 2024

Published : 20 June 2024

DOI : https://doi.org/10.1007/s12596-024-01949-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Face detection
  • Skin detection
  • Face geometry features
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 23 June 2024

Image-based facial emotion recognition using convolutional neural network on emognition dataset

  • Erlangga Satrio Agung 1 ,
  • Achmad Pratama Rifai 1 &
  • Titis Wijayanto 1  

Scientific Reports volume  14 , Article number:  14429 ( 2024 ) Cite this article

496 Accesses

Metrics details

  • Computer science
  • Human behaviour
  • Information technology

Detecting emotions from facial images is difficult because facial expressions can vary significantly. Previous research on using deep learning models to classify emotions from facial images has been carried out on various datasets that contain a limited range of expressions. This study expands the use of deep learning for facial emotion recognition (FER) based on Emognition dataset that includes ten target emotions: amusement, awe, enthusiasm, liking, surprise, anger, disgust, fear, sadness, and neutral. A series of data preprocessing was carried out to convert video data into images and augment the data. This study proposes Convolutional Neural Network (CNN) models built through two approaches, which are transfer learning (fine-tuned) with pre-trained models of Inception-V3 and MobileNet-V2 and building from scratch using the Taguchi method to find robust combination of hyperparameters setting. The proposed model demonstrated favorable performance over a series of experimental processes with an accuracy and an average F1-score of 96% and 0.95, respectively, on the test data.

Similar content being viewed by others

research paper for face recognition

Four-layer ConvNet to facial emotion recognition with minimal epochs and the significance of data diversity

research paper for face recognition

A study on computer vision for facial emotion recognition

research paper for face recognition

An enhanced speech emotion recognition using vision transformer

Introduction.

Humans use emotions to express their feelings to others and as a communication tool to convey information. Emotions reflect human mood in the form of a psychophysiological condition of a human. Emotions result from human interactions and internal or external factors 1 . Dynamic changes in emotion play an important role in human life because they directly affect most of the daily activities and habits carried out by humans 2 . Emotions are the dominant driver of decisions made by humans 3 . Positive emotions result in the formation of good communication and increase human productivity. Meanwhile, negative emotions can harm both mental and physical conditions. Therefore, an automated system based on human emotions is important for continuing to develop 4 .

Humans can express emotions through hands, voice, gestures, and facial expressions, with 55% of emotions conveyed through facial expressions 5 . The human face displays information cues that are relevant to provide expression of an emotional state or behavior. Facial expressions play an important role in human communication, as they help us understand the intentions of others. Hence, facial recognition emerges as an important domain in understanding human emotion. Among various facial recognition techniques, facial emotion recognition (FER) has seen substantial advancement 6 . Using machine learning, FER can help humans distinguish emotions through facial expressions by analyzing images or video data to obtain information about emotional states 7 , which is important for social interaction because it can help humans understand the feelings and intentions of others. FER is commonly used in various fields such as health, education, labor, robotics, communication, psychology, and others 8 .

Recent advancements in FER-based automation systems can be classified into two main parts of feature generation: conventional extraction and automatic extraction via deep neural networks 9 . While the conventional approach holds an advantage in computation time and is commonly used for real-time classification problems 10 , this approach lacks flexibility as it requires predefined feature extraction and classifiers 11 . As such, deeper knowledge of the feature extraction and classifier is required to fetch meaningful and good features for models’ input without discarding important information. This issue can be an obstacle to developing and implementing the detection models. On the other hand, an automated approach employing deep learning algorithms, such as the convolutional neural network (CNN), reduces or eliminates dependencies from other models and/or existing preprocessing techniques by carrying out end-to-end learning directly from input data 12 . However, CNNs require extensive data to obtain a higher level of classification 13 . Researchers worldwide have done much research by building the CNN model to solve the FER problem. Researchers have used various types of image data-based datasets as input data for the constructed models, such as Facial Emotion Recognition 2013 (FER 2013) and Extended Cohn-Kanade Dataset (CK +). However, the entire dataset still focuses on the 7 basic human emotions, so further development is needed to solve the FER problem with a more varied number of emotion classes.

This study is motivated by the need to address the limitations of the existing FER systems that predominantly recognize a limited set of basic emotions. Utilizing the Emognition dataset, which encompasses ten distinct emotions: neutral, amusement, anger, awe, disgust, enthusiasm, fear, liking, sadness, and surprise 14 , this study aims to develop FER models capable of handling a wider spectrum of human emotions. The Emognition dataset not only includes common emotions but also introduces four new emotion classes: awe, enthusiasm, amusement, and liking, providing a richer foundation for enhancing FER applications in areas such as mobile application development, education, product marketing, and tourism management. For example, the inclusion of the amusement emotion improves interaction with entertainment devices like games and movies 15 . Recognizing enthusiasm can help educators enhance learning environments and manage student engagement effectively 16 . In other fields, the emotion of enthusiasm can also be used to determine the suitability of the workload given to a worker, the awe emotion can significantly increase consumer willingness to share 17 , and liking emotion also has a role in shaping consumer preferences and brand affinity in marketing. Despite its potential, using the Emognition dataset in FER models is relatively unexplored, representing a significant gap in current research that this study seeks to address.

This study also explores whether CNNs, trained on the Emognition dataset 14 , can more effectively classify a more diverse range of emotions. This involves assessing the benefits of processing image data extracted from video sequences compared to direct video inputs, which could potentially allow for selecting relevant data to use and discarding irrelevant data, thus optimizing the performance of the FER models. Overall, this study aims to improve the accuracy and applicability of FER systems and broaden the scope of emotions that these systems can recognize. Such advancements in emotion recognition technology could have significant implications across various aspects of human life, including social interactions, mental health, education, and employment. This ongoing development of FER technology highlights its critical role as a necessary knowledge domain in the modern world.

The remainder of this paper is as follows. In the next section, previous related work is explained. In Sect. “ Methodology ”, the proposed method and the background theories are described. In Sect. “ Experimental setup ”, experimental works and obtained results are examined and analyzed. In the last section, conclusions and future works are discussed.

Related work

Research in FER has evolved through conventional and automated approaches involving various datasets and methodologies. Thus far, numerous studies have utilized datasets such as Cohn-Kanade (CK) 18 , Extended Cohn-Kanade (CK +) 19 , Facial Expression Recognition 2013 (FER 2013) 20 , Japanese Female Facial Expression (JAFFE) 21 , FACES 22 , and Radboud Faces Database (RaFD) 23 . These datasets primarily include six to eight basic emotion classes. For example, CK and CK + have seven emotional classes: neutral, anger, contempt, disgust, fear, happiness, and sadness 24 . Similarly, FER 2013 and JAFFE introduce seven classes with slightly different categories: neutral, angry, disgusted, fearful, happy, sad, and surprised 24 . While the datasets as mentioned earlier introduced seven classifications, FACES introduced six categories of emotions: neutral, sad, disgust, fear, anger, and happy 22 , while RaFD has eight emotional classes: anger, disgust, fear, happiness, sadness, surprise, disdain, and neutral. Albeit these datasets have provided insightful data, the emotions covered are considerably basic emotions and less complex.

Recently, Saganowski et al. 14 introduced the Emognition dataset, which includes ten distinct emotions: neutral, amusement, anger, awe, disgust, enthusiasm, fear, liking, sadness, and surprise. In addition to these emotions, the dataset offers several enhancements: it captures physiological signals, represents emotional states using both discrete and dimensional scales, and highlights differences among positive emotions. These improvements facilitate emotion recognition from both facial expression analysis and physiological perspectives, accommodating variances that might occur with specific emotions. While several studies have utilized the physiological signals from the Emognition dataset to classify emotions (e.g., 25 , 26 , 27 , 28 ), there has been relatively little research examining the facial expression data within the same dataset for emotion recognition and classification (e.g., 29 . This gap highlights a key area for further investigation, aiming to fully leverage the dataset's capabilities in enhancing FER technologies.

Concerning FER techniques, researchers use several conventional methods to extract features from input image data. Some of these methods have been applied in several publications, including cropping faces and converting them into grayscale images 30 , using an optical flow-based approach 31 , and using a histogram of oriented gradients (HOG) 32 . In addition, several publications use an automated approach in extracting features based on the CNN algorithm, including the development of a model with an architectural configuration from scratch 33 , transfer learning with MobileNet 34 , MobileNet-V2 35 , VGG19 36 , DenseNet121 37 , and others.

In the conventional approach, Gupta 30 preprocessed the CK and Extended CK + datasets by cutting the facial region using the HAAR filter from OpenCV and then converting them into grayscale images. In addition, the detection of landmark points on the face is normalized at each point. The random sample technique in the training data distribution is applied with a ratio of 80% training data and 20% validation data. The training process is carried out with the support vector machine classifier model and obtains an accuracy of 93%. Using the same type of classifier, Anthwal and Ganotra 31 performed dense optical flow calculations on facial images to extract vertical and horizontal components. Preprocessing in this study was carried out using the Viola-Jones algorithm to cut the facial area and resize it to a grayscale image with a resolution of 256 × 256. Using Extended Cohn-Kanade (CK +) as a dataset, the best results were obtained for 6 class categories (excluding the contempt class) with 90.52% accuracy. In contrast, for 7 classes, only 83.38% accuracy was achieved, indicating that the classifier model performed better when trained on 6 class categories compared to 7 class categories.

Using the JAFFE and Cohn-Kanade (CK) datasets, Supta et al. 32 built an FER system based on the HOG and support vector machine (SVM). Preprocessing is carried out on the detected parts using histogram equalization techniques and image sharpening to reduce lighting effects. Then, HOG extracts distinctive image features from faces and combines them into feature vectors. Finally, SVM is used to classify expressions using polynomial kernel functions. The proposed system is evaluated on JAFFE and CK data and shows that the proposed FER system provides up to 97.62% and 98.61% accuracy for JAFFE and CK data, respectively.

The conventional feature extraction method for FER requires complex image preprocessing and manual feature extraction, which take a long time 38 . Manually extracted features depend heavily on the previous knowledge of the researchers. This causes the resulting features to be exposed to high bias and causes the loss of implicit patterns. The effectiveness of the extracted features also depends on how well the manual feature extraction technique is used. In contrast, an automated approach based on deep learning is very good at classifying images but requires extensive data to train and perform recognition efficiently 13 .

Extensive data requirements are one of the crucial factors in the context of deep learning model development. Ramalingam and Garzia 33 developed the CNN algorithm using a combination layer of convolution, rectified linear unit (ReLU), pooling twice for feature extraction, and ending with a fully connected layer. This CNN algorithm achieved an accuracy of only 60% on the FER 2013 dataset, which included 35,887 sample images. One deep learning approach that can improve training accuracy is using transfer learning techniques. This technique takes advantage of features learned by models trained on ImageNet to overcome the problem of the lack of large datasets available online 35 .

Ramalingam and Garzia 33 also used transfer learning with the pretrained VGG16 model. In the FER2013 dataset, the accuracy of this transfer learning algorithm reaches 78%, so there is an increase in accuracy of 18% compared to the CNN model without transfer learning. With the same dataset, Raja Sekaran et al. 13 implemented transfer learning using AlexNet as a pretrained model. This research also implements an early stopping mechanism to overcome the problem of overfitting AlexNet. The proposed model only requires preprocessing in the form of image conversion to grayscale to reduce the effects of lighting and human skin color on the classification results. The model managed to achieve 70.52% accuracy for the FER dataset.

In FER 2013, Abdulsattar and Hussain 37 developed six well-known deep learning models for FER problems. These models are MobileNet, VGG16, VGG19, DenseNet121, Inception-v3, and Xception. Model performance was evaluated and compared using transfer learning and fine-tuning strategies on the FER2013 dataset. In transfer learning, all layers in the pretrained model are frozen or not retrained. However, in fine-tuning, all layers in the pretrained model are retrained. The results show that the fine-tuning strategy performs better than transfer learning, with differences ranging from 1 to 10%. The VGG16 pretrained model achieved the highest accuracy with a maximum accuracy of 0.6446.

Transfer learning using VGG16 was also carried out by Oztel et al. 39 using the RaFD dataset. The treatment of the VGG16 pretrained model was divided into two scenarios: with transfer learning and without transfer learning. The model without transfer learning was modified on the 39th and 41st layers of the model structure and a random weight was placed on the model layer, while the model with transfer learning kept the model structure intact. The VGG16 scenario with transfer learning produced the best accuracy compared to the VGG16 scenario without transfer learning. The less-optimal results in the VGG16 scenario without transfer learning are due to a lack of training data and input data imbalance.

Gondkar et al. 35 also carried out research using transfer learning on CK + . The models used in the transfer learning technique are Xception, VGG16, VGG19, ResNet50, ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, and DenseNet121. A comparative analysis of these models was performed using various evaluation metrics, such as model size, training accuracy, validation accuracy, training loss, and validation loss. The results showed that the pretrained models ResNet50V2, ResNet101V2, ResNet152V2, and MobileNet achieved training and validation accuracy of more than 90%. Most pretrained models have demonstrated outstanding performance, with ResNet101V2 achieving a training accuracy of 93.08% and a validation accuracy of 92.87%. MobileNet achieved training and validation accuracies higher than 90%. Regarding model size, MobileNet was the smallest yet more efficient than most other models.

Undeniably, the type of preprocessing used in the input data can impact the quality of the resulting model. This is evidenced by the preprocessing experiment conducted by Sajjanhar et al. 36 . In that study, three treatments were applied in the CK + , JAFFE, and FACES datasets. The first preprocessing step involved maintaining the region of interest (ROI) or by cutting the face area from the image. The second preprocessing step was performed by calculating the difference between the gray level intensities of the ROI image pixels at neutral and peak expressions. The third preprocessing step was performed by forming a local binary pattern (LBP) from the image. These three types of preprocessing were tested on the CNN algorithm and resulted that the second preprocessing succeeded in providing the best accuracy compared to other types of preprocessing with 85.19% in CK + data, 65.17% in JAFFE data, and 84.38% in FACES data.

Agobah et al. 34 applied transfer learning using MobileNetV1 across multiple datasets for training, validation, and testing. They optimize CNN training by combining center loss and softmax loss, using the FER 2013 dataset for training and the JAFFE and CK + datasets for validation and testing. This addition improved accuracies on CK + and JAFFE by 2.94% and 5.08%, respectively, with JAFFE achieving 96.43% precision and 95.24% recall and F1 score. While increasing the number of classes in the CK + dataset initially reduced accuracy due to complexity and data limitations, using a larger dataset significantly enhanced performance, raising accuracy by 4.41% over a smaller dataset. However, Agobah et al. 34 highlighted that some misclassifications occurred due to the inherent complexity of distinguishing emotions like anger and sadness.

Meena et al. 40 explored the use of CNN for sentiment identification on facial expression in the CK + and FER-2013 datasets. Several architectures were developed to evaluate the efficiency of the proposed models in those datasets. The models were categorized into two types based on the data classes: CNN-3 model considered positive, negative, and neutral expressions, while CNN-7 model considered happy, neutral, sad, surprise, fear, angry, and disgust expressions. The CNN-3 model yielded accuracy at 79% and 95% for the FER-2013 and CK + databases, respectively. Meanwhile, the CNN-7 model resulted on slightly lower accuracy at 69% and 92% for the same datasets.

Recent FER studies explored advanced models for emotion recognition from video data 29 , 41 . Bilotti et al. 41 developed a multimodal CNN approach integrating facial frames, optical flow, and Mel Spectrograms, which achieved impressive accuracies of approximately 95% on the BAUM-1 and 95.5% on the RAVDESS datasets. In contrast, Manalu and Rifai 29 focused on hybrid CNN-RNN models, comparing a custom model with transfer learning models based on InceptionV3 and MobileNetV2 architectures. Their custom model achieved a maximum accuracy of 63%, less than the 66% by InceptionV3-RNN but higher than the 59% by MobileNetV2-RNN, with the custom model also offering significantly quicker processing times. Both studies highlight the potential of combining different data inputs and model architectures to enhance the accuracy and efficiency of facial expression recognition systems.

Overall, the studies above show the effectiveness of using the transfer learning method in building models by considering the limited data availability in the FER domain. Several pre-trained models are also used to obtain optimal accuracy on the FER, JAFFE, CK + , CK, FACES, and RaFD datasets. However, from the various studies conducted, no research has discussed the application of CNN-based deep learning using Emognition datasets. Considering that the Emognition dataset covers a more significant number of emotions, some of which have never been explored before, this study develops FER models using CNN to address the Emognition dataset. Besides exploring the transfer learning and fine-tuning strategy, this study also proposes a novel network with aims to develop a more efficient model for FER. Building a CNN from scratch for facial emotion recognition provides a deep understanding of the network’s inner workings, allowing for customization and optimization of the architecture to the specifics of the task. Since the FER images have not been covered by standard pre-trained models, it is worth developing a full learning model which can be customized according to the problem specification.

Methodology

In this work, we use different types of methods for building the CNN model: transfer learning using MobileNet-V2 and Inception-V3 with fine-tuning strategies and building models from scratch by designing a new network with better efficiency. We also apply a serial type of preprocessing in the Emognition dataset to adjust with the research goal.

Data pre-processing

This process is carried out in several sequential stages: process video to frame, face cropping from frame, data cleaning, data splitting, rescaling, resizing, and data augmentation, as depicted in Fig.  1 . Following the selected deep learning method, the half-body video data is transformed into image data (frame sequences). This task is performed in the Process Video to Frame stage. Once image data in video frames is obtained, the facial region is automatically detected and then cropped within those frames. This activity took place in the Face Cropping from Frame stage.

figure 1

Stages of data pre-processing.

Upon obtaining the facial data, data cleaning is performed by retaining facial images corresponding to emotions while discarding those not accurately representing the intended emotions for their respective classes. During the data input process into the model, a dataframe is employed to enhance data management with greater flexibility and transparency. By utilizing the created data frame, shuffling is performed to the data to achieve a more balanced data distribution. The data are then split into training, validation, and test datasets after the shuffling. The training and validation data play a direct role in training the model, while the test data was solely involved in the testing process.

Subsequently, we apply a rescaling process to transform pixel values of the input images into a range between 0 and 1, and resizing was performed to standardize the input image dimensions. These rescaling and resizing procedures are undertaken to conform to the input requirements of the CNN model. The augmentation techniques are also employed to diversify the dataset, reducing the overfitting risk.

Transfer learning approach

In the transfer learning of proposed models, the pre-trained weight of the feature extraction layer from both MobileNet-V2 and Inception-V3 are called and then fine-tuning on several layers of the pre-trained model are performed. The choice of Inception-V3 is based on its high accuracy in previous studies. Inception-V3 introduces more complex and efficient architectural designs, including asymmetric convolutions. This means it uses convolutions of different sizes within the same module, allowing it to capture patterns over various spatial hierarchies more effectively. As a result, the network can learn more complex features with fewer parameters, reducing the risk of overfitting. However, Inception-V3 still has more than 24 Million parameters which may require longer training time. Meanwhile, MobileNet-V2 is selected for its small size with around 3.4 million parameters and relative good accuracy, making it suitable for application in smaller devices. It uses depthwise separable convolutions as its basic building block, significantly reducing the number of parameters and computational cost compared to traditional convolutional layers without a substantial loss in accuracy. MobileNet-V2 also uses architectural feature known as the inverted residual structure with linear bottlenecks. This design optimizes the flow of information through the network, ensuring that the model remains lightweight while still capturing essential features necessary for accurate predictions albeit not as powerfull as Inception-V3 which can capture more complex features.

Fine-tuning is achieved by deactivating all pre-trained weights and conducting training with several epochs, then we activate half of the pre-trained model and continue further training with the same number of epochs. As such, the training process of the transfer learning model is divided into two scenarios: scenario 1 and scenario 2, in which scenario 1 includes the stage before fine-tuning, and scenario 2 contains the fine-tuning process. Dense layers are also added with a certain number of unit neurons. These transfer learning and fine-tuning stages are adopted based on research from Elgendy 42 . The training process of the model using the transfer learning method conducted in two stages is described as follows:

In this scenario, all convolutional layers of the selected pretrained model are frozen (freezing), and the classification head from the pretrained model is not utilized. A new classification head is then added, tailored to the case of emotion expression classification, and training is conducted. By freezing the pretrained model, the weights in the convolutional network are not updated during the training process. This scenario aims to train only the classification head of the model while maintaining the weights that have been trained in the pretrained model.

Unlike the first scenario, in this scenario, the pretrained model is partially unfrozen at the last 50% of its layers. Training then continues from where it left off in Scenario 1. Here, the first scenario is executed in the first half of the training process, in which if there are 100 epochs, then the first 50 epochs are dedicated for the first stage. Afterward, the training process is continued using the second scenario for the rest of epochs. The main focus here is to train a portion of the layers in the convolutional network to better suit the case of emotion expression classification. The choice to unfreeze the last 50% of layers is because these layers contain features that are specific to the data trained on the previous dataset.

The use of two stage training process using scenarios 1 and 2 of the transfer learning model is to enhance the effectiveness of the transfer learning process. In scenario 1, the focus is on leveraging the pretrained model, which contains previously trained weights. However, as additional layers are added, these weights become more specific to the features of the previous dataset. Therefore, in scenario 2, there is an effort to retrain part of the pretrained model on the last 50% of its layers to align with the current case. This effort optimizes the weights of the pretrained model for better utilization in the transfer learning process.

MobileNet-V2

MobileNet is a very lightweight image classification model with minimal operations initiated by Sandler et al. 43 . MobileNet has three variants there are MobileNet-V1, MobileNet-V2, and MobileNet-V3. In this study, we use MobileNet-V2 as one of the pre-trained models in the transfer learning process. MobileNet-V2 is the smallest model in size and has the second fastest GPU processing time after MobileNet-V1 on ImageNet dataset. In addition, this model has greater both top-1 accuracy and top-5 accuracy than MobileNet-V1. This model is also composed of parameters with the smallest number compared to other existing models.

MobileNet-V2 has two residual blocks with a stride of 1 and a second block with a stride of 2. Each block consists of 3 layers of 1 × 1 pointwise layer, depthwise layer, and 1 × 1 linear convolution layer with ReLU6 activated. The MobileNet-V2 architecture is demonstrated in Fig.  2 .

figure 2

MobileNet-V2 architecture 44 .

Inception-V3

Inception-V3 is the successor of Inception-V2 and Inception-V1 35 , initiated by Szegedy et al. 45 . This model consists of five 5 × 5 convolution inception modules replaced by two 3 × 3 convolution layers and an efficient grid reduction block to reduce the number of parameters without sacrificing the overall model efficiency. In addition, four 7 × 7 convolution inception modules are replaced with two 1 × 7 and 7 × 1 convolution layers, followed by another grid reduction block.

Inception-V3 has an additional classifier connected to the ends of these 4 inception modules. At the end of the model, two inception modules used 3 × 1 and 1 × 3 convolution layers in parallel to increase dimensionality. Then it is connected to the average pooling layer, fully connected layer, dropout, and softmax to produce output. Compared to MobileNet-V2, Inception-V3 has a greater number of parameters with better top-1 accuracy and top-5 accuracy than MobileNet-V2 on ImageNet. The Inception-V3 architecture is shown in Fig.  3 .

figure 3

Inception-V3 architecture 46 .

Full learning approach

In the full learning approach, where the model is constructed from the ground up, we employ multiple feature extraction layers. Each of these layers is composed of a convolutional layer paired with a subsequent pooling layer. The convolutional layers vary in the quantity of filters employed and each has a predefined filter size. For the pooling layers, we utilize max pooling. Additionally, a flatten layer is incorporated to transform the feature matrix into a vector form. To diminish the risk of overfitting during the training phase, a dropout layer is also integrated into the model.

To optimize the result, a design of experiment (DoE) using the Taguchi method is performed to find a robust combination of the number of feature extraction layers and the number of epochs to be used in the build model of full learning technique. The stages of the design of the experiment using the Taguchi method are shown in Fig.  4 .

figure 4

Steps in Design of Experiment using Taguchi Method.

The process begins with the clear definition of the problem at hand, which in this case is to determine a robust combination of the number of feature extraction layers and epochs for the model. Following this, the output performance characteristics that will gauge the success of the model. The output characteristic to be compared is the validation accuracy at the end of each epoch since it enables direct comparisons between different models. The higher the accuracy level on the validation data, the better the model performs.

The next step involves pinpointing the control factors, which include the network architecture details (number of convolutional layers) and training parameter (number of epoch). The number of convolutional layers determines the depth of feature extraction, which can affect a model's ability to learn from complex data. More layers can capture intricate patterns but also risk overfitting. The number of epochs affects how well the model learns from the data; too few epochs can lead to underfitting, while too many may lead to overfitting. Both factors directly influence the model’s learning capacity and generalization to new data, making them critical control factors for optimizing performance.

Moving on, the levels of each control factor to test are selected, considers potential interactions between them, and ensures the appropriate degrees of freedom for the experiment’s statistical validity. Here, each factor has 3 levels (number of convolutional layers: 4, 5, 6, and number of epochs: 50, 100, 150). The most efficient combination was achieved through this by conducting 9 runs in total, hence utilizing an L9 orthogonal array. Afterward, the experiments are executed using the same platform and device to minimize noise. Statistical analysis is conducted by analyzing the value of the S/N ratio from the experimental results obtained. Since the goal of this experiment is to find a combination of factors that can maximize highest validation accuracy value of the expected output, the type of S/N ratio chosen is the S/N ratio larger is better. The output of the DoE is the best combination of number of convolutional layers and number of epoch which are then used for building and training full learning model.

Evaluation criteria

After constructing the model using both techniques, this study analyze the outcomes based on the evaluation metrics employed, namely accuracy, precision, and recall. The model creation process can be considered complete if these three metrics yield satisfactory results. However, if the outcomes are not deemed satisfactory, we undertake hyperparameter tuning for the CNN model to enhance the training outcomes. For the evaluation of several alternative models, some indicators have been set. These criteria are shown in Table 1 .

Based on Table 1 , the best model should have good accuracy, precision, and recall from training and validation datasets with the smallest loss, not show overfitting and underfitting, have a good confusion matrix, and perform the best accuracy, precision, recall, and F1-score on the test results. In addition, the computation time, especially testing (interference time) is also assessed to ensure practicability of the model to be applied in real-world implementations.

Consent for publication

We hereby provide consent for the publication of the manuscript detailed above, including any accompanying images or data contained within the manuscript.

Experimental setup

The Emognition dataset contains physiological signals and upper body video recordings from 43 participants who watch movie clips that have been emotionally validated and triggered to produce nine discrete emotions. According to Saganowski et al. 14 the Emognition dataset offers several advantages compared to other datasets, including nine discrete emotions—amusement, awe, enthusiasm, liking, surprise, anger, disgust, fear, sadness- with one neutral emotion, emphasizing the differences between positive emotions, and enabling diverse analysis in emotion recognition (ER) from the physiology and facial expressions domain. This study only use half-body video data of the Emognition Dataset with a total of 387 videos. There are two types of frame rates: 60 FPS and 30 FPS. For a total, there are 287 video with 60 FPS, while the rest 100 videos are 30 FPS. Generally, the videos in the Emognition dataset vary in length across different videos and classes.

Data preprocessing

Converting video to frame.

At this juncture, frames from the video data are extracted by taking into account both the frame rate and the duration of the footage. The Emognition dataset encompasses two distinct frame rates: 60 frames per second (FPS) and 30 FPS. Employing a sampling method, frames at regular one-second intervals are collected throughout the length of the video. Accordingly, for videos recorded at 60 FPS, frames are extracted every 60th frame, and for those at 30 FPS, every 30th frame was selected.

Cropping face from frame

Subsequent to frame extraction, we proceed to isolate the facial region in each frame. This segmentation utilizes the Cascade Classifier function from the OpenCV library to precisely delineate the face in every sequence of frames acquired from the previous stage.

Data cleaning

According to Saganowski et al. 14 , certain high-intensity emotional expressions manifest under multiple conditions within the film sequences, and a single sequence may elicit multiple emotions. In light of these findings, a subjective data pruning is conducted. Each image is individually inspected, with those deemed suitable retained and the unsuitable ones removed. This pruning process significantly reduces the total dataset volume. Additionally, an absence of a distinct 'surprise' emotion classification lead to its exclusion, narrowing down the emotion categories to nine, including a neutral category. This reduction yields a final count of 2,535 facial images, constituting a mere 6.12% of the original image dataset.

Shuffling and splitting

The data randomization and splitting processes are executed simultaneously. Randomized shuffling is managed via the random state parameter, and partitioning is dictated by the specified test size. Random shuffling in train-test splitting is employed to ensure that the training and testing datasets are representative of the overall dataset. This method mitigates the risk of bias in the model training process and enhances the model's ability to generalize from its training to unseen data. By shuffling, the data is randomized, preventing the model from learning potential patterns that may be due to the order of the data rather than the underlying relationships between the variables.

The datasets are then segmented into 80% for training, and 10% each for validation and testing. The aim of this segmentation is to balance the need for a model to learn effectively from the data (requiring a substantial training set) against the need to prevent overfitting and to accurately estimate the model's predictive performance on new, unseen data (requiring separate validation and testing sets). A larger training set allows the model to better understand the complex patterns and relationships within the data, which is vital for developing a model that performs well. Moreover, the relatively high portion of train datasets is carefully determined by considering the limited number of available data at only 2535 facial images.

Rescale and resize

Normalization of input pixel values is accomplished through the use of a rescale parameter set at 1/255 within the image data generator, converting the pixel values to a normalized matrix ranging between 0 and 1. Concurrently, resizing of images is administered through the target size parameter, aligning with the predetermined image dimensions for this study, which are set at 300 × 300 pixels.

Data augmentation

To further enrich the dataset, we apply data augmentation strategies using an image data generator. This involves applying a series of transformations to the input images, such as height shifts, shear intensity variations, zoom alterations, and horizontal flipping. These modifications enable the generation of new image variants, thereby enhancing the diversity of the training dataset.

Parameter configuration and model implementation

There are several hyperparameters of the training process that are pre-determined. Details of these parameters are shown in Table 2 . Meanwhile, the number of epochs for full learning model is determined based on the Taguchi experiments. During the training, the same hyperparameters setting are set for both transfer learning and full learning models to ensure fair comparison, except the image size which follows the input size of respective architecture and the number of epochs. The number of epochs for transfer learning models are 100 epochs, which is uniformly divided for each 1st and 2nd scenarios.

The proposed method is developed and implemented using TensorFlow written in Python. The models are fully trained and tested using Google Colab, accessed through a computer with Intel(R) Core(TM) i5-10200H CPU with 2.40 GHz, 8 GB RAM, and GeForce GTX 1650 Ti with Max-Q Design.

Result and discussion

Model architecture, transfer learning with mobilenet-v2 and inception-v3.

In the first scenario, we freeze all the layers of pretrained models. This means that the parameters of the pretrained model were not retrained. The input data flowed either forward pass or backward pass through the feature extraction layer without performing any updates. This first scenario allowed the model to train only on the fully connected layer. In the second scenario, we perform fine-tuning by activating the last half of the pretrained layer. The total layers in MobileNet-V2 and Inception V3 are 154 and 311 layers, respectively. Therefore, we activated the last 77 layers in MobileNet-V2 and the last 155 layers in Inception-V3. The Adam optimizer was applied to prevent overfitting by adding a learning rate parameter with 0.0001. The difference in the number of parameters is shown in Table 3 , while the illustrations of transfer learning architectures are shown in Fig.  5 .

figure 5

Architectures of transfer learning models.

Full learning model

For the build-from-scratch technique, we used the Taguchi method to find a robust combination between the feature extraction layer and the epoch to be used. The Taguchi method has been widely applied to obtain the optimal architecture design and hyperparameter setting of CNN, thus avoiding time-consuming trial-and-error methods 47 . Here, the options of the number of convolution layers include three levels of 4, 5, and 6 of feature extraction layers. Three of the initial feature extraction layers were identical. The 4th, 5th, and 6th convolution layers were convolution layers with 64 filters, with each filter size being 3 × 3 and using the ReLU activation function. Each convolution layer is followed by a pooling layer with a pool size of 2 × 2. The epoch parameters have levels of 50, 100, and 150 epochs.

The validation accuracy was used for the output, while larger values are better for the type of S/N ratio used. The combination design and its results are obtained as shown in Table 4 , while the data analysis output is presented in Fig.  6 .

figure 6

Taguchi results.

Based on the main effect plot for the S/N ratio and the main effects plot for the mean in Fig.  6 , the robust design combines the number of feature extraction layers at level 5 and the number of epochs at level 150. This combination provides higher accuracy and can produce a system that is not sensitive to changes.

Next, we used five types of convolution layers, each with 16 filters on the first convolution, 32 on the second convolution, and 64 on the third, fourth, and fifth convolutions. The filter size for all convolution layers is 3 × 3. Each convolution layer was followed by a max-pooling layer with a pool size of 2 × 2. We also used the global average pooling layer to convert the feature matrix into a vector for a fully connected layer with two dense layers. The first dense layer had 512 neurons, and the second dense layer (output layer) had 9 neurons, according to the number of classes in the input data model. The activation function used for each layer was ReLU, except for the second dense layer (output layer), which uses Softmax activation because it adapts to the type of problem, which is multiclass classification.

With that architecture, the model had a total of 135,337 trainable parameters. A training process with 150 epochs carried out all these parameters. The architecture of the full learning model is depicted in Fig.  7 .

figure 7

Full learning architecture.

Training result

Training processes were executed using the hyperparameter values described in Sect. “ Experimental setup ”. Specifically, the number of convolutional layers and epochs for the full learning model was determined based on the Taguchi results. We used two scenarios approach for the transfer learning models during the training process. The training process for the MobileNet-V2, Inception-V3, and full learning models were presented in Figs. 8 , 9 and 10 , respectively.

figure 8

Training process for transfer learning MobileNet-V2 mode.

figure 9

Training process for transfer learning Inception-V3 model.

figure 10

Training process for full learning model.

The horizontal green line in Figs.  8 and 9 represents the boundary between the first and second scenarios. The training process was conducted for 100 epochs, with each scenario lasting for 50 epochs. Before and after the training process began, the model was evaluated on the validation data. It can be observed that there is a significant improvement in accuracy and a notable reduction in loss. Thus, through the training process using both the first and second scenarios, the model's performance has been successfully enhanced, as evidenced by increased accuracy and decreased loss. This indicates that the model has effectively learned from the training data and can generalize well on the validation data.

Based on the evaluation results on the accuracy, precision, and recall matrices from epoch to epoch on the training and validation data for each type of existing model, it can be concluded that the transfer learning model with Inception-V3 gives the best results until the last epoch. The transfer learning model with Inception-V3 shows a consistent positive trend in the training and validation data without significant fluctuations in the validation data, so there is no overfitting identified. The indications of overfitting and underfitting are also not found on the training graph of MobileNet-V2 and the full learning model, thus indicating proper training for all models.

Further, the training results on the matrix evaluation of accuracy, precision, recall, and loss for each at the end of the training process can be seen in Fig.  11 . Here, the transfer learning model with Inception-V3 can provide optimal results on training data and validation data, the transfer learning model with MobileNet-V2 shows less consistent results between training and validation results. In contrast, the full learning model shows conditions that are less than optimal compared to other models, scoring the lowest precision and accuracy, as well as highest loss both in training and validation dataset. In addition, the evaluation results on the validation data conducted after the training process can be seen in Fig.  12 .

figure 11

Training results comparison between all models.

figure 12

Evaluation of validation data.

Based on Fig.  12 , the transfer learning model with Inception-V3 has the highest accuracy value and the smallest loss in the evaluation process using validation data. This model shows the best performance, which indicates that the training process in the transfer learning model with Inception-V3 is more effective than the other models. A comparison of training results is also reviewed from the time of the training process for the three models. The training time comparison is shown in Fig.  13 .

figure 13

Training time (in minutes) comparison.

Based on Fig.  13 , the transfer learning model with Inception-V3 requires a shorter training time when compared to the other models with 72.25 min. Noted that, although Inception-V3 needed longer training time than MobileNet-V2 on the original training using the ImageNet, here the training times for both transfer learning models are not significantly different. Meanwhile, the training time for the full learning model is the longest at 110.57 min. This is because the model requires more time to learn from the beginning with all layers activated, and its model was trained using 150 epochs.

Overall, the transfer learning model with Inception-V3 and the fine-tuning process that has been carried out has produced optimal results in the training process. Fine-tuning in Inception-V3 successfully adapts the model effectively with a shorter training time on the input data.

Testing results

After the training and validation processes are executed, the testing process is conducted to evaluate the performance of the detection model in generalizing new data. Based on the testing process on the same test data, a comparison of the total accuracy results on the test data of the three models that have been built is carried out. The results indicate that the transfer learning model with Inception-V3 has a better total accuracy value than other models, with the accuracy of the transfer learning model with Inception-V3 at 0.96 with MobileNet-V2 at 0.89 and the full learning model at 0.87. This indicates that the ability of the transfer learning model with Inception-V3 to recognize the true class of the entire data is better than the other models. The comparison of the accuracy was also conducted for each class; the results of the comparison are shown in Fig.  14 .

figure 14

Testing accuracy.

Based on Fig.  14 , the transfer learning model with Inception-V3 performs better accuracy than other models for each class in the input data. This indicates that the transfer learning model with Inception-V3 can make predictions that generate quite a lot of true positives and true negatives. Only in the awe class does the class accuracy of the transfer learning model with Inception-V3 have the same value as the transfer learning model with MobileNet-V2.

The comparison of the testing results is also carried out on the recall matrix for the three models that have been built, as shown in Fig.  15 . The result indicates that the recall value of the transfer learning model with Inception-V3 is superior in most classes. Only in the fear class the recall of the Inception-V3 model is smaller than the recall from the transfer learning model with MobileNet-V2, whereas in the anger and awe classes, the accuracy of the transfer learning model with MobileNet-V2 and Inception-V3 shows the same value. In the amusement class, the recall of the build model from scratch is superior to the other models, and in the enthusiasm and neutral classes, both recalls are the same. In other classes, the recall of the transfer learning model with Inception-V3 is superior to other models.

figure 15

Testing recall.

A comparison of the testing results is also carried out from the perspective of the precision value of each class. The comparison results are shown in Fig.  16 . Based on Fig.  16 , the precision of the transfer learning model with Inception-V3 dominates in enthusiasm and neutral classes compared to the other models. In the amusement, awe, disgust, and liking types, the precision is the same as the transfer learning model with MobileNet-V3. While in the sadness, the model has lower precision. In the sadness, the precision is dominated by the transfer learning model with MobileNet-V2, and in the anger, precision is the same between the build model from scratch and transfer learning with Inception-V3.

figure 16

Testing precision.

Through F1-Score, each class's recall and precision values are combined so that a more comprehensive analysis can be obtained. Based on Fig.  17 , the transfer learning model with Inception-V3 performs better than other models for each class in the input data. However, in the awe class, the F1-Score value of the Inception-V3 is the same as the F1-Score value of the MobileNet-V2, which means that both have comparable performance in that class.

figure 17

F1-score in testing data.

The assessment includes a review of the testing duration for each model when evaluating a single image. The testing time is crucial since it influences the applicability of the developed model to be implemented in the real system. These findings on testing time are depicted in Fig.  18 . According to the data presented in Fig.  18 , the Inception-V3 model exhibits a longer testing period compared to the other models. This is due to the greater complexity of the transfer learning model with Inception-V3 compared to other models. In contrast to the build-from-scratch model, which has a much simpler model complexity, it has quicker time processing. Nevertheless, the testing time of all models is considerably fast enough to be implemented for real-time detection.

figure 18

Testing time (in seconds) comparison.

The testing results confirm that the transfer learning model with Inception-V3 is the best choice in dealing with the task of classifying emotions in the Emognition dataset. Better performance on these evaluation matrices indicates the model's ability to capture a balanced overall classification performance and recognize certain emotion classes. Although the transfer learning model with Inception-V3 has a longer testing time, the testing time is not significantly different from other models.

Experiments on JAFFE and KDEF datasets

To further validate the robustness and versatility of the proposed models, extensive testing was conducted using two well-established facial emotion recognition datasets: the Japanese Female Facial Expression (JAFFE) dataset 48 and the Karolinska Directed Emotional Faces (KDEF) dataset 49 . These datasets are frequently utilized in the FER field due to their diverse representation of facial emotions and have historically served as benchmarks for evaluating the effectiveness of FER algorithms.

The models used for these experiments were adapted from the best-performing models on the Emognition dataset, as detailed in Sect. “ Testing results ”. This approach leverages the sophisticated feature extraction capabilities already developed for the Emognition models, thus providing a strong foundation for recognizing emotions in JAFFE and KDEF images. The transfer learning technique was applied, utilizing the same hyperparameter configurations as outlined in Sects. “ Model architecture ” and “ Training result ”, ensuring consistency in model training and evaluation.

The performance of the adapted models on the JAFFE and KDEF datasets was assessed based on their testing accuracy and F1-scores. These metrics are critical for comparing the efficacy of the proposed models against existing models documented in recent literature. The results are systematically presented in Table 5 , which includes comparative data from other recent studies.

The analysis demonstrates that the proposed models, especially Inception-V3 transfer learning model, are effective for facial emotion recognition tasks across different datasets. When compared to other research, the proposed models are competitive, often outperforming or matching other state-of-the-art results. The model of Dada et al. 53 , Lasri et al. 55 , and Baygin et al. 56 stand out with slightly higher metrics on JAFFE. However, it should be noted that the proposed models are trained and tuned using the pre-processing and hyperparameters specifically tailored for Emognition dataset, while the existing models were developed and trained for JAFFE and KDEF datasets. The proposed models could be further refined to enhance their accuracy and F1-scores, potentially by incorporating techniques from the best-performing models in the literature or by further tuning their architectures and training parameters according to the respective datasets.

To provide a clearer visual representation of model effectiveness, Figs.  19 and 20 display the confusion matrices for the JAFFE and KDEF datasets, respectively. These matrices illustrate the precision of emotion classification across different emotions provided by the datasets, highlighting the models' strengths and areas for improvement in recognizing specific emotional expressions.

figure 19

Confusion matrix of the models on JAFFE dataset.

figure 20

Confusion matrix of the models on KDEF dataset.

The comparative analysis of these models on two different datasets provides important insights into their respective strengths and weaknesses. InceptionV3 emerges as the most consistent and reliable model, particularly advantageous in settings where high accuracy across a diverse range of emotions is required. The MobileNetV2 and Full Learning models, while effective, demonstrate specific areas where enhancement is needed, particularly in the accurate classification of negative emotions. The lower accuracy of MobileNetV2 transfer learning models was notable in sad expression where there is a 25% misclassification rate primarily involving confusion with angry and disgust on the JAFFE dataset. It also showed less precision with angry expression on the KDEF dataset. The full learning model showed some limitations with disgust where it only achieves 75% accuracy with notable misclassifications involving fear and sad on the JAFFE dataset. Meanwhile on the KDEF dataset, the misclassifications primarily involve angry and disgust, indicating a recurring challenge in distinguishing between closely related negative emotions. Further improvement may include fine-tuning and hyperparameter optimization which could offer significant benefits, particularly for models showing potential yet inconsistent performance across emotional categories.

Analyzing the training and testing outcomes shows that the transfer learning model utilizing Inception-V3 outperforms the other models. The transfer learning model with MobileNet-V2 shows less optimal performance compared to the transfer learning model with Inception-V3. This is caused by overfitting after the fine-tuning technique is applied (scenario 2) during the training process. As such, the fine-tuning technique is less suitable when applied to MobileNet-V2 with an Emognition dataset. In addition, the number of parameters used in the training process in the transfer learning model with MobileNet-V2 is less when compared to the transfer learning model with Inception-V3. MobileNet-V2 requires more data to be used in the training process. This is also in line with the finding of Abdulsattar and Hussain 37 , who found that the transfer learning model with MobileNet-V2 had less than optimal results compared to other models.

Although the results indicate that full learning model shows less optimal performance when compared to the overall transfer learning model, the difference on testing accuracy with MobileNet-V2 is not significant. Moreover, the full learning model consumes the lowest testing time among the three models. The full learning model has 135,337 trainable parameters, while the MobileNet-V2 transfer learning model has 1,320,969 trainable parameters in scenario 1 and 3,578,953 in scenario 2. This stark difference in complexity results in notably faster training and testing times for the full learning model. Hence, this results indicates that full learning model is still promising enough to be developed for specific task such as for FER. Several improvements in the datasets, architectural design, and training setting can be made to increase the performance of full learning model.

One of the possible reason for full learning model’s ineffectiveness is due to the lack of training data used by the model to start the fitting process from scratch. The limited data for each category within the Emognition dataset may have hindered the full learning model from reaching a high level of accuracy. Transfer learning models, which benefit from pre-training on extensive and varied datasets, are better equipped for identifying and learning new features and patterns. They come with a foundational understanding of class features, simplifying the classification task. To overcome the data scarcity issue, subsequent research could create an expansive database specifically for FER images to train full learning models more efficiently. This is also in line with what was conveyed by Raja Sekaran et al. 13 , where the use of large amounts of data is needed to support success in creating a CNN model from scratch. Moreover, future research could aggregate existing datasets to form a comprehensive and sizable dataset for this purpose.

Another possible reason could be that the hyperparameters employed were not ideally suited for the problem addressed in this research. Although this study has considered determining the optimal value of number of convolutional layers and number of epoch, there are several other hyperparameters that the optimal values can be explored, such as number of filters, filter sizes, number of neurons, activation functions, regularization, learning rate, and batch size. As such, comprehensive experimentation on these hyperparameters would be necessary.

In addition, improving architectural design of full learning model can be explored by several approach. Normalization techniques like batch or layer normalization can be instrumental in stabilizing the training process, thereby accelerating convergence. To combat overfitting and foster generalization, dropout layers or regularization methods can be strategically integrated into the model. The use of residual connections, inspired by architectures such as ResNet, can be added for allowing the training of deeper networks by enabling the direct flow of gradients. Ensemble methods, which combine the strengths of various models or architectures, can also be employed to improve overall accuracy. Incorporating the aforementioned enhancements into subsequent research has the potential to pave the way for the development of FER models that are not only more sophisticated and precise but also offer greater clarity in their decision-making processes.

Further, this study posseses several limitations that could provide opportunities for future development. The developed models are based on static images for making predictions. The use of static images has several disadvantages as it lacks of temporal context in which static images do not capture the dynamic nature of facial expressions, missing out on temporal cues and changes over time that can be critical for accurately recognizing emotions. In addition, real-world emotional expressions often involve subtle movements and transitions. Static images cannot fully represent these micro-expression, potentially leading to oversimplified models that struggle with the complexity of real facial expressions. A single static image may also not adequately represent the range of expressions associated with a particular emotion, leading to reduced generalizability of the model. Therefore, the future studies can be directed for developing video-based deep learning model for FER. Utilizing video datasets offers a significant benefit by recording facial expressions as they evolve over several frames, delivering a richer, more detailed basis for emotion classification than what static image datasets can provide. This dynamic capture of facial changes enhances the model's ability to accurately recognize a wider range of emotions 57 . Based on the analysis on the JAFFE and KDEF datasets, the proposed models offer promising performance. Hence, future studies can further implement and evaluate the proposed architecures in several other datasets to assess the robustness of the model, especially its ability to learn different emotions.

This study utilizes a form of CNN that is unexplainable, meaning the internal decision-making process of the model is not transparent to the researchers. Future research could focus on explainable artificial intelligence (XAI), focusing on interpretability enhancement to understand how the model processes and classifies input images. The implementation of Explainable Artificial Intelligence (XAI) for Facial Emotion Recognition (FER) holds particular significance for critical and sensitive sectors like police investigations, psychology, judiciary, and healthcare due to several key reasons. In sensitive applications, understanding how decisions are made is crucial for establishing trust. XAI provides transparency into the decision-making process of FER systems, enabling stakeholders to comprehend why a particular emotion was recognized, which is vital for building confidence in the system’s outputs. In environments like judiciary or police investigations, the accuracy of emotion recognition can have profound implications. XAI helps ensure that the conclusions drawn are based on valid, understandable reasoning, which is essential for accountability, especially in legal contexts where decisions can affect the outcomes of cases or investigations. The ethical use of FER in psychology and healthcare requires careful consideration of privacy and consent, as well as the potential consequences of misinterpretation. XAI enables a deeper scrutiny of the ethical implications of deploying such technology by making the operational logic accessible and comprehensible.

Conclusions

This study introduces an image-based computer vision approach for developing deep learning techniques to automate Facial Emotion Recognition (FER) using the Emognition dataset. The dataset consists of ten expression—amusement, awe, enthusiasm, liking, surprise, anger, disgust, fear, sadness, and neutral. At first, the dataset is subjected to a series of preprocessing steps to transform it into a clean dataset containing 2535 facial images. Subsequently, this dataset is split into 2028 images for training, 253 for validation, and 254 for testing. The development of CNN models involves two distinct methods: transfer learning with fine-tuning using pre-trained models Inception-V3 and MobileNet-V2 and the creation of a CNN model from scratch.

The experimental results demonstrate that all three proposed CNN models perform admirably in classifying emotions across the 9 emotion classes. The training and testing outcomes consistently support the conclusion that the transfer learning model, specifically Inception-V3, exhibits superior performance compared to the other models. This finding also underscores the effectiveness of the fine-tuning process applied to Inception-V3 in adapting the model to the input data. The detail analysis also indicates that the developed models succesfully predicted several new emotions unique in Emognition Datasets, which are amusement, enthusiasm, awe, and liking with high accuracy. Furthermore, this research holds promise for practical implementation in various domains, including marketing, mental health, education, application development, and beyond.

Data availability

The datasets in this study are provided by the Emognition team. Please refer to the following article for the datasets: S. Saganowski, J. Komoszyńska, M. Behnke, B. Perz, D. Kunc, B. Klich, L. D. Kaczmarek, and P. Kazienko, “Emognition dataset: emotion recognition with self-reports, facial expressions, and physiology using wearables,” Sci Data, vol. 9, no. 1, pp. 1–9, Dec. 2022, https://doi.org/10.1038/s41597-022-01262-0 .

Krishna, A. H., Sri, A. B., Priyanka, K. Y. V. S., Taran, S. & Bajaj, V. Emotion classification using EEG signals based on tunable-Q wavelet transform. IET Sci. Meas. Technol. 13 (3), 375–380. https://doi.org/10.1049/iet-smt.2018.5237 (2019).

Article   Google Scholar  

Ismael, A. M., Alçin, Ö. F., Abdalla, K. H. & Şengür, A. Two-stepped majority voting for efficient EEG-based emotion classification. Brain Inform. 7 (1), 1–12. https://doi.org/10.1186/s40708-020-00111-3 (2020).

Lerner, J. S., Li, Y., Valdesolo, P. & Kassam, K. S. Emotion and decision making. Annu. Rev. Psychol. 66 , 799–823. https://doi.org/10.1146/annurev-psych-010213-115043 (2015).

Article   PubMed   Google Scholar  

Aslan, M. CNN based efficient approach for emotion recognition. J King Saud Univ. Comput. Inf. Sci. 34 (9), 7335–7346. https://doi.org/10.1016/j.jksuci.2021.08.021 (2022).

Mehrabian, A. Nonverbal Communication 1st edn. (Routledge, 1972).

Google Scholar  

Gautam, C. & Seeja, K. R. Facial emotion recognition using Handcrafted features and CNN. Procedia Comput. Sci. 218 , 1295–1303. https://doi.org/10.1016/j.procs.2023.01.108 (2023).

Andalibi, N. & Buss, J. The human in emotion recognition on social media: attitudes, outcomes, risks. In: Conference on Human Factors in Computing Systems - Proceedings, Association for Computing Machinery, (2020). https://doi.org/10.1145/3313831.3376680 .

Jacintha, V., Simon, J., Tamilarasu, S., Thamizhmani, R., Thanga Yogesh, K. & Nagarajan, J. A review on facial emotion recognition techniques. In Proceedings of the 2019 IEEE International Conference on Communication and Signal Processing, ICCSP 2019, Institute of Electrical and Electronics Engineers Inc., pp. 517–521 (2019). https://doi.org/10.1109/ICCSP.2019.8698067 .

Ko, B. C. A brief review of facial emotion recognition based on visual information. Sensors 18 (2), 1–20. https://doi.org/10.3390/s18020401 (2018).

Article   MathSciNet   Google Scholar  

Suk, M. & Prabhakaran, B.: Real-time mobile facial expression recognition system-a case study In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 132–137 (2014). https://doi.org/10.1109/CVPRW.2014.25

Tian, Y., Luo, P., Wang, X. & Tang, X. Pedestrian Detection aided by deep learning semantic tasks. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5079–5087, 2015. https://doi.org/10.1109/CVPR.2015.7299143

Walecki, R., Ognjen, R., Pavlovic, V., Schuller, B., Pantic, M. Deep structured learning for facial action unit intensity estimation. In: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3405–3414 (2017). https://doi.org/10.1109/CVPR.2017.605

Raja Sekaran, S.A.P.C., Lee, P., Lim, K.M. Facial emotion recognition using transfer learning of AlexNet. In 2021 9th International Conference on Information and Communication Technology, ICoICT 2021, pp. 170–174, 2021. https://doi.org/10.1109/ICoICT52021.2021.9527512 .

Saganowski, S. et al. Emognition dataset: emotion recognition with self-reports, facial expressions, and physiology using wearables. Sci Data 9 (1), 1–9. https://doi.org/10.1038/s41597-022-01262-0 (2022).

Eilenberger, S.D. Amusement device for registering emotion (1943). https://patents.google.com/patent/US2379955 .

Piroozfar, P., Farooqi, I., Judd, A., Boseley, S., Farr, E.R.P. VR-enabled participatory design of educational spaces: an experiential approach. In: International Conference on Construction in the 21st Century, pp. 496–502 (2022).

Zhu, H., Duan, X. & Su, Y. Is the sense of awe an effective emotion to promote product sharing: Based on the type of awe and tie strength. J. Contemp. Mark. Sci. 4 (3), 325–340. https://doi.org/10.1108/jcmars-10-2021-0036 (2021).

Kanade, T., Cohn, J.F., Tian, Y. Comprehensive database for facial expression analysis. In Proceedings - 4th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2000 (2000). https://doi.org/10.1109/AFGR.2000.840611 .

Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z. & Matthews, I. (2010) The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, CVPRW 2010 (2010). https://doi.org/10.1109/CVPRW.2010.5543262 .

Agostinelli, F., Anderson, M.R. & Lee, H. Adaptive multi-column deep neural networks with application to robust image denoising. Adv. Neural Inf. Process. Syst. (2013).

Lyons, M., Akamatsu, S., Kamachi, M. & Gyoba, J. Coding facial expressions with Gabor wavelets. In: Proceedings - 3rd IEEE International Conference on Automatic Face and Gesture Recognition, FG 1998 (1998). https://doi.org/10.1109/AFGR.1998.670949 .

Ebner, N. C., Riediger, M. & Lindenberger, U. FACES-a database of facial expressions in young, middle-aged, and older women and men: Development and validation. Behav. Res. Methods 42 (1), 351–362. https://doi.org/10.3758/BRM.42.1.351 (2010).

Langner, O. et al. Presentation and validation of the radboud faces database. Cogn. Emot. 24 (8), 1377. https://doi.org/10.1080/02699930903485076 (2010).

Li, S. & Deng, W. Deep facial expression recognition: A survey. IEEE Trans. Affect. Comput. 13 (3), 1195–1215. https://doi.org/10.1109/TAFFC.2020.2981446 (2022).

Kunc, D., Komoszyńska, J., Perz, B., Kazienko, P. & Saganowski, S. Real-life validation of emotion detection system with wearables. Lecture Notes Comput. Sci. https://doi.org/10.1007/978-3-031-06527-9_5 (2022).

Kune, D. Unsupervised learning for physiological signals in real-life emotion recognition using wearables. In 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2022 (2022). https://doi.org/10.1109/ACIIW57231.2022.10086004 .

Perz, B. Personalization of emotion recognition for everyday life using physiological signals from wearables. In 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2022 (2022). https://doi.org/10.1109/ACIIW57231.2022.10086031 .

Kunc, D., Komoszynska, J., Perz, B., Saganowski, S., Kazienko, P. Emognition system—Wearables, physiology, and machine learning for real-life emotion capturing. In 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2023 (2023). https://doi.org/10.1109/ACIIW59127.2023.10388097 .

Manalu, H. V. & Rifai, A. P. Detection of human emotions through facial expressions using hybrid convolutional neural network-recurrent neural network algorithm. Intell. Syst. Appl. 21 , 200339. https://doi.org/10.1016/j.iswa.2024.200339 (2024).

Gupta, S. Facial emotion recognition in real-time and static images. In Proceedings of the 2nd International Conference on Inventive Systems and Control, ICISC 2018 (2018). https://doi.org/10.1109/ICISC.2018.8398861 .

Anthwal, S. & Ganotra, D. An optical flow based approach for facial expression recognition. In: 2019 International Conference on Power Electronics, Control and Automation, ICPECA 2019—Proceedings, pp. 1–5, 2019. https://doi.org/10.1109/ICPECA47973.2019.8975442 .

Supta, S.R., Sahriar, M.R., Rashed, M.G., Das, D. & Yasmin, R. An effective facial expression recognition system. In Proceedings of 2020 IEEE International Women in Engineering (WIE) Conference on Electrical and Computer Engineering, WIECON-ECE 2020, pp. 66–69, 2020. https://doi.org/10.1109/WIECON-ECE52138.2020.9397965 .

Ramalingam, S. & Garzia, F. Facial expression recognition using transfer learning. In: Proceedings—International Carnahan Conference on Security Technology, pp. 1–5 (2018). https://doi.org/10.1109/CCST.2018.8585504 .

Agobah, H., Bamisile, O., Cai, D., Bensah Kulevome, D.K., Darkwa Nimo, B. & Huang, Q. Deep facial expression recognition using transfer learning and fine-tuning techniques. In: EI2 2022 - 6th IEEE Conference on Energy Internet and Energy System Integration, Institute of Electrical and Electronics Engineers Inc., 2022, pp. 1856–1860. https://doi.org/10.1109/EI256261.2022.10116540 .

Gondkar, A., Gandhi, R., Jadhav, N. Facial emotion recognition using transfer learning: A comparative study. In: 2021 2nd Global Conference for Advancement in Technology, GCAT 2021 (2021) https://doi.org/10.1109/GCAT52182.2021.9587608

Sajjanhar, A., Wu, Z., Wen, Q. Deep learning models for facial expression recognition. In 2018 International Conference on Digital Image Computing: Techniques and Applications, DICTA 2018, pp. 1–6 (2019). https://doi.org/10.1109/DICTA.2018.8615843 .

Abdulsattar, N.S. & Hussain, M.N. Facial expression recognition using transfer learning and fine-tuning strategies: a comparative study. In: Proceedings of the 2nd 2022 International Conference on Computer Science and Software Engineering, CSASE 2022, pp. 101–106 (2022). https://doi.org/10.1109/CSASE51777.2022.9759754 .

Shi, W. & Jiang, M. Fuzzy wavelet network with feature fusion and LM algorithm for facial emotion recognition. In: Proceedings of 2018 IEEE International Conference of Safety Produce Informatization, IICSPI 2018, pp. 582–586 (2019). https://doi.org/10.1109/IICSPI.2018.8690353 .

Oztel, I., Yolcu, G., Oz, C. Performance comparison of transfer learning and training from scratch approaches for deep facial expression recognition. In: UBMK 2019 - Proceedings, 4th International Conference on Computer Science and Engineering, pp. 1–6 (2019) https://doi.org/10.1109/UBMK.2019.8907203 .

Meena, G., Mohbey, K. K., Indian, A., Khan, M. Z. & Kumar, S. Identifying emotions from facial expressions using a deep convolutional neural network-based approach. Multimed. Tools Appl. 83 (6), 15711–15732. https://doi.org/10.1007/s11042-023-16174-3 (2024).

Bilotti, U., Bisogni, C., De Marsico, M. & Tramonte, S. Multimodal Emotion recognition via convolutional neural networks: comparison of different strategies on two multimodal datasets. Eng. Appl. Artif. Intell. https://doi.org/10.1016/j.engappai.2023.107708 (2024).

Elgendy, M. Deep learning for vision systems (2020).

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. & Chen, L.C. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4510–4520 (2018) https://doi.org/10.1109/CVPR.2018.00474 .

Shaees, S., Naeem, H., Arslan, M, Naeem, M.R., Ali, S.H. & Aldabbas, H. Facial emotion recognition using transfer learning. In: 2020 International Conference on Computing and Information Technology (ICCIT-1441), pp. 1–5, 2021. https://doi.org/10.1109/GCAT52182.2021.9587608 .

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens J. & Wojna, Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2818–2826 (2016). https://doi.org/10.1109/CVPR.2016.308 .

Kurama, V. A Review of Popular Deep Learning Architectures: ResNet, InceptionV3, and SqueezeNet. [Online]. Available: https://blog.paperspace.com/popular-deep-learning-architectures-resnet-inceptionv3-squeezenet/ (2020).

Lin, C. J., Li, Y. C. & Lin, H. Y. Using convolutional neural networks based on a Taguchi method for face gender recognition. Electron 9 (8), 1227. https://doi.org/10.3390/electronics9081227 (2020).

Article   CAS   Google Scholar  

Lyons, M., Kamachi, M. & Gyoba, J. The Japanese female facial expression (JAFFE) dataset. [Online]. Available: https://zenodo.org/records/3451524 (1998).

Lundqvist, D., Flykt, A. & Öhman, A. The Karolinska Directed Emotional Faces—KDEF, CD ROM from Department of Clinical Neuroscience, Psychology section, Karolinska Institutet. ISBN 91-630-7164-9 (1998)

Yang, B., Cao, J., Ni, R. & Zhang, Y. Facial expression recognition using weighted mixture deep neural network based on double-channel facial images. IEEE Access 6 , 4630–4640. https://doi.org/10.1109/ACCESS.2017.2784096 (2017).

Ullah, Z., Qi, L., Hasan, A. & Asim, M. Improved deep CNN-based two stream super resolution and hybrid deep model-based facial emotion recognition. Eng. Appl. Artif. Intell. 116 , 105486. https://doi.org/10.1016/j.engappai.2022.105486 (2022).

Reddi, P. S. & Krishna, A. S. CNN implementing transfer learning for facial emotion recognition. Int. J. Intell. Syst. Appl. Eng. 11 (4s), 35–45 (2023).

Dada, E. G., Oyewola, D. O., Joseph, S. B., Emebo, O. & Oluwagbemi, O. O. Facial emotion recognition and classification using the convolutional neural network-10 (CNN-10). Appl. Comput. Intell. Soft Comput. https://doi.org/10.1155/2023/2457898 (2023).

Sari, M., Moussaoui, A. & Hadid, A. A simple yet effective convolutional neural network model to classify facial expressions. In: Modelling and Implementation of Complex Systems: Proceedings of the 6th International Symposium. Springer International Publishing, Cham, pp. 188–202 (2021)

Lasri, I., Riadsolh, A. & Elbelkacemi, M. Facial emotion recognition of deaf and hard-of-hearing students for engagement detection using deep learning. Educ. Inf. Technol. 28 (4), 4069–4092. https://doi.org/10.1007/s10639-022-11370-4 (2023).

Baygin, M. et al. Automated facial expression recognition using exemplar hybrid deep feature generation technique. Soft Comput. 27 (13), 8721–8737. https://doi.org/10.1007/s00500-023-08230-9 (2023).

Duncan, D., Shine, G. & English, C. Facial emotion recognition in real time. Comput. Sci . 1–7 (2016).

Download references

Acknowledgements

We would like to express our sincere thanks to Prof. Stanislaw Saganowski and Dr. Joanna Komoszyńska for granting us access to the Emognition Wearable Dataset.

Author information

Authors and affiliations.

Department of Mechanical and Industrial Engineering, Universitas Gadjah Mada, Yogyakarta, Indonesia

Erlangga Satrio Agung, Achmad Pratama Rifai & Titis Wijayanto

You can also search for this author in PubMed   Google Scholar

Contributions

Erlangga Satrio Agung: Methodology, software, formal analysis, investigation, visualization, writing—original draft, writing—review and editing; Achmad Pratama Rifai: Conceptualization, methodology, formal analysis, investigation, resources, validation, visualization, writing—original draft, writing—review and editing; Titis Wijayanto: Data curation, conceptualization, writing—original draft, writing—review and editing.

Corresponding author

Correspondence to Achmad Pratama Rifai .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Agung, E.S., Rifai, A.P. & Wijayanto, T. Image-based facial emotion recognition using convolutional neural network on emognition dataset. Sci Rep 14 , 14429 (2024). https://doi.org/10.1038/s41598-024-65276-x

Download citation

Received : 13 December 2023

Accepted : 18 June 2024

Published : 23 June 2024

DOI : https://doi.org/10.1038/s41598-024-65276-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Facial emotion recognition
  • Convolutional neural network
  • Deep learning
  • Emognition dataset

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper for face recognition

  • DOI: 10.55041/ijsrem32506
  • Corpus ID: 269708512

Intelligent Bus Tracking System with Attendance based on Facial Recognition

  • Amardeep Singh
  • Published in INTERANTIONAL JOURNAL OF… 7 May 2024
  • Computer Science

Related Papers

Showing 1 through 3 of 0 Related Papers

IMAGES

  1. (PDF) A Review Paper on FACIAL RECOGNITION

    research paper for face recognition

  2. (PDF) Face Recognition based Attendance System

    research paper for face recognition

  3. (PDF) Criminal Identification System using Facial Recognition

    research paper for face recognition

  4. (PDF) Face detection and tracking: Using OpenCV

    research paper for face recognition

  5. (PDF) A Review Paper on Face Recognition Techniques

    research paper for face recognition

  6. (PDF) Face Recognition Methods & Applications

    research paper for face recognition

VIDEO

  1. Face Recognition using Tensor Flow, Open CV, FaceNet, Transfer Learning

  2. Face Recognition ☠️ #shorts #scary

  3. DIY paper face mask set #shorts #tonniartandcraft #diy #art #love #craft #youtubeshorts

  4. paper face John part 2

  5. Paper Face

  6. How to make a paper face

COMMENTS

  1. (PDF) Face Recognition: A Literature Review

    The task of face recognition has been actively researched in recent years. This paper provides an up-to-date review of major human face recognition research. We first present an overview of face ...

  2. A Review of Face Recognition Technology

    Face recognition technology is a biometric technology, which is based on the identification of facial features of a person. People collect the face images, and the recognition equipment automatically processes the images. The paper introduces the related researches of face recognition from different perspectives. The paper describes the development stages and the related technologies of face ...

  3. Face recognition: Past, present and future (a review)☆

    The history of face recognition goes back to the 1950s and 1960s, but research on automatic face recognition is considered to be initiated in the 1970s [409]. ... There are also a number of survey papers summarizing the work done on face recognition. The first survey papers were published in 1990s [293], [49].

  4. (PDF) A Review of Face Recognition Technology

    The paper describes the development stages and the related technologies of face recognition. We introduce the research of face recognition for real conditions, and we introduce the general ...

  5. Past, Present, and Future of Face Recognition: A Review

    Face recognition is one of the most active research fields of computer vision and pattern recognition, with many practical and commercial applications including identification, access control, forensics, and human-computer interactions. However, identifying a face in a crowd raises serious questions about individual freedoms and poses ethical issues. Significant methods, algorithms, approaches ...

  6. Face Recognition by Humans and Machines: Three Fundamental Advances

    1. INTRODUCTION. The fields of vision science, computer vision, and neuroscience are at an unlikely point of convergence. Deep convolutional neural networks (DCNNs) now define the state of the art in computer-based face recognition and have achieved human levels of performance on real-world face recognition tasks (Jacquet & Champod 2020, Phillips et al. 2018, Taigman et al. 2014).

  7. A review on face recognition systems: recent approaches and ...

    Face recognition is an efficient technique and one of the most preferred biometric modalities for the identification and verification of individuals as compared to voice, fingerprint, iris, retina eye scan, gait, ear and hand geometry. This has over the years necessitated researchers in both the academia and industry to come up with several face recognition techniques making it one of the most ...

  8. Face Recognition: From Traditional to Deep Learning Methods

    applications, including face recognition. The rest of this paper provides a summary of some of the most representative re-search works on each of the aforementioned types of methods. A. Geometry-based Methods Kelly's [1] and Kanade's [2] PhD theses in the early seventies are considered the first research works on automatic face recognition.

  9. Face Recognition: Recent Advancements and Research Challenges

    A Review of Face Recognition Technology: In the previous few decades, face recognition has become a popular field in computer-based application development ... This paper also includes the research challenges that might occur and some potential future works as well. We present a prospective analysis of facial recognition also in this manuscript

  10. [2212.13038] A Survey of Face Recognition

    A Survey of Face Recognition. Xinyi Wang, Jianteng Peng, Sufang Zhang, Bihui Chen, Yi Wang, Yandong Guo. Recent years witnessed the breakthrough of face recognition with deep convolutional neural networks. Dozens of papers in the field of FR are published every year. Some of them were applied in the industrial community and played an important ...

  11. Human face recognition based on convolutional neural network and

    CNN model for face recognition. In this paper, a CNN model is developed to improve the accuracy of face image classification. The structure of the model is similar to the classical LeNet-5 model, but they are different on some parameters of the model, such as input data, network width and full connection layer. ... Olivetti Research Laboratory ...

  12. [1804.06655] Deep Face Recognition: A Survey

    Deep Face Recognition: A Survey. Mei Wang, Weihong Deng. Deep learning applies multiple processing layers to learn representations of data with multiple levels of feature extraction. This emerging technique has reshaped the research landscape of face recognition (FR) since 2014, launched by the breakthroughs of DeepFace and DeepID.

  13. Face Recognition

    600 papers with code • 23 benchmarks • 64 datasets. Facial Recognition is the task of making a positive identification of a face in a photo or video image against a pre-existing database of faces. It begins with detection - distinguishing human faces from other objects in the image - and then works on identification of those detected faces.

  14. (PDF) Face detection and Recognition: A review

    According to the study conducted by Akanksha and Singh [8] the main tasks provided by facial recognition are verification and identification. This study further commended the use of three ...

  15. A comprehensive study on face recognition: methods and challenges

    Face Recognition (FR) is a biometric application capable of uniquely identifying and verifying a person by analysing and comparing patterns based on the individual's face. ... He has published number of research papers and books in field of image processing, computer graphics and computer algorithms. He has served as committee member and ...

  16. Deep face recognition: A survey

    Loss function. Deep network architecture. 1. Introduction. Face recognition (FR) has been the prominent biometric technique for identity authentication and has been widely used in many areas, such as military, finance, public security and daily life. FR has been a long-standing research topic in the CVPR community.

  17. Design and Evaluation of a Real-Time Face Recognition System using

    In this paper, design of a real-time face recognition using CNN is proposed, followed by the evaluation of the system on varying the CNN parameters to enhance the recognition accuracy of the system. An overview of proposed real-time face recognition system using CNN is shown in Fig. 1. The organization of the paper is as follows.

  18. Face Detection and Recognition Using OpenCV

    Face detection and picture or video recognition is a popular subject of research on biometrics. Face recognition in a real-time setting has an exciting area and a rapidly growing challenge. Framework for the use of face recognition application authentication. This proposes the PCA (Principal Component Analysis) facial recognition system. The key component analysis (PCA) is a statistical method ...

  19. A deep facial recognition system using computational intelligent ...

    3.1 Traditional facial recognition components. The whole system comprises three modules, as shown in Fig 1.. In the beginning, the face detector is utilized on videos or images to detect faces. The prominent feature detector aligns each face to be normalized and recognized with the best match.; Finally, the face images are fed into the FR module with the aligned results.

  20. [2201.02991] A Survey on Face Recognition Systems

    Face Recognition has proven to be one of the most successful technology and has impacted heterogeneous domains. Deep learning has proven to be the most successful at computer vision tasks because of its convolution-based architecture. Since the advent of deep learning, face recognition technology has had a substantial increase in its accuracy. In this paper, some of the most impactful face ...

  21. (PDF) Deep Learning Convolutional Neural Network for Face Recognition

    A facial recognition system is a te chnology which matches a human face from a digital image. or a video picture to a database of faces, usually used to authenticate users by means of ID. checks ...

  22. DEEP LEARNING FOR FACE RECOGNITION: A CRITICAL ANALYSIS

    face recognition relate to occlusion, illumination and pose invariance, which causes a notable decline in ... Current research in both face detection and recognition algorithms is focused on Deep ... this paper will review all relevant literature for the period from 2003-2018 focusing on the

  23. A real-time face detection based on skin detection and geometry

    Research fields like face recognition, face identification, and facial expression analysis require face detection as a prerequisite. Both speed and accuracy have gained obvious advantages for YOLOv3, a popular object detection algorithm. ... The focus of this paper is on developing a face recognition system by utilizing digital image processing .

  24. Image-based facial emotion recognition using convolutional neural

    Research in FER has evolved through conventional and automated approaches involving various datasets and methodologies. Thus far, numerous studies have utilized datasets such as Cohn-Kanade (CK ...

  25. (PDF) Face Recognition and Face Detection Benefits and Challenges

    Face Recognition has been used for many applications concerning security, identification, and authentication. This research paper examined 60 years of face recognition history and highlighted the ...

  26. Intelligent Bus Tracking System with Attendance based on Facial Recognition

    The Intelligent Bus Tracking System enhances the accuracy, efficiency, and security of bus tracking while introducing innovative features such as facial recognition for identification to Student Attendance. The Intelligent Bus Tracking System is a cutting-edge solution that integrates machine learning tools and face detection technology to revolutionize transportation services within academic ...

  27. (PDF) Face Detection and Recognition Using OpenCV

    Intel's OpenCV is a free and open-access image and video processing library. It is linked to computer vision, like feature and object recognition and machine learning. This paper presents the main ...