The present invention generally relates to personal authentication employing Visual Question Authentication Protocol (VQAP) using a combination of pictures and text passwords and, more particularly, to a Collaborative Context-aware Visual Question Answering (C2VQA) system to produce a text answer given an image and a text-based question about that image in the presence of corrupt, incomplete, or irrelevant data for purposes of authenticating a user seeking access to protected resources.
Some user interfaces (UI) exist that require interaction with humans in a meaningful way. One such UI is the “I'm not a robot” or CAPTCHA example where a user first checks a “I'm not a robot” box and then is presented with a matrix of images and asked to select those images that meet a certain characteristic. This is a verification system used to verify that a “user” is in fact not an automated script attempting to access some website. A user in this context is not individually known to the verification system and no secret information is stored or used to validate the user's unique identity. Such systems are not relevant to the present invention.
The present invention is directed to an authentication system wherein a user is presented with a challenge to gain access to protected resources and must validate their unique identity against the information that the system has about the user. The resources may be located in a physical location, an on-line database, cloud-storage backup, webmail, on-line banking, or other resources that require protection. Strong automatic authentication processes are necessary in an age where authentication attacks, in the form of website intrusion, phising and cybercrime in general are on the rise. Though a variety of approaches exist for authentication (including biometric methods), the majority are still split between text-based passwords and graphical methods that require a user to interact with images. At the heart of authentication is the validation of secret information that only the true user should know, such as a password in text-based systems. Text passwords, however, are a constant source of frustration for most users, as rules for creating secure combinations of characters, numbers and symbols conversely result in less-memorable passwords. People are being taught to adhere to complex, and often confusing rules for creating passwords that do not play well to the strengths of human memory. Security rules around text-based passwords are often coupled with additional restrictions that specify how often passwords must be changed.
Graphical or image-based authentication, on the other hand, require a user to interact with images in a meaningful way for purposes of authentication. One such system is Visual Question Answering (VQA) which involves the task of producing a text answer, given an image and a text-based question about that image. This approach requires understanding the meaning and intent of the question, looking at the image content, and then producing a response that is based on both current and previously acquired knowledge. Attempting to answer the given question when ignoring one or both of these inputs (such as guessing the correct answer) should most likely fail. Therefore, in order to be successful in such a task, models must take in multiple modalities of data (visual and text) and join them together in a meaningful way in a process called multimodal fusion.
Models that take advantage of such fusion are called multimodal models. The sources of information for training multimodal models are varied. Within the visual realm, there are images and videos, which can come from a number of different sources, such as cellphone recordings, surveillance cameras, television broadcasts, satellite imagery, or even medical images. Text sources are just as varied, including both structured data, such as ontologies like WordNet, and unstructured data, such as questions, answers, image captions, news posts, and even social media data. While this list is not exhaustive, it serves to show the variety of data sources that VQA involves. VQA itself has numerous variants that fall within the umbrella of multimodal models. Using video as the specific visual modality, the task is often referred to as Video Question Answering, which now must take into account the temporal aspect of questions that span time periods that require understanding in order to give a correct answer. When the task involves multiple, related questions and answers, it is referred to as Visual Dialog, which is a variant of VQA that must handle the temporal aspects of memory by recalling previous answers as a part of the context of answering the current question.
Each of these inputs, however, are possible points of failure. Given this, we pose the following overarching questions: What if the input to Visual Question Answering systems were incomplete or corrupted? What are the implications of handling such input and how can VQA systems be trained to relax their assumptions about the nature of their input? This query is not based in a fantastical stretch of the imagination or a contrived situation, but instead firmly rooted in the reality that humans navigate every day. Given that we live and act in the real world, humans face corrupted, incomplete, and irrelevant data constantly, whether it is a misunderstood direction on a written form (corrupted), a tangentially related query from a child (incomplete), or a coworker's question that “sidetracks” the current conversation (irrelevant).
It is therefore an object of the present invention is to approach VQA using a collaborative and context-aware approach where the content of queries can be parsed to assess their relevance, if any, and iteratively refined for their ultimate resolution. The Collaborative Context-Aware Visual Question Answering (C2VQA) methodology according to the invention encompasses Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and deep learning, joint visual-text embedding, sequencing, and memory models to interpret the queries and best answer them.
We amend the task of VQA to describe the purpose of a Collaborative Context-Aware Visual Question Answering (C2VQA) system, namely to produce a text answer, given an image and a text-based question about that image in the presence of corrupt, incomplete, or irrelevant data. The C2VQA task is one that is more attuned to real-world scenarios and is more applicable to a larger set of applications than VQA alone.
More particularly, the invention is a system for improved personal authentication using a novel combination of pictures and text passwords that gets more secure over time as more users utilize the system. The invention increases the security of authentication by using methods that are easier for a valid user to remember (by using cues), while simultaneously more difficult for an attacker to guess, even in the case where that attacker might be looking over the user's shoulder. The basic idea is that a user will initially register an image and a question/answer pair about this image when setting up an account. At authentication time, the user will be presented with a set of images (one of which may be the user's registered image) and a question and will be asked to provide the registered answer. By selecting the correct image and providing the correct answer to the given question, the user can prove their identity. Research has demonstrated that images are far more memorable than text-based passwords, leading to less wasted effort resetting user passwords. Because this requires two secrets to authenticate the user, it is immediately compatible with the idea of two-factor authentication, where a user could be required to answer the question via one method (e.g., a mobile phone) and select an image via another (e.g., a web browser).
At the root of the invention is a better method for selecting the set of images to present to a user using a specialized machine learning model, the combination of text and image passwords, and some other login scenarios that help to thwart attackers. This methodology does not suffer from security flaws that exist with other authentication methods, such as those based on selecting faces as a password and those that use text-based passwords.
The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:
Referring now to the drawings, and more particularly to
The scenario selector component 105 can take the form of any number of implementations. For example, this could be as simple as random selection or as complex as a deterministic trigger that responds to advanced persistent threats (APTs), such as large scale spoofing attacks. In addition, this component could be configured to present multiple scenarios, while being mindful of the human capacity for remembering multiple graphical passwords. This component then passes along a question and a set of images to be scored by the relevance classifier 106, which splits the relevant from the irrelevant images before passing the set onto the user for authentication. The reasoning behind having such a component, and for having multiple authentication challenge scenarios, is flexibility, security, and adaptability.
Next, the background processes of block 208 are shown in
Once the registration process has been successfully completed for a user, the user can be authenticated for access to a protected resource. The process starts in function block 401 of
Now, assume an attacker attempts to authenticate as Alice. The attacker will be presented with steps 401 to 404, as before; however, we shall now assume that the system selects authentication scenario B, wrong images, an example of an active authentication scenario to aggressively guard against advanced persistent threats. In this case, the question is Alice's registered question, but all the images provided to the attacker are distractor images (steps 405 to 408) that are deemed relevant to the given question. In order to properly authenticate with this scenario, the attacker must select any image but provide the answer that only Alice should know, which is that none of the given images are her registered image. If the attacker fails to provide this answer, they would proceed through steps 409 to 411 and start again at step 402 until the maximum number of authentication attempts is reached and they are not provided access to the system. Similarly, for scenario C, wrong question, one of the images is Alice's registered image, but this question is not Alice's registered question for her registered image, and only Alice would know this. In the example illustrated, the question “What model is the car?” is a registered question, but it is Bob's registered question, not Alice's. Alice could authenticate by selecting her registered image and communicate the fact that this is not her registered question.
The overall architecture and description of the invention to fulfill the C2VQA task will now be described. The design and components of the system are given in
The visual data is then fed into one or more image-captioning models The generated captions are each tokenized and passed through the same word embedding as the question in order to generate word vector representations. Each of these models has the option of either outputting single image descriptions, which cover the entire image with a single sentence, or dense image captions, which cover parts of the image, referred to as image regions. Again, depending on the task, one or both types of captioning models could be used. In practical implementations it was found that different models (both single and dense) that were trained on different datasets could provide a benefit. Because captioning models describe visual content through the lens of the data that it was trained with, including the vocabulary of that dataset, this part of the architecture provides a great deal of flexibility to a C2VQA system. For example, if a new task required knowledge about fine-grained clothing or hair-styles, this knowledge could be gained by substituting a pre-trained image captioning model that was trained to describe images of people in this way. Each captioning model contains within it, at a minimum, a visual model, such as Convolutional Neural Networks (CNNs), and word-generation components, such as an Recurrent Neural Networks (RNNs). If this architecture is treated as an end-to-end system, then such models would be fine-tuned as a part of the overall task. We purposefully did not use the end-to-end interpretation of the architecture, as a part of the goal of our design was to highlight how existing pre-trained components could be used in novel ways. Past this point, we describe a general and high-level process of learning that occurs about the separate channels of input, the image captions and the question. While the specifics are different for each later implementation, each sequence of input is encoded by some type of a RNN, which is responsible for learning the important items in each data sequence.
In order for the system to leverage all of the input data, some combination of the two types of data (images, via image captions, and question) is required. This combination of RNN outputs can be done in different ways. In some implementations, we favored “late fusion”, or the idea that the caption and question should be joined into a single feature representation at the latest possible layer. Other implementations utilized an earlier fusion, or even multiple layers of fusion, especially with dense captions that could be individually merged with the input question in different ways.
The block in
At the root, it is necessary to point out that we are dealing with concepts from artificial intelligence, and most notably, Computer Vision (CV) and Natural Language Processing (NLP). As noted earlier, CNNs are a core component of the system, and owe lineage to both the areas of CV and machine learning. Likewise, NLP word vectorization techniques, such as Word2Vec, rely on machine learning to adapt to a given dataset. When combined with RNNs, these components, which are usable in both the CV and NLP domains, form the central parts of a Visual Question Answering (VQA) system.
Extending this to the base C2VQA system requires the addition of relevance and editing as new types of answers that the system can conceptually give back to the user in order to better work with them to achieve the current goal. We model relevance as a multi-class classification problem, where a single word in the question can be classified as irrelevant or the entire question can be classified as relevant. For the second aspect, triage, we use pre-trained models from the relevance task to demonstrate the filtering of irrelevant from relevant images based on the question.
VQAP is a cross-over between a graphical recognition-based system and a text-based cued-recall system. In order to use VQAP, a user is required to register three types of data: images, questions, and answers. A user may register multiple images, and a single image may be associated with one or more text-based questions, and each question must be associated with a single text-based answer. Each of these questions must be relevant with regard to their associated image, which is verified by using the relevance classifier. At registration time, when a user associates a question with a registered image as an authentication pair, the relevance classifier is given both of these items as input in order to output a classification of relevant or irrelevant. If the question/image pair is deemed irrelevant, then the user is prompted to alter the question or to select a different image. This ensures that a good authentication pair is selected and simultaneously begins to give VQAP the concept of “password strength”, which can easily be expanded on to more complex feedback. For instance, VQAP could also notify a user that images they are registering are already used by another user (which may not be an issue) or are too distinct from other images in the system (which could leave the user vulnerable if not enough similar distractors can be found).
While the invention has been described in terms of a single preferred embodiment, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.
Number | Date | Country | |
---|---|---|---|
62775118 | Dec 2018 | US | |
62793256 | Jan 2019 | US |