Embodiments of the invention relate generally to a platform server for generating content based on user experience and method for providing the same.
Currently, in an environment offering various types of content, user-generated creative content does not imply creating something entirely new from nothing; rather, it involves generating new content by reflecting users' personal inclinations, such as experiences, thoughts, and preferences, on existing content.
Specifically, creative content is generated by interconnecting existing content in various ways. For this purpose, A I-based content generation methods have been proposed.
However, during the training of AI-based content generation models, unrefined languages or information that does not match users' needs may be input. Consequently, there are difficulties in producing creative content that aligns with users' intentions.
The above information disclosed in this Background section is only for understanding of the background of the inventive concepts, and, therefore, it may contain information that does not constitute prior art.
Platform servers for generating content based on user experience and methods for providing the same according to embodiments of the invention are capable of content having maximized creativity while a user's own concept is formed, based on the user's experiences and thoughts.
Additional features of the inventive concepts will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the inventive concepts.
According to one or more embodiments of the invention, a platform server for generating content based on user experience includes: a memory including a first learning model trained to generate a reconstructed content, based on a text, and a processor to communicate with the memory and to control the first learning model to output at least one reconstructed content corresponding to the text when the text is input from a user terminal. The processor is configured to receive an input content and a first text matching the input content from the user terminal, generate a bag of words, based on the first text, determine caption data using a second text derived from the first text included in the bag of words and a predetermined sentence structure, input a sentence indicated by the caption data to the first learning model, generate at least one reconstructed content corresponding to the sentence, connect the at least one reconstructed content with the input content, and output the at least one reconstructed content and the input content to the user terminal.
The memory may further include a second learning model configured to output at least one text indicating an input image, and in a case where the input content is the input image, the processor may be configured to, upon receiving the first text matching the input content: generate a recommended text including a sentence or a word describing an object, appearance, and background of the input image by performing image captioning using the second learning model, may provide the recommended text to the user terminal, and may, as the first text, receive at least one final text determined or corrected by the user based on the recommended text.
In a case where the input content is the input image, the processor may be configured to provide the user terminal with a process for generating the caption data including a predetermined sentence structure forming the caption data and a word category for each item forming the sentence structure. The word category for each item may include at least one blank filled by settings of the user, and a connecting word may be formed between the blanks of the word category for each item.
The processor may be further configured to provide the user terminal with a recommended word to be input to each of the plurality of blanks, and provide the recommended word by considering a relationship with the input image and whether the recommended word coincides with the word category.
In a case where the reconstructed content is a reconstructed image, the processor may be further configured to receive a feedback for the reconstructed image selected by the user from the at least one reconstructed image, from the user terminal, additionally generate the second text of the bag of words, based on the feedback, and reconstruct a sentence of the caption data, based on the feedback.
The processor additionally may be further configured to generate the second text for at least one word category of the bag of words, based on the feedback.
When receiving the feedback from the user terminal, the processor may be further configured to provide a user interface to the user terminal to receive a feedback opinion for the reconstructed image and a pinpoint in the reconstructed image matching the feedback opinion in accordance with an operation of the user.
The memory may further include a third learning model trained as a morphological analyzer to preprocess text and separate the text into morphemes, and after receiving the feedback, the processor may be further configured to input the reconstructed image to the third learning model to determine a common morpheme token related to the reconstructed image, reconstruct a sentence of the caption data by using the determined common morpheme token, and provide the reconstructed sentence to the user terminal.
The user terminal may be configured to correct the reconstructed sentence in accordance with an operation of the user, and may transmit the finally determined reconstructed sentence to the content generation platform server, and when receiving the finally determined reconstructed sentence, the processor may be further configured to input the finalized reconstructed sentence to the first learning model, re-output the reconstructed image, and provide the reconstructed image to the user terminal.
The processor may be further configured to generate an archive of each of the reconstructed images generated as the reconstructed image is initially generated and the feedback is repeatedly performed.
The processor may be further configured to search for real images having a similarity level higher than a preset reference value, based on at least one of the reconstructed images, and provide the real images to the user terminal.
The processor may be further configured to extract caption data including a sentence or a word through captioning of each of the at least one reconstructed image, compare the extracted caption data with the real images, and recommend new caption data.
According to another embodiments of the invention, a method for providing platform for generating content based on user experience, the method being performed by a processor of a user terminal in conjunction with a platform server including at least one learning model, the method includes the steps of: determining an input image and a first text for the input image in response to an input of a user, performing image captioning on the input image and providing a recommended text including a sentence or a word for at least one category of an object, an appearances, or a background, additionally determining the first text for the input image in response to a user input for the recommended text, generating a bag of words, based on the determined first text, providing the user with a process for setting caption data, based on the bag of words, determining a second text of the caption data in accordance with an input of the user, and setting the caption data in accordance with the determined second text and a predetermined sentence structure, and inputting a sentence indicated by the set caption data, generating a reconstructed image through at least one learning model, and outputting the generated reconstructed image.
The step of providing the user with the process for setting the caption data, based on the bag of words, may include a step of outputting a process for generating the caption data including the predetermined sentence structure forming the caption data and a word category of each item forming the sentence structure.
The word category of each item of the process for generating the caption data may include at least one blank filled by settings of the user, and a connecting word may be formed between the blanks of the word category of each item.
The method may further include a step of receiving a feedback for the reconstructed image selected by the user from the at least one reconstructed image output from the user, and a step of reconstructing a sentence of the caption data, based on the feedback.
The step of receiving the feedback for the reconstructed image may include a step of receiving a feedback opinion for the reconstructed image and a pinpoint in the reconstructed image matching the feedback opinion in accordance with an operation of the user.
The step of reconstructing the sentence of the caption data, based on the feedback, may include a step of inputting the reconstructed image to a third learning model to determine a common morpheme token related to the reconstructed image, and reconstructing the sentence of the caption data by using the determined common morpheme token to provide the reconstructed sentence to the user terminal.
The step of reconstructing the sentence of the caption data, based on the feedback, may include a step of additionally generating and providing the second text for at least one word category of the bag of words, based on the feedback.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention, and together with the description serve to explain the inventive concepts.
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of various embodiments or implementations of the invention. As used herein “embodiments” and “implementations” are interchangeable words that are non-limiting examples of devices or methods employing one or more of the inventive concepts disclosed herein. It is apparent, however, that various embodiments may be practiced without these specific details or with one or more equivalent arrangements. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring various embodiments. Further, various embodiments may be different, but do not have to be exclusive. For example, specific shapes, configurations, and characteristics of an embodiment may be used or implemented in another embodiment without departing from the inventive concepts.
Unless otherwise specified, the illustrated embodiments are to be understood as providing features of varying detail of some ways in which the inventive concepts may be implemented in practice. Therefore, unless otherwise specified, the features, components, modules, layers, films, panels, regions, and/or aspects, etc. (hereinafter individually or collectively referred to as “elements”), of the various embodiments may be otherwise combined, separated, interchanged, and/or rearranged without departing from the inventive concepts.
The use of cross-hatching and/or shading in the accompanying drawings is generally provided to clarify boundaries between adjacent elements. As such, neither the presence nor the absence of cross-hatching or shading conveys or indicates any preference or requirement for particular materials, material properties, dimensions, proportions, commonalities between illustrated elements, and/or any other characteristic, attribute, property, etc., of the elements, unless specified. Further, in the accompanying drawings, the size and relative sizes of elements may be exaggerated for clarity and/or descriptive purposes. When an embodiment may be implemented differently, a specific process order may be performed differently from the described order. For example, two consecutively described processes may be performed substantially at the same time or performed in an order opposite to the described order. Also, like reference numerals denote like elements.
When an element, such as a layer, is referred to as being “on,” “connected to,” or “coupled to” another element or layer, it may be directly on, connected to, or coupled to the other element or layer or intervening elements or layers may be present. When, however, an element or layer is referred to as being “directly on,” “directly connected to,” or “directly coupled to” another element or layer, there are no intervening elements or layers present. To this end, the term “connected” may refer to physical, electrical, and/or fluid connection, with or without intervening elements. Further, the D1-axis, the D2-axis, and the D 3-axis are not limited to three axes of a rectangular coordinate system, such as the x, y, and z-axes, and may be interpreted in a broader sense. For example, the D1-axis, the D2-axis, and the D3-axis may be perpendicular to one another, or may represent different directions that are not perpendicular to one another. For the purposes of this disclosure, “at least one of X, Y, and Z” and “at least one selected from the group consisting of X, Y, and Z” may be construed as X only, Y only, Z only, or any combination of two or more of X, Y, and Z, such as, for instance, XYZ, XYY, Y Z, and ZZ. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
Although the terms “first,” “second,” etc. may be used herein to describe various types of elements, these elements should not be limited by these terms. These terms are used to distinguish one element from another element. Thus, a first element discussed below could be termed a second element without departing from the teachings of the disclosure.
Spatially relative terms, such as “beneath,” “below,” “under,” “lower,” “above,” “upper,” “over,” “higher,” “side” (e.g., as in “sidewall”), and the like, may be used herein for descriptive purposes, and, thereby, to describe one elements relationship to another element(s) as illustrated in the drawings. Spatially relative terms are intended to encompass different orientations of an apparatus in use, operation, and/or manufacture in addition to the orientation depicted in the drawings. For example, if the apparatus in the drawings is turned over, elements described as “below” or “beneath” other elements or features would then be oriented “above” the other elements or features. Thus, the exemplary term “below” can encompass both an orientation of above and below. Furthermore, the apparatus may be otherwise oriented (e.g., rotated 90 degrees or at other orientations), and, as such, the spatially relative descriptors used herein interpreted accordingly.
The terminology used herein is for the purpose of describing particular embodiments and is not intended to be limiting. As used herein, the singular forms, “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Moreover, the terms “comprises,” “comprising,” “includes,” and/or “including,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, components, and/or groups thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It is also noted that, as used herein, the terms “substantially,” “about,” and other similar terms, are used as terms of approximation and not as terms of degree, and, as such, are utilized to account for inherent deviations in measured, calculated, and/or provided values that would be recognized by one of ordinary skill in the art.
Various embodiments are described herein with reference to sectional and/or exploded illustrations that are schematic illustrations of idealized embodiments and/or intermediate structures. As such, variations from the shapes of the illustrations as a result, for example, of manufacturing techniques and/or tolerances, are to be expected. Thus, embodiments disclosed herein should not necessarily be construed as limited to the particular illustrated shapes of regions, but are to include deviations in shapes that result from, for instance, manufacturing. In this manner, regions illustrated in the drawings may be schematic in nature and the shapes of these regions may not reflect actual shapes of regions of a device and, as such, are not necessarily intended to be limiting.
As customary in the field, some embodiments are described and illustrated in the accompanying drawings in terms of functional blocks, units, and/or modules. Those skilled in the art will appreciate that these blocks, units, and/or modules are physically implemented by electronic (or optical) circuits, such as logic circuits, discrete components, microprocessors, hard-wired circuits, memory elements, wiring connections, and the like, which may be formed using semiconductor-based fabrication techniques or other manufacturing technologies. In the case of the blocks, units, and/or modules being implemented by microprocessors or other similar hardware, they may be programmed and controlled using software (e.g., microcode) to perform various functions discussed herein and may optionally be driven by firmware and/or software. It is also contemplated that each block, unit, and/or module may be implemented by dedicated hardware, or as a combination of dedicated hardware to perform some functions and a processor (e.g., one or more programmed microprocessors and associated circuitry) to perform other functions. Also, each block, unit, and/or module of some embodiments may be physically separated into two or more interacting and discrete blocks, units, and/or modules without departing from the scope of the inventive concepts. Further, the blocks, units, and/or modules of some embodiments may be physically combined into more complex blocks, units, and/or modules without departing from the scope of the inventive concepts.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure is a part. Terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and should not be interpreted in an idealized or overly formal sense, unless expressly so defined herein.
In this specification, a ‘content generation platform server’ includes various devices which can perform computational processing to provide a user with results. For example, the content generation platform server according to an embodiment of the invention may include all of a computer, a server device, and a portable terminal, or may be provided in any form of these.
Here, for example, the computer may include a notebook, a desktop, a laptop, a tablet PC, a slate PC, or the like equipped with a web browser.
The server device is a server that processes information by communicating with an external device, and may include an application server, a computing server, a database server, a file server, a game server, a mail server, a proxy server, a web server, and the like.
For example, the portable terminal may include all types of handheld wireless communication devices such as a Personal Communication System (PCS), a Global System for Mobile communications (GSM), Personal Digital Cellular (PDC), Personal Handyphone System (PHS), Personal Digital Assistant (PDA), International Mobile Telecommunication (IM T)-2000, Code Division Multiple Access (CDMA)-2000, W-Code Division Multiple Access (W-CDMA), and Wireless Broadband Internet (Wibro) terminals, or smart phones, and wearable devices such as a watches, rings, bracelets, anklets, necklaces, glasses, contact lenses, or head-mounted devices (HM D).
A user experience-based content generation platform may include a content generation platform server and a terminal (not illustrated) that generates user experience-based content in response to a user input.
In an embodiment, when a terminal generating the content may generate the user experience-based content through the following process. The terminal activates a content generation program to communicate with the content generation platform server, and transmits data in accordance with the user input. The content generation platform server analyzes the transmitted data to generate the content, based on a learning model. Thereafter, the content generation platform server transmits the content to the terminal again.
In another embodiment, a terminal generating the content may generate the user experience-based content through the following process. The terminal performs the user input and some data processing, transmits an analysis using the learning model to the content generation platform server to perform the analysis, receives data output from the learning model, and provides the data to the user.
In still another embodiment, a terminal generating the content may perform a user experience-based content generation method through the following process. The terminal receives a content generation program including the learning model from the content generation platform server, and thereafter executes the received content generation program to analyze data directly input by a user.
Hereinafter, a user experience-based content generation method will be described in which the terminal generating the content receives the user input and receives the data analyzed by the learning model from the content generation platform server. According to another embodiment, as a matter of course, a content generation process in which a partial process of the content generation method described as being performed by a content generation terminal (hereinafter, referred to as a content generation platform server) is also included in the embodiment in the invention. Here, the terminal may include the computer, the portable terminal, and the like as described above.
Hereinafter, the content generation method performed by the platform server according to an embodiment of the invention will be described with reference to
Referring to
The processor 110 may communicate with the memory 140, and the content generation program may include a first learning model trained to output at least one reconstructed content corresponding to a text based on a user input when receiving the text from a terminal.
Meanwhile, a user may have a difficulty in preparing the text for generating a reconstructed content desired by the user in the first learning model, and may have a difficulty in finding solutions when the reconstructed content is not generated in a desired direction.
An embodiment of the invention aims to provide a user experience-based content generation method through a process of generating the content or by archiving a process in which the user directly inputs the content and the text matching the content, and assisting the user to compose the text, based on this archive.
For this purpose, the processor 110 may receive input content and the first text matching the input content from the user.
Here, as the input content and the first text, the user may directly select the input content in any way through an archive input, and the first text matching the selected input content may be directly input (typewritten).
In addition, the first text may be a text by the user input to the first learning model to obtain an initial concept of the reconstructed content to be generated, and the reconstructed content generated based on this text may be the input content. That is, the user may input the first text to the first learning model for the initial concept, and may set the generated reconstructed content as the input content.
Hereinafter, a case where the input content is an input image and the reconstructed content is a reconstructed image will be described as an example.
The first text described above may include a personal reason and an opinion which relate to the input image. Without being limited thereto, any information that may reflect experiences of the user may be used. For example, when the input image is a fox image, the processor 110 may register a silly fox face and a strong touching texture which relate to the fox image input by the user, as the first text. In this way, the first text is not used to simply describe a subject of the input image, but may include various opinions such as a drawing style of the input image.
The processor 110 may receive the input content and the first text through an operation of the user to generate the archive for setting an initial concept reflecting user experiences. The archive may match and store various content, for example, images and related texts, based on the user experiences. Thereafter, the processor 110 may use images and related texts when generating the reconstructed content matching the user's intention, based on the archive including at least one input content and the first text matching the input content.
As illustrated in
Meanwhile, an embodiment of the invention may provide the following recommendation service in relation to the input of the first text through the learning model.
For example, the second learning model may include a first image captioning learning model that generates the text for describing each object and situation within the input image when the input image is input.
In addition, the second learning model may include a second image captioning learning model that generates the text for describing a type, a characteristic, or the like which is a word category of different attributes of the input when the input image is input.
Therefore, referring to
In more detail, the processor 110 may filter the text input by the user and the recommended texts for setting the initial concept, as a base text for generating the first text through a bow filter. The processor 110 may input the filtered base texts to a sentence generation learning model (sentence construction), and may generate a suggested first text for the input image to be provided to the user.
The user may correct and edit the provided first text to generate a final first text for the input image.
That is, the processor 110 may receive the final text determined or corrected based on the input test by the user and the recommended text, as the first text. That is, the recommended text may be selected as it is or may be corrected in response to the operation of the user so that the recommended text is reflected as the first text.
Referring to
Thereafter, the processor 110 may generate the bag of words, based on the first text. Through the above-described repeated processes, a plurality of the first text may be used, and the processor 110 may generate the bag of words, based on the plurality of first texts. In this case, the bag of words may be formed by reflecting characteristics of each user, based on the first texts input by the user. In addition, the processor 110 may recommend the first texts for generating the bag of words by learning the input images.
The processor 110 may determine caption data by using a second text included in the bag of words and a predetermined sentence structure. In this case, the caption data may mean a sentence expressing what content the image includes. The second text may mean a text included in the bag of words formed based on the plurality of first texts.
The processor 110 may cause an output unit (not illustrated) to output a predetermined sentence structure including a plurality of blanks matching the word category for describing the input image and a connecting word between the plurality of blanks. The word category may mean those in which the words to be input to the blanks are classified by attributes.
Referring to
In this case, the processor 110 may provide a recommended word which may be input to each of blanks corresponding to the items of each of the plurality of word categories, based on the first text. For this purpose, the processor 110 may provide the first text and/or the recommended word included in the first text by considering a relationship with the input image in the first text and whether the first text coincides with each word category.
For example, the processor 110 may provide a predetermined sentence structure such as A ‘TYPE’ of ‘BASE’ that is ‘DETAIL’ in the ‘STYLE’. In this case, the ‘TYPE’, ‘BASE’, ‘DETAIL’, and ‘STYLE’ may be implemented as blank items that prompt inputs of words. Here, the words may mean phrases and clauses composed of multiple words, which may be words included in the first text input by the user for the input image and/or the first text recommended from the input image. Each blank may display the first text matching each word category required for describing the image, such as the type, base, detail, and style.
In addition, the processor 110 may additionally provide a reference word representing attributes of each blank item. In detail, the processor 110 may recommend the reference words representing the attributes of each word category item which is frequently used but not input to the archive by the user for each item of the type, base, detail, and style. In addition, the processor 110 may provide the image matching the reference word when recommending the reference word so that the user may intuitively understand the meaning of the reference word.
In the caption data, the above-described “A, of, that is, and in the” is a connecting word between multiple blanks, may include an articles, a particle, or the like for completing the sentence when the word is input to the blank. That is, the connecting word in an embodiment of the invention may mean a word for completing the sentence when the word is input to the blank in a predetermined sentence structure.
Specifically, the processor 110 in an embodiment of the invention may input a high-quality sentence helpful for training the learning model, to prevent a case where users input sentences which do not satisfy reference values in terms of a format, a content and the like, or a case where users input sentences which do not accurately reflect intentions of the users since the users do not know which sentences to input. That is, the processor 110 guides the users how to combine the sentences by including any words.
Referring to
For example, when the input image is a fox and the predetermined sentence structure such as A ‘TYPE’ of ‘BASE’ that is ‘DETAIL’ in the ‘STYLE’ is output to a screen, the processor 110 may provide a plurality of recommended words which may be input to the blanks of the ‘TYPE’, ‘BASE’, ‘DETAIL’ and ‘STYLE’ in relation to the fox by searching for the bag of words. The processor 110 may input words of oil-color painting, a fox, sitting in a field at sunrise, an realism art which are selected by the user corresponding to the ‘TYPE’, ‘BASE’, ‘DETAIL’ and ‘STYLE’ in the plurality of recommended words so that the words correspond to each blank. Based on the input words, the processor 110 may determine caption data such as “A oil-color painting of a fox that is sitting in a field at sunrise in the style of realism art.
That is, the processor 110 may provide a word belonging to the first text corresponding to the item of the TY PE from a previously recorded archive, or the recommended word for the input image, so that the user may select the word corresponding to the item of the TYPE. In addition, the processor 110 may provide a reference word typically used for the item of the TY PE, and a related reference image (Related Reference) for understanding an expression meaning of the reference word, so that the user may select the word corresponding to the item of the TY PE.
In addition, referring to
Referring to
In this case, the type, figuration, base, description, action, and style may be derived from a code of an object, an expression, an atmosphere, a style, a background, and others.
The type may be a finally shown output (for example, a pattern, an illustration, a manipulation, or a photograph), and the figuration may be simplified, shaped, abstracted, or the like.
In addition, the base may be a core subject, and the description may mean describing the subject.
In addition, the action may be an action taken by the subject.
A first code of the above-described type, figuration, base, description, action, and style may be further classified and applied to a second code in detail, and prototype codes of the type, base, detail, and style may be acquired.
These prototype codes may be a word category that guide an input to the blanks of the predetermined sentence structure. The word category and the category of the bag of words are not limited to those described above, and may be changed depending on needs of an operator.
Meanwhile, the processor 110 in an embodiment of the invention may perform the following process to improve sentence completeness of the determined caption data.
As an example, the processor 110 includes a grammar check function, and may perform a grammar check on the determined caption data. The processor 110 may check the sentence in which a grammatical error may occur due to only the input word searched from the bag of words and the previously determined connecting word, and may correct the connecting word or may correct a format of the word input to the blank so that a combination of the words input to the blanks may be correctly formed. For example, the processor 110 may correct parts of speech of the words input to the blanks to fit the sentence structure, or may correct (for example, add, delete, or change) the connecting word.
As another example, when presenting the recommended word searched from the bag of words, the processor 110 in an embodiment of the invention may correct the parts of speech of the recommended word, and may present a result by considering the sentence structure between the blank to which the recommended word is input and the preset connecting word. For example, the processor 110 may correct the parts of speech of A, may add the connecting words such as articles and particles to A, and may present the connecting words as the recommended words, depending on the connecting word adjacent to the blank to which the word A is input.
The processor 110 may provide a tool to enable the user to freely input the sentence and generate the caption data, based on the bag of words. For example, referring to
The processor 110 may input the caption data to the learning model to generate at least one reconstructed content corresponding to the first text. The reconstructed content may be the reconstructed image. In this case, the processor 110 may generate the reconstructed content corresponding to the first text by using a multimodal AI that performs learning by simultaneously inputting various modalities.
Referring to
As illustrated in
The above-described feedback process may be performed before connecting and outputting the at least one reconstructed content and the input content (to be described later) after generating the at least one reconstructed content. Without being limited thereto, feedback process may also be performed after connecting and outputting the at least one reconstructed content and the input content.
The processor 110 may additionally generate the second text of the bag of words, based on the archive including the first text in which the reconstructed image is used as the input image, based on the feedback. That is, the processor 110 may receive the feedback on the reconstructed image, may receive the text and the pinpoint by using the reconstructed image as the feedback, may generate the input image-first text by using the reconstructed image, and may store the input image-first text in the archive. In addition, the processor 110 may provide a process of generating the caption data, based on the additionally stored input image-first text, and may repeat the process of generating the reconstructed image that gradually further matches the intention of the user.
For this purpose, the processor 110 may additionally generate the second text in accordance with the category of the bag of words, based on the feedback.
Referring to
Meanwhile, referring to
Referring to
Referring to
In addition, referring to
As illustrated in
The processor 110 in an embodiment of the invention may include one or more cores, and may include a processor for data analysis and deep learning, such as a central processing unit of a computing device, a general purpose graphics processing unit, and a tensor processing unit. The processor 110 may read a computer program stored in the memory 140 to perform data processing for machine learning according to an embodiment of the invention. According to an embodiment of the invention, the processor 110 may perform operations for learning a neural network. The processor 110 may perform calculations for learning a neural network, such as processing input data for learning in deep learning, extracting features from the input data, calculating errors, and updating weights of the neural network using backpropagation. Although not illustrated, the processor 110 in an embodiment of the invention may input noise training data including clean label data and label noise data to a neural network model to select label noise, and may train a classifier by mixing the label noise data and the clean label data.
The neural network model may be a deep neural network. In an embodiment of the invention, a neural network, a network function, and a neural network may be used with the same meaning. A deep neural network (DNN) and a deep neural network may mean a neural network including multiple hidden layers in addition to an input layer and an output layer. When the deep neural network is used, latent structures of data may be identified. That is, latent structures of a photo, a text, a video, a voice, or a music (for example, what object is in the photo, what the content and emotion of the text are, what the content and emotion of the voice are, and the like) may be identified. The deep neural network may include a convolutional neural network (CNN), a recurrent neural network (RNN), a restricted Boltzmann machine (RBM), a deep belief network (DBN), a Q network, a U network, a Siamese network, and the like.
The convolutional neural network is a type of the deep neural network including the neural network including a convolutional layer. The convolutional neural network is a type of multilayer perceptrons designed to use minimal preprocessing. The CNN may include one or more convolutional layers and artificial neural network layers combined therewith. The CNN may additionally use a weight and a pooling layer. Owing to this structure, the CNN may sufficiently use two-dimensional structured input data. The convolutional neural network may be used to recognize objects in images. The convolutional neural network may process image data by representing the image data as a matrix having dimensions. For example, in a case of image data encoded in RGB (red-green-blue), the image data may be represented as a two-dimensional matrix by each color of the R, G, and B (for example, in a case of a two-dimensional image). That is, a color value of each pixel of the image data may be a component of the matrix, and a size of the matrix may be the same as a size of the image. Therefore, the image data may be represented as three two-dimensional matrices (three-dimensional data array).
In the convolutional neural network, a convolutional process (input and output of convolutional layer) may be performed by multiplying matrix components at each location of a convolutional filter and the image while moving the convolutional filter. The convolutional filter may include a matrix in a form of n*n. The convolutional filter may generally include a filter in a fixed form which is smaller than the total number of pixels of the image. That is, when an m*m image is input to the convolutional layer (for example, a convolutional layer in which a size of the convolutional filter is n*n), a matrix representing n*n pixels including each pixel of the image may be multiplication between the convolutional filter and the components (that is, multiplication between respective components of the matrix). Since the multiplication with the convolutional filter is used, a component matching the convolutional filter may be extracted from the image. For example, a 3*3 convolutional filter for extracting upper and lower straight line components from the image may be configured as [[,1,], [,1,], [,1,]]. When the 3*3 convolutional filter for extracting the upper and lower straight line components from the image is applied to the input image, the upper and lower straight line components matching the convolutional filter may be extracted and output from the image. In the convolutional layer, the convolutional filter may be applied to each matrix (that is, each color of to the R, G, and B in a case of an R, G, and B coded image) for each channel representing the image. In the convolutional layer, features matching the convolutional filter may be extracted from the input image by applying the convolutional filter to the input image. A filter value (that is, a value of each component of the matrix) of the convolutional filter may be updated by backpropagation during a learning process of the convolutional neural network.
The output of the convolutional layer may be connected to a subsampling layer to simplify the output of the convolutional layer. In this manner, a memory usage amount and a computational amount may be reduced. For example, when the output of the convolutional layer is input to a pooling layer having a 2*2 max pooling filter, the image may be compressed by outputting a maximum value included in each patch for each 2*2 patch from each pixel of the image. The above-described pooling may be a method for outputting a minimum value from the patch or outputting an average value of the patch, and any pooling method may be included in an embodiment of the invention.
The convolutional neural network may include one or more convolutional layers and sub-sampling layers. The convolutional neural network may extract features from the image by repeatedly performing the convolutional process and the sub-sampling process (for example, the above-described max pooling). Through the repeated convolutional process and sub-sampling process, the neural network may extract global features of the image.
The output of the convolutional layer or the subsampling layer may be input to a fully connected layer. The fully connected layer is a layer in which all neurons in one layer are connected to all neurons in the neighboring layer. The fully connected layer may mean a structure in which all nodes in each layer are connected to all nodes in other layers in the neural network.
At least one of a CPU, a GPGPU, and a TPU of the processor 110 may process learning of a network function. For example, the CPU together with the GPGPU may process the learning of the network function and classification of data using the network function. In addition, in one embodiment in the invention, the processors of a plurality of computing devices may be used together to process the learning of the network function and the classification of data using the network function. In addition, a computer program executed in the computing device according to one embodiment in the invention can be a program executable by the CPU, the GPGPU, or the TPU.
The memory 140 may store a computer program for providing a platform providing method, and the stored computer program may be read and executed by the processor 150. The memory 17 may store any form of information generated or determined by the processor 110 and any form of information received by the communication processor 150.
The memory 140 may store data supporting various functions of the content generation platform server 100 and a program for the operation of the processor 110, may store input/output data (for example, the input image, the first text, the reconstructed image, the second text of the bag of words, and the like), and may store multiple application programs or applications running on the content generation platform server 100, data for the operation of the content generation platform server 100, and commands. At least some of these application programs may be downloaded from an external server via wireless communication.
This memory 140 may include at least one type of storage media in memories of a flash memory type, a hard disk type, a solid state disk type, a silicon disk drive (SDD) type, a multimedia card micro type, a card type (for example, an SD or X D memory and the like), a random access memory (RAM), a static random access memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, and an Optical disk. In addition, the memory may be a database separate from a main device but connected to the main device in a wired or wireless manner.
The communication processor 150 may include one or more components that enable communication with an external device, and for example, may include at least one of a broadcast receiving module, a wired communication module, a wireless communication module, a short-range communication module, and a location information module.
Although not illustrated, the content generation platform server 100 in an embodiment of the invention may further include an output unit and an input unit.
The output unit may display a user interface (UI) for providing a label noise selection result, a learning result, or the like. The output unit may output any form of information generated or determined by the processor 110 and any form of information received by the communication processor 150.
The output unit may include at least one of a liquid crystal display (LCD), a thin film transistor-liquid crystal display (TFT LCD), an organic light-emitting diode (OLED), a flexible display, and a three-dimensional display (3D display). Some of these display modules may be configured as a transparent or light-transmitting type through which the outside is visible. This type may be referred to as a transparent display module, and a representative example of the transparent display module is a Transparent OLED (TOLED).
The input unit may receive information input by the user. The input unit may include a key and/or a button on the user interface, or a physical key and/or a physical button for receiving the information input by the user. A computer program for controlling a display according to the embodiments in the invention may be executed in response to the user input through the input unit.
Hereinafter, a method for providing a platform in which the content generation platform server 100 provides a learning model trained to output at least one reconstructed image corresponding to a text when the content generation platform server 100 receives the text from a user terminal will be described.
The processor 110 of the content generation platform server 100 may receive input content and a first text matching the input content from the user terminal through a content generation module 120 (Step 310). The input content may be an input image.
In addition, the input content may be a reconstructed image generated by a previous user through a first learning model, and the first text may be a text matching the reconstructed image.
In Step 310, the processor 110 may generate a recommended text including a sentence or a word including an object, an appearance, a background, and the like through captioning of the input image using a second learning model of the content generation module 120, and may transmit and output the generated recommended text to the user terminal.
In addition, through the content generation module 120, the processor 110 may receive a final text determined or corrected based on a text input by the previous user or/and a recommended text output based on the input image from an image captioning model, as the first text.
Next, the processor 110 may generate a bag of words, based on the first text through the content generation module 120 (S320).
In detail, the processor 110 may classify words included in the first text (here, the words mean words forming the first text including a phrase and a clause) into each word category, and may provide a recommended word for each word category.
Next, the processor 110 may determine caption data by using the second text included in the bag of words and a predetermined sentence structure through the content generation module 120 (S330).
In this case, through an output unit, the processor 110 may output a predetermined sentence structure including a plurality of blanks and connecting words between the plurality of blanks, which match the word category for describing the input image through the image captioning model of the content generation module 120.
The processor 110 may provide a recommended word which can be input to each of the plurality of blanks, and may provide the recommended word by considering a relationship with the input image and whether the recommended word coincides with the word category.
In detail, the server 100 may provide the recommended word for the first text and the recommended word for the input image to the user terminal to set a category for each word item of caption data of the predetermined sentence structure, and based on the recommended words, may receive the second text determined for each word item category from the user terminal in response to the user input. In this manner, the second text of the caption data may be determined.
The processor 110 may automatically set the determined second text, the connecting word, and the particle to determine the text for the caption data. Next, the processor 110 may input the text of the caption data to the first learning model through the content generation module 120, and may generate at least one reconstructed content corresponding to the text indicated by the caption data (S340).
Next, the processor 110 may provide information to the user terminal through the communication processor 150 to connect and output at least one reconstructed content and the input content based on a caption data input process of generating the reconstructed content through the content generation module 120 (S350).
In addition, the processor 110 may control the user terminal to be provided with a process of receiving the feedback on the reconstructed image selected by the user from at least one reconstructed image through the feedback processing module 130.
When the user terminal provides the feedback on the reconstructed image to the feedback processing module 130, the user terminal may receive a feedback opinion on the reconstructed image and a pinpoint in the reconstructed image matching the feedback opinion in response to an operation of the user.
When receiving the feedback through the feedback processing module 130, the processor 110 receiving the feedback from the user terminal through the communication processor 150 may input the reconstructed image to the learning model to identify a common morphological token related to the reconstructed image, may reconstruct a sentence of caption data by using the identified common morphological token, and may output the reconstructed sentence through the output unit. For this purpose, the processor 110 may read out a third learning model which is a morphological analyzer trained to analyze text preprocessing and separate the text into morpheme units, from the memory 140, and may use the third learning model.
The processor 110 may correct the reconstructed sentence through the feedback processing module 130 in response to the operation of the user.
Next, the processor 110 may additionally generate the second text by updating the bag of words, based on the feedback of the reconstructed image through the feedback processing module 130.
Next, the processor 110 may reconstruct the sentence of the caption data, based on the feedback through the feedback processing module 130.
The processor 110 may additionally generate the second text in accordance with the word category of each item in the bag of words, based on the feedback through the feedback processing module 130.
The processor 110 may generate an archive of each reconstructed image generated as the reconstructed image is initially generated and the feedback is repeatedly performed through the content generation module 120.
The processor 110 may update the first learning model through the archive generated in this way. Since the first learning model is trained through the text according to a predetermined sentence structure of the caption data and the reconstructed image matching the sentence structure (for example, the input image), the first learning model may be trained with a high rate of matching the predetermined sentence structure.
The processor 110 may search for and provide real images having a similarity level higher than a preset reference value, based on at least one reconstructed image through the content generation module 120.
The processor 110 may extract the caption data including a sentence or a word through captioning of at least one reconstructed image through the content generation module 120, and may recommend new caption data by comparing the extracted caption data with the real images.
Meanwhile, the above-described method according to the invention may be implemented and stored as a program or an application in a medium to be executed in combination with a hardware server.
The disclosed embodiments may be implemented in a form of a recording medium storing instructions executable by a computer. The instructions may be stored in a form of program codes, and when the program codes are executed by a processor, program modules may be generated to perform operations of the disclosed embodiments. The recording medium may be implemented as a computer-readable recording medium.
The computer-readable storage medium includes all types of storage media that store instructions which can be read by a computer. For example, the computer-readable storage medium includes a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, and the like.
According to the above-described technical solution in an embodiment of the invention, an archive reflecting a user's own concept can be formed, based on the user's experiences and thoughts, and an image having maximized creativity can be provided by using the archive.
According to the above-described technical solution in an embodiment of the invention, a learning model is trained by using a text and a predetermined sentence structure in a bag of words. Therefore, an image that matches the user's intention can be generated.
Advantageous effects in an embodiment of the invention are not limited to the advantageous effects described above, and other advantageous effects not described herein will be clearly understood by those skilled in the art, from the description below.
As described above, the disclosed embodiments have been described with reference to the accompanying drawings. Those skilled in the art to which an embodiment of the invention pertains will understand that the present disclosure can be implemented in forms different from those of the disclosed embodiments without changing the technical idea or essential features in the present disclosure. The disclosed embodiments are exemplary, and should not be construed as limitative.
The present disclosure has industrial applicability since a method for generating user experience-based content by using a learning model is performed through processing of a server and a terminal.
Although certain embodiments and implementations have been described herein, other embodiments and modifications will be apparent from this description. Accordingly, the inventive concepts are not limited to such embodiments, but rather to the broader scope of the appended claims and various obvious modifications and equivalent arrangements as would be apparent to a person of ordinary skill in the art.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0125347 | Sep 2022 | KR | national |
This application is a Bypass Continuation of International Patent Application No. PCT/KR 2023/014847, filed on Sep. 26, 2023, which claims priority from and the benefit of Korean Patent Application No. 10-2022-0125347, filed on Sep. 30, 2022, each of which is hereby incorporated by reference for all purposes as if fully set forth herein.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2023/014847 | Sep 2023 | WO |
Child | 19093352 | US |