The present application claims priority to Chinese Patent Application No. 202310035955.X, filed on Jan. 10, 2023 and entitled “PROCESSING METHOD, APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM FOR MULTIMODAL DATA”, the entirety of which is incorporated herein by reference.
Embodiments of the present disclosure relate to the field of computer technology, and more particularly, to a processing method, apparatus, electronic device and storage medium for multimodal data.
Currently, studies on multimodal deep learning have been widely conducted, which is committed to simultaneously processing at least two of modal data such as speech, text, images, and videos.
In the prior art, the pre-training process of multimodal submodels usually focuses on feature contrastive learning between global-dimensional multimodal data, while ignoring the correspondence between finer-grained features, thereby resulting in limited performance of pre-trained models on downstream multimodal data processing tasks.
The embodiments of the present disclosure provide a processing method, apparatus, electronic device and storage medium for multimodal data, enabling to establish fine-grained correspondence between multimodal data and enabling to improve the performance of pre trained models in downstream multimodal data processing tasks.
In a first aspect, the embodiments of the present disclosure provide a processing method for multimodal data, comprising: obtaining data to be processed of an original modality; determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model;
wherein the target processing model comprises a multimodal submodel, and the pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data; wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality.
In a second aspect, the embodiments of the present disclosure further provide a processing apparatus for multimodal data, comprising: a data obtaining module, used for obtaining data to be processed of an original modality; a data processing module, used for determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model; wherein the target processing model comprises a multimodal submodel, and the pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data; wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality.
In a third aspect, the embodiments of the present disclosure further provide an electronic device comprising: one or more processors; a storage device for storing one or more programs, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the processing method for multimodal data according to any embodiment of the present disclosure.
In a fourth aspect, the embodiments of the present disclosure further provide a storage medium comprising computer-executable instructions, the computer-executable instructions, when executed by a computer processor, used for performing the processing method for multimodal data according to any embodiment of the present disclosure.
The technical solution of the embodiments of the present disclosure obtains the data to be processed of the original modality; determines the result data of the target mode corresponding to the data to be processed by processing the data to be processed with the target processing model; wherein the target processing model includes a multimodal submodel, and the pre-training task of the multimodal submodel includes the task of locating local data that matches the second modal data from the first modal data; wherein the target processing model includes a multimodal submodel, and the pre-training task of the multimodal submodel includes the task of locating local data that matches the second modal data from the first modal data; wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality. By setting the pre-training task of locating local data that matches the second modal data from the first modal data, the multimodal submodel can establish a finer- grained local correspondence relationship between the multimodal data, thereby improving the performance of the target processing model on downstream tasks.
In conjunction with the accompanying drawings and with reference to the following detailed description, the above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent. Throughout the drawings, like or similar reference numerals denote like or similar elements. It should be understood that the drawings are illustrative and that the originals and elements are not necessarily drawn to scale.
The following will describe embodiments of the present disclosure in more detail with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be construed as limited to the embodiments set forth herein. On the contrary, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of protection of the present disclosure.
It should be understood that the various steps described in the implementations of the methods of the present disclosure may be executed in different orders and/or in parallel. In addition, the implementations of the methods may include additional steps and/or omit the steps shown. The scope of the present disclosure is not limited in this regard.
The term “including” and its variations used in this article are open-ended, i.e. “including but not limited to”. The term “based on” means “based at least in part on”. The term “one embodiment” refers to “at least one embodiment”; the term “another embodiment” refers to “at least one additional embodiment”; and the term “some embodiments” refers to “at least some embodiments”. Relevant definitions of other terms will be given in the following description.
It should be noted that the concepts of “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules, or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules, or units.
It should be noted that the modifications of “one” and “multiple” mentioned in the present disclosure are illustrative and not restrictive. Those skilled in the art should understand that unless otherwise specified in the context, they should be understood as “one or more”.
It may be understood that the data involved in this technical solution (including but not limited to the data itself, data acquisition or use) should comply with the requirements of relevant laws and regulations and relevant provisions.
As shown in
S110: obtaining data to be processed of an original modality;
S120: determining result data of a target modality corresponding to the data to be processed by processing the data to be processed with a target processing model; wherein the target processing model includes a multimodal submodel, and a pre-training task of the multimodal submodel includes a task of locating local data that matches second modal data from first modal data.
In embodiments of the present disclosure, the original modality and the target modality generally refer to different modalities, and different modalities can be considered as different data types, e.g., including without limitation to, speech, text, images, videos and other modalities. With the trained target processing model, the input data to be processed of the original modality can be processed, and the result data of the target modality corresponding to the data to be processed can be determined.
In some optional implementations, the target processing model can be applied to at least one of the following tasks: a video-based text locating task, a text-based video temporal locating task, a video-based text retrieval task, a text-based video retrieval task, a video-based text generation task, a text-based video generation task, a video question answering task, and a video parsing task.
When the target processing model is applied to a video-based text locating task, the original modality may include video, and the target modality may include text; when the target processing model is applied to a text-based video temporal locating task, the original modality may include text, and the target modality may include video. In both cases, the determining the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model may comprise: extracting a feature of the data to be processed and extracting a feature of data to be located of of the target modality with the target processing model; encoding the feature of the data to be processed and the feature of the data to be located to obtain an encoding result; and locating local data that matches the data to be processed from the data to be located based on the encoding result. The video-based text locating task may include, for example, a task of locating matched local text segments from long text based on an input video (abbreviated as text locating task); the text-based video temporal locating task may include, for example, a task of locating matched local video segments from a long video based on input text (abbreviated as an event locating task).
When the target processing model is applied to a video-based text retrieval task, the original modality may include video, and the target modality may include text; when the target processing model is applied to a text-based video retrieval task, the original modality may include text, and the target modality may include video. In both cases, the determining the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model may comprise: extracting a feature of the data to be processed with the target processing model, and matching the extracted feature with features of each data of the target modality in a predetermined library to retrieve corresponding result data from the predetermined library. The video-based text retrieval task includes tasks such as determining a text description corresponding to a classification according to the video; the text-based video retrieval task includes tasks such as searching for relevant complete videos based on input keywords.
When the target processing model is applied to a video-based text generation task, the original modality may include video, and the target modality may include text; when the target processing model is applied to a text-based video generation task, the original modality may include text, and the target modality may include video. In both cases, the determining the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model may comprise: extracting a feature of the data to be processed with the target processing model, and generating the result data of the corresponding target modality based on the extracted feature. The video-based text generation task includes tasks such as generating a text description corresponding to video content; the text-based video generation task includes tasks such as generating related videos based on input keywords.
When the target processing model is applied to a video question answering task, the original modality may include video, and the target modality may include text. At this time, the determining the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model may comprise: extracting features of the video and question text with the target processing model, and generating answer text based on the features of the video and the question text. The video question answering task includes, for example, a video content comprehension task.
When the target processing model is applied to a video parsing task, the original modality may include video, and the target modality may include text. At this time, the determining the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model may comprise: extracting features of the video with the target processing model; dividing the video into different video segments according to the features of the video, and generating text corresponding to the content of each video segment. The video parsing task includes, for example, a video content comprehension task.
In these optional implementation methods, the original modality can be one of video and text, and the target modality can be the other. Thereby, the processing of modal data between video and text can be realized, which helps to intelligently produce and analyze videos. In addition, the target processing model can also process other video and text tasks, as well as tasks between other multimodal data (such as mutual indexing and generation between audio and text), which are not exhausted here.
In the embodiments of the present disclosure, the target processing model may include a pre-trained multimodal submodel or a model structure for specific downstream tasks. The multimodal submodel may include, for example, a transformer model and other models with comprehension ability of different modal data. The pre-training task of the multimodal submodel may include a task of locating local data that matches the second modal data from the first modal data.
When the first modal data belongs to the original modal, the second modal data belongs to the target modal; when the first modal data belongs to the target modal, the second modal data belongs to the original modal. That is, the first modal data and the second modal data belong to different modalities, and the first modal data/second modal data belong to one of the original modal and the target modal. The task of locating local data that matches the second modal data from the first modal data can include but is not limited to the text locating task and event locating task described above.
By introducing the task of locating local data that matches the second modal data from the first modal data in the pre-training process of the multimodal submodel, the pre-trained multimodalm submodel can learn a finer-grained local correspondence between the multimodal data, thereby improving the performance of the target processing model to which the multimodal submodel belongs on downstream tasks (such as video text interlocating, retrieval, generation, and multimodal video analysis).
In addition, the pre-training task of the multimodal submodel can further include other tasks based on large-scale and broad multimodal data. The broad multimodal data can be considered as multimodal data that includes different domains and is not specific to downstream tasks. Through large-scale pre-training, the multimodal submodel can have high comprehension ability of multimodal data in different domains at the same time, extract common features between multimodal data in different domains, which is conducive to transferring the high comprehension ability of multimodal data to the target processing model to help the target processing model perform specific downstream tasks.
The technical solution of the embodiments of the present disclosure obtains the data to be processed of the original modality; and determines the result data of the target modality corresponding to the data to be processed by processing the data to be processed with the target processing model; wherein the target processing model includes a multimodal submodel, and the pre-training task of the multimodal submodel includes the task of locating local data that matches the second modal data from the first modal data; wherein when the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality. By setting the pre-training task of locating local data that matches the second modal data from the first modal data, the multimodal submodel can establish a finer-grained local correspondence relationship between the multimodal data, thereby improving the performance of the target processing model on downstream tasks.
Various optional solutions in the method for multimodal data processing provided in this embodiment and the above embodiments of the present disclosure can be combined. The method for multimodal data processing provided in this embodiment describes in detail the pre-training process of the multimodal submodel. By fusing each first modal segment data into a longer first modal data and constructing a first fusion feature of the first modal data based on a first feature of each first modal segment data, a foundation can be laid for the pre-training task of locating local data from the data. Afterwards, the first fusion feature and the second feature of the given second modal data can be encoded, and target segment data in the first modal data that matches the second modal data can be predicted according to the encoding result for supervised training of the multimodal submodel, so that the pre-trained multimodal submodel can learn a finer-grained local correspondence between the multimodal data.
S210: constructing a first fusion feature based on a first feature of each first modal segment data in the first modal data.
In this embodiment, each first modal segment data has the same modality, and each first modal segment data may be joined into longer first modal data. Based on an existing feature extraction model, a first feature may be extracted from each first modal segment data. Thereafter, a first fusion feature corresponding to the first model data may be constructed based on each of the first features according to the concatenation order of the first modal data.
As an example,
S220: encoding the first fusion feature and a second feature of second modal data to obtain an encoding result.
When the first modal data belongs to the original modality, the second modal data belongs to the target modality; when the first modal data belongs to the target modality, the second modal data belongs to the original modality. The second feature can be extracted from the given second modal data based on the existing feature extraction model. The input first fusion feature and the second feature can be encoded based on the existing feature encoder to obtain the encoding result.
Referring to
S230: predicting target segment data that matches the second modal data from each of the first modal segment data according to the encoding result.
In this embodiment, the encoding result may be input to a decoder and other subsequent network layers, so that the subsequent network layer can predict the first feature matching the second feature from the first fusion feature according to the encoding result, and then can locate the corresponding first modal segment data from each of the first modal data according to the matched first feature, i.e., locate the target segment data matching the second modal data.
For example, referring again to
S240: pre-training the multimodal submodel according to the target segment data and label data corresponding to the second modal data.
The label data corresponding to the second modal data may be obtained in advance, and the label data may be first modal segment data really corresponding to the second modal data in the first modal data, or may be position information of the real corresponding first modal segment data in the first modal data.
When the label data is the real corresponding first modal segment data, a loss value can be determined based on an existing loss function and according to the target segment data and the real corresponding first modal segment data; when the label data is the real position information of the real corresponding first modal segment data in the first modal data, the loss value can be determined based on an existing loss function and according to the position information of the target segment data in the first modal data and the real position information. Afterwards, forward feedback can be performed based on the loss value to adjust parameters in the multimodal submodel, so that pre-training of the multimodal submodel can be completed.
The technical solution of the present disclosure describes the pre-training process of the multimodal submodel in detail. By fusing each of the first modal segment data into a longer first modal data and constructing a first fusion feature of the first modal data according to the first feature of each first modal segment data, a foundation can be laid for the pre-training task of locating local data from the data. Afterwards, the first fusion feature and the second feature of the given second modal data can be encoded, and the target segment data that matches the second modal data in the first modal data can be predicted according to the encoding result for supervised training of the multimodal submodel, so that the pre-trained multimodal submodel can learn a finer-grained local correspondence between the multimodal data.
In addition, the method for multimodal data processing provided by this embodiment and the method for multimodal data processing provided in the above embodiments belong to the same disclosure concept, technical details which are not described in detail in this embodiment may refer to the above embodiments, and the same technical features have the same beneficial effects in this present embodiment and the above embodiments.
Various optional solutions in the method for multimodal data processing provided in this embodiment and the above embodiments of the present disclosure can be combined. The method for multimodal data processing provided in this embodiment describes in detail the construction process of the first fusion feature of the long video. By constructing the first fusion feature of the long video, a foundation can be laid for the event locating task in pre-training, so that the pre-trained multimodal submodel learns the correspondence between complete text and fine-grained local video.
Adjusting the order of each of the video segment data, and concatenating the first feature of each of the video segment data whose order has been adjusted; sampling each of the video segment data, and concatenating the first feature of each of the sampled video segment data.
In
As an example,
Method 1: randomly adjusting the order of V1-V3 to the 3rd, 1st, and 2nd (that is, the concatenation order of V1-V3 in the first modal data is the 3rd, 1st, and 2nd); concatenating the corresponding first features v1-v3 according to the adjusted order to obtain the first fusion feature vm.
Method 2: sampling each of the first features v1-v3 according to the sampling way of V1-V3. For example, in
In addition to the two construction methods shown in
In some optional implementations, the label data corresponding to the second modal data may include: start and end frame position information of the video segment data corresponding to the second modal data in the first modal data.
For example, as shown in
Accordingly, the features input into the encoder in
In these optional implementations, the first fusion feature can be constructed based on the first feature by adjusting the order and/or sampling, thereby laying a foundation for the event locating task.
The technical solution of the embodiments of the present disclosure describes in detail the construction process of the first fusion feature of the long video. By constructing the first fusion feature of the long video, a foundation can be laid for the event locating task in pre-training, so that the pre-trained multimodal submodel can learn the correspondence between complete text and fine-grained local videos. Meanwhile, the modeling of video context temporal information can also be realized through the event locating task, which can improve the performance of the pre-trained model on more downstream tasks (such as video temporal positioning and other tasks).
Further, the method for multimodal data processing provided in this embodiment and the method for multimodal data processing provided in the above embodiments belong to the same disclosure concept, technical details which are not described in detail in this embodiment may refer to the above embodiments, and the same technical features have the same beneficial effects in the present embodiment and the above embodiments.
Various optional solutions in the method for multimodal data processing provided in this embodiment and the above embodiments of the present disclosure can be combined. The method for multimodal data processing provided in this embodiment describes in detail the construction process of the first fusion feature of long text. By constructing the first fusion feature of long text, a foundation can be laid for text locating tasks in pre-training, so that the pre-trained multimodal submodel learns the correspondence between complete video and fine-grained local text.
Adjusting the order of each text segment data, concatenating a first feature of each adjusted text segment data; extracting a segment token feature of each text segment data, and aggregating the various segment token features.
In
As an example, two methods of constructing the first fusion feature are shown in
Method 1: adjusting the order of T1-T3 to the 3rd, 1st, and 2nd respectively (that is, the concatenation order of T1-T3 in the first modal data is the 3rd, 1st, and 2nd); according to the adjustment order, concatenating the corresponding first features t1-t3 to obtain the first fusion feature.
Method 2: extracting a segment token feature CLS Token from each of the first features v1-v3; aggregating each CLS Tokens, for example, concatenating according to v v the concatenation order of T1-T3 in the first modal data in
In addition to the two construction methods shown in
In some optional implementations, the label data corresponding to the second modal data may include: start and end character position information or segment ordering information of the text segment data corresponding to the second modal data in the first modal data.
For example, referring to
Accordingly, the features input into the encoder in
In these optional implementations, the first fusion feature can be constructed based on each of the first features by adjusting the order and/or extracting the segment token features, thereby laying a foundation for the text locating task.
The technical solution of the embodiments of the present disclosure describes in detail the construction process of the first fusion feature of the long text. By constructing the first fusion feature of the long text, a foundation can be laid for the text locating task in pre-training, so that the pre-trained multimodal submodel can learn the correspondence between complete video and fine-grained local text. Further, the method for multimodal data processing provided in this embodiment and the method for multimodal data processing provided in the above embodiments belong to the same disclosure concept, technical details which are not described in detail in this embodiment may refer to the above embodiments, and the same technical features have the same beneficial effects in the present embodiment and the above embodiments.
As illustrated in
In some optional implementations, the processing apparatus for multimodal data may further include:
In some optional implementations, the modal pre-training module may construct the first fusion feature by at least one of the following:
In some optional implementations, the label data corresponding to the second modal data may comprise: start and end frame position information of video segment data corresponding to the second modal data in the first modal data.
In some optional implementations, when each of the first modal segment data comprises text segment data, the first fusion feature may be constructed based on at least one of:
In some optional implementations, the label data corresponding to the second modal data may comprise: start and end character position information or segment ordering information of text segment data corresponding to the second modal data in the first modal data.
In some optional implementations, the target processing model may be applied to at least one of:
Embodiments of a processing apparatus for multimodal data provided by the embodiments of the present disclosure, may perform processing methods for multimodal data provided by the embodiments of the present disclosure, and the processing apparatus for multimodal data may have corresponding functional modules for performing the method and may achieve beneficial effects.
It is should be noted that the various units and modules included in the above- mentioned device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions may be realized; in addition, the specific names of each functional unit are only for ease of distinction from each other, it is not used to limit the scope of protection of the present disclosure.
Referring now to
As shown in
Generally, the following devices can be connected to the I/O interface 705: input devices 707, including touch screens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 707, including liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 708, including magnetic tapes, hard disks, etc.; and communication devices 709. Communication devices 709 can allow electronic devices 700 to communicate wirelessly or wirelessly with other devices to exchange data. Although
In particular, according to embodiments of the present disclosure, the process described above with reference to the flowchart may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program comprising program code for performing the method shown in the flowchart. In such embodiments, the computer program may be downloaded and installed from the network through the communication device 709, or is installed from the storage device 708, or is installed from the ROM 702. When the computer program is executed by the processing device 701, the above-described functions defined in the method of the present disclosure are performed.
It should be noted that the computer-readable medium described above in this disclosure can be a computer-readable signal medium or a computer-readable storage medium or any combination thereof. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or any combination thereof. More specific examples of computer-readable storage media may include but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, which carries computer-readable program code. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, which may send, propagate, or transmit programs for use by or in conjunction with instruction execution systems, apparatuses, or devices. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination thereof.
In some embodiments, the client and server may communicate by using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include local area network (“LAN”), wide area network (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future developed networks.
The computer-readable medium may be included in the electronic device, or it may exist alone and not assembled into the electronic device.
The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device is caused to:
when the first modal data belongs to the target modality, the second modal data belongs to the original modality.
Computer program code for performing the operations of the present disclosure may be drafted in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages, such as, Java, Smalltalk, C++, and also including conventional procedural programming languages such as “C” language or similar programming languages. The program code may be executed entirely on the computer of the user, partially on the computer of the user, as a standalone software package, partially on the user's computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet).
The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functions, and operations of systems, methods, and computer program products that may be implemented in accordance with various embodiments of the present disclosure. in this regard, each block in the flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may also occur in a different order than those marked in the figures. For example, two blocks represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the function involved. It should also be noted that each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or may be implemented using a combination of dedicated hardware and computer instructions.
The modules described in the disclosed embodiments can be implemented by software or hardware. The name of the module does not limit the module itself in some cases. For example, the allocation module can also be described as “when creating a virtual machine in TCE-metal, assign the corresponding bare metal device module to the virtual machine.”
The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), system-on-chip (SOCs), complex programmable logic devices (CPLDs), and the likes.
In the context of this disclosure, machine-readable media can be tangible media that can contain or store programs for use by or in conjunction with instruction execution systems, devices, or devices. Machine-readable media can be machine-readable signal media or machine-readable storage media. Machine-readable media may include, but may be not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any suitable combination thereof. More specific examples of machine-readable storage media may include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, convenient compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
According to one or more embodiments of the present disclosure, [example 1] provides a processing method for multimodal data, the method includes:
According to one or more embodiments of the present disclosure, [example 2] provides a processing method for multimodal data, further including:
predicting target segment data that matches the second modal data from each of the first modal segment data according to the encoding result;
According to one or more embodiments of the present disclosure, [example 3] provides a processing method for multimodal data, further including:
in some optional implementations, when each of the first modal segment data comprises video segment data, the first fusion feature is constructed based on at least one of:
adjusting the order of each of the video segment data, and concatenating the first feature of each of the video segment data whose order has been adjusted;
sampling each of the video segment data, and concatenating the first feature of each of the sampled video segment data.
According to one or more embodiments of the present disclosure, [example 4] provides a processing method for multimodal data, further including:
According to one or more embodiments of the present disclosure, [example 5] provides a processing method for multimodal data, further including:
According to one or more embodiments of the present disclosure, [example 6] provides a processing method for multimodal data, further including:
According to one or more embodiments of the present disclosure, [example 7] provides a processing method for multimodal data, further including:
According to one or more embodiments of the present disclosure, [example 8] provides a processing method for multimodal data, including:
The above description is only the preferred embodiment of the present disclosure and an explanation of the technical principles used. Those skilled in the art should understand that the scope of the disclosure involved in the present disclosure is not limited to the specific combination of the above technical features, but should also cover other technical solutions formed by any combination of the above technical features or their equivalent features without departing from the above disclosure concept. For example, the technical solution formed by replacing the above features with (but not limited to) technical features with similar functions disclosed in the present disclosure.
In addition, although the operations are depicted in a specific order, this should not be understood as requiring these operations to be performed in the specific order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments individually or in any suitable sub-combination.
Although the subject matter has been described in language specific to structural features and/or methodological logical acts, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the particular features or acts described above. Rather, the particular features and acts described above are merely exemplary forms of implementation of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202310035955.X | Jan 2023 | CN | national |