The present disclosure generally relates to a machine learning model. Particularly, the present disclosure relates to a visual language model.
The analysis and application of aerospace big data belongs to the field of aerospace technology. With the advancement of observation and computer technologies, relevant organizations can now obtain massive heterogeneous aerospace data every day. Remote sensing satellite data offers huge potential in providing insights (e.g. asset/commodity/carbon monitoring) for various clients ranging from governments, corporates to individuals. The common data analytics methods for satellite data analysis and data services include feature extraction, image classification, segmentation, object detection, change detection, etc.
The traditional satellite data pre-processing pipeline is limited by lack of data processing and analytic capabilities, as it requires a large amount of manual intervention to extract usable information from massive remote-sensing data, resulting in low production efficiency, prohibitive cost for large scale processing, and is prone to human error. For example, 190,000+ co-seismic surface landslides occurred after Wenchuan Earthquake in 2008, and the current landslide inventory were all from manual identification and compilation, which is very time and labor intensive.
There is a need in the art for an improved technique for extracting usable information from massive remote-sensing data while achieving an increased level of automation in extracting the information so as to reduce an amount of manual intervention that is involved.
In the present disclosure, SpaceGPT is introduced for chatting with a user in natural language and processing remote-sensing images according to instructions issued or hinted by the user in a chat made in natural language. SpaceGPT is a unified model capable of learning from multiple label modalities in various datasets including remote-sensing imagery datasets (e.g., object detection and segmentation), with visual-language model capabilities that connect a LLM and one or more VFMs to enable sending and receiving images during chatting. A user can ask the model to perform certain image-related tasks using text queries. Automation in analyzing the remote-sensing images without intervention or assistance by human experts is advantageously achievable.
A first aspect of the present disclosure is to provide a computer-implemented method utilizing SpaceGPT for chatting with a user in natural language and performing an image-related task mentioned or hinted by the user in a query. The image-related task is performed on one or more remote-sensing images.
In the method, a platform for bidirectionally communicating with the user and performing the image-related task is set up in software. The platform comprises a prompt manager for communicating and prompting a LLM. The prompt manager is arranged to: forward the query to the LLM so as to cause the LLM to identify the image-related task from the query and determine one or more image-processing actions to be performed on the one or more remote-sensing images for accomplishing the image-related task; prompt the LLM to provoke a matched VFM selected from a predetermined set of one or more VFMs to perform an individual image-processing action if the LLM determines that the individual image-processing action matches an image-processing operation performable by the matched VFM; and receive from the LLM a reply to the user on any outcome of the image-related task. After the platform is set up, the platform is used to receive the query from the user, process the query and forward the reply to the user. Before the platform is used to process the query, one or more selected VFMs are fine-tuned with one or more remote-sensing imagery datasets. As a result, advantageously, one or more respective image-processing operations performable by the one or more selected VFMs are adapted to or optimized for remote sensing-related image processing. The one or more selected VFMs are selected from the predetermined set of one or more VFMs. It is possible that all VFMs in the predetermined set are selected.
Preferably and advantageously, the one or more selected VFMs are fine-tuned with the one or more remote-sensing imagery datasets in a self-supervised manner.
In certain embodiments, the one or more respective image-processing operations performed by the one or more selected VFMs include one or more operations selected from an image classification operation, an object detection operation and an image segmentation operation.
In certain embodiments, the one or more selected VFMs include one or more of machine-learning models selected from OV-DETR, Grounding Dino and Segment Anything Model.
In certain embodiments, the one or more respective image-processing operations performed by the one or more selected VFMs include one or more operations for detecting or identifying one or more types of hydrological or geomorphological catastrophes.
The one or more types of hydrological or geomorphological catastrophes may include flooding, landsliding, or both.
Preferably and advantageously, the platform is set up in a cloud-computing environment.
In certain embodiments, the platform further comprises a command interface for interfacing with the user. The command interface is arranged to receive the query from the user, forward the received query to the prompt manager and forward the reply to the user. The command interface is a LLM-supported command interface. The command interface may be further arranged to support sending and receiving image files during chatting.
In certain embodiments, the LLM is selected to be ChatGPT.
In certain embodiments, the platform further comprises the LLM and the predetermined set of one or more VFMs such that the platform is self-contained with a visual language model.
A second aspect of the present disclosure is to provide a system for chatting with a user in natural language and performing an image-related task mentioned or hinted by the user in a query during chatting. The image-related task is performed on one or more remote-sensing images.
The system comprises one or more computers networked together. The one or more computers are configured to execute a computing process of chatting with the user in natural language and performing the image-related task as set forth in any of the embodiments of the disclosed method.
Other aspects of the present disclosure are disclosed as illustrated by the embodiments hereinafter.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been depicted to scale.
The present disclosure is concerned with a remote-sensing data service platform that utilizes a visual language model, which is named as SpaceGPT, for chatting with a user in natural language and processing remote-sensing images according to instructions issued or hinted by the user in a chat made in natural language. Advantageously, automation in analyzing the remote-sensing images without intervention or assistance by human experts is achievable even if the user is a technical layman in computer vision and is not aware of suitable computer-vision operations necessary to be carried out in analyzing the remote-sensing images to achieve a certain high-level purpose. As a result, advantages of time efficiency and cost reduction in analyzing the remote-sensing images are obtained. To realize these advantages, the data service platform utilizes a LLM to understand the user's intent in processing the images and provoke relevant computer-vision AI models fine-tuned on remote sensing imagery datasets, such as open vocabulary object detection and segmentation model with state-of-the-art model architecture, to enable the user to extract useful insights by natural language text query.
Before proceeding further, it is instructive to provide a short description of how an ordinary user interacts with SpaceGPT in analyzing a remote-sensing image.
A first aspect of the present disclosure is to provide a computer-implemented method for chatting with a user in natural language and performing an image-related task mentioned or hinted by the user in a query during chatting. The image-related task is performed on one or more remote-sensing images. The disclosed method utilizes SpaceGPT to understand user intent in processing the one or more remote-sensing images so as to determine the image-related task. The disclosed method is exemplarily illustrated with the aid of
The step 410 is an initialization step. In the step 410, a platform 205 used for bidirectionally communicating with the user and performing the image-related task is set up in software. In particular, the platform 205 comprises a prompt manager 240 for communicating and prompting a LLM 210.
The LLM 210 is capable of understanding natural language in order to identify the image-related task from the query (referenced as 291) made by the user in the chat (referenced as 380). The image-related task may be explicitly mentioned in the query 291, or may be “hinted” or “implied” in the query 291 such that the LLM 210 is required to first understand the user's intent from the words of the query 291 and then estimate the image-related task. In addition, the LLM 210 is also capable of determining one or more image-processing actions to be performed on the one or more remote-sensing images (referenced as 382) for accomplishing the image-related task. One practical example of the LLM 210 is ChatGPT.
The prompt manager 240 is arranged (or programmed) to at least perform the following three procedures. First, the prompt manager 240 forwards the query 291 made by the user to the LLM 210. As a result of this forwarding of the query 291, the LLM 210 is triggered or caused to identify the image-related task from the query 291 and to determine one or more image-processing actions to be performed on the one or more remote-sensing images 382 for accomplishing the image-related task. Second, the prompt manager 240 prompts the LLM 210 to provoke a matched VFM selected from a predetermined set of one or more VFMs 220 to perform an individual image-processing action if the LLM 210 determines that the individual image-processing action matches an image-processing operation performable by the matched VFM. Third, the prompt manager 240 receives from the LLM 210 a reply 292 to the user on any outcome of the image-related task.
As shown in
In practical situations, it is often that the prompt manager 240 is operated by one service provider while the LLM 210 and the one or more VFMs 220 are operated by other service provider(s) or may even be open-access software applications. It is reasonable that the platform 205 does not include the LLM 210 and the VFM(s) 220, as shown in
Usually, the platform 205 further comprises a command interface 230 for interfacing with the user. The command interface 230 is arranged to receive the query 291 from the user, forward the received query 291 to the prompt manager 240 and forward the reply 292 to the user. Note that the command interface 230 is a LLM-supported command interface, which allows textual communication in natural language between the user and SpaceGPT 200. Note that generally, the command interface 230 is further arranged to support sending and receiving image files during chatting.
Refer to
It is desirable to customize some selected VFMs in the predetermined set of one or more VFMs 220 to remote sensing-based image processing such that higher levels of performance in performing corresponding image-processing operations can be achieved when compared to non-customized versions of aforementioned selected VFMs. Before the platform 205 is used in the step 430 to process the query 291, or more practically, before the chat 380 is initiated, one or more selected VFMs 225 are fine-tuned with one or more remote-sensing imagery datasets (as training datasets) in the step 420, where the one or more selected VFMs 225 are selected from the predetermined set of one or more VFMs 220. Optionally, all of the one or more VFMs 220 in the predetermined set are selected to be the one or more selected VFMs 225. As an advantageous result, one or more respective image-processing operations performable by the one or more selected VFMs 225 are adapted to or optimized for remote sensing-related image processing. Note that the step 420 is also an initialization step.
Preferably and advantageously, the one or more selected VFMs 225 are chosen such that these VFMs 225 can be fine-tuned in a self-supervised manner. As a result, the time cost of preparing the one or more remote-sensing imagery datasets for the fine-tuning purpose can be made lower in comparison to the case of fully-supervised training.
It is also preferable that the one or more selected VFMs 225 are chosen such that the one or more respective image-processing operations performed by the one or more selected VFMs 225 include one or more operations selected from an image classification operation, an object detection operation and an image segmentation operation. The latter three operations are commonly used in remote sensing-based image processing.
Examples of VFMs 225 that perform any of these three operations include OV-DETR, Grounding Dino and SAM. In certain embodiments, the one or more selected VFMs 225 include one or more of machine-learning models selected from OV-DETR, Grounding Dino and SAM. More details of the latter three machine-learning models are provided as follows.
OV-DETR is an end-to-end transformer-based open-vocabulary detector that formulates a learning objective as a binary matching one between input queries and corresponding objects. It conditions the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model such as CLIP. With fine-tuning CLIP on the remote sensing dataset such as Satlas and DOTA. For more details on OV-DETR, see Y. ZANG, W. LI, K. ZHOU, C. HUANG and C. C. LOY, “Open-Vocabulary DETR with Conditional Matching,” arXiv: 2203.11876 [cs.CV], 2022, the disclosure of which is incorporated by reference herein.
Grounding Dino is an open-set object detector combining transformer-based detector DINO with grounded pre-training to detect arbitrary objects with human inputs such as category names or referring expressions. It accepts an (image, text) pair as inputs, and outputs 900 (by default) object boxes with similarity scores across all input words. Grounding Dino adopts the following architecture: a text backbone, an image backbone, a feature enhancer, a language-guided query selection, and a cross-modality decoder. Grounding Dino serves the object detection task the same as OV-DETR with comparable accuracy and generalization capabilities but different fine-tuning method on remote sensing datasets. Grounding Dino offers an alternative choice when the result of OV-DETR is not satisfactory. For more details on Grounding Dino, see S. LIU et al., “Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection,” arXiv: 2303.05499 [cs.CV], 2023, the disclosure of which is incorporated by reference herein.
SAM is a promptable segmentation system with zero-shot generalization to unfamiliar objects and images, without the need for additional training. SAM produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It serves the object segmentation tasks on remote sensing images. For more details on SAM, see A. KIRILLOV, “Segment Anything,” arXiv: 2304.02643 [cs.CV], 2023, the disclosure of which is incorporated by reference herein.
Apart from the image classification operation, object detection operation and image segmentation operation, all of which are usually deemed to be fundamental image-processing tools, it is possible that one or more higher-level image-processing tools directly related to remote sensing are performed in the one or more selected VFMs 225. In certain embodiments, the one or more respective image-processing operations performed by the one or more selected VFMs 225 may include, or may further include, one or more operations for detecting or identifying one or more types of hydrological or geomorphological catastrophes. Specifically, each of the one or more operations is configured to detect or identify a respective type of hydrological, or geomorphological, catastrophes. Examples of hydrological or geomorphological catastrophes include a flooding event and a landslide event. In certain embodiments, the one or more types of hydrological or geomorphological catastrophes include flooding, landsliding, or both. Generally, the aforesaid one or more operations are image-analysis-based operations.
In summary, the one or more selected VFMs 225 obtained after fine-tuning form a remote-sensing fine-tuned AI algorithm library 225, which contains AI models that are fine-tuned on the one or more remote-sensing imagery datasets and are capable of completing common tasks on remote-sensing images, such as image classification, object detection and image segmentation.
Although it is shown in
It is instructive to provide an example for illustrating a typical flow of operations performed in SpaceGPT 200 on handling the chat 380 and processing the query 291. Assume that the step 502 has been executed such that the remote-sensing fine-tuned AI algorithm library 225 has been established in the predetermined set of one or more VFMs 220. Refer to
Other implementation details of the disclosed method or SpaceGPT 200 are elaborated as follows.
Preferably and advantageously, the platform 205 (or 205a) is set up in a cloud-computing environment. As the platform 205/205a is cloud-native, it enables the user to input desired remote-sensing image data or data stream on a cloud or in a user device (such as a tablet) owned by the user to perform machine learning analytics, with visual ChatGPT interaction, to discover actionable insights.
Those skilled in the art will appreciate that the platform 205 (or 205a) may be implemented by appropriate programming on a computer system or equivalent. The computer system may be a computer or a group of plural computers networked together. An individual computer may be a general-purpose computer, a computer equipped with specialized AI computation processor(s), a desktop computer, a mobile computing device, a computing server, a distributed server (as realized in a cloud-computing environment), or any computing device. Networking the aforesaid plural computers may be networked by cables or optical fibers, or may be networked wirelessly. The aforesaid plural computers may be networked on the Internet.
Image files of the one or more remote-sensing images 382 may be uploaded to SpaceGPT 200 by the user through the command interface 230. If the one or more remote-sensing images 382 are stored at a location (e.g., at a storage device accessible through the Internet), the user may only need to provide an address of the location to SpaceGPT 200.
A second aspect of the present disclosure is to provide a system for chatting with a user in natural language and performing an image-related task mentioned or hinted by the user in a query during chatting, where the image-related task is performed on one or more remote-sensing images.
The system comprises one or more computers networked together. The one or more computers are configured to execute a computing process of chatting with the user in natural language and performing the image-related task according to any of the embodiments of the disclosed method as elaborated above in describing the workflow 400.
Details related to the one or more computers and a network that networks these computers together are mentioned above.
As final remarks, advantages of SpaceGPT 200 are summarized as follows.
Firstly, SpaceGPT 200 enables high accuracy open vocabulary object detection and segmentation on remote sensing satellite images. With state-of-the-art model architecture, the object detection and segmentation models are trained and fine-tuned on multiple large-scale remote-sensing imagery datasets with a self-supervised manner, which ideally achieve comparable accuracy with closed-set object detection and segmentation models. It enables the disclosed visual language model 200 to generalize beyond the limited number of base classes labeled during the training phase, and the user can detect and segment common objects in remote-sensing imagery beyond the base classes. That is, the user is likely to be able to successfully detect a target object even if a training dataset used for training SpaceGPT 200 does not contain the target object. SpaceGPT 200 helps the remote-sensing data analytics avoid limitations imposed by using a small number of high-quality remote-sensing image datasets due to variability of remote-sensing image in scale, resolution and elevation angles.
Secondly, SpaceGPT 200 allows a query to be made in natural language, with little or no prior knowledge of remote sensing satellite data characteristics. SpaceGPT 200 utilizes the cutting-edge multi-modality VLM with both text and image understandings for the user to carry out object detection and segmentation tasks via using natural language text queries. The LLM 210 is able to understand user intent and provoke corresponding computer vision model(s), such as object detection or segmentation. This functionality of SpaceGPT 200 offers revolutionary big-data insight-generation experience for users, even if these users are new to remote-sensing image processing.
Thirdly, SpaceGPT 200 built on a cloud-native platform is able to help all users, regardless of levels of expertise they may have, access satellite data and gain insights faster and easier than ever. It also helps companies, governments and civil society use satellite imagery to discover actionable insights regarding important phenomena, such as deforestation, agriculture, climate change, biodiversity, and supply chains worldwide.
The present disclosure may be embodied in other specific forms without departing 1from the spirit or essential characteristics thereof. The present embodiment is therefore to be considered in all respects as illustrative and not restrictive. The scope of the invention is indicated by the appended claims rather than by the foregoing description, and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
This application claims the benefit of U.S. Provisional Patent Application Ser. No. 63/589,987 filed Oct. 12, 2023, the disclosure of which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63589987 | Oct 2023 | US |