The present application hereby claims priority under 35 U.S.C. § 119 to European patent application number EP19200381.2 filed Sep. 30, 2019, the entire contents of which are hereby incorporated herein by reference.
Embodiments of the invention generally relate to intra-hospital genetic profile similar search.
In healthcare, physicians often base their decisions on experience on previous patient cases. The paradigm is that similar patients will respond similarly to the same treatment. Physicians therefore try to remember and associate similar patient cases to the one patient they currently care for in order to decide on further diagnostic procedures or on treatment options. Traditionally, the search for similar patients is up to the individual physician and therewith dependent on the physician's personal experience and network.
Recent years saw considerable effort in the healthcare business to automate and thereby objectify the search for similar patients. One approach in this regard is to automatically query databases for cases with similar diagnoses, similar medical findings and/or similar courses of diseases. While this certainly constitutes a promising first step, studies indicate that such criteria are often not specific enough to provide a reliable support for the physician. What is more, criteria such as prior diagnoses or findings are inherently subjective as well, as they are likewise based on human assessment.
The inventors have discovered that what is therefore needed is an objective measure for the similarity between two cases. In principle, genetic data sets could provide such an objective standard of comparison. In oncology, the usage of large genetic data sets is a common approach in treating advanced cancer patients to decide on further treatment options with targeted therapies. However, the evidence for a lot of the mutations found in a patient's tumor is weak and their influence on therapy response is often unclear. Only rarely, the interpreting physician is able to use his/her knowledge of previous patients with similar genetic profiles to decide on a treatment option. This is due to the vast number of combinatorial mutations profiles and the little number of patients with a genetic tumor profile within one hospital. This has the consequence that much of the data available within one healthcare organization is generally sparse, and it is very difficult to determine, through manual searching, all of the relevant data that might be applicable to a particular patient. Accordingly, conventional clinical environments are not generally capable of matching patient information on the basis of genetic data sets.
For these reasons, the inventors have discovered that it would be, in principle, desirable to extend the search for similar cases to incorporate a plurality of healthcare organizations. However, this is not straight-forwardly possible, as data privacy regulations impose tight constraints on the freedom to exchange medical information across different institutions. In particular, this applies for genetic data sets. For instance, it may be forbidden to directly exchange genetic raw data. For the same reasons, it is generally not possible to directly access genetic databases across different organizations and query them for similar cases.
Accordingly, at least one embodiment of the present invention is directed to providing devices and/or methods which allow for an improved way of sharing medical information for similar patient cases. Particularly, at least one embodiment of the present invention is directed to providing devices and/or methods that allow for a swift, objective and reliable identification of similar patient cases while respecting existing legal restrictions in exchanging medical information, and that allow for a seamless integration of the ensuing processes into existing clinical workflows.
Embodiments of the present invention are directed to a method for sharing medical data sets, corresponding system, corresponding computer-program product and computer-readable storage medium. Some embodiments are the object of the claims and are set out below.
In the following, the technical solution according to at least one embodiment of the present invention is described with respect to the claimed apparatuses as well as with respect to the claimed methods. Features, advantages or alternative embodiments described herein can likewise be assigned to other claimed objects and vice versa. In other words, claims addressing the inventive method can be improved by features described or claimed with respect to the apparatuses. In this case, functional features of the method are embodied by objective units or elements of the apparatus, for instance.
According to a first embodiment, a computer-implemented method for sharing medical information is provided. The method comprises several steps. A first step is directed to receiving a first genomic data set, the first genomic data set being generated at a first site. A further step is directed to comparing the first genomic data set with a plurality of second genomic data sets stored in a database external to the first site. A further step is directed to identifying, amongst the second genomic data sets, one or more reference genomic data sets, on the basis of determining a similarity between first genomic data set and each of the second genomic data sets. A further step is directed to dispatching a notification to the first site indicative of the one or more reference genomic data sets.
According to an embodiment, a system for sharing medical information is provided. The system comprises an interface unit, a database and a computing unit. The interface unit is configured to communicate with a first site for receiving a first genomic data set. Further the interface unit is configured to communicate with the database. The database is configured to store a plurality of second genomic data sets, the database being external to the first site. The computing unit is configured to compare the first genomic data sets with a fraction or all of the second genomic data sets and to identify, amongst these second genomic data sets, one or more reference genomic data sets, on the basis of determining a similarity between first genomic data set and the respective second genomic data sets. Further, the computing unit is configured to dispatch a notification to the first site indicative of the reference genomic data sets via the interface unit.
According to an embodiment, a computer program product is provided. The computer program product comprises program elements which induce a computing unit of a system for sharing medical information to perform the method as described above in connection with one or more embodiments, when the program elements are loaded into a memory of the computing unit.
According to a further embodiment, program elements are stored that are readable and executable by a computing unit of a system for sharing medical information, in order to perform steps of the as described above in connection with one or more embodiments, when the program elements are executed by the computing unit.
At least one embodiment is directed to a computer-implemented method for sharing medical information, comprising:
receiving a first genomic data set, the first genomic data set being generated at a first site;
comparing the first genomic data set received with a plurality of second genomic data sets stored in a database external to the first site;
identifying, amongst the plurality of second genomic data sets, one or more reference genomic data sets, based upon determining a similarity between the first genomic data set received and the plurality of second genomic data sets; and
dispatching a notification to the first site indicative of the one or more reference genomic data sets identified.
At least one embodiment is directed to a system for sharing medical information, comprising:
an interface unit, configured to communicate with a first site, for receiving a first genomic data set from the first site;
a database, configured to store second genomic data sets, the database being external to the first site; and
a computing unit, external to the first site and configured to:
At least one embodiment is directed to a non-transitory computer program product storing program elements which induce a computing unit of a system for sharing medical information to perform the method of an embodiment, when the program elements are loaded into a memory of the computing unit.
At least one embodiment is directed to a non-transitory computer-readable medium storing program elements, readable and executable by a computing unit of a system for sharing medical information, to perform the method of an embodiment, when the program elements are executed by the computing unit.
Characteristics, features and advantages of the above de-scribed invention, as well as the manner they are achieved, become clearer and more understandable in the light of the following description and embodiments, which will be described in detail with respect to the figures. This following description does not limit the invention on the contained embodiments. Same components or parts can be labeled with the same reference signs in different figures. In general, the figures are not drawn to scale. In the following:
The drawings are to be regarded as being schematic representations and elements illustrated in the drawings are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art. Any connection or coupling between functional blocks, devices, components, or other physical or functional units shown in the drawings or described herein may also be implemented by an indirect connection or coupling. A coupling between components may also be established over a wireless connection. Functional blocks may be implemented in hardware, firmware, software, or a combination thereof.
Various example embodiments will now be described more fully with reference to the accompanying drawings in which only some example embodiments are shown. Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments. Example embodiments, however, may be embodied in various different forms, and should not be construed as being limited to only the illustrated embodiments. Rather, the illustrated embodiments are provided as examples so that this disclosure will be thorough and complete, and will fully convey the concepts of this disclosure to those skilled in the art. Accordingly, known processes, elements, and techniques, may not be described with respect to some example embodiments. Unless otherwise noted, like reference characters denote like elements throughout the attached drawings and written description, and thus descriptions will not be repeated. The present invention, however, may be embodied in many alternate forms and should not be construed as limited to only the example embodiments set forth herein.
It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, components, regions, layers, and/or sections, these elements, components, regions, layers, and/or sections, should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments of the present invention. As used herein, the term “and/or,” includes any and all combinations of one or more of the associated listed items. The phrase “at least one of” has the same meaning as “and/or”.
Spatially relative terms, such as “beneath,” “below,” “lower,” “under,” “above,” “upper,” and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as “below,” “beneath,” or “under,” other elements or features would then be oriented “above” the other elements or features. Thus, the example terms “below” and “under” may encompass both an orientation of above and below. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly. In addition, when an element is referred to as being “between” two elements, the element may be the only element between the two elements, or one or more other intervening elements may be present.
Spatial and functional relationships between elements (for example, between modules) are described using various terms, including “connected,” “engaged,” “interfaced,” and “coupled.” Unless explicitly described as being “direct,” when a relationship between first and second elements is described in the above disclosure, that relationship encompasses a direct relationship where no other intervening elements are present between the first and second elements, and also an indirect relationship where one or more intervening elements are present (either spatially or functionally) between the first and second elements. In contrast, when an element is referred to as being “directly” connected, engaged, interfaced, or coupled to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., “between,” versus “directly between,” “adjacent,” versus “directly adjacent,” etc.).
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms “a,” “an,” and “the,” are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the terms “and/or” and “at least one of” include any and all combinations of one or more of the associated listed items. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Also, the term “example” is intended to refer to an example or illustration.
When an element is referred to as being “on,” “connected to,” “coupled to,” or “adjacent to,” another element, the element may be directly on, connected to, coupled to, or adjacent to, the other element, or one or more other intervening elements may be present. In contrast, when an element is referred to as being “directly on,” “directly connected to,” “directly coupled to,” or “immediately adjacent to,” another element there are no intervening elements present.
It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Before discussing example embodiments in more detail, it is noted that some example embodiments may be described with reference to acts and symbolic representations of operations (e.g., in the form of flow charts, flow diagrams, data flow diagrams, structure diagrams, block diagrams, etc.) that may be implemented in conjunction with units and/or devices discussed in more detail below. Although discussed in a particularly manner, a function or operation specified in a specific block may be performed differently from the flow specified in a flowchart, flow diagram, etc. For example, functions or operations illustrated as being performed serially in two consecutive blocks may actually be performed simultaneously, or in some cases be performed in reverse order. Although the flowcharts describe the operations as sequential processes, many of the operations may be performed in parallel, concurrently or simultaneously. In addition, the order of operations may be re-arranged. The processes may be terminated when their operations are completed, but may also have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, subprograms, etc.
Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments of the present invention. This invention may, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
Units and/or devices according to one or more example embodiments may be implemented using hardware, software, and/or a combination thereof. For example, hardware devices may be implemented using processing circuitry such as, but not limited to, a processor, Central Processing Unit (CPU), a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a System-on-Chip (SoC), a programmable logic unit, a microprocessor, or any other device capable of responding to and executing instructions in a defined manner. Portions of the example embodiments and corresponding detailed description may be presented in terms of software, or algorithms and symbolic representations of operation on data bits within a computer memory. These descriptions and representations are the ones by which those of ordinary skill in the art effectively convey the substance of their work to others of ordinary skill in the art. An algorithm, as the term is used here, and as it is used generally, is conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of optical, electrical, or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, or as is apparent from the discussion, terms such as “processing” or “computing” or “calculating” or “determining” of “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device/hardware, that manipulates and transforms data represented as physical, electronic quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
In this application, including the definitions below, the term ‘module’ or the term ‘controller’ may be replaced with the term ‘circuit.’ The term ‘module’ may refer to, be part of, or include processor hardware (shared, dedicated, or group) that executes code and memory hardware (shared, dedicated, or group) that stores code executed by the processor hardware.
The module may include one or more interface circuits. In some examples, the interface circuits may include wired or wireless interfaces that are connected to a local area network (LAN), the Internet, a wide area network (WAN), or combinations thereof. The functionality of any given module of the present disclosure may be distributed among multiple modules that are connected via interface circuits. For example, multiple modules may allow load balancing. In a further example, a server (also known as remote, or cloud) module may accomplish some functionality on behalf of a client module.
Software may include a computer program, program code, instructions, or some combination thereof, for independently or collectively instructing or configuring a hardware device to operate as desired. The computer program and/or program code may include program or computer-readable instructions, software components, software modules, data files, data structures, and/or the like, capable of being implemented by one or more hardware devices, such as one or more of the hardware devices mentioned above. Examples of program code include both machine code produced by a compiler and higher level program code that is executed using an interpreter.
For example, when a hardware device is a computer processing device (e.g., a processor, Central Processing Unit (CPU), a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a microprocessor, etc.), the computer processing device may be configured to carry out program code by performing arithmetical, logical, and input/output operations, according to the program code. Once the program code is loaded into a computer processing device, the computer processing device may be programmed to perform the program code, thereby transforming the computer processing device into a special purpose computer processing device. In a more specific example, when the program code is loaded into a processor, the processor becomes programmed to perform the program code and operations corresponding thereto, thereby transforming the processor into a special purpose processor.
Software and/or data may be embodied permanently or temporarily in any type of machine, component, physical or virtual equipment, or computer storage medium or device, capable of providing instructions or data to, or being interpreted by, a hardware device. The software also may be distributed over network coupled computer systems so that the software is stored and executed in a distributed fashion. In particular, for example, software and data may be stored by one or more computer readable recording mediums, including the tangible or non-transitory computer-readable storage media discussed herein.
Even further, any of the disclosed methods may be embodied in the form of a program or software. The program or software may be stored on a non-transitory computer readable medium and is adapted to perform any one of the aforementioned methods when run on a computer device (a device including a processor). Thus, the non-transitory, tangible computer readable medium, is adapted to store information and is adapted to interact with a data processing facility or computer device to execute the program of any of the above mentioned embodiments and/or to perform the method of any of the above mentioned embodiments.
Example embodiments may be described with reference to acts and symbolic representations of operations (e.g., in the form of flow charts, flow diagrams, data flow diagrams, structure diagrams, block diagrams, etc.) that may be implemented in conjunction with units and/or devices discussed in more detail below. Although discussed in a particularly manner, a function or operation specified in a specific block may be performed differently from the flow specified in a flowchart, flow diagram, etc. For example, functions or operations illustrated as being performed serially in two consecutive blocks may actually be performed simultaneously, or in some cases be performed in reverse order.
According to one or more example embodiments, computer processing devices may be described as including various functional units that perform various operations and/or functions to increase the clarity of the description. However, computer processing devices are not intended to be limited to these functional units. For example, in one or more example embodiments, the various operations and/or functions of the functional units may be performed by other ones of the functional units. Further, the computer processing devices may perform the operations and/or functions of the various functional units without subdividing the operations and/or functions of the computer processing units into these various functional units.
Units and/or devices according to one or more example embodiments may also include one or more storage devices. The one or more storage devices may be tangible or non-transitory computer-readable storage media, such as random access memory (RAM), read only memory (ROM), a permanent mass storage device (such as a disk drive), solid state (e.g., NAND flash) device, and/or any other like data storage mechanism capable of storing and recording data. The one or more storage devices may be configured to store computer programs, program code, instructions, or some combination thereof, for one or more operating systems and/or for implementing the example embodiments described herein. The computer programs, program code, instructions, or some combination thereof, may also be loaded from a separate computer readable storage medium into the one or more storage devices and/or one or more computer processing devices using a drive mechanism. Such separate computer readable storage medium may include a Universal Serial Bus (USB) flash drive, a memory stick, a Blu-ray/DVD/CD-ROM drive, a memory card, and/or other like computer readable storage media. The computer programs, program code, instructions, or some combination thereof, may be loaded into the one or more storage devices and/or the one or more computer processing devices from a remote data storage device via a network interface, rather than via a local computer readable storage medium. Additionally, the computer programs, program code, instructions, or some combination thereof, may be loaded into the one or more storage devices and/or the one or more processors from a remote computing system that is configured to transfer and/or distribute the computer programs, program code, instructions, or some combination thereof, over a network. The remote computing system may transfer and/or distribute the computer programs, program code, instructions, or some combination thereof, via a wired interface, an air interface, and/or any other like medium.
The one or more hardware devices, the one or more storage devices, and/or the computer programs, program code, instructions, or some combination thereof, may be specially designed and constructed for the purposes of the example embodiments, or they may be known devices that are altered and/or modified for the purposes of example embodiments.
A hardware device, such as a computer processing device, may run an operating system (OS) and one or more software applications that run on the OS. The computer processing device also may access, store, manipulate, process, and create data in response to execution of the software. For simplicity, one or more example embodiments may be exemplified as a computer processing device or processor; however, one skilled in the art will appreciate that a hardware device may include multiple processing elements or processors and multiple types of processing elements or processors. For example, a hardware device may include multiple processors or a processor and a controller. In addition, other processing configurations are possible, such as parallel processors.
The computer programs include processor-executable instructions that are stored on at least one non-transitory computer-readable medium (memory). The computer programs may also include or rely on stored data. The computer programs may encompass a basic input/output system (BIOS) that interacts with hardware of the special purpose computer, device drivers that interact with particular devices of the special purpose computer, one or more operating systems, user applications, background services, background applications, etc. As such, the one or more processors may be configured to execute the processor executable instructions.
The computer programs may include: (i) descriptive text to be parsed, such as HTML (hypertext markup language) or XML (extensible markup language), (ii) assembly code, (iii) object code generated from source code by a compiler, (iv) source code for execution by an interpreter, (v) source code for compilation and execution by a just-in-time compiler, etc. As examples only, source code may be written using syntax from languages including C, C++, C#, Objective-C, Haskell, Go, SQL, R, Lisp, Java®, Fortran, Perl, Pascal, Curl, OCaml, Javascript®, HTML5, Ada, ASP (active server pages), PHP, Scala, Eiffel, Smalltalk, Erlang, Ruby, Flash®, Visual Basic®, Lua, and Python®.
Further, at least one embodiment of the invention relates to the non-transitory computer-readable storage medium including electronically readable control information (processor executable instructions) stored thereon, configured in such that when the storage medium is used in a controller of a device, at least one embodiment of the method may be carried out.
The computer readable medium or storage medium may be a built-in medium installed inside a computer device main body or a removable medium arranged so that it can be separated from the computer device main body. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium is therefore considered tangible and non-transitory. Non-limiting examples of the non-transitory computer-readable medium include, but are not limited to, rewriteable non-volatile memory devices (including, for example flash memory devices, erasable programmable read-only memory devices, or a mask read-only memory devices); volatile memory devices (including, for example static random access memory devices or a dynamic random access memory devices); magnetic storage media (including, for example an analog or digital magnetic tape or a hard disk drive); and optical storage media (including, for example a CD, a DVD, or a Blu-ray Disc). Examples of the media with a built-in rewriteable non-volatile memory, include but are not limited to memory cards; and media with a built-in ROM, including but not limited to ROM cassettes; etc. Furthermore, various information regarding stored images, for example, property information, may be stored in any other form, or it may be provided in other ways.
The term code, as used above, may include software, firmware, and/or microcode, and may refer to programs, routines, functions, classes, data structures, and/or objects. Shared processor hardware encompasses a single microprocessor that executes some or all code from multiple modules. Group processor hardware encompasses a microprocessor that, in combination with additional microprocessors, executes some or all code from one or more modules. References to multiple microprocessors encompass multiple microprocessors on discrete dies, multiple microprocessors on a single die, multiple cores of a single microprocessor, multiple threads of a single microprocessor, or a combination of the above.
Shared memory hardware encompasses a single memory device that stores some or all code from multiple modules. Group memory hardware encompasses a memory device that, in combination with other memory devices, stores some or all code from one or more modules.
The term memory hardware is a subset of the term computer-readable medium. The term computer-readable medium, as used herein, does not encompass transitory electrical or electromagnetic signals propagating through a medium (such as on a carrier wave); the term computer-readable medium is therefore considered tangible and non-transitory. Non-limiting examples of the non-transitory computer-readable medium include, but are not limited to, rewriteable non-volatile memory devices (including, for example flash memory devices, erasable programmable read-only memory devices, or a mask read-only memory devices); volatile memory devices (including, for example static random access memory devices or a dynamic random access memory devices); magnetic storage media (including, for example an analog or digital magnetic tape or a hard disk drive); and optical storage media (including, for example a CD, a DVD, or a Blu-ray Disc). Examples of the media with a built-in rewriteable non-volatile memory, include but are not limited to memory cards; and media with a built-in ROM, including but not limited to ROM cassettes; etc. Furthermore, various information regarding stored images, for example, property information, may be stored in any other form, or it may be provided in other ways.
The apparatuses and methods described in this application may be partially or fully implemented by a special purpose computer created by configuring a general purpose computer to execute one or more particular functions embodied in computer programs. The functional blocks and flowchart elements described above serve as software specifications, which can be translated into the computer programs by the routine work of a skilled technician or programmer.
Although described with reference to specific examples and drawings, modifications, additions and substitutions of example embodiments may be variously made according to the description by those of ordinary skill in the art. For example, the described techniques may be performed in an order different with that of the methods described, and/or components such as the described system, architecture, devices, circuit, and the like, may be connected or combined to be different from the above-described methods, or results may be appropriately achieved by other components or equivalents.
According to a first embodiment, a computer-implemented method for sharing medical information is provided. The method comprises several steps. A first step is directed to receiving a first genomic data set, the first genomic data set being generated at a first site. A further step is directed to comparing the first genomic data set with a plurality of second genomic data sets stored in a database external to the first site. A further step is directed to identifying, amongst the second genomic data sets, one or more reference genomic data sets, on the basis of determining a similarity between first genomic data set and each of the second genomic data sets. A further step is directed to dispatching a notification to the first site indicative of the one or more reference genomic data sets.
In other words, it is an idea of at least one embodiment of the present invention to base the search for similar cases on a comparison of genetic data sets. If there is a match between two genomic data sets, a corresponding notification is generated thereby sharing medical information. The matching involves the comparison with genomic data sets from a central knowledge database in which a plurality of genomic data sets is stored for comparison. The provision of a central database enables healthcare providers to upload genomic data sets to an external matching system which can more readily be configured to satisfy data privacy regulations when dealing with genomic data. In particular, by collecting genomic data in a central database, the access to the data can be tightly controlled while still enabling to exchange data. While healthcare providers may not be allowed to directly access external databases for retrieving similar patient cases, they may still send the genomic data sets to an external facility comprising the database and providing means for comparing and matching two genomic data sets.
A genomic data set generally relates to genomic data of a patient. Genomic data may, for instance, be obtained by a biopsy procedure involving extraction of sample cells or tissues for examination to determine the presence or extent of a disease by determining the genomic state. The genomic state may relate to the DNA or RNA sequence or the chromosomal state. In oncology, another common way of obtaining a genomic data set is to analyze liquid patient samples for tumor DNA/RNA and extract the corresponding DNA or RNA sequence and/or chromosomal state. The extraction of the genetic sequence from a patient sample may involve known techniques such as sequencing, genotyping, the usage of microarray platforms including RNA or mRNA expression, or the usage of polymerase chain reaction (PCR) platforms, copy-number variation (CNV) platforms, (whole) genome sequencing platforms or the like. Thus, first and second genomic data set may relate to raw genomic data such as the DNA and/or RNA sequences. Further, genomic data comprised in first and second genomic data set may be in the form of gene expression levels, gene states, chromosomal states or the like. What is more (and as will be further detailed below), first and second genomic data set may also relate to already processed genomic data of a patient. “Processed” may mean that one or more genomic features and/or characteristic values have been derived (i.e., extracted or calculated) from the genomic raw data (i.e., the gene sequence). The genomic features may relate to high-level information derived from the genomic data sets (as will be further detailed below). The genomic features may be selected or tailored according to the clinical question at hand. For oncology related questions, the genomic features may, for instance, rely on identifying mutations in the genomic data. Accordingly, corresponding genomic features might relate to the genomic regions of mutations in the genomic data sets, mutation hotspots in the genomic data sets, the effect of mutation in the genomic data sets (gain or loss), and/or the clinical actionability of mutations in the genomic data sets.
Moreover, “processed” may mean that the genomic data underlying first and second genomic data sets underwent a filtering step. In this regard, information that does not identify a required piece of information such as a chromosomal DNA copy loss or gain may have been filtered out prior to forwarding the first genomic data set and/or storing the second genomic data sets in the database. As such, filtered genomic data may be created that generally only includes those regions of interest that may contain a chromosomal abnormality or alternation. In addition, first and second genomic data sets may comprise supplementary information such as information pertaining to the disease type and state of the patient, further patient information such as age or sex, the patient's health record, therapy and medication information, information about the practicing physician or the like. The supplementary information may be appended to the genomic data sets as metadata. Thus, summarizing the above, first and second genomic data sets may relate to raw or processed genomic data and may comprise metadata and supplementary information. Genomic data sets, may, for instance, comprise plain gene sequences, information about gene mutations, gene associations, gain or loss, gene expression levels or gene states or, in general, information about genomic testing.
The first site may be seen as relating to a first clinical organization or environment from where the first genomic data set originates. As such, the first site may be embodied by a hospital, clinical consortium of a plurality of hospitals, a practice, a gene or cancer center, gene laboratory or the like. In general, the second genomic data sets have not been generated at the first site, but at sites different than the first site (i.e., at other clinical organizations) and have been previously uploaded to the database from these other sites.
The database is a database of genomic information or a genomic knowledge database. It may include any storage medium or organizational unit for storing and accessing genomic data sets and any supplementary information associated with the second genomic data sets. The database may include a plurality of individual memory units and repositories and may, in particular, include distributed data architectures. The database may include a variety of data records accessed by an appropriate interface to manage delivery of genomic data sets and supplementary information. The database being “external” to the first site may mean that it is not within the premises of the first site. In other words, the database may be located at a site different from the first site. Noteworthy, the “location” of the database may also relate to a cloud platform, the server architecture of which is likewise external to the first site. The database may thus be seen as being physically separated from the first site. Further, it may be configured such that it cannot be accessed from the first site (or, generally, from the outside for that matter). The database may thus provide a platform for archiving sensible genomic information from a plurality of institutions (sites).
The step of comparing may comprise accessing the database and retrieving each of the stored second genomic data sets for comparison to the first genomic data set. However, the step of comparing may further comprise selecting a sub-group from the second genomic data sets for the ensuing identification of reference genomic data sets.
The step of identifying one or more reference genomic data sets is directed to identify those genomic data sets amongst the second genomic data sets that are similar to the first genomic data set. The similarity may amount to a plain similarity in gene sequences but may also include similar (higher-level) genomic features such as similar expression levels, similar gene mutation signatures, similar gain or loss, similar gene associations and so forth. Moreover, any of the available metadata (by ways of the supplementary information) may be factored in. For instance, the identification of similar genomic data sets may involve retrieving genomic data sets from patients of similar age, the same sex, and/or who underwent similar treatment. In other words, patient context information may be used to perform a matching process for identifying similar genomic data sets. In general, the step of identifying may comprise evaluating one or more similarity criteria. Mathematically, this may include extracting, from first and second genomic data sets one or more characteristic values according to the one or more similarity criteria, which characteristic values may then be compared to identify similar genomic data sets. The characteristic values may be aggregated to a score for each genomic data set, wherein individual characteristic values may be assigned different weights. Another expression for such procedure would be applying a similarity metric to the genomic data sets (which similarity metric comprises a plurality of similarity criteria).
In other words, the step of identifying may comprise scoring first and second genomic data sets according to one or more similarity criteria (i.e., calculating a score for each genomic data set based on one or more similarity criteria). The similarity between two genomic data sets may be conceived as a “distance” between two genomic data sets in terms of one or more similarity criteria. The smaller the distance, the higher the similarity. If a score is calculated for each of the genomic data sets, the distance may be conceived as the difference between the scores of two genomic data sets.
Another expression for “distance” would be “degree of similarity”. Accordingly, the step of identifying may amount to identifying, amongst the second genomic data sets, reference genomic data sets having a degree of similarity to the first genomic data set above a certain value or threshold. The threshold for the degree of similarity (distance) may be seen as a figurative threshold. However, the step of identifying may likewise comprise setting a predetermined threshold in this regard (either automatically, semi-automatically or by a user). In addition, the threshold may be seen as an appropriate margin of similarity around one or more characteristic values determined for the first genomic data set for quantifying the similarity to other genomic data sets.
The notification to the first site notifies the first site that a reference genomic data sets has been found. It enables a user at the first site to initiate further steps in order to take advantage of that information. The notification may comprise additional information that allows the user to contact colleagues associated with the one or more reference genomic data sets. To this end, the notification may comprise an indication of the site of origin and/or the responsible physicians of the one or more reference genomic data sets. The notification may comprise the therapy and treatment response, genetic tumor profile corresponding to the reference genomic data set. The notification may be dispatched via a dedicated communication channel. The dedicated communication channel may be further configured to permit direct communication between the respective physicians, for instance, by exchanging text messages or by setting up telephone and/or video conferences. Further the notification may contain a link (e.g., in the form of an URL) for one-time access to the reference genomic data sets and the corresponding supplementary information in the database.
The steps according to the first embodiment preferably happen external to the first site. In other words, the steps of receiving, comparing, identifying, and dispatching are carried out externally to the first site. These steps may be complemented by corresponding steps happening at the first site. These steps may comprise uploading the first genomic data set (to the database or corresponding system external to the first site) and receiving the notification. Further optional steps happening at the first site may be: generating the first genomic data set, selecting the first genomic data set for upload, and/or pre-processing the first genomic data set prior to uploading it. Of note, these steps may likewise form part of the method according to the first embodiment.
In summary, the above steps synergistically contribute to an improved way of automatically finding similar cases and thereby facilitate an efficient exchange of medical information for similar patient cases. Specifically, the usage of genomic data sets for identifying similar cases introduces an objective measure for matching similar cases. This is because the genomic data sets as such do not depend on subjective diagnosis steps. The usage of a database which collects comparative genomic data sets across a plurality of institutions (sites) enables to considerably increase the amount of comparative data. Since the number of combinatorial similarity criteria in connection with genomic data sets is huge, the clustering of comparative data from a plurality of institutions is one of the preconditions for efficiently using genomic data sets for similar patient searches. What is more, the automated comparison and identification of similar cases according to the above embodiment greatly facilities the procedure as any manually searching can be dispensed with.
Moreover, the usage of the central database as a platform for identifying similar cases provides a way of sharing medical information in highly regulated environments. Through the intermediation of the database it is not required to directly exchange genomic data sets between institutions and/or to grant direct accesses to local databases storing sensible patient information. The usage of a central database is complemented with a notification step which informs participating users of similar patient cases and at the same time enables to channelize and regulate the information content forwarded to the users. In particular, this allows to provide meaningful feedback about similar patient cases and at the same time ensures that the procedure is in line with all relevant data privacy regulations. What is more, the method according to the first embodiment readily integrates into clinical workflows, as the actual process steps are outsourced and performed automatically.
According to an embodiment, the method further comprises the step of introducing (or adopting) the first genomic data set in the database.
The step of introducing archives the first genomic data set in the database. With that, the first genomic data set may be used as comparative genomic data set (i.e., second genomic data set) for future cases. Introducing may further comprise storing any supplementary information provided together with the first genomic data set. As mentioned, the supplementary information may be appended to the genomic data set (as metadata) or provided in the form of separate files. Upon receipt, the first genomic data set may be assigned a unique identifier and all supplementary information may be assigned the same unique identifier unambiguously linking it to the respective first genomic data set. The unique identifier may be an accession number or any other suitable electronic identifier.
By including the first genomic data set (and any supplementary information) into the database alongside the second genomic data sets, the shared knowledge comprised in the system is enhanced and the similar patient search is rendered more efficient for subsequent queries.
According to an embodiment, first and/or second genomic data sets are anonymized, or, in other words, do not comprise any personal information pertaining to the patient.
“Anonymized” may mean that first and second genomic data sets do not reveal or contain any information from which the patient can be identified (i.e., patients name, address, photographs and the like). According to an embodiment, the method may further comprise the step of anonymizing the first genomic data set. The step of anonymizing may comprise filtering out any personal information with which the patient can be identified. The step of anonymizing may be carried out either at the first site or upon receiving the first genomic data set, i.e., external to the first site.
By anonymizing the genomic data sets, it can be safely ruled out that the information contained in the genomic data sets or in the associated supplementary information can be traced back to the corresponding patient.
According to an embodiment, the database is a local database located at a second site different than the first site. Consequently, the first genomic data sets are received at the second site, and the steps of comparing, identifying and dispatching are carried out at the second site.
In other words, this embodiment covers an implementation according to which the database sits at a local healthcare organization which provides its services to other institutions. In this respect, the database is a local database within the premises of the second site. Such a configuration may be beneficial if the access to the database needs to be tightly controlled. For instance, the interface to the database may be configured such that the database can only be accessed from within the second site without any direct connection to external networks. Like the first site, the second site may be a hospital, clinical consortium of a plurality of hospitals, a practice, a gene center or the like. The second genomic data sets contained in the database may either stem exclusively from the second site or originate from a plurality of external sites.
According to an embodiment, the database is configured as a cloud platform and the first genomic data sets are received at the cloud platform with the steps of comparing, identifying and dispatching being carried out at the cloud platform.
The embodiment constitutes a second example implementation of the database. Implementing the database as a cloud platform has the advantage that it can be more readily accessed from the sites participating in the patient similarity search program. Further, the entire communication between the individual sites (e.g., once a reference genomic data set has been found) may then be routed via the cloud platform. This may reduce the operational burden at the local sites and may decrease the hurdle for the local sites to participate. In turn, this may have the benefit that the build-up of the knowledge database is fostered. At the same time, data confidentiality may still be maintained by configuring the cloud platform such that the database cannot be directly accessed from the outside.
According to an embodiment, the step of identifying is based on applying a trained function to the first genomic data set. According to a further embodiment, the step of identifying is based on applying the trained function to first and second genomic data sets.
A trained function maps input data to output data. The output data can, in particular, depend on one or more parameters of the trained function. The one or more parameters of the trained function can be determined and/or be adjusted by training. The determination and/or the adjustment of the one or more parameters of the trained function can be based, in particular, on training data. The training data may comprise a pair made up of training input data and associated training output data. For creating training mapping data, the trained function is applied to the training input data. In particular, the determination and/or the adjustment can be based on a comparison of the training mapping data and the training output data.
Other terms for trained function are trained mapping specification, mapping specification with trained parameters, function with trained parameters, algorithm based on artificial intelligence, algorithm of machine learning. An example for a trained function is an artificial neural network, wherein the edge weights of the artificial neural network correspond to the parameters of the trained function.
In particular, the trained function may be applied to at least the first genomic data set. Additionally, the trained function may be applied to the second genomic data set. The trained function may be applied to the first genomic data set upon receipt of the first genomic data set. The trained function may be applied to the second genomic data set upon identifying the reference genomic data set or already prior to that, in particular, already (long) before the first genomic data set is received. The trained function may be trained to output genomic features and/or characteristic values. The corresponding outputs of the trained function may then be stored in the database alongside or in lieu of the corresponding second genomic data sets. According to some implementations, the trained function is applied to the genomic data sets upon storing them in the database.
The trained function may be configured (trained) so as to output a similarity score for the first genomic data set which can be matched with corresponding similarity scores of the second genomic data sets upon identifying the one or more reference genomic data set. The trained function may be further configured (trained) to output one or more genomic features and/or characteristic values of the first genomic data set which can be compared to corresponding genomic features and/or characteristic values of the second genomic data sets upon identifying one or more reference genomic data set.
Accordingly, the corresponding outputs of the trained function may be seen as providing “intermediate results” on the basis of which the one or more reference genomic data set may be identified. Of note, the further processing of the intermediate results may likewise be based on applying the same or another trained function to the intermediate results. Further, the trained function may be configured (trained) to directly identify the one or more reference genomic data sets when applied to the first genomic data set (i.e., without outputting intermediate results).
However, the usage of intermediate results may be beneficial to reduce the amount of data that needs to be stored and exchanged. Further, the usage of intermediate results may be beneficial from the perspective of data confidentiality. This is because the genomic data set can be effectively stripped from any genomic raw data by extracting genomic features and/or characteristic values. If the trained function is provided to the first site, genomic features and/or characteristic values may be calculated on-site. This opens the possibility to forward this information in the first genomic data set in lieu of the raw data.
In the training phase, the trained function may be trained on appropriate training data. The training data may comprise test genomic data sets as training input data and reference genomic data sets as training output data the similarity of which has been verified (e.g., by humans).
The usage of a trained function for identifying one or reference genomic data sets has the advantage that the trained function may learn to rely on features, characteristics, and insights for quantifying the similarity of two genomic data sets which are not readily accessible by traditional techniques and/or the human mind. Moreover, using trained functions for identifying one or more reference genomic data sets enables a fast, i.e., basically on-the-flight search of a high number of second genomic data sets stored in the database. Further, the usage of trained functions synergistically contributes to the requirement of keeping genomic data as confidential as possible. This is because the usage of trained functions facilitates a highly autonomous data processing scheme requiring no or only little interactions with human operators (which might breech data confidentiality). Moreover, the trained function can be readily configured not to output any sensible personal information. Thus, the trained function may also be used to anonymize genomic data sets.
According to an embodiment, the trained function is based on a support vector machine algorithm and/or a random forest algorithm and/or a regularized regression model.
Support vector machine algorithms, random forest algorithms as well as regularized regression models have proven particularly versatile in classifying data sets in general. Moreover, these algorithms showed particularly good results in connection with the analysis of genetic information. In extensive tests, the inventors have recognized that these algorithms are particularly suited for matching genomic data sets in similar patient searches.
According to an embodiment, first and second genomic data sets comprise supplementary information or metadata associated to the genetic information and the step of identifying is based on the supplementary information or metadata.
The supplementary information or metadata may comprise patient context information. Such context information may include information pertaining to a disease state of a particular patient, age, sex, or patient history. Further, the supplementary information or metadata may comprise disease phenotypes and genetic alterations. As such, the supplementary information may be factored in in the process of identifying one or more reference genomic data sets. For instance, in the step of identifying, the search may be focused on genomic data sets from patients with similar disease phenotypes, in the sense that these genomic data sets are preselected for further detailed analysis. This has the benefit, that the performance of the similarity search may be increased both in terms of accuracy and speed. Likewise, the trained function may use the supplementary information as further input data.
According to an embodiment, first and second genomic data sets comprise supplementary information and/or metadata associated to the genetic data sets and the step of comparing comprises preselecting the second genomic data sets on the basis of the supplementary information and/or metadata.
Preselecting may, for instance, comprise sorting genomic data sets with matching metadata into one or more groups. In the ensuing step of identifying only such genomic data sets may be considered that fall in the same group as the first genomic data set. According to an example, the aforementioned groups may relate to disease groups of cases having a clinical and functional similarity of the underlying diseases. Such disease groups may relate to grouping the genomic data sets according to tumor types, for instance. In a similar manner, alterations may be grouped into alteration groups that are functionally similar.
According to an embodiment, the first and or second genomic data sets comprise one or more genomic features respectively derived from an underlying genetic sequence of a patient, and the step of identifying is based on the one or more genomic features.
A genomic feature is a feature that has been calculated and/or extracted from genetic raw data such as the gene sequence. Thus, the genomic feature may be seen as high-level representation of one or more characteristics encoded in a gene sequence. In other words, genomic features are data objects extracted from the gene sequence. The genomic features may be associated to the aforementioned similarity criteria, preferably such that each genomic feature corresponds to similarity criteria. Generating the genomic features may comprise processing the first and second genomic data sets so as to respectively extract, from the first and second genomic data sets, one or more genomic features, respectively corresponding to the one or more similarity criteria. In contrast to the aforementioned characteristic values, genomic features relate to more abstract data packages or objects.
As such, genomic features may comprise different kinds of information from sequence excerpts to gene expression profiles to plain numbers. Genomic features may thus be seen as containers for transporting arbitrary higher-level information about a gene sequence. Genomic features may be related to the characteristic values. On the one hand, a genomic feature may be a characteristic value by itself (if, for instance, the genomic feature relates to a number). On the other hand, one or more characteristic values may be derived from a genomic feature by further processing. Examples for genomic features may be annotated functions associated to a genetic region. An example would be a protein coding gene.
Further genomic features may in general address information about mutations in the gene sequence. This may include the location/existence of mutation hotspots in the genomic data sets as one genomic feature (hotspots are regions in a genome that exhibit elevated rates of mutations relative to a neutral expectation), the effect of a mutation as further genomic feature or the clinical actionability of mutations as yet a further genomic feature. For instance, such genomic features may be output by the trained function (e.g., in the form of the aforementioned intermediate results).
The usage of genomic features constitutes a way to condensate the relevant information for conducting similarity search based on genomic data. This is beneficial in terms of the system requirements for exchanging and storing genomic data sets. In addition, the process of identifying reference genomic data set may be rendered more efficient since a smaller amount of data needs to be digested. Moreover, the usage of genomic features also contributes to the data privacy. This is because (although being of course based on gene sequences) genomic features preferably do not contain any dedicated (whole) gene sequence. While the gene sequence constitutes a genetic fingerprint from which a corresponding patient can be identified, this is no longer possible (or at least considerably more difficult) for genomic features.
Therefore, according to an embodiment, first and/or second genomic data sets consist of one or more genomic features. Preferably, they do not contain any explicit gene sequences anymore.
Upon identifying one or more reference genomic data sets, each individual genomic feature may be individually compared. Alternatively, identification may be based on a condensed feature parameter (also denoted as a genomic feature set or genomic feature vector) which is based on a plurality of individual genomic features. According to an embodiment, first and second genomic data sets thus comprise a feature vector of a plurality of individual genomic features.
According to an embodiment, the one or more genomic features comprised in the first genomic data set are generated at the first site.
According to the above explanations, the usage of genomic features enhances the performance of the method, limits the amount of exchanged data and contributes to the data security. In this regard, deriving the genomic features already at the first site makes it possible to only forward high-level features. Genomic raw data, from which a patient may still be identified, may be retained on-site.
According to an embodiment, the step of identifying comprises extracting on or more genomic features from the first genomic data set.
The extraction may be performed at the first site prior to forwarding the first genomic data set or after receipt of the first genomic data set, e.g., at the cloud platform or at the second site. The extraction may be performed by applying the trained function to the first genomic data set.
According to an embodiment, the step of identifying comprises determining a similarity between the first genomic data set and the second genomic data sets by comparing the one or more genomic features of the first genomic data set to the corresponding one or more genomic features of the second genomic data sets.
According to an embodiment, the step of identifying comprises comparing a genomic feature vector of the first genomic data set to a corresponding genomic feature vector of the second genomic data sets.
According to an embodiment, the first and second genomic data sets each comprise a genomic feature vector being respectively generated from corresponding raw gene sequences (optionally by respectively applying a trained function to the raw gene sequences), wherein in the step of identifying, the similarity between first and second genomic data sets is estimated based on a comparison of their corresponding genomic feature vectors.
According to an embodiment, the step of identifying comprises: determining one or more similarity criteria associated with the first and second genomic data sets, processing the first and second genomic data sets so as to respectively extract, from the first and second genomic data sets, one or more characteristic values respectively corresponding to the one or more similarity criteria, and identifying the one or more reference genomic data sets on the basis of the characteristic values.
Characteristic values may in general be characteristic numbers which alone or as an ensemble classify or identify a genomic data set, e.g., for comparing it to others but also for compressing the amount of data contained in a genomic data set for storing or data exchange. Each characteristic value may relate to a similarity criterion usable for retrieving the one or more reference genomic data set. Each characteristic value may correspond to one genomic feature as introduced above. Accordingly, the characteristic values may likewise be calculated from the genetic raw data, e.g., by applying a trained function to the raw data. Moreover, characteristic values may also relate to metadata such as patient's sex, age, or treatment response and so forth. As mentioned, the step of processing for extracting the characteristic values may take place already at the first site—with the benefit that only the characteristic values need to be forwarded (thereby reducing the amount of data exchanged and increasing the data security).
Determining the similarity criteria may involve choosing or adapting the similarity criteria according to the first genomic data set currently under consideration. Further, determining may relate defining a plurality of standardized criteria according to which each genomic data set is processed by default.
Noteworthy, the first and second genomic data sets may be processed independently from one another. In particular, the second genomic data sets may be processed before or long before the receipt of the first genomic data set. Specifically, the second genomic data sets' characteristic values may already be comprised in the second genomic data sets as stored in the database—either alongside or in lieu of any genetic raw data. As explained, the latter variant is beneficial in terms of storage space and data security.
According to an embodiment, the processing of the first genomic data set so as to extract, from the first data set, the one or more characteristic values is performed at the first site.
This has the effect that only the characteristic values and no raw data need to be forwarded by the local sites. As mentioned, this is beneficial in terms of data confidentiality and contributes to lowering the amount of data that needs to be exchanged.
According to an embodiment, one or more (or all) similarity criteria (and therewith the corresponding characteristic values) are based on an evaluation of gene mutations.
As regards oncology related questions, focusing on mutations in genomic data bears several advantages. On the one hand mutations allow for an efficient identification of reference genomic data sets since mutations usually pinpoint a disease or disease state very well. Further, characteristic values associated with mutations may furthermore be useful for physicians to evaluate the case at hand, e.g., in molecular tumor boards.
Specifically, the similarity criteria may comprise genomic regions (areas in the gene sequence) of mutations in the genomic data sets, mutation hotspots in the genomic data sets (hotspots are regions in a genome that exhibit elevated rates of mutations relative to a neutral expectation), mutation consequences in terms of gain and/or loss of function, effects of mutations on the signaling pathway, the clinical actionability of mutations in the genomic data sets, tumor profiles, disease types, patient's age and/or sex, treatment plan and/or treatment response and any combination thereof. In turn, the corresponding characteristic values are based on and are indicative of these criteria.
The clinical actionability is, in other words, a measure of whether clinical action should be taken based on heterogeneous information generated by genomic analysis. As regards the clinical actionability, the ESMO Scale for Clinical Actionability of molecular Targets (ESCAT) may be used, for instance. Alternatively, the clinical actionability may be determined according to the guidelines of the Association for Molecular Pathology (AMP).
The above characteristics have proven useful for the process of identifying similar cases on the basis of comparing genomic data sets. Moreover, these values enable an efficient data exchange in regulated environments. On the one hand, this is because they are uncoupled from the underlying gene sequences (which might still allow to identify the patient). On the other hand, values according to the above criteria provide indices anyway relevant for deciding on a case.
According to an embodiment, the step of identifying comprises calculating, for the first and second genomic data sets, a score as the weighted sum of the respective characteristic values, and comparing the scores of first and second genomic data sets.
By introducing a weighting of the individual characteristic values, in other words, different similarity criteria may be weighted differently for identifying the reference genomic data set. With that, different criteria may be balanced that contribute differently to the degree of similarity between two genomic data sets. According to an embodiment, the weights comprised in the weighted sum may be provided by the trained function.
According to an embodiment, the similarity between the first genomic data set and a second genomic data set is proportional to the difference in scores between the first and second genomic data sets. According to a further embodiment, the identification of the reference genomic data sets amongst the second genomic data sets may involve selecting those seconding genomic data sets as reference genomic data sets the score of which corresponds to the score of the first genomic data set within a predetermined margin. The predetermined margin may be set automatically and/or (semi-)automatically and/or by a user.
According to an embodiment, the step of identifying comprises generating a ranking of the reference genomic data sets on the basis of their similarity to the first genomic data set.
The ranking may be based on the aforementioned difference in scores, the characteristic values, the genomic features or any of the explained similarity criteria. By ranking the reference genomic data set, the first site may be provided with an indication as to the relevance of retrieved reference genomic data set. The higher a reference genomic data set is ranked, the more relevant it might be for the case at hand. In doing so, the method effectively integrates into existing workflows and helps the involved physicians to focus on the most relevant information.
According to an embodiment, the step of dispatching further comprises the step of retrieving, for each reference genomic data set, supplementary information, and including the supplementary information in the notification.
As mentioned, the supplementary information may be stored alongside the second genomic data sets in the same or a different database. The supplementary information may be retrieved based on appropriate unique identifiers respectively assigned to each genomic data set stored in the database and the corresponding supplementary information. By including the supplementary information, the first site may be provided with additional information relevant for the case and not already provided in the notification.
According to an embodiment, the supplementary information comprises contact information associated to the reference genomic data sets, an information at which sites the reference genomic data sets have been generated, a therapy history associated to the reference genomic data sets, a treatment response profile associated to the reference genomic data sets a genetic tumor profile associated to the reference genomic data sets, and any combination thereof.
By providing the first site with an information about the site of origin and/or the treating physician of the respective reference genomic data set, a physician at the first site is enabled to retrieve additional information about the respective reference genomic data set and consult with her or his colleagues. As this involves forwarding personal data about the physician and not about the patient, the patient's data confidentiality is maintained. Likewise, the genetic tumor profile is of immediate use for the physicians at the first site as it provides valuable insights at one glance and can be readily discussed at the tumor boards at the first site. Further, since the tumor profile cannot be traced back to the patient, data confidentiality is maintained also with respect to this piece of information. The same holds true for the (anonymized) treatment history and treatment response profiles, which enable a treating physician to figure out which therapeutic measures have proven useful in parallel cases. To further ensure data privacy, the step of dispatching may comprise a step of anonymizing the notification such that it does not reveal or contain any information from which the patients belonging to the one or more reference genomic data set can be identified (i.e., patients name, address, photographs and the like).
According to a further embodiment, the notification includes the one or more reference genomic data sets.
For data security reasons, the reference genomic data sets included in the notification preferably do not contain any genetic raw data such as gene sequences but only high-level information that cannot be traced back to the respective patient (such as the aforementioned characteristic values, genomic features, similarity criteria or scores). To this end, an additional step of filtering the reference genomic data set may be provided before appending them to the notification striping the reference genomic data set from any genetic raw data.
According to an embodiment, the step of dispatching comprises including the one or more characteristic values of the first genomic data set and/or the corresponding one or more characteristic values of the respective reference genomic data set into the notification.
With that, the physician at the first site may be provided with meaningful information as to why a respective reference genomic data set has been chosen and where the similarities and differences lie. Further, dependent on the underlying similarity criterion, the information therewith provided may be useful for the further analysis of the case.
According to an embodiment, the method further comprises the step of establishing a communication channel for direct communication between the first site and the respective sites of origin of the one or more reference genomic data sets.
The communication channel constitutes an interactive connection between the matched sites. The communication channel may enable real-time interaction between the treating physicians, e.g., by exchanging voice or text messages. The communication channel may be embodied in the form of a chatroom or virtual molecular tumor board, e.g., hosted by the cloud platform or the aforementioned second site. The communication channel may be based on a secured connection. The communication channel may be based on a VPN connection. Providing the communication channel may comprise a log-in step for the treating physicians using a registered ID and password which may be forwarded in the notification or via a separate communication channel such as via email or sms (“short message service”). Access to the communication channel may be provided by an URL included in the notification or via existing user accounts. Information between participants may be exchanged in the form of verbal and/or written or textual communication. As such, the communication channel may be embodied by secured internet connection, preferably comprising a voice over internet protocol (VoIP) connection and/or a (text/video or audio) chat connection. The communication channel may also provide for graphical user interfaces at the matched sites, e.g., in the form a web client.
According to an embodiment, a system for sharing medical information is provided. The system comprises an interface unit, a database and a computing unit. The interface unit is configured to communicate with a first site for receiving a first genomic data set. Further the interface unit is configured to communicate with the database. The database is configured to store a plurality of second genomic data sets, the database being external to the first site. The computing unit is configured to compare the first genomic data sets with a fraction or all of the second genomic data sets and to identify, amongst these second genomic data sets, one or more reference genomic data sets, on the basis of determining a similarity between first genomic data set and the respective second genomic data sets. Further, the computing unit is configured to dispatch a notification to the first site indicative of the reference genomic data sets via the interface unit.
The interface unit may be understood as an interface for data exchange at least between the first site, the system and any other sites of origin of the second genomic data sets. To this end, the interface unit may be configured to communicate over one or more connections or buses. The interface unit may be embodied by a gateway or other connection to a network (such as an Ethernet port or WLAN interface). The network may be realized as local area network (LAN), e.g., an intranet, ethernet or a wide area network (WAN), e.g., the internet. The network may comprise a combination of the different network types. According to an embodiment, the network connection may also be wireless.
The computing unit can be realized as a data processing system or as a part of a data processing system. Such a data processing system can, for example, comprise a cloud-computing system, a computer network, a computer, a tablet computer, a smartphone and the like. The computing unit can comprise hardware and/or software. The hardware can be, for example, a processor system, a memory system and combinations thereof. The hardware can be configurable by the soft-ware and/or be operable by the software. Generally, all units, sub-units or modules may be at least temporarily be in data exchange with each other, e.g. via network connection or respective interfaces. Consequently, individual units may be located apart from each other, especially the definition unit may be located apart, i.e. at the mobile device, from the remaining units of the computing units.
According to an embodiment of the present invention, the system is adapted to implement at least one embodiment of the inventive method for sharing medical information. The computing unit may be seen as a matching engine configured to compare the received first genomic data set to the second genomic data sets stored in the database and identify one or more reference genomic data sets on that basis.
To this end, the computing unit may be configured to access the database and retrieve one or more second genomic data sets for comparing them with the first genomic data set. Further, computing unit may be configured to process the first genomic data set and/or the second genomic data sets for identifying one or more reference genomic data sets. The processing may comprise extracting one or more genomic features respectively from first and second genomic data sets, calculating one or more characteristic values respectively from first and second genomic data sets, respectively calculating a score for first and second genomic data sets, and calculating a degree of similarity between first and second genomic data sets (on the basis of one or more of the aforementioned processing steps).
Further, the computing unit may be configured to rank the identified reference genomic data sets according to their similarity to the first genomic data set. The computing unit may further be configured to run a trained function (to apply a trained function to the first and second genomic data sets) in the step of identifying one or more reference genomic data set. Further, the computing unit may comprise communication modules configured to initiate and/or control the communication between the first site and the sites of origin of the one or more reference genomic data sets.
To this end, the communication modules may be configured to dispatch a notification to the first site that one or more reference genomic data sets have been found, e.g., via the interface unit or any other appropriate channel. Further, the communication modules may be configured to establish a communication channel between the first site and sites of origin of the one or more reference genomic data sets. The communication channel may be hosted by the system, e.g., via the communication modules and/or the interface, so that any information exchange is routed through the system. As an alternative, the communication channel may be configured as a direct communication channel between the involved sites.
The system may be configured as a local system characterized in that all system components (i.e., databases, computing and interface units) are arranged at one defined local site, such as a hospital, cancer or gene center. Although the system components may still be spread throughout the local site, e.g., in the form of a local server architecture, all processes run on premises within the local sites and all databases and repositories are likewise arranged within the local site.
As an alternative, the system may be configured as a cloud system or cloud platform comprising a real or virtual group of computers and database like a so called ‘cluster’ or ‘cloud’.
According to an embodiment, a computer program product is provided. The computer program product comprises program elements which induce a computing unit of a system for sharing medical information to perform the method as described above in connection with one or more embodiments, when the program elements are loaded into a memory of the computing unit.
According to a further embodiment, program elements are stored that are readable and executable by a computing unit of a system for sharing medical information, in order to perform steps of the as described above in connection with one or more embodiments, when the program elements are executed by the computing unit.
The realization of the invention by a computer program product and/or a computer-readable medium has the advantage that already existing providing systems can be easily adopted by software updates in order to work as proposed by the invention.
The computer program product can be, for example, a computer program or comprise another element next to the computer program as such. This other element can be hardware, for example a memory device, on which the computer program is stored, a hardware key for using the computer program and the like, and/or software, for example a documentation or a software key for using the computer program. The computer program product may further comprise development material, a runtime system and/or databases or libraries. The computer program product may be distributed among several computer instances.
In summary, by providing a platform for securely storing comparative data and processing uploaded genomic data sets, embodiments of the invention establishe a way to base patient similarity search on genomic data and securely exchange information across a plurality of involved local sites.
Local sites A, B, C may contain local computing units 40A, 40B, 40C through which one or more users (such as physicians or other healthcare personnel) may interface to the system 100. Local computing units 40A, 40B, 40C may comprise a hardware or software component, e.g., a microprocessor or a FPGA (‘Field Programmable Gate Array). Local computing units 40A, 40B, 40C may be embodied as workstations, tablets, smart phones, server systems or connectivity nodes. Local computing units 40A, 40B, 40C may be configured to perform steps according to the workflow described in connection with
Further, local sites A, B, C may contain acquisition units 50A, 50B, 50C for acquiring genomic data and transferring the genomic data into genomic data sets GDS. The genomic data acquired may be raw data or already processed genomic data. Raw data may be acquired from acquisition units including but not limited to microarray platforms including RNA or mRNA expression, genotyping, gene expression platforms, polymerase chain reaction (PCR) platforms, copy-number variation (CNV) platforms, (whole) genome sequencing platforms or the like. The genomic data acquired from the acquisition unit 50A, 50B, 50C may be in the form of gene sequences, gene expression levels, gene states or the like. Alternatively, the acquisition may be from storage or memory, such as acquiring a previously created genomic data set GDS from an appropriate archiving system as acquisition units. The raw data may subsequently be processed or (pre-)processed in the acquisition units 50A, 50B, 50C and/or in the local computing units 40A, 40B, 40C.
To interface with one or more users, local computing units 40A, 40B, 40C may comprise a user interface such as one or displays or touch screens. Local computing units 40A, 40B, 40C may be configured as reading workplaces with which users can retrieve and review genomic data sets GDS and related supplementary information SI. To retrieve the supplementary information SI, local computing units 40A, 40B, 40C may be configured to query appropriate local data storage devices or repositories 30A, 30B, 30C within the respective sites A, B, C, for instance. To this end, local computing units 40A, 40B, 40C may be configured to extract a unique identifier from the genomic data sets GDS, indicative of the patient or case under consideration. Such unique identifier may be a patient ID, a case or accession number, a patient name or the like. The unique identifier may be assigned to the genomic data sets GDS upon their acquisition. The unique identifiers may subsequently be used to query the available local databases 30A, 30B, 30C for supplementary information SI having the same unique identifier. The supplementary information SI may comprise information pertaining to the disease type and state of the patient, further patient information such as age or sex, the patient's health record, therapy and medication information, information about the practicing physician or the like. The supplementary information SI may be provided in the form of an electronic medical record (EMR), for instance. The local storage devices 30A, 30B, 30C may be part of hospital information systems (HIS), radiology information systems (RIS), clinical information systems (CIS), laboratory information systems (LIS) and/or cardiovascular information systems (CVIS) or the like.
For reviewing genomic data sets by a user, local computing units 40A, 40B, 40C may be configured to execute at least one software component for serving a display unit and a input unit of local computing units 40A, 40B, 40C in order to provide a suited graphical user interface. With the graphical user interface, the user may, for instance, select genomic data sets GDS for review from the acquisition units 50A, 50B, 50C or local storage devices 30A, 30B, 30C. Further, the user may review one or more graphical representations of the genomic data sets GDS as provided by the graphical user interface. Moreover, the graphical user interface may provide the user a selection of analytic tools with which he or she can further analyze the genomic data sets GDS currently under review. Further, the graphical user interface may allow the users to select genomic data sets GDS for sharing with institutions outside of the respective local site A, B, C.
Local computing units 40A, 40B, 40C may be configured to further process the genomic data sets GDS. This may comprise steps such as bringing the genomic data sets GDS into an appropriate format or data compression procedures but may also involve associating and/or appending corresponding supplementary information SI to the genomic data sets GDS as metadata.
Further, in terms of processing the genomic data, local computing units 40A, 40B, 40C may be configured to extract genomic features from the genomic data sets GDS. The genomic features may relate to high-level information derived from the genomic data sets GDS, e.g., by using bio-informatics algorithms. In particular, genomic features may be generated by applying a trained function to the genomic data sets GDS. The trained function may, for instance, be provided by the matching system 1. The trained function may be based on a support vector machine algorithm and/or a random forest algorithm and/or a regularized regression model. The genomic features may be selected or tailored according to the clinical question at hand. For oncology related questions, the genomic features may, for instance, rely on identifying mutations in the genomic data. Accordingly, corresponding genomic features might relate to the genomic regions of mutations, mutation hotspots, the effect of mutations (in terms of gain or loss of function), and/or the clinical actionability of mutations. As an alternative or in addition to that, the preprocessing as described above may also be performed in the acquisition units 50A, 50B, 50C. Moreover, the processing of genomic data sets GDS may comprise a filtering step. For instance, local computing units 40A, 40B, 40C may be configured to filter out information that does not identify a required piece of information such as a chromosomal DNA copy loss or gain. As such, filtered genomic data sets GDS may be created that generally only include those regions that may contain a chromosomal abnormality or alternation.
Thus, summarizing the above, genomic data sets GDS may relate to raw or processed genomic data and may comprise metadata and supplementary information SI. As such, genomic data sets GDS, may, for instance, comprise plain gene sequences, information about gene mutations, gene associations, gain or loss, gene expression levels or gene states, tumor profiles, disease states, sex or age of the patient, and so forth.
The components at the respective sites A, B, C are interfaced with an appropriate local network enabling local communication at the respective sites A, B, C. Data transfer is preferably realized using a network connection. The network may be realized as local area network (LAN), e.g., an intranet, ethernet or a wide area network (WAN). Network connection is preferably wireless, e.g., as wireless LAN (WLAN or Wi-Fi). The network may comprise a combination of the different network types. In particular, the network may comprise a HL7 and/or FHIR compatible network. HL7 (Health Level Seven) specifies a set of flexible standards, guidelines, and methodologies by which various healthcare systems can communicate with each other. It allows information to be shared and processed in a uniform and consistent manner and therefore enables to easily share clinical information. The FHIR (Fast Healthcare Interoperability Resources)-standard builds on previous standards from HL7 and uses a web-based suite of API-technology. It is meant to enhance the interoperability and support a wider variety of devices from workstations to tablets to smart phones.
For patient privacy reasons, there is preferably no direct communication across the different sites A, B, C, however. This restriction is indicated by the dashed lines in
Local computing units 40A, 40B, 40C may comprise filtering modules (not shown) configured to filter out personal patient data from the genomic data sets GDS prior to uploading the genomic data sets GDS (i.e., local computing units 40A, 40B, 40C are configured to anonymize the genomic data sets GDS). In addition to that or as an alternative, also matching system 1 may be configured to anonymize the uploaded genomic data sets GDS1, likewise relying on appropriate filtering modules, for instance.
In the example as shown in
Alternatively, supplementary information SI may be stored in repository 30. Like the genomic data sets GDS2, the supplementary information SI may either be recorded locally at the site B where the matching system 1 resides or come from external sites A, C, e.g., in the form of an appendix to the uploaded genomic data sets GDS1.
Matching engine 10 may comprise a plurality of sub-units 11-14 configured to process genomic data sets GDS1, GDS2 for identifying similar genomic data sets and share this information with the external sites A, C. Matching engine 10 may comprise either a computer/processing unit, a microcontroller or an integrated circuit. Alternatively, matching engine 10 may comprise a real or virtual group of computers like a so called ‘cluster’ or ‘cloud’. Further matching engine 10 may be a server system. The server system may be a central server. Further, matching engine 10 may comprise a memory such as a RAM, e.g., for temporally loading genomic data sets GDS2 from the database for further processing.
Sub-unit 11 is a pre-processing module configured to analyze the uploaded genomic data sets GDS1 (also denoted as first genomic data set), to determine if and which pre-processing steps are required for the further analysis. Further, sub-unit 11 is configured to pre-process the uploaded genomic data set GDS1 accordingly. Analyzing may comprise analyzing the format and information content of the uploaded genomic data sets GDS1. Here, it may be determined, for instance, if the uploaded genomic data set GDS1 comprises raw data and/or already processed data. The outcome of this analysis may then be compared to the system requirements of the matching engine 10. If any discrepancy is detected, further pre-processing steps may be scheduled and carried out for bringing the genomic data sets GDS1 in shape for the subsequent similarity search. The pre-processing steps in general may be of the same kind as mentioned in connection with the processing steps performed by the local computing units 40A, 40B, 40C. This may, in particular, involve the extraction of genomic features. As mentioned, these genomic features may address mutations such as the genomic regions of mutations in the genomic data, mutation hotspots in the genomic data, the effect of mutation in the genomic data (gain or loss), and/or the clinical actionability of mutations in the genomic data set. To derive the genomic features, sub-unit 11 may be configured to apply and execute suited bio-informatics-algorithms, and, in particular, one or more trained functions. Whether the pre-processing is done in the matching engine 10 or already locally at the local sites A, C may vary according to the specific requirements. Sourcing out some or all of the pre-processing steps to the local sites A, C has the benefit of reduced data traffic and enhanced data security. By contrast, centralizing the pre-processing at matching engine 10 may improve compatibility and ensures that the full genomic information is still present at matching engine 10. As yet a further option, pre-processing steps may also be split between matching engine 10 and local computing systems 40A, 40B, 40C.
Sub-unit 12 is a module configured to further process the uploaded genomic data sets GDS1 by searching and identifying reference genomic data sets. Reference genomic data sets are those genomic data sets amongst the genomic data sets GDS2 stored in database 20 that are “similar” to the uploaded genomic data sets GDS1. To identify the reference genomic data sets, sub-unit 12 may be configured to calculate a degree of similarity between the uploaded genomic data set GDS1 and the genomic data sets GDS2 from database 20. As will be further detailed below, sub-unit 12 is preferably configured to do so on the basis of a weighted comparison of distinct characteristic values extracted from the genomic data sets GDS1, GDS2 and/or the genomic features. For a more efficient search for reference genomic data sets, sub-unit 12 may also be configured to analyze any metadata adhered to the genomic data sets GDS1, GDS2. As mentioned, the metadata may comprise an indication (or electronic tag) about the kind of disease linked to the genomic data set. By evaluating this information, sub-unit 12 may, for instance, focus on genomic data sets GDS2 in database 20 having the same indication (or electronic tag) and, hence, belong to the same disease group.
Sub-unit 13 is a module for retrieving supplementary information SI associated with the reference genomic data sets. The supplementary information SI may either be adhered to the genomic data sets GDS2 as metadata or be archived separately in designated databases such as repository 30B. If the supplementary information SI is adhered to the genomic data sets GDS2 in the form of metadata, e.g., in a header or the like, sub-unit 13 may be configured to access, read and process the metadata and retrieve the supplementary information SI directly from the genomic data sets GDS2. Alternatively, sub-unit 13 may be configured to query and retrieve the supplementary information SI from the corresponding repository 30A, e.g., by using an appropriate data identifier. Repository 30A may be separate from database 20 or integrated in database 20. As mentioned, the supplementary information SI may be information concerning the attending physician(s) responsible for the case, information concerning the kind of the disease, treatment information, information about the treatment response or the like.
Sub-unit 14 is a module for enabling information exchange across the sites A, C. In this regard, sub-unit 14 may be configured to dispatch a communication (notification NOT) to the site where the uploaded genomic data set GDS1 came from indicating that a reference genomic data set has been found. In this regard, the distributed environment 100 may be configured such that the notification NOT is displayed at the local computing systems 40A, 40B, 40C. Further, sub-unit 14 may be configured to provide a communication channel CH1, CH2 enabling communication between the site of origin of the uploaded genomic data set GDS1 and the site(s) of origin of the reference genomic datasets. The communication channel CH2 may be such that the respective sites of origin may communicate directly, e.g., via the computing systems 40A, 40B, 40C. In addition to that or as an alternative, the communication channel CH1 may be such that the communication between the sites A, B, C takes place via matching engine 10 (sub-unit 14) as communication node. Further, sub-unit 14 may be configured to include part or all of the retrieved supplementary information SI in the notification.
The designation of the distinct sub-units 11-14 is to be construed by ways of example and not as limitation. Accordingly, sub-units 11-14 may be integrated to form one single unit or can be embodied by computer code segments configured to execute the corresponding method steps running on a processor or the like of the matching engine 10. Each sub-unit 11-14 may be individually connected to other sub-units and or other components of the distributed environment 100 where data exchange is needed to perform the method steps. For example, sub-unit 11 may be connected to the interface units of local computing units 40A, 40B, 40C for receiving the uploaded genomic data sets. Likewise, sub-unit 14 may be directly connected to corresponding interface units of local computing units 40A, 40B, 40C to forward the notification NOT that reference genomic data sets have been found. Further, sub-unit 12 may be directly connected to database 20 and sub-unit 30 may be directly connected to repository 30B. In this regard, database 20 and repository 30B may be activated on a request-base, wherein the request is sent by matching engine 10. Interfaces for data exchange with the matching engine 10 may be realized as hardware- or software-interface, e.g., a PCI-bus, USB or fire-wire. Data transfer is preferably realized using a network connection. The network may be realized as local area network (LAN), e.g., an intranet or a wide area network (WAN). Network connection is preferably wireless, e.g., as wireless LAN (WLAN or WiFi). Further, the network may comprise a combination of different networks.
A computing unit according to an embodiment of the invention may comprise part or all of the matching engine 10. Further, it may comprise part or all of the local computing systems 40A, 40B, 40C at the sites A, B, C. Of note, the layout of the computing unit, i.e., the physical distribution of sub-units is, in principle, arbitrary. For instance, filtering modules for anonymizing genomic data sets GDS may be comprised in local computing units 40A, 40B, 40C and/or in matching system 1. The same holds true for pre-processing modules such as sub-unit 11. Specifically, pre-processing modules may also be comprised in local computing units 40A, 40B, 40C or already in the acquisition units 50A, 50B, 50C.
One difference between the embodiment shown in
As in the case of the embodiment shown in
A first step S10 is directed to receiving an uploaded genomic data set GDS1 at the matching system 1, 1′ from one of the sites A, B, C. The site from which the uploaded genomic data set has been uploaded may also be denoted as “first site”. As will be further detailed in connection with
Subsequently, in step S20, the uploaded genomic data set GDS1 is compared to a plurality of genomic data sets GDS2 stored in database 20, 20′ (also denoted as “second genomic datasets”). The step of comparing may comprise accessing database 20, 20′ and retrieving one or more genomic data sets GDS2 from database 20, 20′ for comparison. The comparison may be carried out with respect to all of the genomic data sets GDS2 stored in database 20, 20′ or just with respect to a subset of the genomic data sets GDS2. Specifically, matching engine 10, 10′ may be configured to preselect one or more genomic data sets GDS2 from database 20, 20′ so that the uploaded genomic data set GDS1 is only compared to a fraction of the genomic data sets GDS2 comprised in database 20, 20′. In
In subsequent step S30, one or more reference genomic data sets are identified based on the genomic data sets GDS2 selected for comparison. As mentioned, a reference genomic data set is a genomic data set which has a certain degree of similarity to the uploaded genomic data set GDS1. The identification of similar genomic data sets may be based on the genomic sequence as such or, in other words, on raw data. In this regard, there are several known ways. One involves evaluating a spatial overlap of the gene sequences. However, according to several embodiments, the comparison is based on one or more higher-level genomic features or characteristic values CV1 . . . CVn encoded in the gene sequence that—dependent on the state of the genomic data sets—might require further processing of the genomic data sets. These genomic features or characteristic values CV1 . . . CVn correspond to so called “similarity criteria”. The similarity criteria may be chosen according to the case and/or the genomic data set at hand. In cancer therapy, the analysis of mutations in the gene sequence plays an important role and, accordingly, similarity criteria may likewise be based on evaluating mutations in the gene sequence. The corresponding genomic and/or characteristic values CV1 . . . CVn may relate to very specific characteristics, such as the exact location of a given mutation in the gene sequence, but may as well concern more generic characteristics, such as the effect of mutations in the signaling pathway.
Example similarity criteria include
the genomic region of a mutation,
the presence of a mutation hotspot (are mutations occurring within a window of a predefined sequence length of amino acids?),
the clinical actionability of mutations,
the mutation consequence (e.g., gain vs. loss of function), or
the effect of mutations on signaling pathways.
As regards the clinical actionability, the ESMO Scale for Clinical Actionability of molecular Targets (ESCAT) may be used, for instance. Alternatively, the clinical actionability may be determined according to the guidelines of the Association for Molecular Pathology (AMP).
Each genomic feature may correspond to one or more characteristic values CV1 . . . CVn. In this regard, the genomic features may be considered as a more abstract form of features extracted from a gene sequence as compared to the characteristic values CV1 . . . CVn. Genomic features may relate to data objects which can be translated into one or more characteristic values CV1 . . . CVn.
For identifying similarities among two genomic data sets, a degree of similarity may be determined by comparing the individual genomic features and/or characteristic values CV1 . . . CVn. Taking the genomic region of a mutation as an example, such an assessment may involve extracting the genomic region of a given mutation from the gene sequence of the uploaded genomic data set GDS1, extracting the corresponding genomic region from the gene sequence of a stored genomic data set GDS2, and comparing the ensuing characteristic values CV1 . . . CVn, e.g., in the form of calculating the difference in characteristic values CV1 . . . CVn. The result provides an indication of whether or not a mutation is at the same position in two genomic data sets GDS1, GDS2. Evidently, the result may be improved by sampling not only one similarity criterion but a plurality of different criteria. The ensemble of genomic features and/or characteristic values CV1 . . . CVn characterizes a genomic data set GDS1, GDS2 and, hence, may be used to efficiently identify similar genomic data sets. Such an ensemble may also be denoted as a genomic feature vector or feature set.
The genomic features and/or characteristic values CV1 . . . CVn may be extracted from the respective genomic data sets GDS1, GDS2 upon the actual identification of one or more reference genomic data sets, i.e., in the framework of step S30. In this case, step S30 may comprise an optional sub-step S31 in the form of a pre-processing step of extracting on or more genomic features and/or characteristic values CV1 . . . CVn from the genomic data sets GDS1, GDS2 according to one or more similarity criteria. According to an embodiment, step S31 involves applying the aforementioned trained function to the uploaded genomic data set GDS1 and/or the genomic data sets GDS2 from database 20, 20′. This pre-processing step S31 is optional, however, and may depend on the state of the uploaded genomic data set GDS1 (as, for instance determined in optional step S12), the state of the genomic data sets GDS2 as stored in database 20, 20′ and the actual method relied upon for identifying similar genomic data sets. As an alternative and as already explained previously, the extraction of one or more genomic features and/or characteristic values CV1 . . . CVn may also be carried out in the framework of previous steps S10 or S20. What is more, at least the genomic data sets GDS2 comprised in database 20, 20′ may be held available in an already pre-processed format with the genomic features and/or characteristic values CV1 . . . CVn already extracted and disposable. A corresponding pre-processing is preferably performed upstream of the actual steps for identifying one or more reference genomic data sets GDS2 as this reduces the computation time for each uploaded genomic data set GDS1. For instance, the feature extraction may be carried out when integrating new genomic data sets GDS2 into database 20, 20′.
As mentioned, the extraction of the genomic features according to a set of similarity criteria may furthermore already be carried out at the local sites A, B, C (e.g., in the local computing units 40A, 40B, 40C). Other procedures that may form part of a pre-processing step (either within or outside of step S31) may include filtering out irrelevant information from the genomic data sets. For instance, portions of the sequence may be filtered out that do not identify a chromosomal DNA copy loss or gain. As such, filtered genomic data sets may be generated that generally only include those regions that may contain a chromosomal abnormality. Like in the case of the genomic feature extraction, this pre-processing may be performed already at the local sites A, B, C or by the matching system 1, 1′ once a genomic data set GDS1 has been uploaded.
For the actual identification of one or more reference genomic data sets, a similarity between the genomic data sets GDS1, GDS2 needs to be quantified. This may, for instance, be done by combining the genomic features of the involved genomic data sets GDS1, GDS2 to form feature vectors. A degree of similarity may then be derived by calculating the dot product between the feature vector of the uploaded genomic data set GDS1 and the corresponding feature vector of genomic data set GDS2 from database 20, 20′ (also referred to as “cosine similarity”). Alternatively, a sum of squared differences between genomic features and/or characteristic values of two genomic data sets GDS1, GDS2 may be calculated as measure for the similarity. Further alternatively, the genomic features and/or characteristic values CV1 . . . CVn may be aggregated to a score S for each genomic data set GDS1, GDS2. Specifically, the score S may be defined as the weighted sum of a plurality of genomic features and/or characteristic values CV1 . . . CVn as follows:
S=W1*CV1+W2*CV2+ . . . +Wn*CVn.
In the above formula, W1 . . . Wn denote weights, which may be positive or negative. Generally speaking, the weights W1 . . . Wn may be seen as indicating the importance of the corresponding genomic feature and/or characteristic value CV1 . . . CVn for finding similar genomic data sets GDS2. The degree of similarity between two genomic data sets GDS1, GDS2 may then be expressed as the difference or distance in the corresponding scores S. Of note, also the summands in the abovementioned dot product or the sum of the squared differences may be correspondingly weighted.
According to an embodiment, all or part of the procedures taking place in step S30 might be performed by one or more trained functions (which are applied on the uploaded genomic data set GDS1 and or the genomic data sets GDS2 in database 20, 20′). According to the above, the trained functions may thus be configured (trained) so as to extract genomic features from the genomic data sets GDS1, GDS2 and output them either as intermediate values or their final output, to score the genomic data sets GDS1, GDS2 on the basis of the genomic features and/or to deliver one or more reference genomic data sets on that basis. However, the trained functions may also follow a completely different procedure and may just indicated one or more reference genomic data sets as the final result. The trained function may be based on regularized regression models (e.g. lasso, elastic net etc.), random forest algorithms, and/or support vector machines.
Once the similarity between the uploaded genomic data set and the genomic data sets GDS2 stored in database 20, 20′ has been quantified in terms of the degree of similarity, one or more reference genomic data sets may be identified on that basis. This may involve ranking the genomic data sets GDS2 according to their degree of similarity to the uploaded genomic data set GDS1. The genomic data sets GDS2 ranked highest may then be identified as reference genomic data set(s). As an alternative or in addition to that, the degrees of similarity may be compared to a predefined threshold. Genomic data sets GDS2 with degrees of similarity above the predefined threshold may then be selected as reference genomic data set. The predefined threshold may be set automatically and/or (semi-)automatically and/or by a user. If none of the genomic data sets GDS2 has a degree of similarity greater than the predefined threshold, either no reference genomic data set is identified at all or the genomic data set(s) GDS2 with the highest degree of similarity is identified as reference genomic data set(s). Further, the identification of reference genomic data sets amongst the second genomic data sets may involve selecting those second genomic data sets as reference genomic data sets, the score of which lies within a predetermined margin around the score of the first genomic data set. The predetermined margin may be set automatically and/or (semi-) automatically and/or by a user.
A further step S40 is directed to dispatching a notification NOT to the site from which the uploaded genomic data set GDS1 has been uploaded. Notification NOT may be indicative, in general, of the result of the genomic similarity search performed by matching system 10, 10′. According to an embodiment, notification NOT may be indicative of the one or more reference genomic data sets identified. If no reference genomic data set could be identified, this may be included in notification NOT as well. Optionally, step S40 may include sub-step S41 of retrieving, for each reference genomic data set, supplementary information SI and adhering it to the notification NOT. The supplementary information SI may include contact information of the attending physician, information about the therapy and the therapy response, or the like. As mentioned, the supplementary information SI may either be already comprised in the reference genomic data sets or be archived separately in designated databases such as in EMR-repositories 30B, 30′. Accordingly, the supplementary information SI may directly be retrieved from the reference genomic data sets or by querying corresponding repositories 30B, 30′ (e.g., based on the aforementioned unique identifiers).
Optional step S50 is directed to importing the uploaded genomic data set GDS1 into the matching system 1, 1′. This may comprise storing the uploaded genomic data set GDS1 in database 20, 20′ and archiving any supplementary information SI associated to the uploaded genomic data set GDS1 (e.g., either in database 20 itself or in repository 30B, 30′). Upon importing, the uploaded genomic data set GDS1 may be formatted such as to correspond to the genomic data sets GDS2 already stored in database 20, 20′. This may comprise extracting genomic features and/or characteristic values CV1 . . . CVn according to the one or more similarity criteria from the uploaded genomic data set GDS1. Further, data import may also include automated operations of tagging data as well as mapping the imported data to data already archived in the system. The actions of tagging and mapping may be based on any metadata adhered to the uploaded genomic data set and/or any piece of supplementary information SI uploaded together with the uploaded genomic data set. For instance, the disease type may be extracted from either the metadata or the supplementary information SI and used to map the uploaded genomic data set to a disease group within database 20, 20′. Prior to archiving, the uploaded genomic data set and any supplementary information SI may be subjected to an appropriate filtering procedure in order to ensure that the archived data is anonymized.
A further optional step S60 is directed to create a communication channel CH1, CH2 between the sites associated with the matched genomic data sets. The communication channel CH2 may be configured such that it facilitates direct communication between the treating physicians associated with the matched genomic data sets GDS1, GDS2. In one embodiment, the communication channel CH1, CH2 is configured such that the communication is anonymous without the need to identify a specific patient and/or physician. The communication may, for instance, be effected via the local computing systems 40A, 40B, 40C. In this regard, the communication channel CH2 may connect local computing units 40A, 40B, 40C directly, e.g., by ways of a secure internet connection. Alternatively, the communication channel CH1 may be such that the communication is routed via the matching system 1, 1′. In other words, the matching system 1, 1′ takes the role of a connectivity node between the local sites A, B, C associated with the matched genomic data sets.
In addition to that or as an alternative, the communication channel CH1, CH2 may enable a selective access to database 20, 20′ and/or the repository 30A, 30′ of the matching system 1, 1′. Further, the communication channel CH1, CH2 may be configured such that it enables local sites A the selective (one-time) access to a corresponding database 30C of another site. Further, the communication channel CH1, CH2 may be such that it provides the local site which uploaded a genomic data set GDS1 supplementary information SI for download. To this end, a URL may be provided to the respective local sites, via which the data can be accessed and downloaded. The URL may, for instance, be included in the notification NOT. Further, the communication channel CH1, CH2 may be configured such that it induces local sites A to forward supplementary information SI associated to the one or more reference genomic data sets to the site of origin of the uploaded genomic data set GDS1.
A first step S1 is directed to acquire genomic data sets GDS by acquisition units 50A, 50B, 50C. This may involve collecting a patient sample and inferring the genetic sequence from it (sequencing). Of note, the sequencing may be obtained from different cells of the body, for example, cells from a tumor. At this stage, the genomic data set GDS may mainly comprise genomic raw data such as a complete genetic sequence of the human genome, or one or more partial genetic sequences, for example, of a chromosome or part of a chromosome. Genetic information included within the genomic data set GDS may include nucleic acids, such as DNA or RNA, coding and/or non-coding RNA expression, and any other genetic or epigenetic modifications such as acetylations, methylations, or others. Further, acquisition units 50A, 50B, 50C may be configured to include metadata in the genomic data set GDS such as a patient ID, patient sex and/or age, the attending physician, a case number or the like. Moreover, the genomic data sets GDS may be provided with an unique identifier in the form of a data tag making the genomic data set GDS unambiguously identifiable at least within the local sites A, B, C. The unique identifier may be a local accession number, for instance. Preferably, the unique identifier is furthermore indicative of the local site A, B, C at which the genomic data set GDS has been generated making the genomic data set GDS traceable to the respective site A, B, C.
A second optional step S2 is directed to pre-process the genomic data set GDS. This may comprise filtering the genomic data set GDS for relevant information. For instance, the raw data may be filtered for gene sequences containing abnormalities and/or mutations which may be meaningful for the later comparison to other genomic data sets GDS2. Moreover, the pre-processing step may comprise evaluating the raw genomic data set GDS according to one or more similarity criteria for the later similarity search in the matching system 1, 1′. The evaluation of the similarity criteria may yield an associated genomic feature or feature set and/or corresponding characteristic values CV1 . . . CVn. The ensemble of genomic features and/or characteristic values CV1 . . . CVn characterizes the genomic data set GDS for a given clinical question. The ensemble of genomic features may be appended to the raw data contained in the genomic data set GDS. According to an embodiment, the ensemble of genomic features and/or characteristic values CV1 . . . CVn may take the place of the raw data in the genomic data set GDS so that the genomic data set GDS only contains processed data in the form of the genomic features. According to an embodiment, step S2 is performed at local computing units 40A, 40B, 40C. As an alternative, at least parts of step S2 may also be performed already upon acquiring the genomic data set at the acquiring units 50A, 50B, 50C.
A further optional step S3 is directed to retrieve supplementary information SI corresponding to the genomic data set GDS and adhere it the genomic data set GDS. This may involve querying local databases 30A, 30B, 30C at the sites A, B, C for supplementary information SI. This may be done using the aforementioned unique identifiers unambiguously linking the respective genomic data set GDS to the supplementary information SI. As mentioned, the supplementary information SI may include context information for the genomic data set GDS which may prove helpful for the later comparison to other genomic data sets GDS2 in the matching system 1, 1′. This may include annotated genes, features, physiological measurements, patient medical history, and/or phenotypic disease descriptions. According to an embodiment, step S3 is performed at local computing units 40A, 40B, 40C.
A further optional step S4 is directed to selecting a genomic data set GDS1 for uploading it to the matching system 1, 1′. To this end, the respective genomic data set GDS may be presented to a user via a graphical user interface at the local computing systems 40A, 40B, 40C. The user may then manually select whether or not the genomic data set GDS shall be uploaded to the matching system 1, 1′ for retrieving similar cases. To assist the user in this decision, local computing units 40A, 40B, 40C may be configured to display supplementary information SI corresponding to the genomic data set GDS under consideration. As an alternative, step S4 may also comprise a semi-automatic selection or automated pre-selection of uploading candidates which may, for instance, be based on prior actions of the user and may be presented to the user for review. Moreover, step S4 may comprise a fully automatic selection of genomic data set GDS for uploading. As mentioned, step S4 is optional. This may mean that the distributed environment 100, 200 may also be configured such that all genomic data sets GDS generated are (automatically) uploaded to the matching system 1, 1′. According to an embodiment, step S4 is performed at local computing units 40A, 40B, 40C.
Another optional step S5 is directed to anonymize the genomic data set GDS1 selected for upload. This may comprise filtering out any personal information from genomic data set GDS1 that would enable identifying the patient belonging to genomic data set GDS1. According to an embodiment, step S5 is performed at local computing units 40A, 40B, 40C.
A further step S6 is directed to uploading the genomic data set GDS1 to the matching system 1, 1′. This may be performed using mutual interfaces (in the form of one or more interface units) at the local sites A, B, C and the matching system 1, 1′. The upload may be effected via internet connection using an appropriate protocol such as https. According to an embodiment, step S6 is performed/initiated at local computing units 40A, 40B, 40C.
A further step S7 is directed to receiving notification NOT, e.g., via the mutual interfaces at the local sites A, B, C and the matching system 1, 1′. Notification NOT may indicate to a user that a reference genomic data set and therewith a similar case has been found by the matching system 1, 1′. Upon receipt, notification NOT may be displayed to the user via an appropriate graphical user interface at local computing units 40A, 40B, 40C. Notification NOT may contain supplementary information SI associated to the reference genomic data set such as phenotypic disease information, information about treatment and treatment response, disease progression, contact information about the attending physician. If no reference genomic data set has been found, this may likewise be indicated in notification NOT.
Another optional step S8 is directed to permitting communication via a communication channel CH1, CH2 between the local sites A, C, B associated to the matched genomic data sets. For instance, a communication session may be conducted between physicians associated to the matched genomic data sets via an appropriate communication channel CH1, CH2 as provided for by matching system 10, 10′. As mentioned, the communication channel CH2 may be either be established as a direct link between the involved sites A, B, C or routed through the matching system 1, 1′ (communication channel CH1). The communication channel CH1, CH2 may be a communication platform, e.g., chat room for exchanging text messages or a video conference platform. If the communication is routed through the matching system 1, 1′, the matching system 1, 1′ may be configured to host such a communication platform. What is more, communication may also include that the matched sites are granted mutual access to their databases for retrieving supplementary information SI associated with the matched genomic data sets. For data privacy reasons, this access is preferably selective in the sense that only information relevant to the case at hand may be accessed and that the accessible information is anonymized. In addition to that or as an alternative, the supplementary information SI associated to the matched genomic data sets may be provided by the matching system 1, 1′ at a designated repository which may be accessed by the local sites A, B, C for download.
Wherever meaningful, individual embodiments or their individual embodiments and features can be combined or exchanged with one another without limiting or widening the scope of the present invention. Advantages which are described with respect to one embodiment of the present invention are, wherever applicable, also advantageous to other embodiments.
The following points are also part of the disclosure:
1. Computer-implemented method for sharing medical information in a distributed environment comprising a plurality of local sites, the method comprising the steps of:
receiving a first genomic data set, the first genomic data set being generated at a first one of the local sites, wherein the first genomic data set comprises genomic data of a first patient;
comparing the first genomic data set with a plurality of second genomic data sets stored in a database external to the first site, wherein the second genomic data sets respectively comprise genomic data of patients different than the first patient;
identifying, amongst the second genomic data sets, one or more reference genomic data sets, on the basis of determining a similarity between the first genomic data set and the second genomic data sets, the reference genomic data sets having a predetermined degree of similarity to the first genomic data sets;
dispatching a notification to the first site indicative of the one or more reference genomic data sets.
2. Method according to 1, wherein the first and second genomic data sets do not comprise any personal information of the corresponding patient.
3. Method according to any of the preceding points, wherein at least a portion of the second genomic data sets has been generated at local sites different than the first site.
4. Method according to any of the preceding points, wherein the database is configured such that it cannot be accessed by the first site.
5. Method according to any of the preceding points, wherein the steps of receiving, comparing, identifying, and dispatching are carried out externally to the first site.
6. Method according to any of the preceding points, further with the step of including (or incorporating) the first genomic data set in the database.
7. Method according to any of the preceding points, wherein the first genomic data set comprises one or more genomic features respectively derived from an underlying gene sequence of a patient at the first site; and
the step of identifying is based on the one or more genomic features.
8. Method according to any of the preceding points, wherein the first genomic data set consists of one or more genomic features respectively derived from an underlying genetic sequence of a patient at the first site; and the step of identifying is based on the one or more genomic features.
9. Method according to 7 or 8, wherein the genomic features are based on evaluating mutations in the underlying genetic sequence, wherein the genomic features preferably comprise one or more genomic regions of mutations in the underlying genetic sequence; one or more mutation hotspots in the underlying genetic sequence; one or more effects of mutation in the underlying genetic sequence; and/or one or more clinical actionabilities of mutations in the underlying genetic sequence.
10. Method according to 7, 8 or 9, wherein the step of identifying comprises comparing the genomic features of the first genomic data set with corresponding genomic features of the second genomic datasets.
11. Method according to 8 to 10, further with the step of extracting one or more genomic features from first and/or second genomic datasets.
12. Method according to 11, wherein the step of extracting is based on applying a trained function to the first and/or second genomic data set, wherein the trained function is preferably based on a support vector machine algorithm and/or a random forest algorithm and/or a regularized regression model.
13. System for sharing medical information in a distributed environment comprising a plurality of local sites, the system comprising:
14. Usage of the method according to any one of points 1 to 12 for identifying one or more patients having a similar genomic data set as compared to the first patient.
15. Method for sharing medical information comprising the steps of:
The patent claims of the application are formulation proposals without prejudice for obtaining more extensive patent protection. The applicant reserves the right to claim even further combinations of features previously disclosed only in the description and/or drawings.
References back that are used in dependent claims indicate the further embodiment of the subject matter of the main claim by way of the features of the respective dependent claim; they should not be understood as dispensing with obtaining independent protection of the subject matter for the combinations of features in the referred-back dependent claims. Furthermore, with regard to interpreting the claims, where a feature is concretized in more specific detail in a subordinate claim, it should be assumed that such a restriction is not present in the respective preceding claims.
Since the subject matter of the dependent claims in relation to the prior art on the priority date may form separate and independent inventions, the applicant reserves the right to make them the subject matter of independent claims or divisional declarations. They may furthermore also contain independent inventions which have a configuration that is independent of the subject matters of the preceding dependent claims.
None of the elements recited in the claims are intended to be a means-plus-function element within the meaning of 35 U.S.C. § 112(f) unless an element is expressly recited using the phrase “means for” or, in the case of a method claim, using the phrases “operation for” or “step for.”
Example embodiments being thus described, it will be obvious that the same may be varied in many ways. Such variations are not to be regarded as a departure from the spirit and scope of the present invention, and all such modifications as would be obvious to one skilled in the art are intended to be included within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
19200381.2 | Sep 2019 | EP | regional |