This invention relates in general to the processing of data, and more particularly, to detecting gaps in a data stream. Even more particularly, this invention relates to detecting gaps in transmitted data streams and performing user configurable operations when these gaps are detected.
In today's rapidly changing marketplace it is important for businesses of all sizes to disseminate information about the goods and services they have to offer. To accomplish this efficiently, and comparatively inexpensively, many business have set up sites on the World Wide Web. These sites provide information on the products or services the business provides, the size, structure, and location of the business; or any other type of information which the business may wish people to access.
Conversely, it is also important for businesses to collect information on the people who are interested in them. These people may include customers, investors or potential employees. One inexpensive method of obtaining data on these people and their various interests is to recreate a visitor's activity on the website of the business. After assimilating data on visitors to their website, the business will have a clearer picture of their interests, and to some degree the effectiveness of the various portions of the website.
The construction and implementation of many websites, however, makes this a difficult task. Though a website may appear as a seamless entity when viewed with a internet web browser, in truth most websites are run by a variety of servers and computers. For example, one group of servers may be running applications providing information on support, some servers may be running CGI gateway applications, and others may be providing product data. This division means that a visitor to the website may be hosted by one server at the beginning of his visit, switched to another server while navigating the website, and wind up on a third before his visit is complete.
Thus, to recreate a visitor's activity on all websites during a single visit (session) all the data about that particular visitor's activity on every server which operates the website should be analyzed. Because there is such a large volume of data available on each user it is helpful to process the data feeds from these servers in real-time. This means that the availability of the data is of the utmost importance. If data is missing or otherwise incomplete the wrong calculations may take place. It is also costly to add missed data back to a set of data which has already been processed. Adding to the complications is the fact that data may not be reported from the various servers in a synchronous manner.
Therefore, in order to reconstruct a visitor's session it is critical that the system analyzing the data reported from the servers is aware of what data to expect, and what data is actually available. Furthermore, the system must be able to synchronize the data under scrutiny. Prior art systems for processing this session and utilization data were not necessarily aware of the type and availability of data, and would either process incomplete data or required data to be bundled and ready to be processed as a batch. Additionally, these prior art systems lacked awareness of the network topology from which they received data, which in turn hampered these systems ability to make intelligent decisions about missing data.
Thus, there is a need for systems and methods which may process data streams from a network topology, detect gaps in a data stream in order to prevent the processing of incomplete data, and which may store the incomplete data separately until it is complete and capable of being processed as a whole.
Systems and methods for the detection of gaps in a set of data are disclosed. These systems and methods allow data to be associated with streams, gaps in the data to be detected, and appropriate remedial action to be taken. In many embodiments, streams may be defined based upon a network topology, incoming data is then associated with those streams. Processing of these streams is then determined by an analysis of the timing of events within the stream.
Additionally, systems are presented which embody this type of methodology in computer systems, hardware, and software that detect gaps in a set of data.
In some embodiments, a time difference is calculated for the events in each stream and the processing of each stream depends upon the calculated time difference.
In other embodiments, the processing of all streams may be halted if the time difference calculated in any stream is greater than a first time period. In related embodiments the time period after which processing of streams may be halted is defined by a user.
In yet other embodiments, a notification is provided when a gap in any data stream is detected. In related embodiments, this notification may be an email sent to a system administrator.
Still other embodiments resume the processing of the data streams upon reception of more data associated with the stream in which a gap was detected.
In another set of embodiments, processing of the data streams continues after a certain period of time. In related embodiments, this period of time may be user configurable.
These, and other, aspects of the invention will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following description, while indicating various embodiments of the invention and numerous specific details thereof, is given by way of illustration and not of limitation. Many substitutions, modifications, additions and/or rearrangements may be made within the scope of the invention without departing from the spirit thereof, and the invention includes all such substitutions, modifications, additions and/or rearrangements.
The drawings accompanying and forming part of this specification are included to depict certain aspects of the invention. A clearer conception of the invention, and of the components and operation of systems provided with the invention, will become more readily apparent by referring to the exemplary, and therefore nonlimiting, embodiments illustrated in the drawings, wherein identical reference numerals designate the same components. The invention may be better understood by reference to one or more of these drawings in combination with the description presented herein. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale.
The invention and the various features and advantageous details thereof are explained more fully with reference to the nonlimiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well known starting materials, processing techniques, components and equipment are omitted so as not to unnecessarily obscure the invention in detail. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only and not by way of limitation. Various substitutions, modifications, additions and/or rearrangements within the spirit and/or scope of the underlying inventive concept will become apparent to those skilled in the art from this disclosure.
A few terms are defined or clarified to aid in understanding the descriptions that follow: a network topology is a mapping between the physical components that produce data and the logical groupings of data that the physical network may produce. A network topology may be the layout of a particular network or system of computers, which in turn may define the data which can be expected by a data processing system employing a method of detecting gaps in this data. This network topology may in turn be composed of logical or physical servers and their associated hosts and data locations. From this network topology a series of streams can be defined. A stream may be regarded as a logical data source, and may be a 1 to 1 mapping of physical to logical sources or a many to 1 mapping of many physical sources into one logical source. Servers may be fault tolerant, indicating that though there may be several hosts or data locations associated with a logical or physical server for the sake of redundancy, data coming from any host or data location associated with that particular server should be regarded as one stream.
Though the exemplary embodiment described below utilizes gap detection in a system designed to analyze data transmitted from servers and other machines implementing a website, those skilled in the art will appreciate that these same systems and methods may be employed for a myriad number of other uses and applications, such as detecting gaps in extant and resident files, or in other types of network transmissions. Additionally, it will be understood that these same systems and methods can be implemented in software systems, computer programs, hardware, and any combination thereof.
Attention is now directed to systems and methods for detecting gaps in a set of data or in a data transmission. These systems and methods may divide the sources of data to be processed into streams and analyze these streams to detect gaps. After gaps are detected remedial action may be taken, and processing of the data may continue. The systems and methods described are especially useful when employed in a data processing system designed to receive data from a variety of sources.
Turning now to
During user's 110 visit to the website, servers 120, 160, 180, hosts 140, 150, 170, 190 and data locations 130 provide information, data, and applications that user 110 utilizes. In turn, servers 120, 160, 180, hosts 140, 150, 170, 190 and data locations 130 record information about user's 110 activities.
This collected information may be analyzed to determine the activities that each user 110 performed during his visit to the website. To properly recreate the user's 110 activities, however, all the data about a user 110 on servers 120, 160, 180, hosts 140, 150, 170, 190 and data locations 130 should be analyzed. Since most systems designed to analyze this user 110 data, including relationship management servers, are designed to process real-time data feeds, availability of this user 110 data is critical. If data is missing, or data from different locations is out of synch, wrong calculations may occur.
Since user 110 data resides on servers 120, 160, 180, hosts 140, 150, 170, 190 and data locations 130, in order to analyze this data it usually must be transmitted from these servers 120, 160, 180, hosts 140, 150, 170, 190 and data locations 130 to a central location for collation and assembly. However, it is difficult to align data from many different sources to produce a cohesive set of data, as data arrives at different times from different servers and is not necessarily in chronological order. Additionally, many times the value of the data under analysis is highly dependent on the timeliness of that data. To assist in this collation and assembly, it is useful to detect when there are gaps in the incoming data in order that some form of remedial action may be taken.
In
The creation of these streams is based on the layout of the network under analysis. Usually, streams are defined in terms of servers 120, 160, 180 and their associated hosts 140, 150, 170, 190 or data locations 130. A host 140, 150, 170, 190 may be a single machine, and a server 120, 160, 180 may contain one or more hosts 140, 150, 170, 190 or data locations 130 that should be considered together. Each server 120, 160, 180 and its associated data locations 130 and hosts 140, 150, 170, 190 may be marked as a stream. A server 120, 160, 180 may also be marked as fault tolerant, indicating that data coming from that server 120, 160, 180 or the associated hosts 140, 150, 170, 190 and data locations 130 should be considered one stream. When files subsequently come in to this central location from the various hosts 140, 150, 170, 190, data locations 130, and servers 120, 160, 180 which make up the network topology, these files are then associated with one of the defined streams. Processing of these streams may then begin 220.
During processing, gaps in the incoming data may be detected 230. During analysis or processing of the data available from the network, data loss between a last event and a next event may be calculated. This data loss may be determined by the comparison of a wide variety of factors, including transaction elements, timing, and the presence of fault tolerant physical devices. After this data loss reaches a certain threshold the system will determine that a gap exists. The amount of data loss which is acceptable before the system determines it has detected a gap may be user configurable.
In one specific embodiment, if the difference between the time of the last event received or processed and the next event available from a single stream is greater than a certain time period then the system will have detected a gap. In some embodiments the time period considered to be a gap is a variable labeled GAP_TIME, and may set by a user. The GAP_TIME variable may be global, or set and assigned to each stream during configuration of the system within which the gap detection methodology is being utilized. Therefore, a stream has a gap if the time of the next event available in that stream is past the time of the last event plus the GAP_TIME variable. A stream can also be considered to have a gap if there is no data available for that stream upon start up of the gap detection method. In specific embodiments, if the GAP_TIME variable is set to 0 gap detection will not be performed on the incoming data.
If a gap is detected in a stream remedial action may then be taken 240. In many embodiments, this remedial action may consist of stopping processing of the data streams and sending a notification. In some related embodiments this notification may be an email to a user or system administrator regarding this gap in the data. While the processing of data is halted, incoming data may be stored for later processing.
At some point the system employing the gap detection methodology may resume processing data 270, 280. If the stream in which a gap was detected begins receiving data again 250, the processing of streams may resume 270. Additionally, in some embodiments a user may define a threshold; after this threshold is reached the system may continue with processing data. This threshold may be dependent on a variety of factors such as transaction elements, fault tolerance, and timeliness of the data.
As the timeliness of data is often its value, in many embodiments, a GAP_CONTINUE variable is present to allow tradeoffs to be made between the accuracy of the data under analysis and its timeliness. In certain related embodiments this GAP_CONTINUE variable may be configured by a user, which gives users of the data processing system flexibility in tuning the behavior of the system.
In the event the stream in which a gap was detected does not begin receiving data after the amount of time defined by the GAP_CONTINUE variable has elapsed 260, the processing of data in streams other than the stream in which a gap was detected may continue 280. In some embodiments if GAP_CONTINUE is set to −1 data processing may not continue until manually configured to do so, similarly if GAP_CONTINUE is set to 0 data processing will continue with no pause. This gap detection methodology can then continue to be applied to the streams undergoing processing.
Specific embodiments of the invention will now be further described by the following, nonlimiting examples which will serve to illustrate in some detail various features and functionality. The following examples are included to facilitate an understanding of ways in which the invention may be practiced. It should be appreciated that the examples can be considered to constitute preferred modes for the practice of the invention. However, it should be appreciated that many changes can be made in the exemplary embodiments which are disclosed without departing from the spirit and scope of the invention.
At initialization of the system there are two streams 310, 320 that have some data available, as can be seen in
The system processes only as much data as is available from all streams. In
As can be seen in
In
The system administrator may update the network topology file to include stream 3330 and fix stream 2320 so it produces data, as depicted in
The following example is an embodiment of the gap detection systems and methods discussed herein:
In the foregoing specification, the invention has been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of invention.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any component(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature or component of any or all the claims.
This application claims a benefit of priority under 35 U.S.C. § 119(e) to U.S. Patent Application No. 60/394,619 entitled “System and Method For Detecting Gaps in a Data Stream” by John C. Artz Jr. and Heeren Pathak filed Jul. 9, 2002. This application is related to U.S. patent application Ser. Nos. 10/616,107, entitled “System and Method of Associating Events with Requests” by John C. Artz et al., filed on Jul. 9, 2003, and 10/616,408, entitled “Method and System for Site Visitor Information” by John C. Artz et al., filed on Jul. 9, 2003. All applications cited within this paragraph are assigned to the current assignee hereof and are fully incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5412801 | de Remer et al. | May 1995 | A |
5557717 | Wayner | Sep 1996 | A |
5668801 | Grunenfelder | Sep 1997 | A |
5732218 | Bland et al. | Mar 1998 | A |
5796952 | Davis et al. | Aug 1998 | A |
5837899 | Dickerman et al. | Nov 1998 | A |
6014706 | Cannon et al. | Jan 2000 | A |
6014707 | Miller et al. | Jan 2000 | A |
6041335 | Merritt et al. | Mar 2000 | A |
6112186 | Bergh et al. | Aug 2000 | A |
6119103 | Basch et al. | Sep 2000 | A |
6128663 | Thomas | Oct 2000 | A |
6138156 | Fletcher et al. | Oct 2000 | A |
6144962 | Weinberg et al. | Nov 2000 | A |
6205472 | Gilmour | Mar 2001 | B1 |
6286043 | Cuomo et al. | Sep 2001 | B1 |
6321206 | Honarvar | Nov 2001 | B1 |
6430539 | Lazarus et al. | Aug 2002 | B1 |
6453336 | Beyda et al. | Sep 2002 | B1 |
6456305 | Qureshi et al. | Sep 2002 | B1 |
6496824 | Wilf | Dec 2002 | B1 |
6509898 | Chi et al. | Jan 2003 | B2 |
6559882 | Kerchner | May 2003 | B1 |
6606657 | Zilberstein et al. | Aug 2003 | B1 |
6615305 | Olesen et al. | Sep 2003 | B1 |
6629136 | Naidoo | Sep 2003 | B1 |
6640215 | Galperin et al. | Oct 2003 | B1 |
6732331 | Alexander | May 2004 | B1 |
6757740 | Parekh et al. | Jun 2004 | B1 |
6785769 | Jacobs et al. | Aug 2004 | B1 |
6836773 | Tamayo et al. | Dec 2004 | B2 |
6839682 | Blume et al. | Jan 2005 | B1 |
6873984 | Campos et al. | Mar 2005 | B1 |
6892238 | Lee et al. | May 2005 | B2 |
6966034 | Narin | Nov 2005 | B2 |
6968385 | Gilbert | Nov 2005 | B1 |
6996536 | Cofino et al. | Feb 2006 | B1 |
7032017 | Chow et al. | Apr 2006 | B2 |
7096271 | Omoigui et al. | Aug 2006 | B1 |
7401066 | Beinglass et al. | Jul 2008 | B2 |
7461120 | Artz et al. | Dec 2008 | B1 |
20010037321 | Fishman et al. | Nov 2001 | A1 |
20020029275 | Selgas et al. | Mar 2002 | A1 |
20020057675 | Park | May 2002 | A1 |
20020062223 | Waugh | May 2002 | A1 |
20020091755 | Narin | Jul 2002 | A1 |
20020095322 | Zarefoss | Jul 2002 | A1 |
20020107841 | Hellerstein et al. | Aug 2002 | A1 |
20020112082 | Ko et al. | Aug 2002 | A1 |
20020129381 | Barone et al. | Sep 2002 | A1 |
20020143925 | Pricer et al. | Oct 2002 | A1 |
20020150123 | Ro | Oct 2002 | A1 |
20020161673 | Lee et al. | Oct 2002 | A1 |
20020178169 | Nair et al. | Nov 2002 | A1 |
20020193114 | Agrawal et al. | Dec 2002 | A1 |
20030088716 | Sanders | May 2003 | A1 |
20030154184 | Chee et al. | Aug 2003 | A1 |
20030190649 | Aerts et al. | Oct 2003 | A1 |
20030202509 | Miyano et al. | Oct 2003 | A1 |
20030212594 | Hogan | Nov 2003 | A1 |
20030236892 | Coulombe | Dec 2003 | A1 |
20040205489 | Bogat | Oct 2004 | A1 |
20040215599 | Apps et al. | Oct 2004 | A1 |
20050102292 | Tamayo et al. | May 2005 | A1 |
20060271989 | Glaser et al. | Nov 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
60394619 | Jul 2002 | US |