1. Field of the Invention
The present invention relates to a method of data acquisition, and more particularly to a method (and system) of acquiring information from user communications while allowing the user to control the information acquired.
2. Background Description
Data acquisition is a very challenging problem to social software. It is, in general, difficult to acquire valuable information. For instance, on average, an employee spends 40% of their time writing emails and instant messaging during work. The information in the e-mails and instant messages is valuable data, which can be used to infer an employee's knowledge.
In order to acquire useful communication information, previous systems work on acquiring data through a corporate e-mail server or an instant message server. Such data acquisition is typically conducted without the users' knowledge. Thus, the acquisition introduces various security and privacy concerns from users and becomes a major reason that hinders the use of valuable communication data for corporate use.
In view of the foregoing and other exemplary problems, drawbacks, and disadvantages of the conventional methods and structures, an exemplary feature of the present invention is to provide a method and structure that can acquire data from a user's communications without affecting the privacy of the user.
In accordance with a first exemplary aspect of the present invention, a method of data acquisition includes extracting information from user communications and allowing a user to control the information to be extracted.
In accordance with a second exemplary aspect of the present invention, a method of data acquisition includes downloading a user's sent materials from a communication data repository, analyzing the downloaded materials and extracting data portions that are authored by the user, generating statistical values from the explicitly extracted data, transmitting the generated statistical values to one or multiple repositories, receiving generated statistical values on one or more multiple server machines, and aggregating statistical values of multiple users.
In accordance with a third exemplary aspect of the present invention, a distributed social sensor system implemented method of social network inference or expertise location includes installing a software program residing on an individual user's machine for downloading the user's own sent materials from a communication data repository, analyzing the downloaded materials and extracting the data portions that are explicitly authored by the user, generating statistical values from the explicitly extracted data, transmitting the generated statistical values to one or multiple social sensor server repositories, installing a software program residing on one or multiple social sensor server repository machines to receive generated statistical values of multiple users, and aggregating statistical values of multiple users to construct one or plural aggregated social networks, expertise inference, or social networks and expertise inference of multiple persons including only users or both users and non-users.
The present invention provides an asset of network client software that resides in an end user's machine. In accordance with certain aspects of the invention, the present invention uses an algorithm process to extract features from communications. Data is transferred into a hub repository using client-server web architecture. The present invention also provides a mechanism to run these processes periodically without user intervention. Furthermore, an exemplary aspect of the present invention allows a user to control the information to be captured.
In accordance with an exemplary aspect, the present invention may infer social network or expertise data from communication. Acquisition of communication data, however, is extremely difficult, because of privacy concerns. Seldom do users want to reveal their communications to other people or allow a machine residing somewhere in the computer network to capture their communication data because of a potential privacy leakage.
Therefore, in accordance with an exemplary aspect, the present invention takes privacy-preservation and copyright-preservation into account for data acquisition. The present invention avoids capturing raw communication data by only taking the statistics of communication data that are explicitly authored by the user. Furthermore, the present invention provides a mechanism that allows a user to monitor acquired information and prevent certain information from being acquired. Additionally, the user is able to modify the inference result, before their inferred expertise or personal social network is aggregated into large repositories to be used for public application.
Accordingly, the present invention significantly increases the confidence level of users and makes them more willing to provide data without compromising their privacy. This invention fosters a foundation of large-scale social network and expertise inference applications.
The foregoing and other objects, aspects and advantages will be better understood from the following detailed description of a preferred embodiment of the invention with reference to the drawings, in which:
Referring now to the drawings, and more particularly to
Certain exemplary, non-limiting embodiments of the present invention are directed to a social sensor system (and method) that deploys social sensors in an employee's computer to gather features of the employee's communications. Because only features, not entire communications, are captured, users are more willing to contribute to the system, because the user's privacy will be maintained. In addition, the system allows users to set stop-words to exclude specific words from being captured. The system may also run periodically and automatically without any user intervention. Thus, this system can be used to capture valuable information that is appropriate for social inference in social software applications.
Most prior expertise locator systems acquire data by having individuals fill out profile information or by extracting the information or deriving artificial intelligence talgorithms from existing sources. Those sources could be “public” such as co-authored documents, patents or user-generated from blogs, wikis and social tagging systems. Data can also be acquired from private sources such as e-mail, chat, and calendar entries that contribute semantic information as well as social network data.
Private data, such as, but not limited to, e-mail logs, have the advantage of containing rich information from which information about what one knows and whom one knows can be derived. These data also address issues of (a) coverage—everyone uses email so data can be collected from everyone not just the people who have authored documents or other data; (b) maintainability—new email is constantly being generated; and (c) ease of use—people are already using email so other than asking users for permission to use their data there is no additional work required by the user.
Using private data, however, may violate a user's (or other party's) privacy. If privacy issues are not adequately addressed, users will quickly stop using an expertise locator system, opt out of volunteering their data, and generate negative word of mouth, all of which would severely affect any ability to have sufficient people in the system to deliver useful search results.
In accordance with an exemplary, non-limiting aspect of the present invention, the system uses e-mails and instant messaging as a data source to obtain appropriate information while maintaining the users' privacy. Additionally, public data from profile, blogs, forum, social bookmarking, etc., may be used to help enhance the expertise ranking accuracy.
In an exemplary embodiment of the present invention, the system (and method) may utilize a plurality (e.g., three) of data sources, including but not limited to, an employee's outgoing emails to other employees within the company, outgoing stored chats, and profile data from an enterprise directory. These data are contributed to a wider aggregated data pool. The system applies artificial intelligence algorithms to infer a participant's social network (who they know) and the expertise of those people (what they know) based on these communications (e.g., outgoing communications). The modified social networks (and the related expertise data) are aggregated to form a composite data pool.
Because of the sensitivity of the data, the present invention provides strict guidelines that restrict the data that may be collected, how the data is used, and what information is available to users. In particular, the present invention uses aggregated and inferred information, which prevents any user from seeing a direct relationship between any person in the system, their email, and the information being displayed. The system does not keep or display any information about whom a user communicated with and about what the user communicated.
The system merely collects data from people who opt into the system. Once a user enters the system of the present invention, the user merely specifies a location of his/her e-mail archives and/or chat history. The system then extracts data from the e-mail archives and/or chat history. The real e-mail or chat data never leaves the users' machines. Only statistical indexes are transmitted.
Furthermore, in accordance with an exemplary non-limiting aspect of the present invention, the system extracts content from outgoing e-mail. That is, the system extracts content from e-mails that were authored by the person who opted into the system. The system may be configured to extract content from only outgoing e-mails authored by the user. The system, however, is not limited to merely extracting information outgoing e-mails and may be used to extract information from any communication involving the user.
Additionally, the system may be configured to exclude threads that are embedded in the e-mail. The system may also be configured to exclude any e-mails marked private or confidential.
The system, as provided in several non-limiting embodiments of the present invention, is open for expertise and social network on all employees of a company by applying a collaborative filtering/link analysis algorithm, which makes unbiased, intelligent inferences among a large number of people based on only data contributed by a small number of people.
To increase the privacy of contributing users and non-contributing parties further, the system of the present invention may inform a non-contributing party that the party may be found through the system whenever a user's data can start making meaningful inferences on the party's expertise and social network. Additionally, the system allows any user (either a data contributor or a non-contributor), at any time, to limit the search items that cannot be found or the people they cannot be associated with.
Email history removal 314 removes the historical thread in an email. The purpose is to remove any portion in an email that is not written by the email sender.
The email/IM filters 305 are used to exclude emails that have specific characteristics as defined in the metadata of email (e.g., subject line, sender, cc, time, etc.). The purpose is to exclude emails that are configured as not to be proceeds. For example, the system uses only the emails authored by the user, exclude emails with subject lines with specific words (e.g., confidential, attorney, personal, private, etc.), uses only the emails sent receivers within a range (e.g., only those emails to inside the company, inside the business division, inside a country, etc.).
The stemming and stop-word removal 307 processes a text analysis scheme, which removes stop-words in sentences and converts all words to stems (e.g., convert “file”, “files”, “filed”, or “filing”, to “file”).
The keyword extraction TF/IDF 315 calculates statistics of stemmed word term frequencies (TF) in each individual email. The inverse document frequency (IDF) is an optional statistic than can be extracted. The boxes described in this figure can apply to not only emails, but also instant messages or calendar data.
The method 400 of data acquisition includes extracting information from user communications 410 and allowing a user to control the information to be extracted 420. Specifically, the method includes extracting information from, for example and not limited to, outgoing user communications. More specifically, the method includes extracting information from, for example and not limited to, communications that are authored by the contributing user. The controlling method may include, for example but not limited to, excluding some communications based on a user-specified exclude list, which includes a list of words or topics to be excluded. The controlling method may also include, for example but not limited to, excluding some communications based on a user-specified exclude list of communicating people.
The method 500 of data acquisition, may include downloading 510 a user's materials (e.g., sent materials) from a communication data repository, analyzing 520 the downloaded materials and extracting data portions (e.g., data portions that are authored by the user), generating 530 statistical values from the extracted data, transmitting 540 the generated statistical values to one or multiple repositories (e.g., social sensor server repositories), receiving 550 the generated statistical values on one or multiple server machines (e.g., social sensor server repository machines), and aggregating 560 statistical values of multiple users.
The aggregated statistical values may then be used to construct one or plural aggregated social networks, expertise inference, or social networks and expertise inference of multiple people including only users or both users and non-users. The method 500 (and system) values may include, for example but not limited to, a set of user interfaces to allow a user to manually add or remove a person(s) from the user's personal social network before or after aggregation. Furthermore, the method may include, for example but not limited to, a set of user interfaces to allow a user to manually remove the user from a set of expertise words before or after aggregation.
In certain exemplary aspects of the present invention, the above-described methods may be implemented in a distributed social sensor system for social network inference or expertise location, as described above and exemplarily illustrated in
Furthermore, the above methods may also include installing a software program residing on an individual user's machine for downloading the user's own sent materials from a communication data repository and installing a software program residing on one or multiple social sensor server repository machines to receive generated statistical values of multiple users.
The CPUs 611 are interconnected via a system bus 612 to a random access memory (RAM) 614, read-only memory (ROM) 616, input/output (I/O) adapter 618 (for connecting peripheral devices such as disk units 621 and tape drives 640 to the bus 612), user interface adapter 622 (for connecting a keyboard 624, mouse 626, speaker 628, microphone 632, and/or other user interface device to the bus 612), a communication adapter 634 for connecting an information handling system to a data processing network, the Internet, an Intranet, a personal area network (PAN), etc., and a display adapter 636 for connecting the bus 612 to a display device 638 and/or printer 639 (e.g., a digital printer or the like).
In addition to the hardware/software environment described above, a different aspect of the invention includes a computer-implemented method for performing the above method. As an example, this method may be implemented in the particular environment discussed above.
Such a method may be implemented, for example, by operating a computer, as embodied by a digital data processing apparatus, to execute a sequence of machine-readable (computer-readable) instructions. These instructions may reside in various types of signal-bearing or computer-readable media.
Thus, this aspect of the present invention is directed to a programmed product, comprising signal-bearing media or computer-readable media tangibly embodying a program of machine-readable (computer-readable) instructions executable by a digital data processor incorporating the CPU 611 and hardware above, to perform the method of the invention.
This computer-readable media may include, for example, a RAM contained within the CPU 611, as represented by the fast-access storage for example. Alternatively, the instructions may be contained in another computer-readable media, such as a magnetic data storage diskette 700 (
Whether contained in the diskette 700, the computer/CPU 611, or elsewhere, the instructions may be stored on a variety of computer-readable data storage media, such as DASD storage (e.g., a conventional “hard drive” or a RAID array), magnetic tape, electronic read-only memory (e.g., ROM, EPROM, or EEPROM), an optical storage device (e.g. CD-ROM, WORM, DVD, digital optical tape, etc.), paper “punch” cards, or other suitable signal-bearing media. In accordance with certain exemplary embodiments of the present invention, the computer-readable media may include transmission media such as digital and analog and communication links and wireless. In an illustrative embodiment of the invention, the machine-readable (computer-readable) instructions may comprise software object code.
While the invention has been described in terms of several exemplary embodiments, those skilled in the art will recognize that the invention can be practiced with modification within the spirit and scope of the appended claims.
Further, it is noted that, Applicants' intent is to encompass equivalents of all claim elements, even if amended later during prosecution.