Claims
- 1. A method of classifying e-mail as spam, the method comprising:
clustering received e-mails into groups of substantially similar e-mails; selecting a set of one or more test e-mails from at least one of the groups, wherein a proportion of the e-mails in the set are spam e-mails; determining the proportion of spam e-mails in the set of test e-mails; classifying the e-mails in the at least one group as spam when the proportion of spam e-mails in the set exceeds a predetermined threshold proportion.
- 2. The method of claim 1 wherein selecting the set of one or more test e-mails comprises selecting a sufficient number of test e-mails for the set such that the proportion of spam e-mails in the set accurately reflects a proportion of spam e-mails in the at least one group.
- 3. The method of claim 1 wherein clustering comprises:
collecting a set of received e-mails; and performing duplicate detection on the set of received e-mails to cluster the set of received e-mails into groups of substantially similar e-mails.
- 4. The method of claim 1 wherein clustering comprises:
performing duplicate detection on a received e-mail when the e-mail is received to determine if the received e-mail is substantially similar to e-mails in an existing group of substantially similar e-mails; adding the received e-mail to the existing group of substantially similar e-mails when the received e-mail is substantially similar to e-mails in the existing group; and using the received e-mail to start a new group of substantially similar e-mails when the received e-mail is not substantially similar to e-mails in the existing group of substantially similar e-mails.
- 5. The method of claim 1 further comprising:
saving a signature of the at least one group when the proportion of spam e-mails in the set exceeds the predetermined threshold proportion; receiving a new e-mail; using the signature to determine whether the new e-mail is substantially similar to the e-mails in the at least one group; classifying the new e-mail as spam when the new e-mail is substantially similar to the e-mails in the at least one group.
- 6. The method of claim 5 wherein the substantially similar e-mails in the at least one group are part of a larger population of substantially similar e-mails, the method further comprising selecting a size for the at least one group such that a proportion of spam e-mails in the at least one group accurately reflects a proportion of spam e-mails in the larger population.
- 7. The method of claim 1 wherein the predetermined threshold is based on a misclassification cost of misclassifying spam e-mail as non-spam e-mail and a misclassification cost of misclassifying non-spam e-mail as spam e-mail.
- 8. A computer-usable medium having a computer program embodied thereon for classifying e-mail as spam, the computer program comprising instructions for causing a computer to perform the following operations:
cluster received e-mails into groups of substantially similar e-mails; select a set of one or more test e-mails from at least one of the groups, wherein a proportion of the e-mails in the set are spam e-mails; determine the proportion of spam e-mails in the set of test e-mails; classify the e-mails in the at least one group as spam when the proportion of spam e-mails in the set exceeds a predetermined threshold proportion.
- 9. The computer-usable medium of claim 8 wherein, to select the set of one or more test e-mails, the computer program further comprises instruction for causing a computer to select a sufficient number of test e-mails for the set such that the proportion of spam e-mails in the set accurately reflects a proportion of spam e-mails in the at least one group.
- 10. The computer-usable medium of claim 8 wherein, to cluster, the computer program further comprises instruction for causing a computer to:
collect a set of received e-mails; and perform duplicate detection on the set of received e-mails to cluster the set of received e-mails into groups of substantially similar e-mails.
- 11. The computer-usable medium of claim 8 wherein, to cluster, the computer program further comprises instruction for causing a computer to:
perform duplicate detection on a received e-mail when the e-mail is received to determine if the received e-mail is substantially similar to e-mails in an existing group of substantially similar e-mails; add the received e-mail to the existing group of substantially similar e-mails when the received e-mail is substantially similar to e-mails in the existing group; and use the received e-mail to start a new group of substantially similar e-mails when the received e-mail is not substantially similar to e-mails in the existing group of substantially similar e-mails.
- 12. The computer-usable medium of claim 8 wherein the computer program further comprises instruction for causing a computer to:
save a signature of the at least one group when the proportion of spam e-mails in the set exceeds the predetermined threshold proportion; receive a new e-mail; use the signature to determine whether the new e-mail is substantially similar to the e-mails in the at least one group; classify the new e-mail as spam when the new e-mail is substantially similar to the e-mails in the at least one group.
- 13. The computer-usable medium of claim 12 wherein the substantially similar e-mails in the at least one group are part of a larger population of substantially similar e-mails, computer program further comprises instruction for causing a computer to select a size for the at least one group such that a proportion of spam e-mails in the at least one group accurately reflects a proportion of spam e-mails in the larger population.
- 14. The computer-usable medium of claim 8 wherein the predetermined threshold is based on a misclassification cost of misclassifying spam e-mail as non-spam e-mail and a misclassification cost of misclassifying non-spam e-mail as spam e-mail.
- 15. An apparatus for classifying e-mail as spam, the apparatus comprising:
means for clustering received e-mails into groups of substantially similar e-mails; means for selecting a set of one or more test e-mails from at least one of the groups, wherein a proportion of the e-mails in the set are spam e-mails; means for determining the proportion of spam e-mails in the set of test e-mails; means for classifying the e-mails in the at least one group as spam when the proportion of spam e-mails in the set exceeds a predetermined threshold proportion.
- 16. The apparatus of claim 15 wherein the means for selecting the set of one or more test e-mails comprises means for selecting a sufficient number of test e-mails for the set such that the proportion of spam e-mails in the set accurately reflects a proportion of spam e-mails in the at least one group.
- 17. The apparatus of claim 15 wherein the means for clustering comprises:
means for collecting a set of received e-mails; and means for performing duplicate detection on the set of received e-mails to cluster the set of received e-mails into groups of substantially similar e-mails.
- 18. The apparatus of claim 15 wherein the means for clustering comprises:
means for performing duplicate detection on a received e-mail when the e-mail is received to determine if the received e-mail is substantially similar to e-mails in an existing group of substantially similar e-mails; means for adding the received e-mail to the existing group of substantially similar e-mails when the received e-mail is substantially similar to e-mails in the existing group; and means for using the received e-mail to start a new group of substantially similar e-mails when the received e-mail is not substantially similar to e-mails in the existing group of substantially similar e-mails.
- 19. The apparatus of claim 15 further comprising:
means for saving a signature of the at least one group when the proportion of spam e-mails in the set exceeds the predetermined threshold proportion; means for receiving a new e-mail; means for using the signature to determine whether the new e-mail is substantially similar to the e-mails in the at least one group; means for classifying the new e-mail as spam when the new e-mail is substantially similar to the e-mails in the at least one group.
- 20. The apparatus of claim 19 wherein the substantially similar e-mails in the at least one group are part of a larger population of substantially similar e-mails, the apparatus further comprising means for selecting a size for the at least one group such that a proportion of spam e-mails in the at least one group accurately reflects a proportion of spam e-mails in the larger population.
- 21. The apparatus of claim 15 wherein the predetermined threshold is based on a misclassification cost of misclassifying spam e-mail as non-spam e-mail and a misclassification cost of misclassifying non-spam e-mail as spam e-mail.
- 22. A method of classifying e-mails, the method comprising:
clustering received e-mails into groups of substantially similar e-mails; selecting one or more test e-mails from at least one of the groups; determining a class for the one or more test e-mails; classifying at least one non-test e-mail in the at least one group based on the determined class of the one or more test e-mails.
- 23. The method of claim 22 wherein clustering comprises:
performing duplicate detection on a received e-mail when the e-mail is received to determine if the received e-mail is substantially similar to e-mails in an existing group of substantially similar e-mails; adding the received e-mail to the existing group of substantially similar e-mails when the received e-mail is substantially similar to e-mails in the existing group; and using the received e-mail to start a new group of substantially similar e-mails when the received e-mail is not substantially similar to e-mails in the existing group of substantially similar e-mails.
- 24. The method of claim 22 wherein clustering comprises:
collecting a set of received e-mails; and performing duplicate detection on the set of received e-mails to cluster the set of received e-mails into groups of substantially similar e-mails.
- 25. The method of claim 22 further comprising:
receiving a new e-mail; performing duplicate detection on the new e-mail to determine if the new e-mail is substantially similar to the e-mails in the at least one group; classifying the new e-mail based on the class of the one or more test e-mails when the new e-mail is substantially similar to the e-mails in the at least one group.
- 26. The method of claim 22 wherein the substantially similar e-mails in the at least one group are part of a larger population of substantially similar e-mails, the method further comprising selecting a size for the at least one group such that a proportion of the e-mails in the at least one group belonging to a particular class accurately reflects a proportion of the e-mails in the larger population that belong to the particular class.
- 27. The method of claim 22 wherein selecting one or more test e-mails comprises selecting multiple test e-mails.
- 28. The method of claim 27 wherein classifying at least one non-test e-mail comprises classifying the at least one non-test e-mail into a particular class when a proportion of the multiple test e-mails belonging to the particular class exceeds a predetermined threshold proportion.
- 29. The method of claim 28 wherein selecting multiple test e-mails comprises selecting a sufficient number of test e-mails such that a proportion of the multiple test e-mails belonging to the particular class accurately reflects a proportion of the e-mails in the at least one group that belong to the particular class.
- 30. The method of claim 29 wherein the particular class is spam.
- 31. The method of claim 30 wherein the predetermined threshold is based on a misclassification cost of misclassifying spam e-mail as non-spam e-mail and a misclassification cost of misclassifying non-spam e-mail as spam e-mail.
- 32. The method of claim 22 wherein determining a class for the one or more test e-mails comprises determining whether the one or more e-mails are spam e-mails such that spam e-mail in the received e-mails can be filtered.
- 33. A computer-usable medium having a computer program embodied thereon for classifying e-mails, the computer program comprising instructions for causing a computer to perform the following operations:
cluster received e-mails into groups of substantially similar e-mails; select one or more test e-mails from at least one of the groups; determine a class for the one or more test e-mails; classify at least one non-test e-mail in the at least one group based on the determined class of the one or more test e-mails.
- 34. The computer-usable medium of claim 33 wherein, to cluster, the computer program further comprises instruction for causing a computer to:
performing duplicate detection on a received e-mail when the e-mail is received to determine if the received e-mail is substantially similar to e-mails in an existing group of substantially similar e-mails; adding the received e-mail to the existing group of substantially similar e-mails when the received e-mail is substantially similar to e-mails in the existing group; and using the received e-mail to start a new group of substantially similar e-mails when the received e-mail is not substantially similar to e-mails in the existing group of substantially similar e-mails.
- 35. The computer-usable medium of claim 33 wherein, to cluster, the computer program further comprises instruction for causing a computer to:
collecting a set of received e-mails; and performing duplicate detection on the set of received e-mails to cluster the set of received e-mails into groups of substantially similar e-mails.
- 36. The computer-usable medium of claim 33 wherein the computer program further comprises instruction for causing a computer to:
receiving a new e-mail; performing duplicate detection on the new e-mail to determine if the new e-mail is substantially identical to the e-mails in the at least one group; classifying the new e-mail based on the class of the one or more test e-mails when the new e-mail is substantially identical to the e-mails in the at least one group.
- 37. The computer-usable medium of claim 33 wherein the substantially similar e-mails in the at least one group are part of a larger population of substantially similar e-mails, the computer program further comprising instruction for causing a computer to select a size for the at least one group such that a proportion of the e-mails in the at least one group belonging to a particular class accurately reflects a proportion of the e-mails in the larger population that belong to the particular class.
- 38. The computer-usable medium of claim 33 wherein, to select one or more test e-mails, the computer program further comprises instruction for causing a computer to select multiple test e-mails.
- 39. The computer-usable medium of claim 38 wherein, to classify at least one non-test e-mail, the computer program further comprises instruction for causing a computer to classify the at least one non-test e-mail into a particular class when a proportion of the multiple test e-mails belonging to the particular class exceeds a predetermined threshold proportion.
- 40. The computer-usable medium of claim 39 wherein, to select multiple test e-mails, the computer program further comprises instruction for causing a computer to select a sufficient number of test e-mails such that a proportion of the multiple test e-mails belonging to the particular class accurately reflects a proportion of the e-mails in the at least one group that belong to the particular class.
- 41. The computer-usable medium of claim 40 wherein the particular class is spam.
- 42. The computer-usable medium of claim 41 wherein the predetermined threshold is based on a misclassification cost of misclassifying spam e-mail as non-spam e-mail and a misclassification cost of misclassifying non-spam e-mail as spam e-mail.
- 43. The computer-usable medium of claim 33 wherein, to determine a class for the one or more test e-mails, the computer program further comprises instruction for causing a computer to determine whether the one or more e-mails are spam e-mails such that spam e-mail in the received e-mails can be filtered.
CLAIM OF PRIORITY
[0001] This application claims priority under 35 USC §119(e) to U.S. Provisional Patent Application Serial No. 60/442,124, filed on Jan. 24, 2003, the entire contents of which are hereby incorporated by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60442124 |
Jan 2003 |
US |