The publishing industry is highly competitive.
where Φ(x)=(2π)−1/2∫−∞xe−v
Using this model, the entire sales curve can be obtained given the first few weeks of data. However, for an accurate prediction, at least 25 weeks of data is needed, which usually includes the peak sales week. Peak sales is strongly correlated with the total sales for a book. If peak sales could be predicted using external features, then feeding it to the statistical model can help obtain the entire sales curve.
A limitation with the above statistical approach is that within 25 weeks after publication, most books already have reached their sales peak and the height of this peak is a good indication of whether this book is going to sell well or not. If only the first few weeks of data before the sales peak is used, however, the estimation is not accurate. Therefore, predictions derived from the statistical model are not suitable for the fast-changing nature of the publishing industry. The systems and methods disclosed herein can predict a book's performance prior to its publication.
One example embodiment is a system for machine learning classification. The system includes representations of characteristics of products, a pre-processor, and a machine learning classifier. The pre-processor is configured to determine (i) representations of comparative intrinsic characteristics of the products based on the representations of characteristics of products and (ii) representations of corresponding comparative extrinsic characteristics of the products. The pre-processor is also configured to generate a data structure representing relationships between the comparative intrinsic characteristics and the comparative extrinsic characteristics. The machine learning classifier is trained with the data structure. The classifier is configured to return representations of comparative extrinsic characteristics in response to comparative intrinsic characteristics. The pre-processor can be configured to filter the representations of characteristics of products before determining the representations of comparative intrinsic characteristics of the products and generating the data structure.
Another example embodiment is a method of machine learning classification. The method includes determining representations of comparative intrinsic characteristics of products based on representations of characteristics of the products, determining representations of corresponding comparative extrinsic characteristics of the products, generating a data structure representing relationships between the comparative intrinsic characteristics and the comparative extrinsic characteristics, and training a machine learning classifier with the data structure to return representations of comparative extrinsic characteristics in response to given comparative intrinsic characteristics.
In many embodiments, the products are books. In such embodiments, the characteristics of the products can be any of: fame of an author of the book, previous cumulative sales for the author, genre and/or topic of the book, publisher value for the book, and seasonal fluctuations. The representations of characteristics of the books can be filtered to filter-out books that do not fit in a general market. The comparative intrinsic characteristics of the products can be differences between a type of characteristic between two books, or differences between multiple types of characteristics between two books.
Another example embodiment is a system for machine learning classification. The system includes a machine learning classifier and a disambiguator. The machine learning classifier is trained with data representing relationships between intrinsic characteristics of products and extrinsic characteristics of the products, and is configured to return a plurality of representations of comparative extrinsic characteristics for a given product in response to a plurality of comparative intrinsic characteristics between the given product and a plurality of other products. The disambiguator is configured to, based on the plurality of representations of comparative extrinsic characteristics for the given product, rank a plurality of intervals between the extrinsic characteristics for the plurality of other products and determine an extrinsic characteristic for the given product based on the ranking. The disambiguator can be configured to rank the plurality of intervals based on a tally of the intervals, where an interval is tallied when it is in a range of a given comparative extrinsic characteristic for the given product when compared to a given other product.
Another example embodiment is a method of machine learning classification. The method includes inputting, to a machine learning classifier trained with data representing relationships between intrinsic characteristics of products and extrinsic characteristics of the products, a plurality of representations of comparative intrinsic characteristics between a given product and a plurality of other products to obtain a plurality of representations of comparative extrinsic characteristics for the given product. The method further includes ranking a plurality of intervals between the extrinsic characteristics for the plurality of other products based on the plurality of representations of comparative extrinsic characteristics for the given product, and determining an extrinsic characteristic for the given product based on the ranking.
In many embodiments, the given product may be an unpublished book and the other products may be published books. In such embodiments, the determined extrinsic characteristic for the given product can be a peak sales or cumulative sales value.
The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.
A description of example embodiments follows.
Described herein are systems and methods that can use important factors in product (e.g., book) purchase patterns, order them with respect to their importance for various book genres, and predict both the sales at their peak week and the total sales in the first year after publication of individual books before their market launch. By using the weekly and peak sales information of thousands of print books published in the United States, along with information about the genre they are written in, the fame/visibility and publishing history of their authors, the past success statistics of the imprints that publish these books, and the seasonality of the book industry in general, methodologies have been developed, as disclosed herein, to predict the number of copies a book will sell at its peak point after publishing and how many additional copies it will continue to sell during its first year. Moreover, these contributing factors can be ranked in terms of significance and offer ways to increase sales by manipulating those that can be altered.
The findings were tested and validated from an unprecedented complete data set of more than 170,000 hardcovers published in the United States since 2008. By using both complex machine learning techniques, such as a neural networks, and more explanatory methods such as linear regressions, both accurate predictions and human-readable conclusions can be obtained.
The disclosed systems and methods can integrate a variety of data sources, including, for example, Bookscan (for sales information and meta-data), Goodreads (additional information about books and authors), and Wikipedia Pageviews (author name recognition and fame). Machine learning and statistical data mining techniques can be used for sales prediction, which not only provides accurate prediction results, but also provides insights about what determines a book's commercial success. Because data is used not only from each book individually, but also from an extensive overview of the market in general in terms of properties and sales histories of over a hundred thousand books, combined with known sales drivers, such as an author's previous success and name recognition, a very accurate prediction can be made about how a book will sell months before the book is on the shelves.
The disambiguator 525 can be configured to rank the plurality of intervals based on a tally of the intervals 535a-n, where an interval is tallied when it is in a range of a given comparative extrinsic characteristic for the given product when compared to a given other product. For example, if the extrinsic characteristic is whether one book has more peak sales than another book, then the intervals 535a-n can represent a ranking of books based on their peak sales 530a-n. When information for a given book is input into the classifier 515, the classifier 515 can return results 520a-n in the form of whether the given book is likely to have greater or lesser sales as compared to each of the other books. The intervals 535a-n may be tallied based on the results 520a-n from the classifier 515. For each of the results 520a-n, if the given book is predicted to have peak sales greater than a book corresponding to peak sales designated by 530b, then the tallies of intervals 535c through 535n can be increased by one. This tallying process can continue for all results 520a-n returned by the classifier 515. The interval with the highest tally can then be used to calculate (e.g., based on an interpolation) the peak sales 540 of the subject book. If more than one interval has the same number of highest tally values, then the peak sales 540 of the subject book can be calculated based on all highest tallied intervals.
The following describes example implementations of the disclosed systems and methods.
I. Preliminary Knowledge of Book Sales Patterns
It has been found that (1) most bestselling books reach their sales peak in less than ten weeks after release, (2) sales tend to follow a universal pattern and can be described by a statistical model, and (3) the model can help predict future sales. Using this statistical model, the entire sales curve of a book can be predicted from only three parameters and can be accurately calculated using the weekly sales data during the first 25 weeks after publication. The limitation with the approach is that within 25 weeks after publication, most books already have reached their sales peak and the height of this peak is a good indication of whether this book is going to sell well or not. If only the first few weeks of data before the sales peak is used, however, the estimation is not accurate. Therefore, predictions derived from the statistical model are not suitable for the fast-changing nature of the publishing industry. The systems and methods disclosed herein, however, can predict a book's performance prior to its publication.
II. Data
One example data source is from Nielsen/NPD Bookscan, a sales data provider for the book publishing industry. The database includes information for all print books in the United States since 2003, from the meta-data of each book (e.g., the ISBN number, author name, title, category, BISAC number, publisher, price) to weekly sales of each book since its publication. The top selling 10,000 books of each month published between 2008 and 2015 can be obtained. For simplicity, only hardcovers were considered for the studies presented herein, a total of 170,927 books, but the systems and methods disclosed herein can be applied to other types of books or products. All hardcovers were used when selecting and calculating the features, but to establish the model and to test its predictive power, only more recent hardcovers published in 2015 were considered, a total of 16,120 books.
Additionally, Wikipedia Pageviews data can be used to quantify the fame (visibility) of each author. Wikipedia Pageviews are an indication of the number of people who have visited a Wikipedia article during a given time period and are available since 2008. They were originally provided in the form of a large data dump, but since August 2016, Wikipedia provides an API for users to query pageviews more easily. Moreover, book descriptions can be obtained from Amazon and Goodreads data, which can be analyzed using Natural Language Processing techniques, offering information about a book's content that is not available from Bookscan.
To compare the results with existing predictions from the industry, Publisher Marketplace data can be used for prediction reference. Publisher Marketplace is a website that provides information of deals in the publishing industry. In general, when publishers decide to purchase a book, they estimate the sales of the book, and based on that estimate offer different size deals; the higher the estimate, the higher the deal. The deals are reported by book agents on the website, and may include information such as author, working title, publisher buying the book, deal date, deal type, and deal size. Sometimes after the book is published, the ISBN number is added as well. Publisher Marketplace categorizes deals into 5 categories: a nice deal ($1-$49,000), a very nice deal ($50,000-$99,000), a good deal ($100,000-$250,000), a 5200.2168-003 (INV-17109) significant deal ($251,000-$499,000), and a major deal ($500,000 and up). From the 85,078 deals on the site, 22,105 include the size-of-deal information, and 3,617 of those are hardcovers.
III. Model
A. Features
When choosing a book, a reader is affected by many aspects. Sometimes it is the author, sometimes it is the properties of the book itself, such as genre and topic, and sometimes it is how well a book is advertised. The example features presented herein are obtained from these three aspects. For the author, readers tend to choose famous authors or authors they have read and liked before. For the book itself, readers usually have specific genre preference. Additionally, well-advertised books with more presences are more likely to attract readers.
Some of these factors are easily quantified. For authors, fame/visibility can be represented by the Wikipedia Pageviews, i.e., how many people click on author's page over time. Also, readers tend to buy books from authors they have read before; therefore, authors with higher previous sales are more likely to attract readers for a new book. The genre information is given by the BISAC code as explained below. For advertising it is more difficult to find a direct measure. However, advertising is usually the publishers' responsibility, and some publishers have more resources in marketing than others; therefore, publisher information can be used to quantify the extent of book advertisement. Finally, it is known that book sales have seasonal fluctuations, which can be taken into consideration as well. Features used can include cumulative fame of the author, previous cumulative sales of the author, genre information, publisher value for the book, and seasonal fluctuations. The calculation of these features can involve various data sources.
1. Author Features
Author Visibility/Fame: Wikipedia Pageviews can be used as a proxy of the general public interest into an author, i.e. his or her fame. There are many aspects of visibility: cumulative visibility, representing all visits starting from the page creation date is relevant for some authors, while recent visibility is more relevant for others. In order to capture different aspects of visibility, the following author parameters for each book can be used, representing the visibility feature group:
Previous Sales: Bookscan weekly sales data can be used to calculate the previous sales of all books written by an author. Similar to an author's visibility, previous sales can be incorporated in various ways. For example, previous sales in different genres from the predicted book is relevant to authors who change genres during their career. Previous sales which belong to other genres may have less influence than previous sales in the same genre as the new book. Therefore, previous sales can be split into two parts: previous sales in this genre, and previous sales not in this genre. In summary, the following information for each book can be determined, representing the previous sales feature group:
2. Book Features
Genre information: Differences between genres are important. Fiction and nonfiction books have extremely different behaviors, and within fiction and nonfiction each sub-genre has its own behavior as well. Direct information about genres can be obtained from the BISAC code, which is a standard code used to categorize books based on topical content. There are 52 major sections in the BISAC code list, such as “COMPUTERS”, “FICTION”, and “HISTORY”. Under each major section, there are a number of detailed descriptors that represent sub-topics, for example “FIC022000” under “FICTION” stands for “FICTION I Mystery & Detective I General”. All of the subcategories may have different behaviors; however, analyzing each of them separately is not ideal. First, under each subcategory, there may be too few books to obtain a meaningful generalized model. Second, some genres show very similar behavior in terms of sales numbers. And finally, too many models would make it difficult to pinpoint important common patterns in the publishing industry. Therefore, clustering can be used to reduce the number of sub-genres needed for the model. First, a coarse-grained BISAC code can be obtained under these rules:
After this process, each sub-genre can be clustered based on both the number of books in them and the median sales of the top selling books in that genre. This can be done separately for fiction and nonfiction. The aim of the clustering is to aggregate genres having comparable size (i.e., the number of books) and comparable potential (median sale of top selling books, i.e., “the ceiling sales of this genre”). An algorithm that can be used for clustering is a K-means algorithm. Table I and
The result of genre clustering is used to group books and establish the model. Within each grouping, the median sales of top 100 selling books can be used for each genre as a feature. For example, General Fiction and Literary are clustered to Fiction Group C. Some clusters are topically surprising; for example, Nonfiction Group B combines Religion, Business, and Economics; however, size and sales potential wise, these three genres are similar. The result of genre clustering can be used to group books and calculate features. Various statistics (including the mean, median, standard deviations, 10th, 25th, 75th and 90th percentile, same hereafter) of the book sales within each genre cluster can be used, forming a genre cluster feature group.
Topic Information: Genre information is assigned by publishers and can be different from how readers categorize books. For example, books under BISAC “BUS” (Business) can cover very different subjects, varying from finance to science of success. Therefore, topics from one-paragraph book summaries on Amazon or Goodreads can be extracted to get a better sense of the true content of the book. Non-negative Matrix Factorization (NMF) techniques from Natural Language Processing can be used.
The NMF outputs two matrices: a topic-keyword matrix and a book-topic matrix. The topic-keyword matrix enables creation of a topic-keyword bipartite graph showing the composition of each topic. For each topic, the book sales distribution and corresponding statistics for each distribution can be obtained. Then for each book, since it's represented as a linear combination of several topics, where the weights come from the book-topic matrix, the features can be calculated as a weighted average of each statistics of each topic.
Publishing Month (Seasonal Fluctuations): Book sales are largely related to publishing month. In a previous analysis of New York Times Bestsellers, it was found that books needed to sell more copies if they want to be on the bestsellers list during the holiday season in December due to people buying more books for the holidays. Similar seasonal fluctuation in sales for all hardcovers were observed as well. Using the fiction and nonfiction hardcover books published between 2008 and 2015, books were aggregated by publishing month, shown in
3. Publisher Features
In Bookscan data, each book is recorded with a publisher and an imprint. In the publishing industry, a publisher usually has multiple imprints with their own different missions. Some imprints may be dedicated to one specific genre. For example, Portfolio under Penguin Random House only publishes business books. An imprint is independent from the publisher with regards to selecting books to publish and taking on the responsibility of advertising. Some imprints are more attractive to authors because they offer higher deals and have more resources when advertising the book. Additionally, in order to keep their reputation, such imprints with better reputations will likely be more selective. Therefore, books published by those imprints would be expected to sell more. To calculate the imprint value, the median sales of top 160 (20 books on average over an 8-year period) selling books under each imprint can use used. It was found that even with better imprints, some books sell badly. In
B. Filtering
Filtering can be conducted on the data before feeding it into the model, with the purpose of excluding special books that do not fit in the general market. The filtering criteria can be:
C. Model Establishment
After obtaining the individual features, Linear Regression models can be used to predict the peak sale (or one-year sales) for each book. From the model described above, it can be observed that all features are fat-tail distributed. Furthermore, it can be observed that the peak sales (or one-year sales) are also log-normal distributed. Therefore, a logarithm can be obtained on the dependent and independent variables, leading to the model as:
log(Peak Sale)˜a1 log(Fame)+a2 log(Previous Sales)+a3 log(Imprint Value)+a4 log(Month Value)+a5 log(Genre Value)+const (1)
for each genre grouping (five for fiction grouping and five for nonfiction grouping).
a. Model Solving
An algorithm that can be use is Ridge Regression, which is a linear regression with L-2 regularization. Ridge Regression penalizes large estimates to reduce over-fitting and is also known to work well with a co-linear relationship between independent variables.
Classic linear regression could be written as finding parameter vector ω=(ω0, ω1, . . . , ωp) for:
ŷ=ω
0+ω0χ1+ω0χ2+ . . . +ωpχp (2)
that minimize the residual sum of squares between the observed value y in the dataset and the predicted value ŷ. More mathematically, the objective function is:
For Ridge Regression, the objective function not only minimizes the residual sum between observed and predicted value, but also penalizes the size of the coefficient by using L-2 norm (Euclidean norm), leading up to objective function as:
where α≥0 controls the amount of shrinkage: the larger the value of a, the greater the shrinkage and the coefficient becomes more robust to co-linearity.
b. Model Testing
Various testing methods can be used for the model.
1. k-Fold Cross Validation
Cross Validation method is a classic testing method in the machine learning field. The whole dataset is randomly partitioned into k equal size subsamples. Of the k subsamples, one sample is retained as the test data for the model and the remaining k-1 subsamples are used as training data. The whole process is repeated k times, with each subsample used only once as test data. Each time, a R2 score is obtained for the test sample. The k R2 from the folds can then be averaged to produce a single estimation. In the testing, k=10 was used.
2. Predict on Next Year
Another test may be performed using books published in a previous year to establish the model to make predictions for books published in the next year. For example, the model can be built based on books published in 2015 and predict the sales for books published in 2016. The performance is again indicated by the R2 score.
3. Using External Reference Data
Predictions can be compared with existing predictions in the publishing industry. Publisher Marketplace provides deal information between imprints and books, and the deal size can be regarded as a prediction made by an imprint. The prediction can be compared with their deal size to see whether there are improvements with the model. A comparison of deal size and actual one-year sale of books is shown in
Book sales follow a heavy-tail distribution, leading to a class imbalance problem where there are far more low-selling books than high-selling books. This imbalance can cause methods like Linear Regression to under predict on the high-selling books, which are, however, the most important books for publishers. In order to address this imbalance problem, the Learning to Place approach addresses the following question: Given a sequence of previously published books ranked by their sales, where would a new book be placed in this sequence?
The example Learning to Place approach includes two stages: 1) Establish a pairwise relationship classifier, which predicts whether a new book will sell better or worse than each book in the training set; and 2) Assign a place to the predicted book based on the pairwise relationships, i.e., find the best place for it in the previously given book sequence ranked by sales. The process is graphically explained in the flowchart of
The Learning to Place approach works as follows:
Training Phase: For each book pair, i and j with feature vectors fi and fj, the two feature vectors Xij=[fi,fj] can be concatenated. If book i's sales number is greater than book j's, then yij=1; if i's sales number is smaller than j's, then yij=−1 (ties can be ignored in the training phase). Formally, denoting with si, the sales of book I, and B, the set of books in the training set, the training data is:
By defining the training data, the problem is converted into a classification problem, in which 1 or −1 is predicted for each book pair. This training data is then sent to a classification algorithm (classifier) F to fit the y label and obtain the weights on each feature in matrix X. A Random Forest classifier can be used for this phase.
Testing Phase:
1) Pairwise relationship prediction: For each new (test) book k, obtain
X
kj
=[f
k
,f
j], for each i∈B. (6)
Then apply the classifier on the testing data to get the predicted pairwise relationship between the predicted book and all other books in the training data.
ŷ
ki
=F(Xki). (7)
2) Assign the place of the predicted instance: After obtaining the pairwise relationships, treat each book in the training data as a “voter.” Books (voters) from the training data can be sorted by sales, for example, dividing the sales axis into intervals. If ŷki=1, i.e., book k should sell more than book i, sales intervals on the right of si obtains a “vote.” If ŷki=−1, book i “votes” for intervals on the left of sl. After the voting process, obtain a voting distribution for each test book and take the interval with the most “votes” as the predicted sales interval for book k.
Model Testing: To test the model, k-fold Cross Validation can be used, a known testing method in machine learning. An evaluation score can be obtained for each fold of the test sample. Example testing that was performed used k=5. The evaluation scores used were:
IV. Results
Books written by first time authors and those written by experienced authors were separated. This separation was based on the fact that, for books written by first time authors, Previous Sales would be zero and predicting power would be limited.
Results for books written by experienced authors:
Table II and Table III show the fitting R2 value and the cross-validation score for each genre grouping for peak sales and one-year sales, respectively. Both results are similar and the scores for one-year sales are higher for some groups.
From the tables, it can be seen that for some groups, there are high cross-validation scores, such as Fiction Group A and Fiction Group B. However, some groups have low cross validation score, such as Fiction Group D. There are multiple reasons for this: (1) some of the groups are very small, which leads to the test sample in cross validation being small, and the R2 for a small dataset may not be accurate; (2) some new editions for old books exist in the dataset, which could influence the performance of the model. In general, the performance for fiction is better than nonfiction, indicating that there may be other features worth using for nonfiction.
Learning to Place was applied to books published in 2015, aiming to predict the one-year sales of each book. Results from Linear Regression are also shown for comparison.
1. Predictions
Finally, Table IV shows the R2, AUC score, and High-end RMSE for fiction and nonfiction, comparing K-nearest neighbor baseline, Linear Regression and Learning to Place. It confirms that for both fiction and nonfiction, Learning to Place always offers higher R2 and AUC scores, and lower high-end RMSE, indicating that it outperforms the other methods.
(5.94 +/− 16) × 109
2. Feature Importance
Feature Importance for Fiction and Nonfiction: To find out which feature group is important, the normalized accuracy score was plotted using each feature group for fiction and nonfiction, shown in
Feature Importance for Different Genres: Learning to Place was applied on selected genres and the feature importance difference between different genres was examined. The five largest genres under fiction (Mystery, Thriller, Fantasy, Historical, Literacy) and nonfiction (Biography, Business, Cooking, History, Religion) were selected, and the feature importance score vector for each genre was obtained.
Since there are features in three main categories: author, book and publisher, the importance for each of these categories can be examined. To achieve this, three models can be trained, each including only one feature category. Sales of each book can be predicted using each of these three models separately, the absolute error Eauthor, Ebook, Epublisher obtained, compared to the true sales of the book, and the three errors normalized so that they sum to one. A ternary plot can be used to inspect where the errors are coming from for different books.
Example Digital Processing Environment
In the context of
In one embodiment, the processor routines 92 and data 94 are a computer program product (generally referenced 92), including a computer readable medium (e.g., a removable storage medium such as one or more DVD-ROM's, CD-ROM's, diskettes, and tapes) that provides at least a portion of the software instructions for the system. Computer program product 92 can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable, communication and/or wireless connection. In other embodiments, the programs are a computer program propagated signal product 75 (
In alternative embodiments, the propagated signal can be an analog carrier wave or digital signal carried on the propagated medium. For example, the propagated signal may be a digitized signal propagated over a global network (e.g., the Internet), a telecommunications network, or other network. In one embodiment, the propagated signal is a signal that is transmitted over the propagation medium over a period of time, such as the instructions for a software application sent in packets over a network over a period of milliseconds, seconds, minutes, or longer. In another embodiment, the computer readable medium of computer program product 92 is a propagation medium that the computer system 50 may receive and read, such as by receiving the propagation medium and identifying a propagated signal embodied in the propagation medium, as described above for computer program propagated signal product. Generally speaking, the term “carrier medium” or transient carrier encompasses the foregoing transient signals, propagated signals, propagated medium, storage medium and the like. In other embodiments, the program product 92 may be implemented as Software as a Service (SaaS), or other installation or communication supporting end-users.
While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.
This application claims the benefit of U.S. Provisional Application No. 62/522,325, filed on Jun. 20, 2017, and U.S. Provisional Application No. 62/685,612, filed on Jun. 15, 2018. The entire teachings of the above applications are incorporated herein by reference.
This invention was made with government support under Grant No. FA9550-15-1-0077 from the Air Force of Scientific Research. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62685612 | Jun 2018 | US | |
62522325 | Jun 2017 | US |