The present application generally relates to supervised machine learning techniques for learning models for use in making job listing recommendations to users of an online job hosting service. More specifically, the application describes techniques for training models using training data having weak labels derived from user actions.
Many online job hosting services have job recommendation services that attempt to identify and recommend job listings that best match the experiences and interests of users. When requested, or perhaps on some periodic basis, a job recommendation service will present some top number of “best” (e.g., highest ranked, or, closest matching) job listings to the user. Some job recommendation services use supervised machine learning techniques to learn one or more models for classifying and/or ranking job listings for each user. Some of these learned models operate on a per-user basis (e.g., personalized models), such that the ranking of job listings is dependent upon the individual actions taken by each user with respect to the specific job listings presented to the user. However, training such models can be difficult when there is insufficient training data. In such instances, alternative and non-conventional approaches are needed.
Embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which:
Described herein are methods and systems, using supervised machining learning techniques, for training models for use in classifying job listing recommendations for use with a recommendation engine, where each model is used to classify job listings as relevant or irrelevant for recommending to an individual user of an online job hosting service, and the training data used to train the model(s) include multiple categories of labeled data based on user actions. In the following description, for purposes of explanation, numerous specific details and features are set forth in order to provide a thorough understanding of the various aspects of different embodiments of the present invention. It will be evident, however, to one skilled in the art, that the present invention may be practiced with varying combinations of the many details and features.
Many online job hosting services have job recommendation and search services that attempt to identify job listings that best match the experiences and interests of users. When requested, or perhaps on some periodic basis, a job recommendation service will present some top number of “best” (e.g., highest ranked) job listings to the user. Some job recommendation services use supervised machine learning techniques to learn a model for classifying (e.g., as relevant or irrelevant) the job listings for each user. Some of these learned models operate on a per-user basis (e.g., personalized models), such that the ranking of job listings is dependent upon the individual actions taken by each user with respect to the specific jobs presented to the user. However, training such models can be difficult if there is not sufficient training data.
In order to learn how to rank jobs for users, training data in the form of examples of relevant and irrelevant job listing recommendations are required to train the models. Typically, being able to learn a good model depends on whether there is a sufficient volume of training data (e.g., job listing recommendations labeled as relevant, or irrelevant). However, getting a sufficient volume of training data for recommendations of job listings is challenging. Generally, two types of labeled training samples exist. The first type of training example can be thought of as an explicit label/signal that arises from explicit user actions, such as when a user applies for a recommended job (Job Apply), takes action to save a recommended job for later viewing (Job Save), or takes action to dismiss a recommended job (Job Dismiss). The second type of labeled training example can be thought of as an implicit label/signal, which, for example, arise from user actions for which a more subtle inference can be drawn. For example, implicit labels/signals arise when a user is presented with a job listing recommendation and chooses to view the job listing (referred to herein as a Job View), or alternatively, chooses not to view the job recommendation (referred to herein as a Job Skip). While explicit labels/signals are of higher quality, they tend to be far fewer in quantity. Thus, in accordance with embodiments of the present invention, using implicit labels/signals addresses the challenges posed by there being insufficient training samples, especially when training personalized per-member random effect components of a recommendation engine. A single member is unlikely to have an adequate number of explicit labels/signals for training a robust per-member model for that member.
As illustrated in
The problem with this approach is that the learned model does not take into consideration the relevant weight or importance of the different signals (e.g., user actions) that are mapped to the two target classes. For instance, it is inherently easy to understand that, when a user simply views a job listing, this user action is not as strong of a signal of interest in a job listing as when the user saves the job listing, or actually applies for the position associated with the job listing. Similarly, when a user interacts with a user interface element (e.g., a button) to dismiss a job listing, this explicit user action expresses more disinterest in the job listing than when a user is presented with a job listing in a list of recommended job listings, but simply skips over (e.g., does not select for viewing) the job listing. Accordingly, the result is that the learned ranking model is not as effective as it could be and users are ultimately likely to be presented with job listing recommendations that are less relevant, and/or, certain relevant job listing recommendations that could and should be presented to a user will not be.
Consistent with embodiments of the present invention and as illustrated in
The front-end layer may comprise a user interface module (e.g., a web server) 302, which receives requests from various client computing devices and communicates appropriate responses to the requesting client devices. For example, the user interface module(s) 302 may receive requests in the form of Hypertext Transfer Protocol (HTTP) requests or other web-based API requests.
The application logic layer may include one or more various application server modules or services (e.g., job hosting service 304), which, in conjunction with the user interface module(s) 302, generate various user interfaces (e.g., web pages) with data retrieved from various data sources in the data layer. Consistent with some embodiments, individual application server modules (not shown) are used to implement the functionality associated with various applications and/or services provided by the online system 300, beyond the functions of the job hosting service 304. For example, with some embodiments, the job hosting service 304 may be integrated with a social networking system or service offering a variety of other functions and services, such as a news feed, photo sharing, and so forth.
As illustrated in
As shown in
As shown in
In
At method operation 502, labeled training data is obtained for the particular user on whose behalf the job listing recommendations are to be generated. For instance, as described in connection with the example user interfaces of
In any case, at method operation 604, the request is processed by first obtaining for the user some candidate set of job listings, and then for each job listing in the candidate set, processing the job listing with a machine-learned classification model that has been trained with training examples—both positive and negative training examples—that are grouped into multiple groups based on some user actions. Accordingly, each user action represents a weak label for the respective training example to which it applies.
Next, at operation 606, for each job listing recommendation that, according to the model, is classified as relevant to the user for whom the job listing recommendations are to be generated and ultimately presented, a ranking score is generated. For example, a global machine-learned model might be used to derive ranking scores for the respective job listings. Finally, at method operation 608, a user interface is presented to the user, where the user interface presents some subset of the ranked job listing recommendations, ordered in accordance with their respective rankings scores.
The following is a description of one example of an embodiment of the present invention, expressed mathematically. Specifically, in this example, the mathematical formulas are provided for training examples that are grouped by the following three user actions: Job Applies, Job Dismisses and Job Skips.
Training a classification model using a supervised machine learning technique involves learning a function g(x) such as to minimize a loss l(y,g(x)). More formally, the optimization involves optimizing for risk (e.g., the expected value of loss over the data distribution) defined as:
R(g)=E(x,y)˜p(x,y)[l(y*g(x)]
which can be re-written as,
R(g)=πE(x)˜Ppositive(x)[l(g(x)+(1−π)E(x)˜Pnegative(x)[l(−g(x)]
with,
π being the class prior (i.e., the fraction of positives in the entire data set),
Ppositive(x) being the data distribution of positive class,
Pnegative(x) being the data distribution of negative class.
By way of example, logistic regression corresponds to learning the sigmoid function on a linear combination of input x to minimize logistic loss ln(1+exp(−y*g(x))). Using the above formulation for R(g) assumes that the training data has sufficient and representative positive and negative examples drawn from Ppositive(x) and Pnegative(x) respectively. When sufficient training examples are not available, an alternative approach is needed. Consistent with embodiments of the invention, the alternative approach uses training examples associated with weak labels—that is, job listings associated with user actions where the user action is used to infer the negative or positive label.
Assuming there are two groups of training examples with weak labels, B1 and B2, then XB1 are samples drawn from B1, and XB2 are samples drawn from B2. The assumption on generative process for data in B1 and B2 is given as,
P
B1(x)=θB1Ppositive(x)+(1−θB1)Pnegative(x)
P
B2(x)=θB2Ppositive(x)+(1−θB2)Pnegative(x)
The above formulation expresses that the weak label samples can be considered as drawn from positive and negative population with the mixing coefficients, θB1 and θB2, respectively.
This leads to a corrected training objective. In the case of weak labels, samples are drawn from PB1(x) and PB2(x), instead of Ppositive (x) and Pnegative(x). The objective function, R(g) is then re-writable in terms of samples drawn from PB1(x) and PB2(x) as,
R(g)=aEx˜pB1(x)[l(g(x)]+bEx˜pB1(x)[l(−g(x)]+cEx˜pB2(x)[l(g(x)]+dEx˜pB2(x)[l(−g(x)]
This is possible for hyper-parameters (a, b, c, d) that satisfy the following,
aθ
B1
+Cθ
B2=π,
a(1−θB1)+c(1−θB2)=0,
bθ
B1
+dθ
B2=0,
b(1−θB1)+d(1−θB2)=1−π
If θB1 and θB2 are known, then solving for the hyper-parameters (a, b, c, d) involves a set of four linear equations in four variables, and can be solved as,
a=π(1−θB2)θB1−θB2a=π(1−θB2)θB1−θB2
b=−(1−π)θB2θB1−θB2b=−(1−π)θB2θB1−θB2
c=−π(1−θB1)θB1−θB2c=−π(1−θB1)θB1−θB2
d=(1-π)θB1θB1−θB2
Using these values for the hyper-parameters (a, b, c, d), a classification model g(x) can be learned and which optimizes R(g) using weakly labeled samples XB1 and XB2.
Consistent with embodiments of the invention, the concept expressed above can be extended to situations when more than two labels are available. Take as an example a job listing recommendation engine that considers user actions relating to Job Applies, Job Dimisses, and Job Skips—two strong labels, and one weak label—in labeling training data. The objective function from above can now be rewritten as,
This is possible for hyperparameters (a, b, c, d, e, f) setting that satisfies,
aθ
B1
+cθ
B2
+eθ
B3=π
a(1−θB1)+c(1−θB2)+e(1−θB3)=0
bθ
B1
+dθ
B2
+fθ
B3=0
b(1−θB1)+d(1−θB2)+f(1−θB3)=1−π
Here, B1 is Job Applies, B2 is Job Dimisses, and B3 is Job Skips. Since the user action, Job Apply, is considered a strong positive signal, 8B1=1. Similarly, since Job Dismiss is a strong negative, θB2=0. This gives,
a+eθ
B3=π
c+e(1−θB3)=0
b+fθ
B3=0
d+f(1−θB3)=1−π
This has an infinite number of solutions. One solution is as follows,
a=π
b=−fθ
B3
c=0
d=1−π−f(1−θB3)
e=0
f=?
Note, since θB3 is expected to be very small, it can be approximated to 0. This makes the estimate for b to be 0, and for d to be (1−π−f). The new objective function can then be re-written as,
R(g)=πE(x)˜pB1(x)[l(g(x)]+(1−π−f)E(x)˜pB2(x)[l(−g(x)]+fE(x)˜pB3(x)[l(−g(x)]
This is equivalent to using Job Applies with a weight of π, Job Dismisses with a weight of (1−π−f), and Job Skips with a weight of f.
Finally, a grid search over different values of f can be performed to identify the value of f that produces the best performance on some validation set.
The machine 800 may include processors 810, memory 830, and I/O components 850, which may be configured to communicate with each other such as via a bus 802. In an example embodiment, the processors 810 (e.g., a Central Processing Unit (CPU), a Reduced Instruction Set Computing (RISC) processor, a Complex Instruction Set Computing (CISC) processor, a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), an ASIC, a Radio-Frequency Integrated Circuit (RFIC), another processor, or any suitable combination thereof) may include, for example, a processor 812 and a processor 814 that may execute the instructions 816. The term “processor” is intended to include multi-core processors that may comprise two or more independent processors (sometimes referred to as “cores”) that may execute instructions contemporaneously. Although
The memory 830 may include a main memory 832, a static memory 834, and a storage unit 836, all accessible to the processors 810 such as via the bus 802. The main memory 830, the static memory 834, and storage unit 836 store the instructions 816 embodying any one or more of the methodologies or functions described herein. The instructions 816 may also reside, completely or partially, within the main memory 832, within the static memory 834, within the storage unit 836, within at least one of the processors 810 (e.g., within the processor's cache memory), or any suitable combination thereof, during execution thereof by the machine 800.
The I/O components 850 may include a wide variety of components to receive input, provide output, produce output, transmit information, exchange information, capture measurements, and so on. The specific I/O components 850 that are included in a particular machine will depend on the type of machine. For example, portable machines such as mobile phones will likely include a touch input device or other such input mechanisms, while a headless server machine will likely not include such a touch input device. It will be appreciated that the I/O components 850 may include many other components that are not shown in FIG. 8. The I/O components 850 are grouped according to functionality merely for simplifying the following discussion and the grouping is in no way limiting. In various example embodiments, the I/O components 850 may include output components 852 and input components 854. The output components 852 may include visual components (e.g., a display such as a plasma display panel (PDP), a light emitting diode (LED) display, a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)), acoustic components (e.g., speakers), haptic components (e.g., a vibratory motor, resistance mechanisms), other signal generators, and so forth. The input components 854 may include alphanumeric input components (e.g., a keyboard, a touch screen configured to receive alphanumeric input, a photo-optical keyboard, or other alphanumeric input components), point-based input components (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or another pointing instrument), tactile input components (e.g., a physical button, a touch screen that provides location and/or force of touches or touch gestures, or other tactile input components), audio input components (e.g., a microphone), and the like.
In further example embodiments, the I/O components 850 may include biometric components 856, motion components 858, environmental components 860, or position components 862, among a wide array of other components. For example, the biometric components 856 may include components to detect expressions (e.g., hand expressions, facial expressions, vocal expressions, body gestures, or eye tracking), measure biosignals (e.g., blood pressure, heart rate, body temperature, perspiration, or brain waves), identify a person (e.g., voice identification, retinal identification, facial identification, fingerprint identification, or electroencephalogram-based identification), and the like. The motion components 858 may include acceleration sensor components (e.g., accelerometer), gravitation sensor components, rotation sensor components (e.g., gyroscope), and so forth. The environmental components 860 may include, for example, illumination sensor components (e.g., photometer), temperature sensor components (e.g., one or more thermometers that detect ambient temperature), humidity sensor components, pressure sensor components (e.g., barometer), acoustic sensor components (e.g., one or more microphones that detect background noise), proximity sensor components (e.g., infrared sensors that detect nearby objects), gas sensors (e.g., gas detection sensors to detection concentrations of hazardous gases for safety or to measure pollutants in the atmosphere), or other components that may provide indications, measurements, or signals corresponding to a surrounding physical environment. The position components 862 may include location sensor components (e.g., a GPS receiver component), altitude sensor components (e.g., altimeters or barometers that detect air pressure from which altitude may be derived), orientation sensor components (e.g., magnetometers), and the like.
Communication may be implemented using a wide variety of technologies. The I/O components 850 may include communication components 864 operable to couple the machine 800 to a network 880 or devices 870 via a coupling 882 and a coupling 872, respectively. For example, the communication components 864 may include a network interface component or another suitable device to interface with the network 880. In further examples, the communication components 864 may include wired communication components, wireless communication components, cellular communication components, Near Field Communication (NFC) components, Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components, and other communication components to provide communication via other modalities. The devices 870 may be another machine or any of a wide variety of peripheral devices (e.g., a peripheral device coupled via a USB).
Moreover, the communication components 864 may detect identifiers or include components operable to detect identifiers. For example, the communication components 864 may include Radio Frequency Identification (RFID) tag reader components, NFC smart tag detection components, optical reader components (e.g., an optical sensor to detect one-dimensional bar codes such as Universal Product Code (UPC) bar code, multi-dimensional bar codes such as Quick Response (QR) code, Aztec code, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2D bar code, and other optical codes), or acoustic detection components (e.g., microphones to identify tagged audio signals). In addition, a variety of information may be derived via the communication components 864, such as location via Internet Protocol (IP) geolocation, location via Wi-Fi® signal triangulation, location via detecting an NFC beacon signal that may indicate a particular location, and so forth.
The various memories (i.e., 830, 832, 834, and/or memory of the processor(s) 810) and/or storage unit 836 may store one or more sets of instructions and data structures (e.g., software) embodying or utilized by any one or more of the methodologies or functions described herein. These instructions (e.g., the instructions 816), when executed by processor(s) 810, cause various operations to implement the disclosed embodiments.
As used herein, the terms “machine-storage medium,” “device-storage medium,” “computer-storage medium” mean the same thing and may be used interchangeably in this disclosure. The terms refer to a single or multiple storage devices and/or media (e.g., a centralized or distributed database, and/or associated caches and servers) that store executable instructions and/or data. The terms shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media, including memory internal or external to processors. Specific examples of machine-storage media, computer-storage media and/or device-storage media include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGA, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms “machine-storage media,” “computer-storage media,” and “device-storage media” specifically exclude carrier waves, modulated data signals, and other such media, at least some of which are covered under the term “signal medium” discussed below.
In various example embodiments, one or more portions of the network 880 may be an ad hoc network, an intranet, an extranet, a VPN, a LAN, a WLAN, a WAN, a WWAN, a MAN, the Internet, a portion of the Internet, a portion of the PSTN, a plain old telephone service (POTS) network, a cellular telephone network, a wireless network, a Wi-Fi® network, another type of network, or a combination of two or more such networks. For example, the network 880 or a portion of the network 880 may include a wireless or cellular network, and the coupling 882 may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile communications (GSM) connection, or another type of cellular or wireless coupling. In this example, the coupling 882 may implement any of a variety of types of data transfer technology, such as Single Carrier Radio Transmission Technology (1×RTT), Evolution-Data Optimized (EVDO) technology, General Packet Radio Service (GPRS) technology, Enhanced Data rates for GSM Evolution (EDGE) technology, third Generation Partnership Project (3GPP) including 3G, fourth generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High Speed Packet Access (HSPA), Worldwide Interoperability for Microwave Access (WiMAX), Long Term Evolution (LTE) standard, others defined by various standard-setting organizations, other long range protocols, or other data transfer technology.
The instructions 816 may be transmitted or received over the network 980 using a transmission medium via a network interface device (e.g., a network interface component included in the communication components 864) and utilizing any one of a number of well-known transfer protocols (e.g., HTTP). Similarly, the instructions 816 may be transmitted or received using a transmission medium via the coupling 872 (e.g., a peer-to-peer coupling) to other devices. The terms “transmission medium” and “signal medium” mean the same thing and may be used interchangeably in this disclosure. The terms “transmission medium” and “signal medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying the instructions 816 for execution by the machine 800, and includes digital or analog communications signals or other intangible media to facilitate communication of such software. Hence, the terms “transmission medium” and “signal medium” shall be taken to include any form of modulated data signal, carrier wave, and so forth. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a matter as to encode information in the signal.
The terms “machine-readable medium,” “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure. The terms are defined to include both machine-storage media and transmission media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals.