The present disclosure is generally directed to industrial systems, and more specifically, to systems and methods for management and recognition of human activities.
The goal of human activity recognition (HAR) is to classify the activity a human is doing at a particular moment or over a window of time. These activities are typically drawn from the action space, , of all possible actions that the person could be performing in a certain context. The umbrella under which HAR covers is vast, as illustrated by the following examples.
In healthcare, an HAR model could aim at identifying if a patient has currently fallen over or is sleeping. For health care providers, an HAR model might try to see if the provider is performing certain actions in a correct order (e.g., washing their hands before putting on their gloves). In sports, an HAR model might try to discern if the human is walking, running, or jumping. In an industrial setting, an HAR model might be concerned with observing how quickly workers are performing certain actions, e.g., picking up a box, hammering a nail into an assembly part, etc. In human-robot collaboration, an HAR model could be used to aid such a system in helping the robot identify whether the accompanying human has performed a task yet.
Aspects of the present disclosure can involve a method, which can include, for receipt of sensor data of a plurality of workers operating across a plurality of physical areas from a plurality of sensors, extracting pose data of the plurality of workers from the sensor data, the pose data indicative of one or more poses of the plurality of workers; determining pose distributions from the extracted pose data; clustering the pose distributions based on similarity to form a plurality of clusters; and training a model for each of the pose distributions of each of the plurality of physical areas to generate a plurality of models, wherein at least a portion of weights used in the plurality of models are shared among ones of the plurality of models belonging to a same cluster of the plurality of clusters and from different ones of the plurality of physical areas.
Aspects of the present disclosure can involve a computer program, having instructions for executing a process, the instructions which can include, for receipt of sensor data of a plurality of workers operating across a plurality of physical areas from a plurality of sensors, extracting pose data of the plurality of workers from the sensor data, the pose data indicative of one or more poses of the plurality of workers; determining pose distributions of each site from the extracted pose data; clustering the pose distributions based on similarity to form a plurality of clusters; and training a model for each of the pose distributions of each of the plurality of physical areas to generate a plurality of models, wherein at least a portion of weights used in the plurality of models are shared among ones of the plurality of models belonging to a same cluster of the plurality of clusters and from different ones of the plurality of physical areas.
Aspects of the present disclosure can involve a system, which can include, for receipt of sensor data of a plurality of workers operating across a plurality of physical areas from a plurality of sensors, means for extracting pose data of the plurality of workers from the sensor data, the pose data indicative of one or more poses of the plurality of workers; means for determining pose distributions of each site from the extracted pose data; clustering the pose distributions based on similarity to form a plurality of clusters; and means for training a model for each of the pose distributions of each of the plurality of physical areas to generate a plurality of models, wherein at least a portion of weights used in the plurality of models are shared among ones of the plurality of models belonging to a same cluster of the plurality of clusters and from different ones of the plurality of physical areas.
Aspects of the present disclosure can involve an apparatus, which can include a processor, configured to, for receipt of sensor data of a plurality of workers operating across a plurality of physical areas from a plurality of sensors, extract pose data of the plurality of workers from the sensor data, the pose data indicative of one or more poses of the plurality of workers; determine pose distributions of each site from the extracted pose data; clustering the pose distributions based on similarity to form a plurality of clusters; and train a model for each of the pose distributions of each of the plurality of physical areas to generate a plurality of models, wherein at least a portion of weights used in the plurality of models are shared among ones of the plurality of models belonging to a same cluster of the plurality of clusters and from different ones of the plurality of physical areas.
Example implementations can thereby prepare input for HAR models that is robust to factors such as camera position or lightning conditions, and facilitate a system of management for HAR models that can adapt to changes in conditions in the production areas. The example implementations can further maximize training of HAR models based on similarity in worker movement by reducing the amount of labeled data required for such models. The example implementations can further appropriately choose features for HAR models based on expected activities in the corresponding production areas. Further, the example implementations can cluster production areas based on similarity of worker activity, while using only a low-dimensional representation of workers that can be readily extracted from a variety of sensors.
The following detailed description provides details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term “automatic” may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.
The classification of human activity is usually done by using available sensor data that can capture or monitor any human(s) in the production area observed by the sensor(s). Machine learning (ML) has been shown to be successful in developing models that can take sensor data observing humans and estimate what actions the worker is doing based on the sensor data (coming from Red Green Blue (RGB) or thermal cameras, Light Detection and Ranging (LiDAR), etc.). Typically, these ML models have as output a conditional probability taking the general form
p
θ(a|X),
where a is an action from the action space of interest, and X is the sensor data capturing the person of interest. Here, θ represent the parameters of the ML model used. In most applications, these parameters are learned from a collection of labeled data
={(,)=1.
Producing can be expensive, as creating labels for sensor data is a time-consuming manual process. With respect to the industrial setting,
In many HAR models, the input X to the model is not the raw sensor data, but significant features extracted from the raw sensor data. While what these features are varies depending on both the sensors used and the application in mind, a common feature of interest is human pose data. This pose data X can be represented as a tensor of shape (T,N,C). Here, T is the number of time frames from which pose data is taken, N is the number of joints along the human body that are being observed, and C is the number of channels observed at each joint and each timestamp. For instance, C=2 in the case where the (x,y)-pixel coordinates of each joint are tracked in a frame. If position, velocity, and acceleration (x,y,z)-vectors are tracked for each joint in a frame, then C=9.
Human pose data provides a lightweight representation of the sensor data that is flexible, capable of being useful for many different applications. As an example, an RGB camera image capturing a human is stored as (H,W,3)-shaped tensor, where H and W are the height and width in pixels of the image, and 3 corresponds to the number of color channels in the image. In many HAR models, N*C«H*W*3, so that the pose data lives on a substantially lower-dimensional manifold than the raw image. For example,
On the other hand, working with just pose data alone comes at the cost of losing contextual data within the image which could potentially be useful in helping a HAR model learn the desired action. A common problem though with many vision-based deep learning models is when trained on data from an RGB camera, the models learn features that are specific to the conditions of the production area during the time of recording, such as camera perspective and lighting. There are many forms of human pose data, but in general the above form is flexible and amenable to various transformations such as rotation, scaling, translation, and flipping. These transformations are used for proper training of machine learning models, as they can serve as normalization operations that can help a model's ability to generalize and moreover are relatively robust to the above-mentioned camera-specific issues. Moreover, extracting pose data from sensors such as RGB or depth cameras is achievable with current state-of-the-art methods in ML.
While human pose data has certainly been shown to be useful in developing accurate HAR models, many challenges remain, specifically in the Industrial Internet of Things (IIoT) field. In this setting, sensor data can be used to capture this worker movement, but the exact task and the nature of the work can change over time for any physical production area. Thus, deploying or developing models in these production areas can be difficult due to the dynamic setting the sensors are observing. Using effective HAR models often requires large amounts of resources as well due to the complicated architecture and large number of parameters that many deep learning models use. Using human pose data to solve problems concerning activity recognition is a well-studied problem. For example, graph convolutional networks leveraging spatio-temporal human pose data are a state-of-the-art method for HAR models. In some related art implementations, the importance of pose normalization is stressed for improved accuracy of HAR models—such as scale and rotation invariance, along with illumination concerns. However, most related art implementations do not consider the broader picture of trying to use HAR on a wider scale and are focused on building these models with particular setups, and so do not maximize the impact these models can have. In related art implementations there is a lack of a solution that can address all of these problems. The challenge of finding the optimal way to train multiple HAR models by appropriately leveraging the similarity in the movements of the different settings remains open. A technical consequence of a such a solution would be to reduce the amount of labeled data required for such models. Also, accounting for changes in conditions in the production areas using a suitable system of management for the HAR models remains difficult.
To address the above issues, example implementations described herein are directed to improving the efficiency of model learning in industrial businesses, in particular in the industrial setting. For example, in the factory floor 101 of
Further, the example implementations utilize clustering based on features extracted from the pose data, which is then used to determine how the models are trained so as to maximize the training of HAR models based on the similarity in worker movement. By using such an approach in contrast to the related art, models can be efficiently and timely generated through the reduction of the amount of labeled data required for such models.
Suppose there are K production areas, and in each production area there is an interest in detecting an action from the action space (k) for k=1, . . . , K. In general, the action spaces (k) are not the same across the different production areas, even when there might be similarities. Building a separate model for each of the production areas may thereby miss out on the commonality between the sensor data (or pose data, etc.) X(k) across the different production areas. Thus, even if (k)∩(k′)=Ø, it is still possible that there is a high degree of similarity between the corresponding sensor data X(k) and X(k′). In optimizing neural networks, a loss function (θ; X, a) is minimized as a function of θ with respect to a labeled pair (X,a), where θ represents the trainable parameters of the neural network. Most optimization algorithms use a variant of stochastic gradient descent (SGD), where the parameters are updated according to the rule
The parameters in SGD can be updated according to an average of the gradient of the loss function across a mini-batch of data =(Xi, ai) If the Xi in the mini-batch are drawn from production areas that are significantly different, the average gradient estimate above may have high variance during each iteration, which could impact training time and/or performance. This can be overcome with a significant amount of data or long enough training time or appropriate tuning (such as the scaling factor η), but due to the expensive data curation process, this may or may not be feasible.
Example implementations thereby involve a method for aggregating and organizing sensor data from different production areas for effective training of machine learning models. By grouping together similar production areas appropriately, machine learning models can thereby be trained to identify different tasks. The example implementations begin by first observing that the distribution of features extracted from the pose data can be used to characterize the production area from which they were sampled.
Now, given M production areas, example implementations devise a strategy for clustering these production areas together and training them based on their region similarities. It is assumed throughout that in each production area, the sensor data is observing human workers, and that from this sensor data human pose data can be extracted from each worker in the sensor's field of view such that it can be represented as a spatio-temporal tensor as shown in
Then, based on features chosen 504 for the production areas in question, the example implementations cluster production areas 505 based on the similarity of these features that are extracted from the pose data. After clustering the production areas, example implementations train or use a model 506 for each production area by jointly considering models within each cluster, even though the goal of the model for each production area may be different. After some time, each model is evaluated 507, and depending on the performance example implementations can decide to adjust the clustering or continue using each model for the task at hand.
By clustering based on similarity of the extracted features rather than the poses themselves, this creates some flexibility in how the exact clusters are produced. The feature extraction process 504 is depicted in
Suppose there is a need to cluster M production areas into K different groups. For each k=1, . . . , K,
(k)
={m|production area m belongs to cluster k}.
Within each cluster, the individual tasks (which are specified by their action spaces (k
used for classification of sensor data X(k
ƒ(k
Here, the symbol ⊕ denotes concatenation.
The function 702 p(k
The function 704 τ:T×N×C→T×N×C performs a suitable combination of transformations to align the human pose 703 to a common perspective 705. In the same way as above, this is the same as step 503 taken in
The function 706 F:T×N×C→D is the feature extraction step 504 that produces features of interest 707 from the aligned pose tensor 705 of shape (T,N,C) modeling the spatio-temporal human pose graph. Examples of features extracted could be time-dependent relationships, bone angle vectors, local coordinate changes of joints, and so on.
The function 708 ψ(k):D→L
Lastly, the function 711 g(k
The central function in performing human activity recognition in the above framework is the function 708 ω(k) that processes the input space of feature extracted poses {circumflex over (X)}(k
Naively clustering based on the distribution of the raw sensor data 701 X(k) will lead to problems in that it will not be able to successfully capture similarities in the observed workers in each frame without careful preprocessing. For example, suppose there are two production areas k and k′ that observe the same tasks performed by workers but in different settings. One would expect that the sensor data X(k) and X(k′) have similar distributions (k) and (k′) respectively, but unless the lighting conditions, camera perspective, occlusions, etc. are similar for each other production area, the distributions could look significantly different from one another. With proper processing of the data, some of these issues can be resolved, but often come at a cost in interpretability of the data.
One could instead consider the distribution of {circumflex over (X)}(k), the feature-extracted pose data from production area k. By comparing the distributions {circumflex over (X)}(k)˜(k) and {circumflex over (X)}(k′)˜(k′) as opposed to (k) and (k′), the effects that the conditions around the production area may have on the observed data can thereby be removed. Moreover, the fact that the random variables {circumflex over (X)}(k) have a support that lives on a significantly lower-dimensional manifold than the corresponding X(k) cab help with clustering, as large-dimensional clustering is often problematic.
Once clustering 505 is accomplished, such models 506 can be trained using previously acquired labeled data.
In the event each model can be trained sequentially 5062 according to some queue of production areas (YES), the models can be trained in the cluster as follows (e.g., in the deep learning case split learning methods are similar): First pick the production area km at the front of the queue 5063, and train model ƒ(k
If the models are not to be trained sequentially, but in parallel (NO), then within each cluster the model(s) ƒ(k
that is then shared across each of the ƒ(k
Since each production area km is likely to change due to the dynamic nature of factory conditions (e.g., in the morning the production line may differ from what is produced during the evening), the model ƒ(k
In the latter case, it may be desirable to re-use the model ƒ(k
If so (YES), then rather than immediately reassigning production area km to a new cluster, the flow proceeds to 5072 to update the last layer g(k
Upon passing a suitable test on validation data 5073, if the performance of model ƒ(k
On the other hand, if the task has not changed at 5071 (NO), then the flow proceeds to 5076 to consider whether the distribution of feature-extracted pose data (k
Example implementations provide systems and methods for optimizing management of multiple HAR models by clustering production areas with human activity based on similarity between the distributions of certain features of extracted human pose data, which contributes to reduced costs by efficiently training models.
As described herein, the example implementations can obtain sensed data for a plurality of time periods from a plurality of specific physical areas as illustrated in
Furthermore, example implementations receive the input of feature selection, and in the clustering is done based on the received feature selection as illustrated in
In example implementations, the extracted posture distribution data can be aligned to a common perspective as illustrated in
In example implementations, there is a feedback mechanism for updating clusters in which clusters are updated based on changes in environment as illustrated in
Computer device 1205 in computing environment 1200 can include one or more processing units, cores, or processors 1210, memory 1215 (e.g., RAM, ROM, and/or the like), internal storage 1220 (e.g., magnetic, optical, solid state storage, and/or organic), and/or I/O interface 1225, any of which can be coupled on a communication mechanism or bus 1230 for communicating information or embedded in the computer device 1205. I/O interface 1225 is also configured to receive images from cameras or provide images to projectors or displays, depending on the desired implementation.
Computer device 1205 can be communicatively coupled to input/user interface 1235 and output device/interface 1240. Either one or both of input/user interface 1235 and output device/interface 1240 can be a wired or wireless interface and can be detachable. Input/user interface 1235 may include any device, component, sensor, or interface, physical or virtual, that can be used to provide input (e.g., buttons, touch-screen interface, keyboard, a pointing/cursor control, microphone, camera, braille, motion sensor, optical reader, and/or the like). Output device/interface 1240 may include a display, television, monitor, printer, speaker, braille, or the like. In some example implementations, input/user interface 1235 and output device/interface 1240 can be embedded with or physically coupled to the computer device 1205. In other example implementations, other computer devices may function as or provide the functions of input/user interface 1235 and output device/interface 1240 for a computer device 1205.
Examples of computer device 1205 may include, but are not limited to, highly mobile devices (e.g., smartphones, devices in vehicles and other machines, devices carried by humans and animals, and the like), mobile devices (e.g., tablets, notebooks, laptops, personal computers, portable televisions, radios, and the like), and devices not designed for mobility (e.g., desktop computers, other computers, information kiosks, televisions with one or more processors embedded therein and/or coupled thereto, radios, and the like).
Computer device 1205 can be communicatively coupled (e.g., via I/O interface 1225) to external storage 1245 and network 1250 for communicating with any number of networked components, devices, and systems, including one or more computer devices of the same or different configuration. Computer device 1205 or any connected computer device can be functioning as, providing services of, or referred to as a server, client, thin server, general machine, special-purpose machine, or another label.
I/O interface 1225 can include, but is not limited to, wired and/or wireless interfaces using any communication or I/O protocols or standards (e.g., Ethernet, 802.11x, Universal System Bus, WiMax, modem, a cellular network protocol, and the like) for communicating information to and/or from at least all the connected components, devices, and network in computing environment 1200. Network 1250 can be any network or combination of networks (e.g., the Internet, local area network, wide area network, a telephonic network, a cellular network, satellite network, and the like).
Computer device 1205 can use and/or communicate using computer-usable or computer-readable media, including transitory media and non-transitory media. Transitory media include transmission media (e.g., metal cables, fiber optics), signals, carrier waves, and the like. Non-transitory media include magnetic media (e.g., disks and tapes), optical media (e.g., CD ROM, digital video disks, Blu-ray disks), solid state media (e.g., RAM, ROM, flash memory, solid-state storage), and other non-volatile storage or memory.
Computer device 1205 can be used to implement techniques, methods, applications, processes, or computer-executable instructions in some example computing environments. Computer-executable instructions can be retrieved from transitory media and stored on and retrieved from non-transitory media. The executable instructions can originate from one or more of any programming, scripting, and machine languages (e.g., C, C++, C#, Java, Visual Basic, Python, Perl, JavaScript, and others).
Processor(s) 1210 can execute under any operating system (OS) (not shown), in a native or virtual environment. One or more applications can be deployed that include logic unit 1260, application programming interface (API) unit 1265, input unit 1270, output unit 1275, and inter-unit communication mechanism 1295 for the different units to communicate with each other, with the OS, and with other applications (not shown). The described units and elements can be varied in design, function, configuration, or implementation and are not limited to the descriptions provided.
In some example implementations, when information or an execution instruction is received by API unit 1265, it may be communicated to one or more other units (e.g., logic unit 1260, input unit 1270, output unit 1275). In some instances, logic unit 1260 may be configured to control the information flow among the units and direct the services provided by API unit 1265, input unit 1270, output unit 1275, in some example implementations described above. For example, the flow of one or more processes or implementations may be controlled by logic unit 1260 alone or in conjunction with API unit 1265. The input unit 1270 may be configured to obtain input for the calculations described in the example implementations, and the output unit 1275 may be configured to provide output based on the calculations described in example implementations.
Processor(s) 1210 can be configured to, for receipt of sensor data of a plurality of workers operating across a plurality of physical areas from a plurality of sensors as illustrate in
Processor(s) 1210 are configured to process feature selection; wherein the processor(s) 1210 is configured to cluster the pose distributions based on the similarity is done based on the feature selection as illustrated in
Depending on the desired implementation, the pose distributions are aligned to a common perspective as illustrated in
Processor(s) 1210 can be configured to update the plurality of clusters based on a determination of a change to one or more of the plurality of physical areas based on changes to the pose distributions as illustrated in
Depending on the desired implementation, the change to the one or more of the plurality of physical areas is one or more of a task change and a distribution change as illustrated in
Processor(s) 1210 can be configured to train the model for the each of the pose distributions of the each of the plurality of physical areas to generate the plurality of models in parallel for a determination that compute resources are available to train the model for the each of the pose distributions of the each of the plurality of physical areas in parallel, and sequentially for the determination that the compute resources are not available to train the model for the each of the pose distributions of the each of the plurality of physical areas in parallel as illustrated in
Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result.
Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system's memories or registers or other information storage, transmission or display devices.
Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.
Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.
As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.
Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the teachings of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.