The present invention relates to image processing.
Computer vision and image recognition are being increasingly used in agriculture and horticulture, for example, to help manage crop production and automate farming.
T. Hague, N. D. Tillett and H. Wheeler: “Automated Crop and Weed Monitoring in Widely Spaced Cereals”, Precision Agriculture, volume 7, pp. 21-32 (2006) describes an approach for automatic assessment of crop and weed area in images of widely-spaced (0.25 m) cereal crops, captured from a tractor-mounted camera.
WO 2013/134480 A1 describes a method of real-time plant selection and removal from a plant field. The method includes capturing an image of a section of the plant field, segmenting the image into regions indicative of individual plants within the section, selecting the optimal plants for retention from the image based on the image and previously thinned plant field sections, and sending instructions to the plant removal mechanism for removal of the plants corresponding to the unselected regions of the image before the machine passes the unselected regions.
A. English, P. Ross, D. Ball and P. Corke: “Vision based guidance for robot navigation in agriculture”, 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, pp. 1693-1698 (2014) describes a method of vision-based texture tracking to guide autonomous vehicles in agricultural fields. The method works by extracting and tracking the direction and lateral offset of the dominant parallel texture in a simulated overhead view of the scene and hence abstracts away crop-specific details such as colour, spacing and periodicity.
A. English, P. Ross, D. Ball, B. Upcroft and P. Corke: “Learning crop models for vision-based guidance of agricultural robots”, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 2015, pp. 1158-1163 describes a vision-based method of guiding autonomous vehicles within crop rows in agricultural fields. The location of the crop rows is estimated with an SVM regression algorithm using colour, texture and 3D structure descriptors from a forward-facing stereo camera pair.
P. Lottes, J. Behley, N. Chebrolu, A. Milioto and C. Stachniss: “Joint Stem Detection and Crop-Weed Classification for Plant-Specific Treatment in Precision Farming”, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 8233-8238 describes an approach which outputs the stem location for weeds, which allows for mechanical treatments, and the covered area of the weed for selective spraying. The approach uses an end-to-end trainable fully-convolutional network that simultaneously estimates stem positions as well as the covered area of crops and weeds. It jointly learns the class-wise stem detection and the pixel-wise semantic segmentation.
According to a first aspect of the present invention there is provided a method comprising receiving an image of a crop, the image comprising an array of elements which includes depth information, wherein the image is a multiple-channel image comprising at least a colour channel and a depth channel providing per-element depth information, providing the image to a trained convolutional neural network to generate a response map comprising an image comprising intensity values having respective peaks corresponding to the stem of a plant in the crop, obtaining, from the response map, coordinates corresponding to the respective peaks, and converting the coordinates in image coordinates into stem locations in real-world dimensions using the provided depth information.
Thus, segmentation is neither used, nor required and coordinates, for example, point- or line-like coordinates, can be extracted directly from the response map. Furthermore, labelling (or “annotation) of images used to train the network can be easier because features can be labelled directly.
The peak(s) may correspond to point(s) which are part(s) of plant(s) in the crop, such as stem, leaf tip, or meristem. The peak(s) may correspond to lines(s) which are crop rows.
The method may comprise adding and/or displaying at least some of the peaks for the detected coordinates on the received image.
Processing the image to obtain the processed image may include per-pixel processing. The per-pixel processing may include standardization and/or normalization.
The method may be performed by at least one processing unit. The at least one processing unit may include at least one central processing unit and/or at least one graphics processing unit. The method is preferably performed by a graphics processing unit.
The method may further comprise receiving a mapping array for mapping depth information to a corresponding element. Thus, the mapping array provides the per-element depth information.
The mapping array may take the form of a texture. Using a texture can have the advantage of allowing sampling of the texture using normalised coordinates thereby permitting linear interpolation and so facilitating handling of size handling. Textures can also be accessed quickly.
The method may further comprise receiving the depth information from a depth sensing image sensor. The depth sensing image sensor may be a stereo depth sensing image sensor. The multiple-channel image further may include an infrared channel. The multiple-channel image may further include an optical flow image.
The intensity values may have Gaussian distributions in the vicinity of each detected location.
The trained convolutional neural network may comprise a series of at least two encoder-decoder modules (each encoder-decoder module comprising an encoder and a decoder). The trained convolutional neural network may be a multi-stage pyramid network.
The method may comprise extracting coordinates in image coordinates from the response map.
The method may further comprise converting the coordinates in image coordinates into camera coordinates in real-world dimensions. If the position of the ground plane is also known, then this can allow an imaging system to be automatically calibrated without, for example, needing to know image sensor height and image sensor angle of orientation. Thus, it is possible to calibrate the size of viewing area and crop dimensions (such as height of crop), automatically.
The method may further comprise amalgamating the camera coordinates corresponding to the same plant from more than one frame into an amalgamated coordinate. Amalgamating the coordinates may comprise projecting the coordinates onto a gridmap. Amalgamating the coordinates may comprise data association between different frames. Data association between different frames may include matching coordinates and updating matched coordinates. Matching may comprise using a descriptor or determination of distance between coordinates. Updating may comprise using an extended Kalman filter. The method may further comprise converting the coordinates in image coordinates into locations in real-world dimensions.
The method may further comprise calculating a trajectory in dependence on the detected stem coordinate and transmitting a control message to a control system in dependence on the trajectory. The control message may include the trajectory. The trajectory may comprise at least two setpoints. The control system may comprise a system for controlling at least one motor and/or at least one actuator. The control message may be transmitted via a communications network. The communications network may be a TCP/IP or UDP/IP network. The communications network may be a serial network.
The response map image may or may not be displayed.
According to a second aspect of the present invention there is provided a computer program which, when executed by at least one processor, performs the method of any of the first aspect.
According to a third aspect of the present invention there is provided a computer product comprising computer-readable medium (which may be non-transitory) storing a computer program which, when executed by at least one processor, performs the method of any of the first aspect.
According to a fourth aspect of the present invention there is provided a computer system comprising at least one processor and memory, wherein the at least one processor is configured to perform the method of any of the first aspect.
The at least one process may perform the method in real-time. The at least one processor may include at least one graphics processor. The at least one processor may include at least one tensor processor. The at least one processor may include at least one floating-point processor. The at least one processor may include at least one central processor.
According to a fifth aspect of the present invention there is provided a system comprising a multiple-image sensor system for obtaining images and the computer system of the fourth aspect. The multiple-image sensor system is arranged to provide the images to the computer system and the computer system is configured to process the images.
The system may further comprise a control system. The control system may comprise a system for controlling at least one motor and/or at least one actuator. The control message may be transmitted via a communications network. The communications network may be a TCP/IP or UDP/IP network. The communications network may be a serial network
According to a sixth aspect of the present invention there is provided a vehicle, such as a tractor, comprising the system of the fifth aspect. The vehicle may comprise an implement mounted to the vehicle. The implement may be a weed-control implement.
Certain embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Referring to
Referring also to
The weeding system 2 is of an in-row (or “intra-row”) type capable of weeding between plants 4 in a row 5 and employs reciprocating blade weeders or rotary weeders. The weeding system 2 may, however, be an inter-row type of weeding implement (which tends to be used for weeding broad acre crops) and/or may employ other forms of weeder unit, such as reciprocating blade weeders.
As will be described hereinafter, an imaging system 11 (
Referring also to
Referring to
In this case, the image sensor system 12 comprises a colour image sensor 13 and two infrared image sensors 151, 152 (
Other combinations of cameras and other types of depth sensing can be used. For example, a colour camera and a time-of-flight infrared-based depth camera 14 can be used, such as an Intel® RealSense™ LIDAR camera or Microsoft® Azure Kinect™. Two colour cameras can be used, for example, the Stereolabs™ ZED 2 stereo camera. An external camera (i.e., a different camera not contained in the same housing), such as a separate colour camera, can be provided even if a camera of the same type is already provided, e.g., to provide stereo depth sensing and/or to provide a camera with a different field of view.
Referring still to
Referring to
The stem locations 29 are provided to the control system 23 in messages 30 to control the positions of the weeding units 7.
Referring to
Referring to
In this example, a four-core CPU 31 can implement simultaneous multithreading (or “hyper-threading”) to provide eight executing threads. However, the CPU 32 may have more cores (for example, six or eight cores) or fewer cores (for example, two cores).
Referring still to
GPU computing allows use of on-chip memory which is quicker than global external GPU RAM. On-chip throughput of shared memory inside an SM can be >1 TB/s and local per-warp RAM can be even quicker.
Referring to
The GPU 32 includes several processing blocks including an optional optical flow mask generation block 60, a processing block 70 for detecting and matching keypoints, a pixel mapping block 80 and ground plane determination block 90.
The processing blocks include a network input preparation block 100 and a camera pose determination block 110. The processing blocks include a trained neural network block 120, a network output processing block 130 and a stem publisher 140.
The CPU 31 also performs bundle adjustment 150 using depth and infrared images 16, 18 to refine the camera orientation and trajectory. In particular, bundle adjustment block 150 generates translational and rotational matrices 151, 152 which are used by the GPU 32 to convert a position in the real world in camera coordinates into a position in the real world in world coordinates. Bundle adjustment 150 may, however, be incorporated into camera pose determination block 110 and be performed by the GPU 32.
Referring also to
The multiple-image sensor system 12 outputs a stream of colour, infrared and depth image frames 14, 16, 18. The CPU 31 processes frames 14, 16, 18, in real-time, on a frame-by-frame basis. Processing in the GPU 32 is staged and ordered so that certain tasks occur at the same time and, thus, make better use of GPU 32 processing.
In block 55, the CPU 31 copies of the infrared image frame 18 pass to the GPU 32. If optical flow is performed, then the infrared image frame 18 passes to the optical flow generation block 60 to generate an optical flow mask 61. The infrared image frame 18 passes to the keypoint detection and matching block 70, for example, based on oriented FAST and rotated BRIEF (ORB), scale-invariant feature transform (SIFT) or speeded up robust features (SURF) processes.
In block 56, the CPU 31 copies the depth image frame 16 to the GPU 32. The depth image frame 16 passes to the keypoint detection and matching block 70, to the mapping array block 80 which generates a mapping array 81 and to the ground plane position determination block 90 which determines the ground plane 91, for example using a random sample consensus (RANSAC) process.
In block 57, the CPU 31 copies the colour image frame 14 to the GPU 32. If generated, the optical flow mask 61 passes to the network input pre-processing block 100.
The keypoint detection and matching block 70 returns matched keypoint positions 73, 73′ in pixel coordinates of current frame and world coordinates of previous camera pose. The keypoints 73, 73′ pass to the camera pose determination block 110 which outputs camera pose 111. As will be explained in more detail later, the bundle adjustment block 150 (
The colour, depth and infrared image frames 14, 16, 18, optical flow 61 (if used), mapping array 81 and ground plane position 91 pass to the network input pre-processing block 100 which generates a network input image 101 which is fed to a trained network 120 which generates a network output 121 (herein also referred to as a “response map”).
The network output 121, the mapping array 81 and depth image frame 16 are pass to the network output processing block 130 which generates plant positions 131 (
The camera pose 111 and the plant positions 132 pass to a stem publisher block 140 which performs stem amalgamation which outputs the plant stem location 29. Using the stem location 29, a new hoe position determination block 160 computes a new hoe position.
Before describing the image sensor system 12 (
Referring to
A set of extrinsic parameters (or “extrinsics”) can be used to transform (in particular, by rotation and translation) from the world coordinate system (X′, Y′, Z′) to the camera coordinate system (XC′, YC′, ZC′) and a set of intrinsic parameters (or “intrinsics”) can be used to project the camera coordinate system (XC′, YC′, ZC′) onto the pixel coordinate system (xp, yp) according to focal length, principal point and skew coefficients.
Processing of images 14, 16, 18 (
Hereinafter, in general, coordinates given in lower case, such as x, y, refer to pixel coordinates in an image, and coordinates given in upper case, such as X, Y, refer to normalised coordinates in an image or field of view. Normalised coordinates comprise a pair of values, each value lying in a range between 0 and 1 (i.e., 0≤X≤1; 0≤Y≤1) and can be found by dividing a pixel position in a given dimension by the relevant size of the dimension, that is, width or height (expressed in pixels). Normalised coordinates need not correspond to a discrete pixel and, thus, only take discrete values. Normalised coordinates can be interpolated.
The subscript “W” is used to denote world and the subscript “C” is used to denote camera. Thus, OW refers to an origin in world coordinates and OC refers to an origin in world coordinates.
An array is a portion of memory with an allocated memory address. The address can be expressed in terms of integer values (for example, 32-bit integer values), float values or other formats. An array can be expressed (and accessed) relative to a base memory address using an index number, i.e., the number of values from the base memory address. Arrays can be accessed in strides using two-dimensional or three-dimensional pixel coordinates if later converted to an index position. A one-dimensional indexed array can be accessed in strides to represent two-dimensional or three-dimensional data in row-major or column-major format.
Suitable code for converting x- and y-pixel values x, y into an index value is:
——host————device—— int XYtoIndex(int x, int y, int width, int
Using an index can facilitate handling of textures in memory since textures are usually stored in a sequence.
An image tends to be defined as an array, usually with multiple channels. For example, there are three channels for an RGB image and there are four channels for an RGBA image. Allocated memory is divided into 8-bit (i.e., one byte) integer values in the range 0 to 255. Like an array, an image can be expressed (and accessed) using an index value or pixel coordinates if later converted to an index position.
Textures are accessed through a texture cache (not shown) on a GPU 32 (
Texture accessing of values is often quicker than normal array accessing on a GPU 32 (
Textures are created by copying memory via the texture cache (not shown) to create a new texture object (not shown).
Surfaces are special, writable textures enabling values to be written to surface memory (not shown) in the memory 33 (
Framebuffer textures allow textures to be rendered, without displaying on a screen, into a texture object and used later on in the pipeline. Thus, it replaces the screen as the primary destination to display textures. Since it is itself a texture, a framebuffer texture can be accessed later on using two-dimensional normalised coordinates.
As will be explained in more detail hereinafter, a mapping texture can be written as a surface to memory (during generation), and read as a texture from memory. Textures can be accessed, rendered, read from and written to via graphics frameworks such as OpenGL and accessed via the cudaTextureObject_t in Nvidia CUDA API via cuda-OpenGL interoperability. Other GPU frameworks, other than CUDA APIs can be used, such as Open CL.
As explained earlier, colour, infrared and depth frames 14, 16, 18 are copied onto the GPU before mapping texture generation. Thus, once copied to the GPU 32, colour, infrared and depth frames 14, 16, 18 can be referred to as textures.
As will be explained in more detail hereinafter, a final composite in RGBNDI format is rendered using the mapping texture, depth, infrared and colour textures to a framebuffer texture. The framebuffer texture is later sampled to create the network input array 101 (
Referring to
The colour image sensor 13 is used to obtain a colour image 14 and the infrared image sensors 151, 152 are used to obtain a depth image 16 which is aligned with respect to the first infrared image sensor 151. The first infrared image sensor 151 can also be used to obtain an infrared image 18. As will be explained hereinafter in more detail, the infrared image 18 can be used to create a normalized difference vegetation index (NDVI) image. Herein, the infrared image sensors 151, 152 are referred to a depth sensor 17.
The colour image sensor 13 is offset by a distance s with respect to a first infrared image sensor 151. The colour image sensor 13 may suffer from distortion, such as modified Brown-Conrady distortion, whereas the infrared image sensors 151, 152 may not suffer distortion or may suffer distortion which is different from that suffered by the colour image sensor 13. Furthermore, the colour image sensor 13 may have a field of view 175 (“rgbFOV”) which is different from a field of view 176 (“depthFOV”) of the infrared image sensor 151. However, in some image sensor systems, the colour image sensors and infrared image sensors may have the same field of view 175, 176. As a result of the offset s, distortion and possible differences in field of views 175, 176, the colour image 14 and the depth and infrared image 16, 18 may be mis-aligned and/or can have different boundaries.
Referring to
The colour sensor 13 has an upper-left corner 177 (“rgbTopLeft”), an upper-right corner 178 (“rgbTopRight”), a lower-left corner 179 (“rgbBottomLeft”) and a lower-right corner 180 (“rgbBottomRight”), a width 181 (“width”) and height 182 (“height”), a left edge 183 (“leftCol”), a right edge 184 (“right Col”), a top edge 185 (“topRow”) and a bottom edge 186 (“bottomRow”). The depth sensor 17 has upper left, upper right, lower left and lower right corners 187, 188, 189, 190, a width 191 and height 192, a left edge 193, a right edge 194, a top edge 195, and a bottom edge 196.
The depth sensor 17 has a larger field of view than the colour sensor 13. Accordingly, not every part of an image obtained by the depth sensor 17 can be mapped to a corresponding part of an image obtained by the colour sensor 13.
Referring to
The focal lengths of the colour and depth sensors 13, 17 are used to compute the corners 177, 178, 179, 180 of the colour sensor 13. The sensors 13, 17 have respective sets of intrinsic parameters including an x-value of the position of a principal point (“ppx”), a y-value of the position of the principal point (“ppy”), an x-value of the position of a focal point (“fx”), and a y-value of the position of the focal point (“fy”). The width 181 and height 182 (of the field of view) of the colour sensor 13 are width and height respectively. For example, width and height of the colour sensor 13 may take values of 424 and 240 pixels respectively.
The x- and y-values (“FOV.x”, “FOV.y”) of a field of view FOV can be calculated using:
The half-frame values of width xh and height yh of the colour sensor 13 with respect to the depth sensor 17 can be calculated using:
The corner positions 177, 178, 179, 180 of the colour sensor 13 with respect to the depth sensor 17 can be calculated using:
As will be explained in more detail hereinafter, the boundaries 183, 184, 185, 186 of the colour sensor 13 are computed during initialisation of the system 20 and are subsequently used in mapping array generation and pixel mapping. Reducing the proportion of the depth sensor 17 that needs to be mapped onto the colour sensor 13 can help to reduce processing and memory overheads. For example, along with faster texture memory accesses on the GPU 32 (
Referring to
As hereinbefore described, the infrared sensors 151, 152 can also be seen as a depth sensor 17. For convenience, when discussing the source of depth images 16, the infrared sensors 151, 152 are describes as a unitary depth sensor 17. However, as explained earlier, depth image can be obtained in other ways, e.g., without stereo depth sensing and/or without using infrared sensors.
As hereinbefore explained, the image processing system 20 can compute sensor boundaries (step S15.1). This typically need only be performed once, during initialisation, for example, when the system 20 is switched on.
The image processing system 20 processes images from the colour sensor 13 and the infrared sensors 151, 152 (steps S15.2 and S15.3). Frames from the sensors 13, 151, 152 (i.e., colour images 14 and depth images 16) are processed frame-by-frame (for example, at a frame rate of 30 fps) in real time.
Referring in particular to
A mapping array 81 (herein also referred to as a “mapping texture”) comprising mapping information 85, for example two pairs of pixel coordinates, is used to map depth pixels 83 from the depth image 16 onto the colour image 14 to create a colour-depth image 86 which can be seen as an array of pixels 87 containing both colour and depth information. As explained earlier, by using a search area 197 (
Images can be processed without adjusting for perspective (herein referred to as “unadjusted mapping” or “straight mapping”) or adjusting for perspective (herein referred to as “view-transforming mapping” or “plan view mapping”) by, for example, using constant width mapping or mapping using orthographic projection.
Generating Mapping Texture without Adjusting for Perspective
Referring to
The boundaries 183, 184, 185, 186 of the field of view 59 of the colour sensor 13 (or simply “the boundaries 183, 184, 185, 186 of the colour sensor 13”) provide the boundaries of the search area 197. Thus, in this case, the search area 197 is rectangular. In the case where an adjustment is made for perspective, a modified, non-rectangular search area is used.
The GPU 32 (
The GPU 32 (
The GPU 32 extracts a pixel value 200 (“depthPixel”) of a pixel 83 (
The GPU 32 uses the extracted depth value 202 and infrared sensor intrinsics 203 (“intrinDepth”) to generate a depth point 204 (“depthPoint”) in a point cloud having coordinates in the infrared camera coordinate system (XIR, YIR, ZIR) (step S18.3). Suitable code which generates the depth point 204 is:
which calls the function pixel2Point, which in turn returns depthCoord as the depth point 204:
The GPU 32 then uses depth-to-colour extrinsics 205 (“extrinD2C”) to map the depth point 204 having coordinates in the infrared camera coordinate system to a depth point 206 (“rgbPoint”) in a point cloud having coordinates in the colour camera coordinate system (Xrgb, Yrgb, Zrgb) (step S18.4). The depth-to-colour extrinsics 205 are based on based on infrared sensor extrinsics (not shown) and colour sensor extrinsics (not shown), for example, by multiplying the two sets of extrinsics.
Suitable code which generates the depth point 206 is:
which calls the function point2Point, which in turn returns rgbCoord as the depth point 206:
The GPU 32 then uses colour sensor intrinsics 207 (“intrinCol”) to deproject the depth point 206 into xrgb, yrgb coordinates 208 on the colour sensor (step S18.5). Suitable code which generates the coordinates 211 is:
which calls the function point2Pixel, which in turn returns texCoord as the coordinates 208:
The GPU 32 out outputs a set of two pairs of coordinates comprising x- and y-colour pixel values xrgb, yrgb and x- and y-depth pixel values xd, yd (“outputArray”)
In this example, the x- and y-colour pixel values xrgb, yrgb are output as an index value found by calling a function normXYtoIndex:
——host————device—— int normXYtoIndex(float x, float y, int width,
Using an index can facilitate handling of textures in memory since textures are usually stored in a sequence in memory. However, conversion into an index value is not necessary.
The cameras sensors 13, 151, 152 (
It is not necessary to adjust for perspective to identify stem locations using the images captured using the sensors 13, 151, 152 (
Perspective adjustment can be achieved in different ways.
One approach is to modify the search space of the mapping texture to scale each texture row by the fixed width in camera coordinates. Corresponding colour and depth sensor pixel coordinates can be found by interpolating along the search space using the fixed width and an average depth value calculated per texture row.
Another approach is to generate a mapping texture of corresponding depth and colour sensor pixel positions scaled by pixel dimensions, with an unmodified search space. This involves finding the camera coordinates for each pixel position in the mapping array and transforming them to normalised coordinates using an orthographic projection to scale horizontal and vertical dimensions by a fixed distances in camera coordinates (in meters). The image is rendered using the newly-transformed camera coordinates. The previously-found corresponding mapping coordinates is used to sample the depth and colour textures to find pixel intensities at each pixel position.
Both approaches will now be described in more detail.
Referring also to
Referring in particular to
The modified search space 197′ can be found from an RGB image 14, IR sensor intrinsics 96 and depth-related information 211 which allows a mapping texture 81 to be generated which transforms images from a perspective view of the ground into a plan view (or “deperspectified view”) of the ground.
Referring in particular to
minBound Xminj and maxBound Xmaxj values indicating the edges of the search space on the depth sensor are found by projecting the average depth value per row, ZCa′g′ and minWidth Xmin′ or maxWidth Xcmax′ onto the depth image sensor using the depth intrinsics, namely principle point (ppx, ppy) and focal lengths (fx, fy), and dividing by the image width wp (in pixels):
For normalised pixel position (i, j) in the mapping texture with sampling width ws and sampling height hs in pixels, the sampling coordinates for the depth image sensor can be found by interpolating between Xmin_i and Xmax_i and using the normalised j value:
For normalised pixel position (inorm, jnorm) in the mapping texture, the sampling coordinates are:
The topRow 185 and bottomRow 186 are normalised:
or (in the case of normalised pixels)
Once the sampling coordinates for the depth sensor have been found, a depth pixel value can be sampled and deprotected into a camera coordinate using pixel2Point. It can then be transferred to the RGB sensor (using point2point) and projected onto the RGB sensor (using point2Pixel) to find the pixel sampling position for the RGB sensor.
Referring to
Generally, two approaches can be used.
First, the system 20 (
Secondly, the system 20 (
This standardises values with
This can be done for all input channels instead of just subtracting pixel mean values for RGB data and using the first approach using minimum and maximum values.
Projections are can be used to represent a three-dimensional object on a two-dimensional plane. Different types of projection can be used.
Orthographic projection is a form of parallel projection in which all the projection lines 250 are orthogonal to the projection plane resulting in the projected size of objects remaining constant with distance from the image plane 249.
Referring to
Referring also to
To describe transforming a three-dimensional object to view on a two-dimensional screen in NDC coordinates, a Model, View (camera) and Projection (orthographic/perspective) matrices [M], [V], [P] can be used.
where [M], [V], [P] are 4×4 matrices are of the form:
In this case, a simple model [M] is used to transform the NDC coordinate system with the y-axis running down the image centred on the camera origin. Reference is made to page 117, Chapter 3 of OpenGL Programming Guide, Fifth edition, version 2.
All coordinates outside of top t 252, bottom b 253, left l 254, right r 255 are clipped or (“removed”) and are not rendered.
Referring also to
The edges of the rendered image equal to the top t 252, bottom b 253, left l 254, right r 255 parameters.
Linearly scaling the vertical and horizontal dimensions of the image effectively creates an image of constant width.
Referring also to
where t, b, l, r, n and f are the top, bottom, left, right, near and far 252, 253, 254, 255, 256, 257.
Referring also to
Referring again to
Referring to
Referring again to
The mapping texture 81 is referred to as a texture since it can be stored in a cudaTextureObject. It may, however, also be referred to as a “mapping array”.
The images 14, 16, 18, 61 can be pre-processed (or “prepared”), for example, to standardise and/or normalise values, and to generate derived images such as NDVI, prior to generating the network input 101.
Image channels in each image can be normalised by subtracting the dataset mean, such as, the mean of the green channel standardized all images in the dataset. Additionally, or alternatively, image channels can be normalised by dividing values by the standard deviation of the channel of the dataset.
Colour images values which are in the form of an 8-bit number (and, thus, take a value between 0 and 255) can be normalised to take a value between 0 and 1 before being standardised. For NDVI, NDVI values which lie between −1 to 1 can be normalised to lie between −0.5 to 0.5.
For depth images, a standard deviation and mean depth for each image can be used to standardise the data between −0.5 and 0.5. Ground plane parameters can be used to transform depth values to give the perpendicular distance from the ground plane 91 (i.e., height off the ground) rather than distance to the camera. Alternatively, it is possible to normalise per image row. Normalising depth coordinates depth coordinates using the ground plane 91 can yield a real crop height.
Referring to
Suitable stacked networks include an hourglass network and a multi-stage pyramid network. In this example, a multi-stage pyramid network is used similar to that described in W. Li, Z, Want, B. Yin, Q. Peng, Y. Du, T Xiao, G. Yu, H. Lu, Y. Wei and J. Sun: “Rethinking on Multi-Stage Networks for Human Pose Estimation”, arXiv: 1901.00148v4 (2019) using a two-stage network 301 and ResNet-18 feature extractors (instead of ResNet 50 feature extractors). An EfficientNet feature extractor can be used instead of a ResNet feature extractor, such as that described in M. Tan and Q. Le: “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks” arXiv:1905.11946v5 (2020) to help increase accuracy and reduce execution time.
A decoder upsamples and combines image features encoded by convolutional layers into per-class response maps of a suitable resolution for extracting features of interest. For example, the output image has a resolution which is 1/N times the resolution of the input image, and N may be for example, N≥1, for example, N may be 1, 2, 3, 4 or more.
Decoder modules can have different convolutional structures, such as Fully Convolutional Networks (FCNs), or more complex structures, such as Feature Pyramid Networks (FPNs). For example, J. Long, E. Shelhamer, T. Darrell in “Fully Convolutional Networks for Semantic Segmentation” arxiv: 1411.4038v2 (2015) describes upsampling and combining image features to produce a single response map of suitable resolution. T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan and S. Belongie: “Feature Pyramid Networks for Object Detection” arxiv: 1612.0314412 (2017) describes networks which can be used to combine encoded feature maps of different resolutions with added convolutional layers producing a response map of predictions at each level of the network. Loss can be calculated for each response map at each level of the network allowing what is referred to as “coarse-to-fine supervision” of learned image features deep into the network structure.
A stack of encoder-decoder modules need not be used. For example, a set of one or more convolutional modules (or blocks) that can generate a response map of suitable resolution can be used. For instance, the module may comprise upsampling layers (such as bi-linear upsampling) or convolutional layers containing learned parameters.
The neural network 120 works by extracting features from an image using layers of convolution filters. Convolutional filters are sets of learnable parameters which can be combined with a portion of an image as a kernel (or filter) using a dot product. The result of the dot product provides the response (or “activation”) of the kernel to that portion of the image. By passing the filter across all parts of the image, an activation map is created.
The neural network allows custom convolutional filters to learn to extract the desired features from an input image. The parameters of the convolutional layers are known as “weights” of the network. Convolutional weights are randomly assigned at the beginning of a training process and where annotated ground truth images are provided to allow network prediction to be refined.
Referring to
Referring also to
A user annotates the image 320 by identifying stem coordinates 321, for example, by positioning a cursor and clicking on the image 320 at the location of the stem.
The image 320 is converted into a binary image 322 containing annotated stem coordinates 323 (where a pixel value is equal to 255 at each stem coordinate, else is equal to 0).
The binary image 322 is converted into a Gaussian-filtered image 324 (or “heatmap”) containing a Gaussian peak 325 at each stem coordinate 323 by passing a Gaussian filter (not shown) across the binary image 322.
The heatmap 324 provides a gradient allowing pixel intensity to be proportional to likelihood of the pixel being a stem coordinate.
A dataset is labelled with stem location annotations and split, 80:20, into training and validation datasets. A dataloader script can be used to load batches of images into the network. The dataloader script also applies random augmentations, such as flipping and rotation to the images, to increase image variability and reduce the chance of network overfitting to trivial image features (number of rows, certain shadow orientations, etc.). Network predictions are compared to ground truth images to calculate a measure of loss at the end of each training iteration. Network parameters can be refined over repeated training iterations. The validation dataset can be used every few training iterations to calculate network performance metrics on images unseen in the training set to gauge real-world performance. Network structure can be described and training sequences conducted using machine learning frameworks such as Pytorch or Tensorflow. The dataset consists of between 1,000 to 2,000 labelled images, although a larger number of images, for example, between 5,000 to 10,000 can be used. There are between 20 and 40 epochs and validation is performed every 5 epochs. The batch size is between 50 to 100 images, although the batch size varies according to size of images. The learning rate is 0.0025.
Referring also to
where E is the mean-squared error, N is the number of data paints, yn is observed value and is the predicted value.
A measure of loss for the network output 121 can be found by finding the MSE for each pixel in the output image 121 compared to the corresponding pixel in the ground truth image 324. The loss function can be minimised and the weights of each network layer updated using a gradient descent algorithm.
The gradient descent algorithm is defined by:
where w(τ+1) represents an updated weight value, w(τ) represents a current weight, n is the learning rate of the network and ∇E is the derivative of the error function.
The amount of change in the updated weight value is governed by the value of the learning rate n.
To allow convolutional weights in the network to be further refined, the activation map of a convolutional layer is combined with a non-linear activation function to give a differentiable, non-linear result able to be used with the gradient descent algorithm. Suitable non-linear activation functions include tanh, sigmoid and rectified linear unit (ReLu).
Images can be batched together to insert into the network, allowing a loss for the entire batch (or epoch) to be minimised (batched gradient descent) or per training image example (stochastic gradient descent).
Over many training examples, the gradient descent algorithm minimises the result to a global minimum, indicating an optimum result has been reached. The speed of change of the updated weight values is governed by the learning rate n. Adjustment of learning rate can prevent the gradient descent algorithm finding a local minimum, which can prevent an optimal result from being found.
Optimisation of the learning rate parameter can be conducted by analysing the momentum or rate of change of each network parameter to prevent gradients stagnating. Suitable optimisers include variations on the Adam (Adaptive Moment Estimation) optimiser which stores an exponentially decaying average of past gradients for each network parameter to estimate gradient moment and prevent stagnation.
Referring to
As will be explained in more detail hereinafter, the stem coordinate is found and is normalised (i.e., taking a value between 0 and 1) using the network output array pixel width 124 and height 125 for more flexible handling later on in the pipeline. The coordinate is taken from the top left corner 126 of the network output array 121. The stem coordinate need not be normalised and/or a different origin can be used.
Using a pose network with non-maximum suppression allows individual keypoints representing stem coordinates to be extracted directly from the network output 121.
Extracting keypoint labels provides a direct approach to finding stem coordinates instead of extracting a centroid of a segmentation. This can be especially useful when, as in this case, cameras are angled and the stem is at the bottom of the plant rather than in the center. Semantic segmentation is not used which allows an output feature map 121 to be smaller than an input image 111 and help to achieve process images faster.
Furthermore, keypoints make it quicker to label images and are more robust compared to segmentation networks as the entire loss function is focused on the correct position of the keypoint rather than the quality of segmentation.
Referring to
Further information is needed to generate three-dimensional stem locations 29. A stem location 29 is a stem coordinate which has been converted into to camera coordinates (i.e., a three-dimensional point expressed, for example, in meters) and combined with odometry information to create a three-dimensional point expressed, for example, in meters in world coordinates. If the camera origin of the frame is known in world coordinates, then stem locations 29 can be tracked from one frame to another.
In this case, visual odometry is used to provide the camera origin of the frame in world coordinates, and odometry information is provided by the camera pose determination block 110 based on keypoint detection and matching in block 70. However, visual odometry need not be used. For example, mechanical odometry, for instance using, among other things, a wheel, can be used.
Referring to
Peaks in the image can also be found using other approaches, for example, by comparing positions of maximum intensity of pixel for each row and column in the image. If a pixel is a maximum in a row and corresponding column, then the pixel is a peak.
Confidence of the prediction (i.e., the predicted plant stem coordinate) can be measured by measuring the standard deviation of the output image. The greater the standard deviation, then the greater spread between results and the sharper the peaks. Thus, a threshold can be set to screen possible plant stem coordinate cx,cy.
The plant stem coordinates cx,cy 131 in the image 121 are then converted into stem coordinates cXC,cYC, 122 in meters with respect to the camera origin (step S42.2). In particular, the mapping texture 81 is used to find the corresponding pixel in the depth image 16 and the pixel value can be converted into a point in in camera coordinates by de-projecting the pixel value using the depth sensor intrinsics.
Determining Plant Position with Respect to Hoe Blade Origin
Referring again to
Referring to
Referring to
Referring in particular to
It is assumed that each plant 26 has a single position P1, P2, . . . , Pn in world coordinates and that is does not move and that the cameras 13, 15 (
Referring also to
Vector g is found from ground plane coefficients:
Vector z is the unit vector of the camera z axis:
The normal vector ng×z is found via the cross product of vectors g and z:
As g and z are both unit vectors, angle θ can be found:
Using the normal vector as the rotation axis and angle between g and z, a quaternion qg×z can be formed:
Once a quaternion has been found, it can be converted to a 4×4 rotation matrix using maths libraries such as GLM for OpenGL or Eigen in C++:
where s=|q|−2 if qg×z is a unit quaternion s=1−2=1.
As explained earlier, the GPU 32 continually (i.e., every frame) computes a ground plane 91. The camera system 12 also has an origin 136 (“camera system origin”). The camera system 12 sits on support 137, such as a pole, and lies above a point 139 on the ground plane 91. Each weeding unit blade 422 (in this case, a reciprocating blade although it may be another type, such as a rotary blade) has an 423 356 which sit behind the vertical plane of the camera system 12.
The updating process performed by block 130 receives as inputs n plant locations 132 per frame in frame coordinates of m plants over t successive frames, namely:
and the position 111 of each frame origin in world coordinates, namely:
and outputs a single world coordinate 29 for each plant 26 (even though there may be many frame coordinates per plant corresponding to that world coordinate), namely:
The frame origin is expressed as
The frame plant coordinates are expressed by:
Frame plant coordinates are converted into world coordinates using:
Plant world coordinates can be converted back into frame coordinates
The direction of the world coordinate axed compared to frame coordinate axes should be noted.
Referring to
The gridmap 411 is divided into parts 4131, 4132, one part 4131, 4132 for each row. In this case, the gridmap 411 is divided into two parts 4131, 4132 separated by a gridmap centreline 414, each row 4131, 4132 having a row centre 4151, 4152.
The gridmap 411 is used in a process of amalgamating points 122 (
Referring also to
Referring also to
The process allows lateral adjustment 421 of the physical hoe blades 422 using horizontal offsets from the frame origin 401 in units of distance, e.g., meters. There is no vertical offset between gridmap and the frame origin. The hoe blades 422 are mounted at a position a fixed distance D from the frame origin 401, for example, 1.4 m.
Referring also to
Referring to
The GPU 32 (
The GPU 32 (
The world plant coordinates in the list 430 are converted to current frame coordinates using frame origin and plotted using their corresponding intensity (step S53.3). Plant world coordinates that are within the current frame Ft are saved to make an updated list 430 of valid plant coordinates and plant coordinates outside the frame are discarded (step S53.4).
Converted frame points are amalgamated using a Gaussian mask (step S53.5). The greater number of points in the amalgamation, the higher the intensity. The closer points are together in a group, the taller the peak.
Peaks are extracted from the masked plot (step S53.6). This gives an amalgamated plant world coordinate for each plant in each frame.
Peaks outside the gridmap boundary and in the relief area 416 and are passed to the data association process to publish as a single coordinate per-plant (step S53.7).
Referring to
The GPU 32 (
The GPU 32 (
The GPU 32 (
If matched, it updates the standard deviation of the previous valid point with current point standard deviation to reflect new information (step S54.5):
It also updates X and Y position world coordinates of previous valid point with new information of the matched current point (step S54.6):
If the current point is not matched with a previous valid point, the GPU 32 (
If a previous valid point has been matched more than a fixed number of times (for example, 4 or 5) and is no longer within the relief area 416 (in other words, it has dropped out of the bottom), it is added to list to be published and remove from list of valid points (step S54.8).
Further details about Bayesian tracking can be found on pages 96 to 97 of “Pattern Recognition and Machine Learning”, Christopher M. Bishop (2006).
Thus, the output is a single per-plant, per-frame world coordinate 29 to publish to hoe blade unit.
Referring to
Before applying the RANSAC algorithm, the pixel values in the depth image 16 is deprojected into a set of points 233 in camera coordinates in real-world units of measurements (e.g., meters) while counting the number 237 of valid depth points in the image (step S55.1). Some regions 238 of the depth image 16 (shown shaded) have no depth information.
The points 233 of the deprojected point cloud 236 and the valid point number 237 are provided to the RANSAC algorithm, together with a set of RANSAC coefficients 239 (step S55.2).
The points 233 are provided in an array of three-dimensional points in real-world units of distance (e.g., meters):
A RANSAC algorithm available through the Point Cloud Library (PCL) library (https://pointclouds.org/) can be used.
The input RANSAC coefficients 239 comprise a maximum probability (which can take a value between 0 and 1), a distance threshold (which can be set to a value between 0 and 1), a maximum number of iterations (which can be set to positive non-zero integer) and an optimise coefficient flag (which can be set to TRUE or FALSE). The RANSAC algorithm outputs a set of plane coefficients 240 and an inlier array 241.
The inlier array 241 comprises an array of value which are set to 0 or 1 for each pointcloud coordinate 233 to indicate whether it is an inlier or an outlier (i.e., did not contribute to the equation of the plane).
Referring also to
The normal vector n to the plane to the camera origin is given by the set of coefficients [a, b, c] and d (shown as h in
Referring also to
The RANSAC algorithm involves randomly sampling a small number of points 233 (in the case of a plane, three points are enough) to calculate an equation of a plane. All other points can be sampled to calculate the perpendicular distance hp of each point 233 to the plane 91. The measure of the level of fit of the plane to the data is determined by the number of points which are within the inlier distance din (which can be set by the user). Fitting a plane 91 to a small number of points 233 and counting the number of inliers is repeated up to the maximum number of iterations nmax (which can be set by the user). The plane with the greatest number of inliers is chosen and its parameters a, b, c and d are returned as the output 240 and can be used as ground plane position 231.
The inlier distance din is the perpendicular distance threshold from the ground plane 91. Points 233 within this threshold are considered to be inliers counted for each RANSAC iteration. Maximum number of iterations nmax is the maximum number of planes to test against the data to calculate the number of inliers.
Calculating the Plane Equation from Three Points
Three points (x1, y1, z1), (x2, y2, z2) and (x3, y3, z3) can be used to calculate an equation of a plane. This can be achieved by finding two vectors a, b from three points:
The normal vector n between the two vectors a, b is found using the cross product:
The unit vector u is found by dividing each component of the normal vector n by the magnitude of the unit vector u:
The plane equation is found by substituting the components ux, uy, uz of the unit vector u and the normal vector magnitude |n| into the plane equation
Calculating Perpendicular Distance of Point from a Plane
Substituting the x, y and z components of each point 233 into the plane equation yields the perpendicular distance hp of the point 233 from the plane 91:
Referring also to
Determining the position of the ground plane allows non-contact machine levelling and hoe blade depth control. Thus, wheel units or “land wheels” (which run on the ground and which are normally used for machine levelling) can be omitted, which can help to decrease the weight of the machine (i.e., the weed-control implement 2) and to reduce disturbing and damaging the crop. Also, land wheels can lack traction and so be prone to wheel slip in wet conditions and so provide an unreliable indication of speed.
Referring to
The camera pose determination process is used to track objects. It can take advantage of bundle adjustment allowing iterative optimization points across many frames to calculate the camera movement across keypoints 73, for example in the form of ORB points 73 over frames in real time. As explained earlier, using visual odometry allows wheel units (or “land wheels”) can be omitted which can be prone to wheel slip in wet conditions and which can damage crops. Adding GPS to the bundle adjustment calculation can also allow low-cost, real-time kinematic (RTK) positioning-like localisation without the need for a base station.
The ORB-SLAM2 SLAM algorithm uses matched ORB feature points 73 between video frames as input to a bundle adjustment and loop closing method to calculate camera movement in world coordinates. The use of the ORB-SLAM2 SLAM algorithm can have one or advantages. For example, it returns six degrees of freedom (6-DOF) camera pose for each successive image frame. It can be run in real-time at acceptable frame rates (e.g., 30 frames per second), even on a CPU (i.e., it does not need to run on a GPU). Keypoints 73 can be filtered using the optical mask, for instance, to compensate for stationary objects. An RGB-D version is available which allows harnessing of depth coordinates as well as ORB points.
Camera poses 111 are calculated in world coordinates, with the origin being the position of the first frame allowing successive frames to be subtracted to give the translation between video frames in the x, y and z dimensions in real-world distances (e.g., in meters). Knowing translation between overlapping video frames allows successive plant positions to be merged together and a position with respect to physical hoe blades to be calculated, irrespective of camera latency.
Referring to
Referring to
Referring again to
The location xi of a stereo keypoint 73 can be defined as:
where ux, uy are two-dimensional x- and y-pixel coordinates of a keypoint 73 (e.g., a matched ORB point) in an image (in this case, an infrared image 18 obtained by the first infrared image sensor 151), uy is an approximation of the x-position in the image of the second stereo sensor 152 using the ux pixel coordinate, the horizontal focal length of the lens fx, the depth value d and a value of the distance b between the two image sensors (in centimeters).
An ORB point 71 provides a unique, rotated BRIEF descriptor 72 for matching with other ORB points 71.
Referring to
The rotational matrix R 151 and translation matrix t 152 therefore give the position of the current frame 111 in world coordinates in relation to previous keyframes 111′ using the matched points 73 for optimisation. ORBSLAM2 also conducts bundle adjustment between the previously matched keyframes 111″, 111′″, 111″″ to perform further pose refinement 153 to reduce or even prevent errors from propagating as one frame is added onto the next. Expressed differently, the current frame 111 is a keyframe (or camera pose) and a local bundle adjustment model is used for refining over many previous frames if the ratio of close versus far keypoints drops below a certain level. A close keypoint can be defined as one where the depth is less than 40 times the stereo baseline.
The camera pose determination block 110 (
The problem can be formulated as a non-linear least squares problem to minimise the reprojection error between poses. This creates a large coefficients matrix to be solved to find the rotation and translation coefficients which best minimise the reprojection error. This can be solved iteratively using the Gauss Newton method, preferably, a Levenberg-Marquardt method (also referred to as “damped least squares” method) which introduces a damping factor to Gauss Newton and interpolates it with gradient descent to help avoid fitting to local minima.
Bundle adjustment problems can be solved more efficiently by taking advantage of the sparse nature of the coefficient matrix as most of the values are zero due to not all of the matched points (i.e., observations) being observed in every image. Approaches to taking advantage of the sparse structure of the coefficient matrix include Schur reduction and Cholesky decomposition.
The g20 library is a bundle adjustment library for computer vision, providing a general framework for solving non-linear least squares problems. It uses a number of linear solvers and chooses the best one to solve the required problem depending on dataset size and number of parameters.
Motion bundle adjustment: finding rotational and translational matrices R, t between current and previous camera poses using matched keypoints in current frame
The first frame to enter the algorithm is inserted as a keyframe and its pose set to the origin (0, 0, 0) in world coordinates. An initial “model” or “map” is created from all keypoints in the initial frame. A stereo keypoints are used, their position in camera coordinates is used to make an initial map (no transformation to world coordinates needed as the first frame is at the origin).
A k-nearest neighbours search (or other similar search) is used to match keypoint 72 of ORB points 73 in current image and previously matched ORB points 73′ in the model.
The least squares problem is minimised to find the rotational and translational matrices R, t:
where x(.)i is the computed stereo keypoint position from the matched ORB point, Is provides the function to reproject the model three-dimensional point into the current frame, R, t are the rotational and translational matrices used to transform the previously-matched model point in terms of the current frame's pose and hence reduce the reprojection error when projected into the image plane, fx and fy are the horizontal and vertical focal lengths of the image lens, and cx and cy are the principle points of the sensor (used for translating projected coordinates into pixel coordinates from the top left-hand corner of the image).
Optimising is performed across visible keyframes KI and all points PI seen in those keyframes. Keyframes which observe points PI are optimised whereas all other keyframes Kf remain fixed but do contribute to the cost function:
Referring to
Features in an image can be defined using a keypoint 71 and a corresponding descriptor 72 classifying the keypoint. A two-dimensional feature detector and descriptor extractor process is used to find keypoints 7172 in the infrared images 18 which can be masked using the optical flow mask 61. A suitable function is cv2.ORB( ) in OpenCV. Other feature detector and descriptor extractor processes, however, can be used, such as scale-invariant feature transform (SIFT) or speeded up robust features (SURF) processes.
Optical flow allows outliers to be removed from images. Outliers mainly comprise stationary objects (i.e., stationary with respect to the moving camera), such as the back of the tractor 1 (
Referring to
Infrared frames 18 are received at a frame rate of 30 frames per second and an array of optical flow vectors 62 is produced for each frame 18 (step S67.1). Thresholding is performed on each vector, i.e., for a vector corresponding to particular pixel, its mask value=1 if |vi|>0, else mask value=0 resulting in a noisy optical flow mask 63 (step S67.2). Erosion is performed on the noisy optical flow mask 63 resulting in an optical flow mask 61 for each frame (step S67.3). Erosion removes bridges, branches and small protrusions in the noisy mask 162 and dilation fills in holes and gaps.
Referring to
The blade trajectory 601 takes account of the forward speed 603 of the weeding machine to compensate for limits on acceleration and velocity of the hoe blades 422. The forward speed 603 is calculated from rate of change of frame position which can be obtained from the camera pose 111 or by other forms of odometry (step S72.1).
Because the camera is some distance ahead of the hoe blade 355, the CPU 31 is able to calculate at what distance 606 on the ground (before of the crop plant 26), the hoe blade 422 needs to move so that there is appropriate clearance 607 from crop stem 602 before the hoe 422 reaches the position of the crop plant 26. The CPU 31 calculates the distance 607 over which clearance should be maintained so that the blade 422 can cut back in with the hoe blade an appropriate distance 608 behind the crop plant.
The hoe trajectory 601 comprises a series of setpoints 604 stored in an array 605 in memory implemented as a circular buffer. As the weeding machine moves along the crop row setpoint positions 604 from this array are communicated via CANopen to brushless DC electric motor (BLDC) motors (not shown) driving actuators (not shown) that control the position of each hoe blade.
The setpoint positions communicated to the motor drive can either be a sequence of discrete position values that together define a motion profile or, where the drive supports it, a trigger position for the start of a pre-defined motion profile. In the latter case the machine position along the crop row must also be communicated to the motor drive.
Referring to
Referring again to
Referring to
As explained earlier, plant stem locations can be extracted using a sample area 127 which is scanned across the image 1211. Using a cross-shaped sample area 127, a 2D peak corresponding to a peak with maxima in both x- and y-directions can be found and its position extracted. This approach is suited for extracting plant stem coordinates.
In some cases, a peak in one of the directions may be all that is needed. For example, the lateral position (e.g., along the x-axis) may be all that is of interest, such as the lateral position of a ridge of a ploughed field, the lateral position of a row of a crop or the lateral position of a ridge of a row.
Referring to
As in the first example, a sample area 127′ is scanned across the image 1212. In this case, however, a three-section bar-shaped sample area 127 is used. Using the bar-shaped sample area 127, a 1D peak corresponding to a peak with a maximum in the x-direction can be found and its position extracted. This approach is suited for extracting the lateral position of a row of a crop or the lateral position of a ridge of a row.
Referring to
The image 320 is converted into a binary image 322 (
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring again to
The image may, however, be provided with depth information in other ways, in particular, without using a depth image 16.
Referring to
A colour camera 13 captures an image 14 which includes the object 341. The object 341 is used as a suitable reference point and may lie, for example, in the centre of the image 14. The image 14 covers an area 342 of ground including the object 341.
A user measures a distance L from the point on the object 341 to a base point 343 lying directly under the camera 13. Thus, the distance L is measured in x-y plane, perpendicular to the z-axis. The height H of the camera 13 above the base point 343 is known or measured. Knowing the distance L and the height H, the distance D between the camera 13 and the point on the object 331 can be calculated. Thus, the depth value for the pixel 82c corresponding to the point on the object is known. Using camera geometry 344 and camera intrinsics 207, a crude depth d for all other pixels 82 in the object can calculated.
Referring again to
Per-element depth information can be used to render an image (herein also referred to as a “rendered image” or “mesh image”) in which width and height of the image corresponds to real-world coordinates. This can allow pixel coordinates in the rendered image to be converted directly into real-world coordinates. Moreover, a set of rendered images will generally have the same or similar widths and heights, potentially even if the images are captured using different cameras in the same nominal set up (e.g., nominal height, angle etc), which can help with training and with feature identification. An example of an orthographic image is shown in
Referring to
where x, y are pixel coordinates, ppx and ppy are x- and y-values of the position of a principal point in pixels, X, Y are in real-world camera coordinates, and fx, fy are focal lengths of the lens in x and y dimensions.
The individual 3D depth points (X, Y, Z) 261 are transformed using an orthographic projection matrix 262 using corresponding mapping array 263 and viewing box parameters 252, 253, 254, 255, 256, 257.
The processor 32 (
Referring to
Referring to
The processes hereinbefore described use per-element depth information in generally one of two ways.
Referring to
In the first pipeline, an un-rendered image 101 is fed into the network 120 (
Referring to
In the second pipeline, the image 14, 16, 18 is re-rendered using per-element depth information 16 and rendered using view parameters in real-world dimensions to produce the network input 101. The per-element image depth information 16 is used to construct the re-rendered (orthogonal) image 269 which has real-world dimensions to match the view parameters in real world dimensions. This allows the viewing parameters in real world dimensions to be used to convert stem coordinates 132 in image coordinates into stem coordinates 29 in real-world dimensions. Thus, the per-element depth information 16 is used indirectly to stem coordinates 29 in real-world dimensions.
Referring again to
Referring to
Referring also to
Referring to
Referring to
Although the normalised depth image 16′ provides improved contrast, it can be inconsistent from image to image and vary according to camera position. Thus, the image may be processed further to correct for a crop height between −0.2 m and 0.5 m from the ground plane 91. For example, some of the image feature shown in
Referring to
Referring to
Referring again to
Referring to
In this case, separate RGB and NDI textures 501, 502 are prepared. Each texture 501, 502 can comprise, for example, 3 channels. One or both textures 501, 502 can be rendered on a screen for display to the user.
The two textures 501, 502 can then be amalgamated into one array comprising, for instance, 6 channels. As hereinbefore described, the array may be processed to standardise and/or normalise values.
Referring to
Referring to
Images that are displayed to a user during system operation may be augmented to provide the user with information and assistance. For example, images may be augmented by providing a projection (or “overlay”) displaying, in different views, the position of crop rows with respect to a ground plane, or a projection displaying, in camara view, an outline of the area which is visible in an orthographic view. The image
Referring to
This can be used for projecting a collection of points forming points of interest, lines 701, and edges of polygons 704, which have been computed as hereinbefore described. Points of interest may be, for example, points on plants, such as stem location.
The GPU 32 (
The GPU 32 (
The GPU 32 (
It will be appreciated that various modifications may be made to the embodiments hereinbefore described. Such modifications may involve equivalent and other features which are already known in the design, manufacture and use of computer image systems and/or farm implements and component parts thereof and which may be used instead of or in addition to features already described herein. Features of one embodiment may be replaced or supplemented by features of another embodiment.
Other frame rates may be used, such as 15, 60 or 90 frames per second.
The system need not be used to control a hoeing implement, but instead, for instance, for targeted application of insecticide or fertilizer, e.g., by spraying herbicides on vegetation that appears in the NDVI image but is beyond a safe radius around the crop stem.
A mechanical hoe need not be used. For example, the system can steer an electrode that is using electric discharge to destroy weeds along the crop row, around the crop plants.
The keypoint need not be a stem, but can be another part of the plant, such as a part of a leaf (such as leaf midrib), a node stem, a meristem, a flower, a fruit or a seed. Action taken as a result of identifying the part of the plant need not be specific to the detected part. For example, having identified a leaf midrib, the system may treat (e.g., spray) the whole leaf.
A gridmap need not be used. For example, only data association and matching can be used (i.e., without any use of a gridmap). Amalgamating the coordinates may comprise data association between different frames. Data association between different frames may include matching coordinates and updating matched coordinates. Matching may comprise using a descriptor or determination of distance between coordinates.
Updating may comprise using an extended Kalman filter. Image patches can be used to match the plant stems over multiple frames, such as a KCF tracker or matching keypoint descriptors. A Hungarian algorithm and bipartite graph matching can be used for data association/matching.
A RANSAC algorithm need not be used for plane fitting. Alternatives include using a Hough transform to transform pointcloud coordinates to parametrized Hough space where points can be grouped to find optimal plane coefficients.
An alternative to Bundle adjustment is Iterative Closest Point (ICP). ICP algorithms can be used to iteratively reduce the distance of closest points between matched point positions without using a non-linear optimiser (such as Levenberg Marquardt or Gauss Newton) to optimise an energy function.
Although claims have been formulated in this application to particular combinations of features, it should be understood that the scope of the disclosure of the present invention also includes any novel features or any novel combination of features disclosed herein either explicitly or implicitly or any generalization thereof, whether or not it relates to the same invention as presently claimed in any claim and whether or not it mitigates any or all of the same technical problems as does the present invention. The applicants hereby give notice that new claims may be formulated to such features and/or combinations of such features during the prosecution of the present application or of any further application derived therefrom.
Number | Date | Country | Kind |
---|---|---|---|
2111983.9 | Aug 2021 | GB | national |
This application is a U.S. National Stage Application of International Patent Application No. PCT/GB2022/052153, filed on Aug. 19, 2022, which claims priority to United Kingdom Patent Application No. 2111983.9, filed on Aug. 20, 2021, the entire contents of all of which are incorporated by reference herein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/052153 | 8/19/2022 | WO |