1. Technical Field
This invention relates to a system and method for phonetic searching of data.
2. Description of Related Art
Distributed File Systems (DFS) allow access to files from multiple hosts via a computer network. This makes it possible for multiple processors to share files and storage resources and for example to access and process data in parallel. Distributed file systems may include facilities for transparent replication and fault tolerance, that is, when a limited number of nodes in a file system go offline, the system continues to work without any data loss.
DFS are particularly useful for providing access to large data sources particularly for parallel processing and searching and the Hadoop Distributed File System (HDFS) is an example of one such open source DFS.
Hurence Hadoop Audio Miner is a product employed in call centers for performing audio to text transcription of source audio files, typically recordings of client contacts on the Hadoop platform. A Hadoop-based text mining engine is then used to perform searches on behalf of users.
It should be appreciated that in order to make audio files text searchable, significant computational effort is required to generate a textual transcription of the original media files and for large or rapidly increasing bodies of media files, it may not be feasible to provide the processing resources to implement this approach. Even where a transcript is produced, such a transcript contains many incorrectly transcribed words, preventing successful searching. Separately, once text files have been extracted from an audio source, they are typically relatively small and so providing local access to this information to search engines is not critical in providing reasonable performance.
On the hand phonetic searching does not create the same processing demands for indexing files, but local access to indexed information is important for performing phonetic searching.
Nexidia Search GRID provides a REST-based development environment where applications use multiple machines in parallel to provide phonetic searching.
Separately, the Aurix Phonetic Speech Search Engine allows high volumes of recordings to be processed, with less hardware power than with conventional Large Vocabulary Continuous Speech Recognition (LVCSR) systems. The Aurix Engine allows audio to be indexed at high rates with the index files being compressed as they are generated.
Nonetheless, expanding such offerings to deal with large scale media sources continually and possibly rapidly generating media files as well as handling search requests raises problems in: (1) the generation and storage of the index data, (2) the management of the generated index data to accommodate the dynamically changing nature of the target media corpus, and (3) the retrieval of the stored index data on demand for media searching.
According to one aspect of the present invention there is provided a method of indexing media information for phonetic searching according to claim 1.
In a second aspect there is provided a method of phonetically searching media information according to claim 16.
Further aspects of the invention provide computer program products stored on computer readable storage media which when executed on processors of a distributed multi-processor system are arranged to perform the steps of any one of claims 1 to 13 and 16 to 18.
Still further aspects comprise distributed multi-processor systems arranged to perform the steps of any one of claims 1 to 13 and 16 to 18.
In embodiments of the present invention, the scheduling of indexing tasks ensures that no single indexing task can block a cluster of processors within a distributed file system.
Embodiments of the invention can provide efficient phonetic search (audio mining) of a large corpus of audio material within the constraints imposed by the Hadoop software framework for distributed computation; in particular by aggregating generated index data (searchable phonetic representations of the audio material) into a relatively small number of archive files. The management of the archive files permits dynamic change of the searchable audio corpus; and provides for efficient access to the archive files for audio search.
It will be seen that using a DFS framework can ensure data locality, so that where possible, searching occurs on a cluster node that holds a local copy of a block of index data. The index data within the block can thus be read by the framework in an efficient streaming read operation (possibly skipping over any data for media files which are not included within the search).
An embodiment of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
Referring now to the drawings, there are essentially two main components to the phonetic searching system of the preferred embodiment: indexing and searching, each of these being linked through a set of common archive files.
The embodiment is implemented on Hadoop which allows the system to run on a distributed cluster of commodity server hardware which can be easily expanded as required. There are three components of Hadoop of particular relevance in the present case: Hadoop Distributed Filing System (HDFS), Hadoop Map-Reduce Framework (MR) and Hadoop Distributed Database (HBase).
Briefly, HDFS provides an interface to a fault-tolerant distributed filing system that transcends the limitations of any individual machine in a cluster. HDFS is optimized for the storage of a relatively small number of large (i.e. gigabyte to terabyte-scale) files, and for high data read rates via fast sequential streaming read operations, at the expense of latency, i.e. slow seek times to random positions within these files. All files within HDFS are stored as a sequence of blocks, each replicated across a number of cluster nodes in order to provide overall resilience against failure of any individual cluster node. The block size is configurable and while it defaults to 64 MB, in the present embodiment, it is set to 256 MB. Files within HDFS, once created, may not be modified: however they may be deleted, and data may be appended to an existing file.
The MR framework provides for scheduled computation against individual blocks of files stored within HDFS in so far as is possible on a cluster node that contains a local copy of that block, in order to minimise network traffic between cluster nodes. This is particularly useful for audio mining, where the index files contain relatively high amounts of data to be read and so remote access could result in a networking bottleneck.
HBase provides a convenient means of storing the results of audio mining in a form that can be readily accessed.
In one example, audio tracks are extracted (possibly from an associated video file), transcoded to linear PCM and placed in the external audio database by a single external process operating in tandem with or within the recording system 10. In high volume systems, this could present a significant scalability bottleneck to the ingestion rate of files for indexing. Thus, in such high volume systems, the audio extraction and transcoding processing could be performed in a distributed manner or possibly combined with or incorporated into the indexing jobs described below.
Thus, for the purposes of the present invention, the media database 12 could include any combination of video files, audio files or transcoded audio information.
The recording system 10 produces a list of pointers 14, for example, URLs, to each media file which is to be processed and searchable by the system. The pointers are essentially written to a queue and processed on a FIFO basis.
A partitioning pre-processor 16 grabs a set number of URLs from the front of the pointer queue and partitions this set into a number of subsets, in such a fashion that each subset represents an approximately equal workload; these subsets form the input to a Hadoop MR indexing job 18. (In the current embodiment this partitioning is not itself performed as a distributed computation, but it could be implemented as such). The partitioning determines the way that the overall indexing computation is split between a set of indexing tasks 20 that comprise the indexing job 18. Each task 20 processes one subset of the overall set of URLs. These tasks get scheduled for execution among the nodes of the cluster as free computational resource becomes available.
In the example shown there are 3 indexing jobs 181 . . . 183. The number N of indexing jobs which run concurrently depends on several factors including: the number of concurrent feeds (NF) from which input media files is taken; the “chunk size” (C) into which each feed is broken before being stored, for example, television programmes are typically 1 hour in length; the frequency (FR) with which the system schedules new indexing jobs; and the overall cluster throughput (TP), which is a function of indexing rate per node (hardware dependent) and cluster size (number of nodes). Thus:
A number of these indexing jobs are allowed to run concurrently; thus indexing is not blocked even if one particular indexing job takes a long time to complete. (This can occur if a constituent indexing task represents a disproportionately large amount of the overall computational workload, for example because it contains one or more unusually large files).
Thus, breaking the overall work burden into sufficiently small chunks distributes work efficiently across the cluster, without danger that any one task ends up with a disproportionate share of the load. However, this also improves responsiveness of the system to concurrent search requests described later, ensuring that a cluster does not risk becoming dominated by long-running indexing tasks if there are search requests pending.
Each indexing job 18 instantiates one or more Map tasks 20, each task processing the media files from one of the sets of URLs provided to the job 18 by the partitioner 16. In the simplest implementation, a single task handles the set of media files awaiting ingest, and this may contain multiple URLs. For each URL in the set, the task 20 reads the corresponding source media file and generates a binary index file corresponding to a probabilistic phonetic representation of the audio contents of media file. Each index file 21 is then appended to an archive file 22. Since the process of appending files is inherently serial in nature, it is arranged that concurrently executing indexing tasks append the index files they generate to different archive files, in order that the indexing tasks are able to run in parallel.
As mentioned, a notable point about many DFS systems and Hadoop DFS in particular, is that once data is appended to an archive file, it cannot be modified. This makes the file system particularly useful for the present invention where indexed data is simply appended to an archive file which is then searchable. If the set of archives becomes too large, or if it is required to maintain the amount of searchable media material within a fixed size, as opposed to allowing it to accrue indefinitely, then say for example, archive files of a given age could be removed, appreciating that the media database 12 in respect of the information indexed in that deleted archive file would become unsearchable (phonetically).
Nonetheless, it would still be possible to physically delete index data associated with specific media files, if required. This would require rewriting the containing archive file with the deleted index data excluded, then replacing the old archive file with the updated one and updating the corresponding meta-data. This would be a potentially expensive operation, and would need to be carried out by a periodic maintenance activity that physically removes index data associated with media files that have been logically deleted, somewhat analogous to defragmenting a hard disk. The fact that the index data is distributed across a number of archive files would help, since each archive file typically represents only a proportion of the total, and as the archive files can be maintained individually, there would be no need to take the entire archive offline at any point in time.
The phonetic stream which is produced by the indexing tasks can be of any given format, but essentially it needs to be compatible with the search tasks which will be searching through the indexed information. In one embodiment, the indexing is performed so as to allow search tasks running the Aurix Phonetic Speech Search Engine to search through the indexed information.
The items shown in the meta-data and index header sections of the record for a given media file show a sync field (essentially a flag comprising a byte sequence that allows the start of an index data record to be validated), an ID field indicating the ID of the media file in the database 12 to which the index data record corresponds, and a length field (a 64-bit record of the length of the index data block). Other meta-data (not shown) includes offsets within the containing archive file for the start and end of the index data associated with a given audio file. The index data header is shown as comprising a record of the audio sample rate and the language and speech type used to generate the audio data. It should also be appreciated that other fields could potentially be added to the meta-data: for example, the number of audio channels. This meta-data could also be stored in a separate database (possibly a HBase), keyed by the audio ID.
Nonetheless, storing this meta-data in the index data record within an archive file 22 improves efficiency during searching, because it eliminates the need to retrieve it from a database.
As mentioned above, distributed file systems and particularly HDFS store data blocks in a redundant fashion and as such any given block of an archive file can be replicated on a number of nodes within a cluster. Search tasks against that block will be preferentially run on one of these nodes that holds a local copy of the block, in order to minimize network traffic. Nonetheless, it will be appreciated that writing indexed information in this format enables efficient searching to be performed by tasks running in parallel across the nodes of a cluster.
It should also be appreciated that in the present implementation, an archive file may contain an incomplete block (at the end), and as indicated in
Although all the index data could in theory be appended into a single archive file, multiple active archive files tend to be more efficient, as appending to a single archive file could represent a performance bottleneck whereas multiple archive files can be appended to concurrently (up to a limit imposed by the number of processing cores and the I/O capacity of the DFS cluster). However, there are also efficiency reasons not to allow the number of archive files to become too large. The data for a given block of an archive file within HDFS can be read from disk in an efficient streaming read operation, rather than requiring individual seeks to the start of each index data file (as would be the case if the index data was stored as individual files). It is therefore best if the archive files are a significant multiple of the block size, rather than being of the order of the block size or less, in order to amortize the proportionately greater cost of processing a part block.
Referring now to
Each time a search MR job 32 is instantiated, it instantiates a number of Map Tasks 1 . . . P, each corresponding to a local block of an archive file to be searched. Increasing HDFS block size for the archive files from the default of 64 MB to at least 256 MB as indicated above, ensures that the computational overhead of setting up a search map task is outweighed by the computational effort required to perform the search, even for small searches. The search job 32 looks for searches in a search queue 28 and it passes the search query to each task which has not performed that search on its block i.e. at any given time a task may be performing more than 1 search as it traverses its block. In the embodiment, each task writes its search results to a common HBase database 30 for later retrieval. Once all tasks have reported their results for a given search, the results can be retrieved by the search interface 24 and returned to a client across the network 26—these results typically take the form of a number of links to the original media files along with the detected locations of the search query in those files.
Distributed file systems are replicated across nodes of a cluster and in typical configurations, blocks might be mirrored across three nodes—bearing in mind that any block can be replicated to any given three nodes of a cluster which might in fact comprise a large number of nodes. So for example, the search tasks for given search jobs might be scheduled across a large proportion of the nodes in the cluster.
In
In the above-described embodiment, the partitioner 16 submits set of a fixed number of URLs to each indexing job. However, it will be seen that if the partitioner were to take into account the size of the media files, partitioning could be handled on the basis of media file size, such that each subset contained approximately the same amount of data.
The invention is not limited to the embodiment(s) described herein but can be amended or modified without departing from the scope of the present invention.
The present application relates to U.S. application Ser. No. ______ entitled “A System and Method for Phonetic Searching of Data” (Ref: 512115-US-NP/P105558us00/A181FC) co-filed herewith and which is incorporated herein by reference.