A discord is a refinement of the concept of an anomalous subsequence of a time series. Being one of the topical issues of time series mining, discords discovery is applied in a wide range of real-world areas (medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, etc.). In this article, we propose a novel parallel algorithm for discords discovery on high-performance cluster with nodes based on many-core accelerators in the case when time series cannot fit in the main memory. We assumed that the time series is partitioned across the cluster nodes and achieved parallelization among the cluster nodes as well as within a single node. Within a cluster node, the algorithm employs a set of matrix data structures to store and index the subsequences of a time series, and to provide an efficient vectorization of computations on the accelerator. At each node, the algorithm processes its own partition and performs in two phases, namely candidate selection and discord refinement, with each phase requiring one linear scan through the partition. Then the local discords found are combined into the global candidate set and transmitted to each cluster node. Next, a node performs refinement of the global candidate set over its own partition resulting in the local true discord set. Finally, the global true discords set is constructed as intersection of the local true discord sets. The experimental evaluation on the real computer cluster with real and synthetic time series shows a high scalability of the proposed algorithm.

Currently, the discovery of anomalous subsequences in a very long time series is a topical issue in a wide spectrum of real-world applications, namely medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, and others. For such applications, it is hard to deal with multi-terabyte time series, which cannot fit into the main memory.

Keogh et al. [

Further, Yankov, Keogh

Our research is devoted to parallel and distributed algorithms for time series mining. In the previous work [

The rest of the article is organized as follows. In Section 2, we give the formal definitions along with a brief description of DADD. Section 3 provides the brief state of the art literature review. Section 4 presents the proposed methodology. In Section 5, the results of the experimental evaluation of our approach have been provided. Finally, Section 6 summarizes the results obtained and suggests directions for a further research.

Below, we follow [

A

A

A

Two subsequences

A subsequence

A subsequence

Given a positive parameter

DADD, the original serial disk-based algorithm [

When computing distance between subsequences, DADD demands that the arguments have been previously z-normalized to have mean zero and a standard deviation of one. Here,

In the original DADD algorithm, the Euclidean distance is used as a distance measure yet the algorithm can be utilized with any distance function, which may not necessarily be a metric [

The DADD algorithm [

Phase 1. Candidate selection |
Phase 2. Discord refinement |

At the first phase, the algorithm scans through the time series

At the second phase, the algorithm initially sets distances of all candidates to their nearest neighbors to infinity. Then, the algorithm scans through the time series

Being introduced in Keogh et al. [

The original HOTSAX algorithm [

Further, Yankov, Keogh et al. [

There are a couple of worth-noting works devoted to parallelization of DADD. The DDD (Distributed Discord Discovery) algorithm [

The PDD (Parallel Discord Discovery) algorithm [

In their further work [

Concluding this brief review, we should also mention the matrix profile (MP) concept proposed by Keogh et al. [

The parallelization employs a two-level parallelism, namely across cluster nodes and among threads of a single node. We implemented these levels through partitioning of an input time series and MPI technology, and OpenMP technology, respectively. Within a single node, we employed the matrix representation of data to effectively parallelize computations through OpenMP. Below, we will show an approach to the implementation of these ideas.

To provide parallelism at the level of the cluster nodes, we perform time series partitioning across the nodes as follows. Let

This means the head part of every partition except the first overlaps with the tail part of the previous partition in

The time series partition is stored as a matrix of aligned subsequences to enable computations over aligned data with as many auto-vectorizable loops as possible. We avoid the unaligned memory access since it can cause an inefficient vectorization due to time overhead for the loop peeling [

Let us denote the number of floats stored in the VPU (vector processing unit of the many-core accelerator) by

The

The parallel algorithm employs the data structures depicted in

Set of discords

Let us denote a ratio of candidates selected at a cluster node and all subsequences of the time series by

The

To provide a fast access to the candidate index during the selection phase, it is implemented as a deque (double-ended queue) with three attributes, namely

Let

After the selection phase, all the nodes exchange the candidates found to construct the combined candidate set, so at each cluster node the candidate body will contain potential discords from all the nodes. At the second phase, the algorithm refines the combined candidate set comparing the parameter

To parallelize this activity, we process rows of the subsequence matrix in the segment-wise manner and employ an additional attribute of the candidate body, namely

In the implementation, we apply the following parallelization scheme at the level of the cluster nodes. Let the input time series

Next, as opposed to MR-DADD [

Then the combined candidate set

The parallel implementation of the candidate selection and refinement phases is depicted in

1: |

1: _{p × H} |

In the selection phase, we parallelize the outer loop along the rows of the subsequence matrix while in the inner loop along the candidates, each thread processes its own segment of the candidate index. By the end of the phase, the candidates found by each thread are placed into the candidate body, and all the cluster nodes exchange the resulting candidate bodies by the MPI_Send and MPI_Recv functions to form the combined candidate set, which serves as an input for the second phase.

In the refinement phase, we also parallelize the outer loop along the rows of the subsequence matrix, and in the inner loop along the candidates, each thread processes its own segments of the candidate body and bitmap. In this implementation, we do not use the early abandoning technique for the distance calculation relying on the fact that vectorization of the square of the Euclidean distance may give us more benefits. By the end of the phase, the column-wise conjunction of the elements in the bitmap matrix will result in a set of true discords found by the current cluster node. An intersection of such sets is implemented by one of the cluster nodes where the rest nodes send their resulting sets.

We evaluated the proposed algorithm during the experiments conducted on the Tornado SUSU computer cluster [

In the first series of the experiments, we assessed the algorithm’s scaled speedup, which is defined as the speedup obtained when the problem size is increased linearly with the number of the nodes added to the computer cluster [

where

For the evaluation, we took ECG time series [

# cluster nodes, | Time series length, | ||||||
---|---|---|---|---|---|---|---|

candidate, |
refined, |
candidate, |
refined, |
candidate, |
refined, |
||

1 | 10^{6} |
5,407 | 772 |
28,916 | 533 |
17,366 | 548 |

2 | 2·10^{6} |
15,115 | 3,245 |
15,979 | 2,124 |
28,376 | 1,174 |

4 | 4·10^{6} |
23,990 | 3,199 |
29,075 | 2,098 |
54,281 | 1,169 |

8 | 8·10^{6} |
45,989 | 3,167 |
53,678 | 1,992 |
109,890 | 1,070 |

16 | 1.6·10^{7} |
117,659 | 2,779 |
129,528 | 1,533 |
299,157 | 1892 |

32 | 3.2·10^{7} |
374,536 | 2,666 |
294,844 | 1,288 |
732,517 | 721 |

64 | 6.4·10^{7} |
587,795 | 2,517 |
707,956 | 1,036 |
1,785,410 | 570 |

128 | 1.28·10^{8} |
1,145,661 | 2,324 |
1,541,099 | 764 |
4,743,032 | 765 |

The results of the experiments are depicted in

In the second series of the experiments, we compared the performance of our algorithm against the analogs we have already considered in Section 3, namely DDD [

Throughout the experiments, we used the synthetic time series generated according the Random Walk model [

Analog | Our algorithm, | Time series | Performance, s | ||||
---|---|---|---|---|---|---|---|

Algorithm | Hardware | # Cluster nodes | # Cores (threads) per node | Analog | Our |
||

DDD | 4 CPU @2.13 GHz | 2 | 4 (16) | 10^{7} |
512 | 5,382 | 745 |

MR-DADD | 8 CPU @3.0 GHz | 2 | 8 (32) | 10^{6} |
512 | 240 | 99 |

GPU-STAMP | 2880 CUDA cores |
2 | 30 (120) | 2^{21} |
256 | 11,664 | 83 |

MP-HPC | 39 CPU @2.6 GHz | 4 | 60 (240) | 8·10^{5} |
1000 | 6,000 | 32 |

In this article, we addressed the problem of discovering the anomalous subsequences in a very long time series. Currently, there is a wide spectrum of real-world applications where it is typical to deal with multi-terabyte time series, which cannot fit in the main memory: medicine, astronomy, economics, climate modeling, predictive maintenance, energy consumption, and others. In the study, we employ the discord concept, which is a subsequence of the time series that has the largest distance to its nearest non-self match neighbor subsequence.

We proposed a novel parallel algorithm for discords discovery in very long time series on the modern high-performance cluster with the nodes based on many-core accelerators. Our algorithm utilizes the serial disk-aware algorithm by Yankov, Keogh et al. [

The experimental evaluation on the real computer cluster with the real and synthetic time series shows the high scalability of the proposed algorithm. Throughout the experiments on real computer cluster over real and synthetic time series, our algorithm showed the linear scalability, increasing in the case of a high computational load due to a greater discord length. Also, the algorithm’s performance was ahead of the analogs that do not employ both computer cluster and many-core accelerators.

In further studies, we plan to elaborate versions of the algorithm for computer clusters with GPU nodes.