Early detection of dynamic harmful cascades in large-scale networks ଝ
Abstract
Quickly detecting harmful cascades in networks can allow us to analyze the causes and prevent further spreading of destructive influence. Since it is often impossible to observe the state of all nodes in a network, a common method is to detect harmful cascades from sparsely placed sensors. However, the harmful cascades are usually dynamic (e.g., the cascade initiators and diffusion trajectories can change over the time), which can severely destroy the robustness of selected sensors. Meanwhile the large scale of current networks greatly increases the time complexity of sensor selection. Motivated by the observation, in this paper we investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks. Specifically, we first put forward a dynamic susceptible-infected model to describe harmful cascades, and formally define a detection time minimization (DTM) problem which focuses on effective sensors placement for early detection of dynamic cascades. We prove that it is #P-hard to calculate the objective function exactly and propose two Monte-Carlo methods to estimate it efficiently. We prove the NP-hardness of DTM problem and design a corresponding greedy algorithm. Based on that, we propose an efficient upper bound based greedy (UBG) algorithm with the theoretical performance guarantee reserved. To further meet different types of large-scale networks, we propose two accelerations of UBG: Quickest-Path-UBG for sparse networks and Local-Reduction-UBG for dense networks to improve the time complexity. The experimental results on synthetic and real-world social networks demonstrate the practicality of our approaches.
Harmful cascade spreading through kinds of network structures has become more and more ubiquitous in the modern world. A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population contact network and lead to an epidemic [12, 34] . A computer virus on a few servers can fast spread to other servers or computers in a computer connection network [15, 17] . In a similar vein, a rumor started by a few individuals can spread quickly through the online social network [2, 26] . It is crucial to detect the harmful cascades as soon as they happen or shortly thereafter, since it allows us to study the causes and prevent further spreading of harmful influence [32, 38] .
In practice, it is commonly infeasible or unaffordable to monitor all nodes at all times, and therefore a common way to detect cascades is to select important nodes where we can place sensors for monitoring [7, 19, 31, 39] . However, existing methods usually viewed the cascade data as static and deterministic, ignoring an important fact that the harmful cascades are usually dynamic and time-variant, in the sense that the cascade initiators and diffusion trajectories can change randomly over the time. Actually the dynamic brings new challenges to network monitoring, since it can severely destroy the robustness of selected sensors [22] . Meanwhile the existing sensor selection algorithms can only work for small networks and not be scalable well to large networks of the https://doi.org/10.1016/j.jocs.2017. 10 .014 1877-7503/© 2017 Elsevier B.V. All rights reserved. day [28] . Motivated by these observations, in this paper we investigate the scalable solutions to sensor selection problem for early detection of dynamic harmful cascades in networks. To model the dynamic property, we carry on our work from a model-driven perspective.
To provide a unified framework, we model all the above examples as an infection spreading in a network G = (V, E), where V is the set of nodes (i.e. individuals) and E is the set of edges (i.e. relationships). In a population network, the infection is the disease that is transmitted between individuals. In the example of a computer virus spreading in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, the infection is the rumor.
Under this unified network framework, we specifically propose a new dynamic cascade model to describe harmful cascade diffusions, and define a detection time minimization (DTM) problem S * = argmin |S|=k,S⊆V D(S), where k is a given parameter determined by budget or monitor capacity, S is sensor nodes set, and D(S) is detection time. The DTM problem focuses on effective sensor nodes selection for early detection of dynamic harmful cascades.
We first prove the NP-hardness of DTM problem and design a corresponding greedy algorithm. Considering the limitation caused by greedy algorithm inefficiency, we then propose two alternative Monte-Carlo methods, each having its pros and cons, to estimate the #P-hard objective function D(S) efficiently. To further address the scalability issue, we propose an efficient upper bound based greedy (UBG) algorithm and its two accelerations Quickest-Path-UBG and Local-Reduction-UBG to cater for different types of large-scale networks. These two accelerations are close to UBG in performance but orders of magnitude faster than UBG in time complexity. Experiments on synthetic and real-world social networks demonstrate the practicality of our approaches.
Detecting the whole network by monitoring finite sensor nodes has been widely applied to detect water contaminations in a water distribution network [19] and virus outbreaks in a human society [7] . Some early work places sensors by topological measures, e.g. targeting high degree nodes [32] or highly connected nodes [8] . However the effects along this idea are commonly unsatisfied, since they ignore the multi-step complexity of network diffusions.
The "Battle of Water Sensor Network" challenge [31] motivated a number of works to optimize sensor placement in water networks to detect contamination [16, 18] . By utilizing submodular property, they proposed to optimize the sensor placement with different criterions such as maximizing the probability of detection, minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena [24] . In their works the data are a set of deterministic scenarios and the random property of diffusion are not taken into consideration. What's more, their methods are not scalable to large networks [28] .
In addition to the above two types of methods, another related work [1] set sensors along fixed paths in the network so as to gather sufficient information to locate possible contaminations. Early detection of contagious outbreaks by monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis and Fowler [7] . Krause et al. [20] presented efficient schedules for minimizing energy consumption in battery operated sensors, while other works analyzed distributed solutions with limited communication capacities and costs [13, 21, 22] . In addition, Berry et al. [4] equated the placement problem with a p-median problem. Li et al. [25] proposed a dynamic-programming (DP) based greedy algorithm which is with a near-optimal performance guarantee.
By contrast, our work is geared to investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks from a model-driven prospective. The main difference from previous works is that we take dynamic properties of cascades into consideration: (1) the cascade initiator is dynamic since we do not know who will initiate another harmful diffusion next time, and (2) the diffusion process is dynamic since the propagation trajectory is uncertain. Under these two dynamic sources, we aim to accurately and fast select k nodes as sensors for early detection of harmful cascades. Besides, as the size of network becomes more and more large-scale, we also need to ensure the effectiveness of the proposed sensor selection methods for large networks.
To minimize the time to find the dynamic harmful cascades, our search attempts to optimize the selection of sensors in a scalable way. Our contributions are summarized as follows:
• We formulate a Detection Time Minimization (DTM) problem.
To describe the dynamic harmful cascade on networks, we put forward a dynamic cascade model for the harmful diffusion. • We prove that it is #P-hard to exactly calculate the objective function of DTM problem and propose two equivalent Monte-Carlo methods to estimate it. Each has its advantages and disadvantages. • We show the NP-hardness of DTM problem as it can be shown to contain the Set Cover problem as a simple special case. We convert the DTM problem to a constrained max optimization problem, prove the submodularity of the new objective function, and then employ the greedy algorithm which achieves an approximation ratio of 1 − 1/e. • We theoretically establish new upper bounds for the remaining time. Based on these bounds, we further propose a new Upper Bound based Greedy (UBG in short) algorithm which can significantly reduce the number of estimations in greedy-based algorithms, especially at the initial step. • We propose two accelerations of UBG: Quickest-Path-UBG and Local-Reduction-UBG to address the DTM problem for large networks, which are close to UBG in performance but orders of magnitude faster than UBG in time complexity.
In addition, our methods not only aim at early detection of harmful cascades, but also shed light on a host of other applications. For example, the goal of Emerging Topic Detection is to identify emergent topics in a social network, assuming full access to the stream of all postings. Providers, such as Twitter or Facebook, have an immediate access to all tweets or postings as they are submitted to their server [6, 29] , while outside observers need an efficient mechanism to monitor changes, such as the methods developed in this work. Another example, an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social networks for pieces of information that can be used in trading decisions before they appear in the more popular news sites [23, 27] . Similarly, intelligence, business and politics analysts are scanning online sources for new information. The theoretical framework and method established in this paper can also be used in these applications.
The rest of the paper is organized as follows. Section 2 presents the cascade model and problem formulation. In Section 3 we show the # P-hardness of detection time calculation and derive two equivalent estimation methods. Section 4 is devoted to the NP-hardness of DTM problem, the submodular properties of transformed objective function, and the corresponding greedy algorithm. To prune the unnecessary estimation calls, we analysis the upper bounds for the new proposed UBG algorithm in Section 5 . Section 6 presents two accelerations of UBG for large scale networks. Section 7 shows the experimental results. We conclude the paper in Section 8. Table 1 outlines the major variables used.
In this section we start with a description of dynamic cascade model and then we define the Detection Time Minimization (DTM) problem abstracted from the early detection problem.
We start somewhat with the framework of [24] , where the models introduced are essentially descriptive to specify a joint distribution over all nodes' behavior in a global sense. In contrast, we focus on more operational models, from mathematical sociology [3] and interacting particle systems [10] , to explicitly represent the step-by-step dynamics of infection.
By considering operational models for the dynamic diffusion of a harmful item through a network G, represented by a directed graph, we will speak of each individual node as being either infected or uninfected. Motivated by the properties of contaminants like rumor and virus, we will focus from now on the progressive case in which nodes can switch from being uninfected to being infected, but do not switch in the other direction. We also focus on the setting that each node's tendency to become infected increases monotonically as more of its neighbors become active. Thus, the process can be reviewed with respect to some particular uninfected node v: as time unfolds, more and more of v's neighbors become infected; at some point, this may cause v to become infected, and v's decision may in turn trigger further infections by nodes connected with v.
Motivated by above observation, we put forward a dynamic susceptible-infected (DSI) model for the harmful cascade spreading, which can be seen as a variant of the common susceptible-infected (SI) model [3] . The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover.
Specifically (1)
The DSI model is attached with a probability distribution = { (u), u ∈ V} to describe the dynamic of cascade initiators.
can be seen as a priori knowledge of nodes being infected initially, where u ∈ V (u) = 1 holds. The DSI model first chooses an initially infected node u ∈ V according to distribution , and then it works as follows. Let I t ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I 0 = {u}. Define J t := 0≤i≤t I i be the cumulative set of nodes that get infected before step t ≥ 0. Then, at step t + 1, each node u ∈ J t may infect its out-neighbors v ∈ V \J t with an independent probability of ip(u, v). Thus, a node v ∈ V \J t is infected at step t + 1 with the probability
If node v is successfully infected, it is added into the set I t+1 . Then update J t+1 by J t+1 ←− J t ∪ I t+1 . Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others. Obviously the cumulative infected process (J t ) t≥0 is Markovian.
In a diffusion model (I t ) t≥0 , given an initially infected node u ∈ V and a set of sensors S ⊆ V, the random detection time is defined as
where a ∧ b : = min{a, b} and T max is the time interval that we observe. We assume inf{ ∅ } =+∞. The random detection time (u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S. Proof. Above properties come from the definition in Eq. (2). We put the detailed proofs in Section 8.1. Note that the detection time (u, S) can be viewed as a special type of stopping time in stochastic process theory [11] .ᮀ Let the variable X denote the random initiator with distribution , i.e., for every u ∈ V, we have
The (expected) detection time from a random initiator to one of the selected sensors in S can be define as
where E is an expectation operator under the cascade model. By the conditional expectation, we can calculate the expected detection time D(S) in the following way:
Indeed, Eq. (5) can be reached like this
where (X) denotes the -algebra generated by variable X. Eq. (5) converts the global detection time E[ (X, S)] as a summation of local detection time E[ (u, S)] with u ∈ V. This conversion provides a basic for Section 5.
Our goal is to find k nodes as sensors in a network in order to minimize the time until an infection -starting from a random initiator in the network -is detected. Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S * ⊆ V such that |S * | = k and D(S * ) = min{D(S)||S| = k, S ⊆ V}, i. e .,
where k is a given parameter determined by budget or monitor capacity. We call this as detection time minimization problem (DTM problem for short). For the sake of concreteness, we will discuss our results in terms of the DSI model in particular.
To solve the DTM problem Eq. (6), the first issue is how to calculate the objective function D(S) given a sensor set S. This question isn't as easy as its description in Eq. (4). For example, the DSI process is underspecified, since we have not prescribed the order in which newly infected nodes in a given step t will attempt to activate their neighbors. Thus, it is not initially obvious that the process is even well-defined, in the sense that it yields the same distribution over outcomes regardless of how we schedule the attempted activations. Actually the computation of D(S) is #P-hard, by showing a reduction from the positive partitioned 2-DNF assignment counting problem [9] .
Proof. We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] . For detailed proof, see Section 8.2. ᮀ Since it is intractable to exactly compute D(S) on a typically sized graph, a natural idea is to employ Monte-Carlo methods to estimate D(S), which can be implemented in two different ways as follows:
The expected detection time D(S) is obtained by directly simulating the random process of diffusion triggered by a random node, say u, chosen according to the distribution defined as Eq. (3). Let I t denote the set of nodes newly infected in the t-th iteration with I 0 = {u}. In the (t + 1)-th iteration, a node u ∈ J t : = ∪ 0≤i≤t I i attempts to activate each uninfected neighbor v / ∈ J t with the probability
The process is repeated until the diffusion hits the sensor set S at some step T(u, S), i.e.,
where inf{ ∅ } : =+∞ as convention. The detection time of this single simulation is recorded at T(u, S) ∧ T max , which is right (u, S) defined in Eq. (2) . We run such simulations for many times and finally estimate the expected detecting time D(S) by averaging over all simulations.
According to the characteristic of the DSI model, the time cost that an infected node u spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(u, v), and we denote this time cost as C(u, v), i.e.,
for any integral s ≥ 1. We can flip all coins a priori to produce a weighted graph G = (V, E, c), where an edge (u, v) is labeled with the time cost c(u, v) -a sample of random variable C(u, v). Actually such a snapshot provides an easy way to sample the detection time of any sensor set S, which exactly equals to the smallest time cost from the initially infected node, say u (chosen according to the distribution ), to the nearest sensor in S. Define c(u, S) be the smallest time cost from u to the nearest sensor in S, then the detection time of this single simulation is recorded as c(u, S) ∧ T max . We produce plenty of snapshots and finally estimate the expected detection time D(S) by averaging over all snapshots.
The snapshot simulation is equivalent to the propagation simulation in estimating D(S). More specifically, the random variables T(u, S) and c(u, S) are identically distributed.
, it is enough to prove that random variables T(u, S) and c(u, S) are identically distributed. For proof details, see Section 8.3. ᮀ
We have confirmed that the two methods are equivalent, while either has its own advantages and disadvantages. For estimating a specific D(S), the simulation method is faster, because it only needs to examine a small portion of edges while the snapshot method has to examine all the edges. For estimating the expected detection time of different sensor sets, the snapshot method outperforms the simulation method in terms of time complexity, since each snapshot serves all sensor sets. Under these observations, the heuristic algorithms use the propagation simulation, while the greedy-based algorithms employ the snapshot simulation in the experimental part.
In this section we first show the NP-hardness of DTM problem in Eq. (6), as it can be shown to contain the Set Cover problem as a simple special case, then we propose a simple greedy algorithm for the DTM problem. Proof. Consider an instance of the NP-complete Set Cover problem, defined by a collection of subsets S = {S 1 , S 2 , . . ., S m } of a ground set U = {u 1 , u 2 , . . ., u n }, we wish to know whether there exist k of the subsets whose union is equal to U. We can assume that k < n < m here. We show that this can be viewed as a special case of the optimal problem (6).
Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes: there is a node i corresponding to each element u i ∈ U, a node j corresponding to each set S j ∈ S, and a directed edge (i, j) with activation probability ip(i, j) = 1 whenever u i ∈ S j . Define the probability distribution = { (u), u ∈ V} on the knowledge of nodes being initially infected as follows
The Set Cover problem is equivalent to deciding if there is a set S of k nodes in this bipartite graph with D(S) ≤ 1. Note that for the instance we have defined, activation is a deterministic process, as all probabilities are 0 or 1. Monitoring k nodes to detect diffusion initiated from U corresponding to sets in a Set Cover solution results in covering all n nodes corresponding to the ground set U, and if any set S of k nodes has D(S) ≤ 1, then the Set Cover problem must be solvable.ᮀ Since the optimization problem is NP-hard and the network is prohibitively large, we cannot compute the optimum value to verify the actual quality of approximations. Hence a natural idea is to employ the greedy algorithm as approximation method. To make better use of the greedy algorithm, we consider an equivalent optimization problem Eq. (6) as follows,
where
is defined as the (expected) remaining time for taking actions when a contaminant is detected. The above alternative formulation has key properties as described in the following Theorem 4. Theoretically, a non-negative
for all S ⊆ T. Proof. It is obvious that the remaining time function R is monotone and R(∅) =0. Now we prove that R is submodular. According to the definitions in Eqs. (5) and (10), it suffices to show that
for all S ⊆ T ⊆ V and v ∈ V \T . By the second property of in Propo-
Note that (u, T) ≤ (u, S) always works by the monotone decrease property of in Proposition 1. Now we discuss Eq. (14) separately according to the value of (u, {v}).
, which holds by the monotone decrease of .
Eq. (14) is proven and we thus get that R(S) is submodular.ᮀ By above properties in Theorem 4, the problem given in Eq. (9) can be approximated by the greedy algorithm in Algorithm 1 with the set function f : = R. For any submodular and monotone function f with f(∅) =0, the problem of finding a set S of size k that maximizes f(S) can be approximated by the greedy algorithm in Algorithm 1. The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected. It is shown that the algorithm guarantees an approximation ratio of f(S)/f(S * ) ≥ 1 −1/e, where S is the output of the greedy algorithm and S * is the optimal solution [30] .
In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S, and we turn to run snapshots for 10, 000 trials to obtain an accurate estimate of R(S), mentioned in Section 3. This actually leads to expensive selection time. Another source of inefficiency in Greedy(k, R) is that there exists O(kN) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes. When N is large, the efficiency of the algorithm is unsatisfactory.
Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for evaluating R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.
In order to prune the estimation calls in Greedy(k, R), a natural idea is to employ the Cost-Effective Lazy Forward selection (CELF) algorithm proposed in [24] . The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of estimation calls can be greatly pruned. CELF optimization produces the same sensor set as the original greedy algorithm, but it is much faster than the original one [24] .
Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks. In particular, in the first round to establish the initial upper bounds, CELF needs to estimate R({v}) using MC simulations for each node v, leading to N times of MC calls, which is time-consuming, especially when the network is very large. The limitation leads to a rather fundamental question that, can we derive an upper bound of R({v}) which can be used to further prune unnecessary detection time estimations (MC calls) in Greedy(k, R)?
Motivated by this question and the idea in [40] , in this section we derive an initial upper bound of R({v}) for Greedy(k, R). Based on the bound, we propose a new greedy algorithm Upper Bound based Greedy (UBG for short), which outperforms the original CELF algorithm. Essentially different from the bounds that derived for the influence spread under the IC model [40] , we here
In this part, we aim to derive an upper bound of R(v). Before introducing the upper bounds in Theorem 5, we first prepare two propositions. Let P u (v ∈ J t ) denote the probability that node v becomes infected before step t when the initially infected node is u. We have the first proposition as follows.
For v ∈ V , the remaining time R(v) under the DSI model can be calculated as
Proof. In fact, by the definition in Eq.
(2), we first have
where the fourth '=' is due to the fact that u ∈ V (u) = 1 in above derivation. ᮀ Proposition 2 reveals that we can treat the global remaining time R(v) as a summation of all T max propagation steps of local probabilities
Based on Proposition 2, a following question is, what is the relationship between two sets,
Proposition 3. For k ≥ 1, we have the following inequation
Proof. For k ≥ 1, by the definition of conditional expectation and DSI model, we obtain
where the first '≤' is due to 1 {v/ ∈J k−1 } ≤ 1, and the second '≤'
x i in the above derivation.ᮀ Proposition 3 clearly identifies the ordering relationship between two adjacent elements in the series P u (v ∈ J 0 ), P u (v ∈ J 1 ), P u (v ∈ J 2 ), · · ·. Now we simplify the results in Proposition 3 by using the form of matrix. Let IP be the infection probability matrix with the element at position (u, v) being ip(u, v). For t ≥ 0, denote the row vector
as the probabilities of nodes being infected before step t, i. e .,  u t (v) := P u (v ∈ J t ). Obviously, we have  u 0 (v) = 1 {u} (v). Now we can rewrite Proposition 3 by using the matrix form,
By iteration, we further get that u t ≤ u 0 · (E + IP) t , where E is a unit matrix. Furthermore, due to the definition of probability  u t (v), it follows that
Hereafter, define A ∧ 1 : = {a(i, j) ∧ 1} for a matrix A = {a(i, j)}.
With the above preparations, we can present the results on upper bound of remaining time as follows,
where E is a unit matrix and [A] (u,v) means the element at position
where is a prior distribution on the likelihood of nodes being the infected source.
We first use a toy example to explain how to calculate the upper bound.
Example 1. Given a directed network G in Fig. 1 with infection probability matrix in Eq. (22) . Let T max = 10, we have ((E + IP) t ∧ 1) is intractable when the network size is large. To overcome the difficulty, we adopt the following procedure to calculate Tmax−1 t=0
If t ≥1 ≤ T max − 1, it implies that we do not need to calculate (E + IP) t any more when t ≥ t ≥1 , and therefore we have
where 1 is a N × N matrix with all elements being 1. Additionally we find that
Proof. In fact, when IP is small enough, it follows that
where the approximation stems from Taylor expansion.ᮀ By Eqs. (21) and (25), the new upper bound of remaining time can be approximately put as
which is relatively tractable when IP is small.
Based on the upper bound, we propose a new UBG algorithm for early outbreak detection. First we explain the difference between UBG and CELF.
The CELF algorithm [24] exploits the submodular property to improve the original greedy algorithm. However, CELF demands N (the number of nodes in the network) remaining time estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs.
In contrast, the proposed Upper Bound based Greedy (UBG) algorithm uses the derived new bound to further reduce the number of remaining time estimations, especially in the initialization step. This way, the nodes are ranked by their upper bound scores which can be used to prune unnecessary calculations in the CELF algorithm. We use Example 2 for illustration.
Even with the UBG algorithm proposed in Section 5, its running time is still unbearable and may not be suitable for large social networks. Although UBG can greatly reduce the number of remaining time estimation calls, each estimation call is very timeconsuming, as it needs produce enough samples and average them. Hence a possible alternative to further accelerate UBG is to employ heuristics to approximate the remaining time R(S) or the marginal return R(S ∪ {u}) − R(S) in UBG. Along this idea, we here introduce two accelerative algorithms to address the inefficiency of UBG: Quickest-Path-UBG and Local-Reduction-UBG. The experimental results will show that (i) both of them are efficient in terms of running time, and (ii) the former one applies to sparse networks better, while the latter one is more effective in dense networks in terms of detection time.
In this part we introduce a tractable heuristic to approximate the remaining time R(S). From Eq. (5) and Eq. (10), the key point of estimating R(S) is how to estimate E[ (u, S)]. Note that E[ (u, S)] measures the expected time delay of a diffusion initiated from node u propagating to sensor set S. An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S. Hence, a question is how to measure the quickest path from node u to the set S? The proposed Quickest-Path-UBG is inspired by answering the question.
According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that E[X] = 1/p. Since the time that a node u spends in infecting its uninfected neighborhood v is distributed geometrically with the parameter ip(u, v), the value 1/ip(u, v) should be the expected time cost that an infection propagates from node u to node v along the edge (u, v). Fig. 2 shows an example.
For any two nodes u and v, let d(u, v) be the smallest time cost among the paths connecting u and v. For example, d( , ) = min{5 + 3.3, 10 + 5} = 8.3 in Fig. 2 . Then, for a subset S ⊆ V, we define d(u, S) := min v ∈ S d(u, v). (27) Intuitively, d(u, S) denotes the expected time that an infection propagates from u to S along the quickest path in the graph G = (V, E, m) . Therefore, a fundamental question arises, is there some approximate relationship between the detection time and the shortest distance. Theoretically we have Theorem 6. When the shortest path from u to S in graph G = (V, E, m) is unique, the quantity d(u, S) ∧ T max can be used to approximate the detection time E[ (u, S)], i.e.,
Proof. In fact, when the shortest path from u to S in graph G = (V, E, m) is unique, it follows that
where we use an result borrowed from probability theory for the approximation: consider random variables
.ᮀ Based on Theorem 6, combining Eq. (5), Eq. (10) and Eq. (28), we have the following derivation,
Hence we can approximate the remaining time
rather than the heavy Monte-Carlo estimation MC(S ∪ {u}) in the 09th row of UBG algorithm. We nail down this new method as Quickest-Path-UBG, which is shown explicitly in Algorithm 3.
Note that the reason we call this acceleration as Quickest-Path-UBG rather than Shortest-Path-UBG lies in that, our Quickest-Path-UBG introduces a new measurement to reflect the infection time cost rather than the distance in common sense.
In this part we turn to a tractable heuristic to approximate the marginal return ı u (S) : = R(S ∪ {u}) − R(S) in the 09th row of UBG algorithm. To facilitate the description, we will approximate the marginal reduction D(S) − D(S ∪ {u}) rather than the marginal return R(S ∪ {u}) − R(S), due to the fact that R(S) = T max − D(S).
In sociology literature, degree and other centrality-based heuristics are commonly used to select influential nodes in social networks [37] . These methods usually looked over the multi-step complexity of network diffusions, and assumed that the infection can propagate ahead only one hop before the end. In other words, if a node initiates an infection, the diffusion spread is at most its first-order neighbors. We call this as one-hop assumption hereafter. Especially, experimental results in [24] showed that selecting vertices with maximum indegree as sensors results in earlier infection detection than other heuristics, which validated the rationality of one-hop assumption to some extent. Under this observation, we will propose a tractable method upon the one-hop assumption to approximate the marginal reduction D(S) − D(S ∪ {u}).
We first introduce a useful result in probability theory: if the random variable X i is distributed geometrically with parameter p i for i = 1, · · ·, n and they are independent, then ∧ i=1,· · ·,n X i is distributed geometrically with parameter 1 − According to the characteristic of the DSI model, the time cost that an infected node w spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(w, v). If the initially infected node w has multi-neighbors in sensor set S, the infections to different neighbors are independent. Under the onehop assumption, the expected detection time of w can thus be calculated like this
If the initially infected node w has no neighbors in sensor set S, the expected detection time of w is defined as T max . If w ∈ S, it follows that E[ (w, S)] = 0. Fig. 3 presents an example to show the calculation result of expected detection time E[ (w, S)] for each node w ∈ V under the one-hop assumption. With the preparations above, we are about to calculate the marginal reduction ı u (S) = D(S) − D(S ∪ {u}) under the one-hop assumption. Assume the set S has been selected as sensor set to detect infections. When considering adding another node u as a new sensor into S, the reduction of the expected detection time can be put as follows
which is by Eq. (5). We now use Theorem 7 to present the concrete expression of the right part of Eq. (30) . Similar with the definition of Eq. (1), we predefine the parents of set S as follows,
Theorem 7. In the DSI model with infection probability ip(w, v) on a directed graph G = (V, E), the reduction ı u (S) of the expected detection time can be calculated like this
under one-hop assumption, where f(u, S) is defined as The reason that we call it as Local-Reduction-UBG is that we only consider the local change in monitoring {u} and Par(u) \ Par(S) Table 2 Statistics of the four real-world networks.
Digger Twitter Epinions Small-world #Node 8194 32,986 51,783 200,000 #Edge 56,440 763,713 476,
We conduct experiments on both synthetic and real-world data sets to evaluate the UBG algorithm, the Quickest-Path-UBG algorithm and Local-Reduction-UBG algorithm. We implement the algorithms using C++ with the Standard Template Library (STL). All experiments are run on a Linux (Ubuntu 11.10) machine with six-core 1400 MHz AMD CPU and 32 GB memory.
Three real and one synthetic data sets are used for comparisons. The Digger data 1 is a heterogeneous network, including Digg stories, user actions (submit, digg, comment and reply) with respect to the stories, and friendship relations among users. The Twitter and Epinions data sets can both be obtained from Stanford Datasource 2 . Epinions is a general consumer review site where visitors can read reviews about a variety of items to help them decide a purchase. The synthetic Small-world data set is the type of graphs in which each node can be reached by a small number of hops. For smallworld model we set the parameter of the nearest neighbors k = 15 and the rewiring probability p = 0.1.
The above networks are representative ones, covering a variety of networks with different types of relations and sizes. The details of the data sets are listed in Table 2 where degree means in-degree. In our experiments, an undirected graph is regarded as a bidirectional graph.
We compare the UBG in Algorithm 2, the Quickest-Path-UBG in Algorithm 3, and Local-Reduction-UBG in Algorithm 4 with both the greedy and heuristic algorithms.
• CELF [24] . The state-of-the-art greedy algorithm, where uses 10,000 snapshots in the whole process for any network. • DEGREE [37] . A heuristic algorithm based on "degree centrality", with high-degree nodes as key ones. The seeds are the nodes with the k highest in-degrees. • INTER-MONITOR DISTANCE [35] . A heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. • PageRank [5] . A link analysis algorithm which ranks the importance of pages in a Web graph. We implement the power method with a damping factor of 0.85, and pick the k highest-ranked nodes as seeds. • Random. It simply selects k random vertices in the graph as the seed set, which is taken as the baseline.
In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run Monte-Carlo simulation on the networks 10, 000 times and calculate the mean. The simple greedy algorithm is not compared because many works have reported that CELF has the same optimization result and less running time. Since the DEGREE heuristic is the state-of-the-art [37] , we do not implement heuristics such as distance centrality and betweenness centrality-based heuristics.
We mainly report results on a uniform infection probability of 0.1 assigned to each directed link in the network, i.e., ip(u, v) = 0.1 for any directed edge (u, v) ∈ E. One can refer to the work [14, 33, 36] for learning real values of the parameters {ip(u, v) : (u, v) ∈ E} from available data. Besides, we let the time horizon T max = 30 and the prior distribution be uniform in the network.
In Section 3, we proposed two Monte-Carlo methods to estimate D(S). Table 3 shows the estimation results of Propagation Simulation and Snapshot Simulation. Here we select 10 nodes with the highest in-degrees in each network as the sensor set S. We can find that Propagation Simulation and Snapshot Simulation release almost the same estimation results, which confirms their equivalence in estimating D(S). Fig. 4 shows the cumulative time cost of these two Monte-Carlo methods. Here we randomly select 10 sensor sets {S 1 , S 2 , . . ., S 10 } from the Digger data and every sensor set has five sensor nodes, i.e. |S i | = 5 for all i = 1, 2, . . ., 10. We can see that the cumulative time cost of Propagation Simulation increases linearly, while that of Snapshot Simulation has a big jump at the first sensor set and then increases slowly. The reason behind is that Snapshot Simulation needs to establish numerous snapshots to estimate the first D(S 1 ), and these established snapshots can be reused in the posterior estimations of {D(S i )} 10 i=2 . Table 4 shows the gap between the real value of remaining time R(S) and its upper bounds. Here we also select ten nodes with the highest in-degrees in each network as the sensor set S.
In Table 5 , we compare the number of estimation or approximation calls at the first 10 iterations among CELF, UBG, Quickest-Path-UBG and Local-Reduction-UBG on the four data sets. From the results, we can observe that the call number in UBG and its two accelerations are significantly reduced compared to that in CELF, especially at the first round. One may notice that in Table 5 , CELF occasionally defeats our methods, but the total call number of our methods are much less than CELF. As listed in 5, the total call number of the first 10 iterations of UBG and its two accelerations, compared to CELF, is reduced at a rate of 94% at least on the four data sets. Similarly, at least 81% reduction of call numbers of CELF can be observed in the first 50 iterations. From the observation, we can conclude that our UBG is more efficient than CELF on large networks.
Detection time measures the time delay of a message propagated from a diffusion source to a sensor. We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50 as shown in Fig. 5 . UBG, as an updated version of CELF, has competitive results on the four data sets. More importantly, the detection times of UBG and CELF are completely the same in the four figures, which explains again that UBG and CELF share the same results in sensor selection. The only difference between UBG and CELF is the number of remaining time estimation calls. The Quickest-Path-UBG and Local-reduction-UBG always perform better than other heuristics. More specifically, if the networks are sparse like Digger and Epinions, the Quickest-Path-UBG is more competitive; if the networks are dense like Twitter and Smallworld, the Local-reduction-UBG is more competitive.
Selection time measures the time cost of an algorithm selecting sensors. Fig. 6 shows the time cost of selecting sensors with k = 50. UBG is 4-8 times faster than CELF. One may argue that such a low improvement of UBG can be neglected in large networks. In fact, UBG scales well to large networks. Because, with the size of a network increase, Monte-Carlo simulations take more time, and thus UBG will achieve better performance by pruning more unnecessary Monte-Carlo simulations.
As to heuristics, Degree and Random are very fast in selecting candidate nodes, which take less than 1 second. Quickest-Path-UBG and Local-Reduction-UBG are exciting and adoptable, due to their good performance in detection time. The PageRank and Inter-Monitor Distance are slightly slower and undesirable, in view of their poor performance in detection time.
We run tests on Twitter data set and obtain the selection time w.r.t. parameter ip (the infection probability). In the experiments, ip increases from 0.1 to 0.5. We assign a uniform infection probability ip to each directed link under the DSI Model. From the results in Fig. 7 , we can conclude that with the parameter ip growing larger, the CELF is more time-consuming. By contrast, the UBG and its accelerations are robust and insensitive to the parameter ip. 2) First, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∪ T(u, S 2 ). For any t ∈ T(u, S 1 ∪ S 2 ), I t ∩ (S 1 ∪ S 2 ) / = ∅ holds. From I t ∩ (S 1 ∪ S 2 ) = (I t ∩ S 1 ) ∪ (I t ∩ S 2 ) / = ∅, we know either I t ∩ S 1 or I t ∩ S 2 is nonempty, which implies t ∈ T(u, S 1 ) ∪ T(u, S 2 ). Hence T(u, S 1 ∪ S 2 ) ⊂ T(u, S 1 ) ∪ T(u, S 2 ) works. Conversely, for any t ∈ T(u, S 1 ) ∪ T(u, S 2 ), t will be located in T(u, S 1 ) or T(u, S 2 ). Without loss of generality, we assume t ∈ T(u, S 1 ), which means
is nonempty, which indicates t ∈ T(u, S 1 ∪ S 2 ). Hence T(u, S 1 ) ∪ T(u, S 2 ) ⊂ T(u, S 1 ∪ S 2 ) also works.
Second, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ). We take infimum operations on both sides of the first step conclusion, i.e., inf T(u, S 1 ∪ S 2 ) = inf T(u, S 1 ) ∪ T(u, S 2 ). Since all these sets are be composed of integers, we obviously get T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).
Finally, we take the infimums with T max on both sides of the second step conclusion, i.e., T(u, S 1 ∪ S 2 ) ∧ T max = T(u, S 1 ) ∧ T(u, S 2 ) ∧ T max , which means (u, S 1 ∪ S 2 ) = (u, S 1 ) ∧ (u, S 2 ) by the definition of .
3) Similarly, in order to prove this property, we only need to , {u}) , i.e., for any t ≥ 0, the probability that t is located in T(u, {v}) equals the probability that t is located in 4) If S ⊂ T, in order to prove this property, we only need to prove T(u, S) ⊂ T(u, T ) and notice the fact that the infimum of a larger set is smaller. For any t ∈ T(u, S), I t ∩ S / = ∅ holds. Besides, as S ⊂ T, I t ∩ S / = ∅ implies I t ∩ T / = ∅, which means t is also in T(u, T ). So, T(u, S) ⊂ T(u, T ), by which we can easily conclude the property, is verified. 5) If T(u, S) is empty, then T(u, S) =+∞. (u, S) =+ ∞ ∧ T max = T max ∈ [0, T max ]. If T(u, S) is nonempty, then every element in it is greater than 0 and notice that (u, S) = T(u, S) ∧ T max ≤ T max . Therefore, we can conclude 0 ≤ (u, S) ≤ T max .
We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] , denoted by (A1).
(A1): Let E be a subset of N 2 and X i , Y j , (i, j) ∈ E are pairwise distinct variables. We define a formula F : = ∨ {X i ∧ Y j : (i, j) ∈ E}, which is a disjunction of all the conjunctive clauses X i ∧ Y j , (i, j) ∈ E. How many valuations that satisfy the formula F? Then we introduce the following problem (A2).
(A2): G = (V, E) is a directed graph, where each edge e ∈ E is associated with a weight w e which is geometrically distributed with a parameter p, i.e. P(w e = i) = p(1 − p) i−1 , i ≥ 1. s, t are two nodes in G and k is a positive integer. What is the probability of the event that the shortest path from s to t has length T at least k?
Now we prove that (A1) is reducible to (A2). We map the variable X i to a node x i and variable Y j to a node y j . If (i, j) exists in E, then we add an edge from x i to y j . Also, we add a source node s and an edge (s, x i ), associated with weight w (s,x i ) which is geometrically distributed with parameter 1/2, for any i ∈ {i : (i, j) ∈ E}. Thus, w (s,x i ) equals 1 with probability 1/2 and strictly greater than 1 with probability 1/2. Similarly, we add a terminal node t and an edge from each y j to t with a weight identically distributed as w (s,x i ) . So we construct a graph with geometrically distributed weights on edges. Now we can build a probability-preserving bijection between the valuations of X i , Y j and the subgraphs of the constructed random graph: for a valuation , the corresponding subgraph is the one where each edge adjacent to x i has length 1 iff (X i ) = 1. The same argument for Y j .
We set k = 3 and claim the fact that the probability P(T ≥ 3) in (A2) is the number of valuations satisfying that F is divided by 2 N , where N is the number of variables. Indeed, the fact is based on the following observation: a valuation is true iff some adjacent pair of X i and Y j is true if the incident edges to the corresponding x i and y j have length 1 in the corresponding subgraph. Note that any edge of length greater than 1 is irrelevant as the structure of the graph, which ensures it can never be part of a path of length 3 from s to t, For any path from s to t must jump three times: from s to some x i , from x i to some y j , and from y j to t. Thus, from what we have stated, (A1) can be reducible to (A2), which implies (A2) is #P-hard.
We aim to solve the problem (denoted by (A)) of calcu-
In order to prove the random variables T(u, S) and c(u, S) are identically distributed, we need to prove P[T (u, S) = t] = P[c(u, S) = t] for any t > 0.
On the one hand, we can prove P[T (u, S) = t] ≤ P[c(u, S) = t]. Given t > 0, we find that for any simulation with T(u, S) = t, a snapshot with c(u, S) = t can be constructed. By definition, if T(u, S) = t, then J i ∩ S = ∅, i = 0, 1, · · ·, t − 1 and J t ∩ S / = ∅. For any node v i+1 ∈ J i+1 \J i / = ∅, suppose that v i+1 is infected through the edge (s, v i+1 ) with s ∈ J i , v i+1 ∈ J i+1 . Denote n s = inf{j : s ∈ J j }. Then we let the edge (s, v i+1 ) take weight c(s, v i+1 ) = i + 1 − n s . For any node s ∈ J t , we denote n s , i.e. n s = sup{j : s ∈ J j }. Then for any v ∈ V \J t , let the edge (s , v i+1 ) take weight c(s , v) = t − n s + C, where C is sampled from a geometric distribution. In addition, we take a weight sampled from geometric distribution on each of the rest edges. In such operations, a snapshot with c(u, S) = t is constructed, which implies P[T (u, S) = t] ≤ P[c(u, S) = t].
On the other hand, we can also get P[c(u, S) = t] ≤ P[T (u, S) = t]. For any snapshot with c(u, S) = t, a simulation with T(u, S) = t can be constructed. We let node u be the initiator, i.e. J 0 = {u} and denote the smallest number in the set {c(u, v) : u ∈ J 0 , v ∈ V \J 0 } by n 0 . Meanwhile, we denote the set of nodes {v : c(u, v) = n s , u ∈ J 0 , v ∈ V \J 0 } by n 0 . Let J n 0 = J 0 ∪ n 0 . Complementally, we define all the J i = J 0 , i = 0, · · ·, n 0 − 1. Denote the smallest number in the set {c(u, v) : u ∈ J t , v ∈ V \J t } by n t and the set of nodes {v : c(u, v) = n s , u ∈ J t , v ∈ V \J t } by n t . Let J t+n t = J n t = J t ∪ n t . Complementally, we define J i : = J t for all i = t, · · ·, t + n t − 1. We now can construct a simulation of the diffusion model (I t ) t≥0 = (J t+1 \J t ) t≥0 . Hence, we have a simulation with T(u, S) = t, which implies P[c(u, S) = t] ≤ P[T (u, S) = t].
Therefore, we have P[T (u, S) = t] = P[c(u, S) = t] for any t > 0, which implies that T(u, S) and c(u, S) are identically distributed. Hence, the simulation and snapshot simulations are equivalent in estimating D(S).
In this paper, we discussed scalable solutions to detection time minimization problem for early detection of dynamic harmful cascades in networks. To address this problem, we proposed an upper bound based greedy (UBG) algorithm and its two accelerative algorithms to cater for lager scale networks. The UBG solution guarantees a near-optimal detection time, pruning at least 80% Monte-Carlo calls of CELF. The novel accelerations on UBG can significantly reduce the selection time and further achieve at least 10 2 times speed-raising. There are several interesting future directions. First, the detection time D(S) is # P-hard to calculate exactly, it is still a question how to design an efficient algorithm to estimate D(S) with a theoretical guarantee. Second, the infection probability ip and prior distribution are predefined in this paper, how to learn these parameters for the DSI model from available cascade data still remains unexplored.
|