CORD-19:6e2db98aa242e3e84116afdfad5e250943c55ac5 JSONTXT 12 Projects

Annnotations TAB TSV DIC JSON TextAE

Id Subject Object Predicate Lexical cue
T1 0-69 Sentence denotes Early detection of dynamic harmful cascades in large-scale networks ଝ
T2 71-79 Sentence denotes Abstract
T3 80-217 Sentence denotes Quickly detecting harmful cascades in networks can allow us to analyze the causes and prevent further spreading of destructive influence.
T4 218-370 Sentence denotes Since it is often impossible to observe the state of all nodes in a network, a common method is to detect harmful cascades from sparsely placed sensors.
T5 371-567 Sentence denotes However, the harmful cascades are usually dynamic (e.g., the cascade initiators and diffusion trajectories can change over the time), which can severely destroy the robustness of selected sensors.
T6 568-672 Sentence denotes Meanwhile the large scale of current networks greatly increases the time complexity of sensor selection.
T7 673-830 Sentence denotes Motivated by the observation, in this paper we investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks.
T8 831-1082 Sentence denotes Specifically, we first put forward a dynamic susceptible-infected model to describe harmful cascades, and formally define a detection time minimization (DTM) problem which focuses on effective sensors placement for early detection of dynamic cascades.
T9 1083-1218 Sentence denotes We prove that it is #P-hard to calculate the objective function exactly and propose two Monte-Carlo methods to estimate it efficiently.
T10 1219-1303 Sentence denotes We prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.
T11 1304-1436 Sentence denotes Based on that, we propose an efficient upper bound based greedy (UBG) algorithm with the theoretical performance guarantee reserved.
T12 1437-1530 Sentence denotes To further meet different types of large-scale networks, we propose two accelerations of UBG:
T13 1531-1643 Sentence denotes Quickest-Path-UBG for sparse networks and Local-Reduction-UBG for dense networks to improve the time complexity.
T14 1644-1760 Sentence denotes The experimental results on synthetic and real-world social networks demonstrate the practicality of our approaches.
T15 1762-1880 Sentence denotes Harmful cascade spreading through kinds of network structures has become more and more ubiquitous in the modern world.
T16 1881-2038 Sentence denotes A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population contact network and lead to an epidemic [12, 34] .
T17 2039-2162 Sentence denotes A computer virus on a few servers can fast spread to other servers or computers in a computer connection network [15, 17] .
T18 2163-2281 Sentence denotes In a similar vein, a rumor started by a few individuals can spread quickly through the online social network [2, 26] .
T19 2282-2473 Sentence denotes It is crucial to detect the harmful cascades as soon as they happen or shortly thereafter, since it allows us to study the causes and prevent further spreading of harmful influence [32, 38] .
T20 2474-2698 Sentence denotes In practice, it is commonly infeasible or unaffordable to monitor all nodes at all times, and therefore a common way to detect cascades is to select important nodes where we can place sensors for monitoring [7, 19, 31, 39] .
T21 2699-2979 Sentence denotes However, existing methods usually viewed the cascade data as static and deterministic, ignoring an important fact that the harmful cascades are usually dynamic and time-variant, in the sense that the cascade initiators and diffusion trajectories can change randomly over the time.
T22 2980-3117 Sentence denotes Actually the dynamic brings new challenges to network monitoring, since it can severely destroy the robustness of selected sensors [22] .
T23 3118-3288 Sentence denotes Meanwhile the existing sensor selection algorithms can only work for small networks and not be scalable well to large networks of the https://doi.org/10.1016/j.jocs.2017.
T24 3289-3327 Sentence denotes 10 .014 1877-7503/© 2017 Elsevier B.V.
T25 3328-3359 Sentence denotes All rights reserved. day [28] .
T26 3360-3533 Sentence denotes Motivated by these observations, in this paper we investigate the scalable solutions to sensor selection problem for early detection of dynamic harmful cascades in networks.
T27 3534-3618 Sentence denotes To model the dynamic property, we carry on our work from a model-driven perspective.
T28 3619-3828 Sentence denotes To provide a unified framework, we model all the above examples as an infection spreading in a network G = (V, E), where V is the set of nodes (i.e. individuals) and E is the set of edges (i.e. relationships).
T29 3829-3923 Sentence denotes In a population network, the infection is the disease that is transmitted between individuals.
T30 3924-4108 Sentence denotes In the example of a computer virus spreading in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, the infection is the rumor.
T31 4109-4444 Sentence denotes Under this unified network framework, we specifically propose a new dynamic cascade model to describe harmful cascade diffusions, and define a detection time minimization (DTM) problem S * = argmin |S|=k,S⊆V D(S), where k is a given parameter determined by budget or monitor capacity, S is sensor nodes set, and D(S) is detection time.
T32 4445-4553 Sentence denotes The DTM problem focuses on effective sensor nodes selection for early detection of dynamic harmful cascades.
T33 4554-4644 Sentence denotes We first prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.
T34 4645-4857 Sentence denotes Considering the limitation caused by greedy algorithm inefficiency, we then propose two alternative Monte-Carlo methods, each having its pros and cons, to estimate the #P-hard objective function D(S) efficiently.
T35 4858-5086 Sentence denotes To further address the scalability issue, we propose an efficient upper bound based greedy (UBG) algorithm and its two accelerations Quickest-Path-UBG and Local-Reduction-UBG to cater for different types of large-scale networks.
T36 5087-5202 Sentence denotes These two accelerations are close to UBG in performance but orders of magnitude faster than UBG in time complexity.
T37 5203-5306 Sentence denotes Experiments on synthetic and real-world social networks demonstrate the practicality of our approaches.
T38 5307-5505 Sentence denotes Detecting the whole network by monitoring finite sensor nodes has been widely applied to detect water contaminations in a water distribution network [19] and virus outbreaks in a human society [7] .
T39 5506-5631 Sentence denotes Some early work places sensors by topological measures, e.g. targeting high degree nodes [32] or highly connected nodes [8] .
T40 5632-5760 Sentence denotes However the effects along this idea are commonly unsatisfied, since they ignore the multi-step complexity of network diffusions.
T41 5761-5922 Sentence denotes The "Battle of Water Sensor Network" challenge [31] motivated a number of works to optimize sensor placement in water networks to detect contamination [16, 18] .
T42 5923-6183 Sentence denotes By utilizing submodular property, they proposed to optimize the sensor placement with different criterions such as maximizing the probability of detection, minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena [24] .
T43 6184-6315 Sentence denotes In their works the data are a set of deterministic scenarios and the random property of diffusion are not taken into consideration.
T44 6316-6384 Sentence denotes What's more, their methods are not scalable to large networks [28] .
T45 6385-6575 Sentence denotes In addition to the above two types of methods, another related work [1] set sensors along fixed paths in the network so as to gather sufficient information to locate possible contaminations.
T46 6576-6738 Sentence denotes Early detection of contagious outbreaks by monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis and Fowler [7] .
T47 6739-6962 Sentence denotes Krause et al. [20] presented efficient schedules for minimizing energy consumption in battery operated sensors, while other works analyzed distributed solutions with limited communication capacities and costs [13, 21, 22] .
T48 6963-7047 Sentence denotes In addition, Berry et al. [4] equated the placement problem with a p-median problem.
T49 7048-7173 Sentence denotes Li et al. [25] proposed a dynamic-programming (DP) based greedy algorithm which is with a near-optimal performance guarantee.
T50 7174-7351 Sentence denotes By contrast, our work is geared to investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks from a model-driven prospective.
T51 7352-7659 Sentence denotes The main difference from previous works is that we take dynamic properties of cascades into consideration: (1) the cascade initiator is dynamic since we do not know who will initiate another harmful diffusion next time, and (2) the diffusion process is dynamic since the propagation trajectory is uncertain.
T52 7660-7789 Sentence denotes Under these two dynamic sources, we aim to accurately and fast select k nodes as sensors for early detection of harmful cascades.
T53 7790-7958 Sentence denotes Besides, as the size of network becomes more and more large-scale, we also need to ensure the effectiveness of the proposed sensor selection methods for large networks.
T54 7959-8093 Sentence denotes To minimize the time to find the dynamic harmful cascades, our search attempts to optimize the selection of sensors in a scalable way.
T55 8094-8138 Sentence denotes Our contributions are summarized as follows:
T56 8139-8198 Sentence denotes • We formulate a Detection Time Minimization (DTM) problem.
T57 8199-8469 Sentence denotes To describe the dynamic harmful cascade on networks, we put forward a dynamic cascade model for the harmful diffusion. • We prove that it is #P-hard to exactly calculate the objective function of DTM problem and propose two equivalent Monte-Carlo methods to estimate it.
T58 8470-8631 Sentence denotes Each has its advantages and disadvantages. • We show the NP-hardness of DTM problem as it can be shown to contain the Set Cover problem as a simple special case.
T59 8632-8913 Sentence denotes We convert the DTM problem to a constrained max optimization problem, prove the submodularity of the new objective function, and then employ the greedy algorithm which achieves an approximation ratio of 1 − 1/e. • We theoretically establish new upper bounds for the remaining time.
T60 8914-9167 Sentence denotes Based on these bounds, we further propose a new Upper Bound based Greedy (UBG in short) algorithm which can significantly reduce the number of estimations in greedy-based algorithms, especially at the initial step. • We propose two accelerations of UBG:
T61 9168-9354 Sentence denotes Quickest-Path-UBG and Local-Reduction-UBG to address the DTM problem for large networks, which are close to UBG in performance but orders of magnitude faster than UBG in time complexity.
T62 9355-9485 Sentence denotes In addition, our methods not only aim at early detection of harmful cascades, but also shed light on a host of other applications.
T63 9486-9639 Sentence denotes For example, the goal of Emerging Topic Detection is to identify emergent topics in a social network, assuming full access to the stream of all postings.
T64 9640-9894 Sentence denotes Providers, such as Twitter or Facebook, have an immediate access to all tweets or postings as they are submitted to their server [6, 29] , while outside observers need an efficient mechanism to monitor changes, such as the methods developed in this work.
T65 9895-10150 Sentence denotes Another example, an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social networks for pieces of information that can be used in trading decisions before they appear in the more popular news sites [23, 27] .
T66 10151-10255 Sentence denotes Similarly, intelligence, business and politics analysts are scanning online sources for new information.
T67 10256-10358 Sentence denotes The theoretical framework and method established in this paper can also be used in these applications.
T68 10359-10405 Sentence denotes The rest of the paper is organized as follows.
T69 10406-10467 Sentence denotes Section 2 presents the cascade model and problem formulation.
T70 10468-10581 Sentence denotes In Section 3 we show the # P-hardness of detection time calculation and derive two equivalent estimation methods.
T71 10582-10738 Sentence denotes Section 4 is devoted to the NP-hardness of DTM problem, the submodular properties of transformed objective function, and the corresponding greedy algorithm.
T72 10739-10860 Sentence denotes To prune the unnecessary estimation calls, we analysis the upper bounds for the new proposed UBG algorithm in Section 5 .
T73 10861-10930 Sentence denotes Section 6 presents two accelerations of UBG for large scale networks.
T74 10931-10972 Sentence denotes Section 7 shows the experimental results.
T75 10973-11008 Sentence denotes We conclude the paper in Section 8.
T76 11009-11051 Sentence denotes Table 1 outlines the major variables used.
T77 11052-11230 Sentence denotes In this section we start with a description of dynamic cascade model and then we define the Detection Time Minimization (DTM) problem abstracted from the early detection problem.
T78 11231-11409 Sentence denotes We start somewhat with the framework of [24] , where the models introduced are essentially descriptive to specify a joint distribution over all nodes' behavior in a global sense.
T79 11410-11595 Sentence denotes In contrast, we focus on more operational models, from mathematical sociology [3] and interacting particle systems [10] , to explicitly represent the step-by-step dynamics of infection.
T80 11596-11805 Sentence denotes By considering operational models for the dynamic diffusion of a harmful item through a network G, represented by a directed graph, we will speak of each individual node as being either infected or uninfected.
T81 11806-12027 Sentence denotes Motivated by the properties of contaminants like rumor and virus, we will focus from now on the progressive case in which nodes can switch from being uninfected to being infected, but do not switch in the other direction.
T82 12028-12165 Sentence denotes We also focus on the setting that each node's tendency to become infected increases monotonically as more of its neighbors become active.
T83 12166-12450 Sentence denotes Thus, the process can be reviewed with respect to some particular uninfected node v: as time unfolds, more and more of v's neighbors become infected; at some point, this may cause v to become infected, and v's decision may in turn trigger further infections by nodes connected with v.
T84 12451-12658 Sentence denotes Motivated by above observation, we put forward a dynamic susceptible-infected (DSI) model for the harmful cascade spreading, which can be seen as a variant of the common susceptible-infected (SI) model [3] .
T85 12659-12766 Sentence denotes The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover.
T86 12767-12783 Sentence denotes Specifically (1)
T87 12784-12904 Sentence denotes The DSI model is attached with a probability distribution = { (u), u ∈ V} to describe the dynamic of cascade initiators.
T88 12905-13000 Sentence denotes can be seen as a priori knowledge of nodes being infected initially, where u ∈ V (u) = 1 holds.
T89 13001-13119 Sentence denotes The DSI model first chooses an initially infected node u ∈ V according to distribution , and then it works as follows.
T90 13120-13201 Sentence denotes Let I t ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I 0 = {u}.
T91 13202-13293 Sentence denotes Define J t := 0≤i≤t I i be the cumulative set of nodes that get infected before step t ≥ 0.
T92 13294-13417 Sentence denotes Then, at step t + 1, each node u ∈ J t may infect its out-neighbors v ∈ V \J t with an independent probability of ip(u, v).
T93 13418-13488 Sentence denotes Thus, a node v ∈ V \J t is infected at step t + 1 with the probability
T94 13489-13557 Sentence denotes If node v is successfully infected, it is added into the set I t+1 .
T95 13558-13601 Sentence denotes Then update J t+1 by J t+1 ←− J t ∪ I t+1 .
T96 13602-13782 Sentence denotes Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others.
T97 13783-13849 Sentence denotes Obviously the cumulative infected process (J t ) t≥0 is Markovian.
T98 13850-13990 Sentence denotes In a diffusion model (I t ) t≥0 , given an initially infected node u ∈ V and a set of sensors S ⊆ V, the random detection time is defined as
T99 13991-14064 Sentence denotes where a ∧ b : = min{a, b} and T max is the time interval that we observe.
T100 14065-14088 Sentence denotes We assume inf{ ∅ } =+∞.
T101 14089-14225 Sentence denotes The random detection time (u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S.
T102 14226-14232 Sentence denotes Proof.
T103 14233-14286 Sentence denotes Above properties come from the definition in Eq. (2).
T104 14287-14329 Sentence denotes We put the detailed proofs in Section 8.1.
T105 14330-14549 Sentence denotes Note that the detection time (u, S) can be viewed as a special type of stopping time in stochastic process theory [11] .ᮀ Let the variable X denote the random initiator with distribution , i.e., for every u ∈ V, we have
T106 14550-14656 Sentence denotes The (expected) detection time from a random initiator to one of the selected sensors in S can be define as
T107 14657-14716 Sentence denotes where E is an expectation operator under the cascade model.
T108 14717-14820 Sentence denotes By the conditional expectation, we can calculate the expected detection time D(S) in the following way:
T109 14821-14861 Sentence denotes Indeed, Eq. (5) can be reached like this
T110 14862-14917 Sentence denotes where (X) denotes the -algebra generated by variable X.
T111 14918-15033 Sentence denotes Eq. (5) converts the global detection time E[ (X, S)] as a summation of local detection time E[ (u, S)] with u ∈ V.
T112 15034-15081 Sentence denotes This conversion provides a basic for Section 5.
T113 15082-15245 Sentence denotes Our goal is to find k nodes as sensors in a network in order to minimize the time until an infection -starting from a random initiator in the network -is detected.
T114 15246-15430 Sentence denotes Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S * ⊆ V such that |S * | = k and D(S * ) = min{D(S)||S| = k, S ⊆ V}, i. e .,
T115 15431-15501 Sentence denotes where k is a given parameter determined by budget or monitor capacity.
T116 15502-15578 Sentence denotes We call this as detection time minimization problem (DTM problem for short).
T117 15579-15677 Sentence denotes For the sake of concreteness, we will discuss our results in terms of the DSI model in particular.
T118 15678-15797 Sentence denotes To solve the DTM problem Eq. (6), the first issue is how to calculate the objective function D(S) given a sensor set S.
T119 15798-15856 Sentence denotes This question isn't as easy as its description in Eq. (4).
T120 15857-16033 Sentence denotes For example, the DSI process is underspecified, since we have not prescribed the order in which newly infected nodes in a given step t will attempt to activate their neighbors.
T121 16034-16227 Sentence denotes Thus, it is not initially obvious that the process is even well-defined, in the sense that it yields the same distribution over outcomes regardless of how we schedule the attempted activations.
T122 16228-16365 Sentence denotes Actually the computation of D(S) is #P-hard, by showing a reduction from the positive partitioned 2-DNF assignment counting problem [9] .
T123 16366-16372 Sentence denotes Proof.
T124 16373-16485 Sentence denotes We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] .
T125 16486-16725 Sentence denotes For detailed proof, see Section 8.2. ᮀ Since it is intractable to exactly compute D(S) on a typically sized graph, a natural idea is to employ Monte-Carlo methods to estimate D(S), which can be implemented in two different ways as follows:
T126 16726-16917 Sentence denotes The expected detection time D(S) is obtained by directly simulating the random process of diffusion triggered by a random node, say u, chosen according to the distribution defined as Eq. (3).
T127 16918-17002 Sentence denotes Let I t denote the set of nodes newly infected in the t-th iteration with I 0 = {u}.
T128 17003-17139 Sentence denotes In the (t + 1)-th iteration, a node u ∈ J t : = ∪ 0≤i≤t I i attempts to activate each uninfected neighbor v / ∈ J t with the probability
T129 17140-17233 Sentence denotes The process is repeated until the diffusion hits the sensor set S at some step T(u, S), i.e.,
T130 17234-17269 Sentence denotes where inf{ ∅ } : =+∞ as convention.
T131 17270-17390 Sentence denotes The detection time of this single simulation is recorded at T(u, S) ∧ T max , which is right (u, S) defined in Eq. (2) .
T132 17391-17518 Sentence denotes We run such simulations for many times and finally estimate the expected detecting time D(S) by averaging over all simulations.
T133 17519-17753 Sentence denotes According to the characteristic of the DSI model, the time cost that an infected node u spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(u, v), and we denote this time cost as C(u, v), i.e.,
T134 17754-17777 Sentence denotes for any integral s ≥ 1.
T135 17778-17948 Sentence denotes We can flip all coins a priori to produce a weighted graph G = (V, E, c), where an edge (u, v) is labeled with the time cost c(u, v) -a sample of random variable C(u, v).
T136 17949-18200 Sentence denotes Actually such a snapshot provides an easy way to sample the detection time of any sensor set S, which exactly equals to the smallest time cost from the initially infected node, say u (chosen according to the distribution ), to the nearest sensor in S.
T137 18201-18359 Sentence denotes Define c(u, S) be the smallest time cost from u to the nearest sensor in S, then the detection time of this single simulation is recorded as c(u, S) ∧ T max .
T138 18360-18477 Sentence denotes We produce plenty of snapshots and finally estimate the expected detection time D(S) by averaging over all snapshots.
T139 18478-18565 Sentence denotes The snapshot simulation is equivalent to the propagation simulation in estimating D(S).
T140 18566-18654 Sentence denotes More specifically, the random variables T(u, S) and c(u, S) are identically distributed.
T141 18655-18749 Sentence denotes , it is enough to prove that random variables T(u, S) and c(u, S) are identically distributed.
T142 18750-18787 Sentence denotes For proof details, see Section 8.3. ᮀ
T143 18788-18897 Sentence denotes We have confirmed that the two methods are equivalent, while either has its own advantages and disadvantages.
T144 18898-19076 Sentence denotes For estimating a specific D(S), the simulation method is faster, because it only needs to examine a small portion of edges while the snapshot method has to examine all the edges.
T145 19077-19272 Sentence denotes For estimating the expected detection time of different sensor sets, the snapshot method outperforms the simulation method in terms of time complexity, since each snapshot serves all sensor sets.
T146 19273-19446 Sentence denotes Under these observations, the heuristic algorithms use the propagation simulation, while the greedy-based algorithms employ the snapshot simulation in the experimental part.
T147 19447-19660 Sentence denotes In this section we first show the NP-hardness of DTM problem in Eq. (6), as it can be shown to contain the Set Cover problem as a simple special case, then we propose a simple greedy algorithm for the DTM problem.
T148 19661-19667 Sentence denotes Proof.
T149 19668-19918 Sentence denotes Consider an instance of the NP-complete Set Cover problem, defined by a collection of subsets S = {S 1 , S 2 , . . ., S m } of a ground set U = {u 1 , u 2 , . . ., u n }, we wish to know whether there exist k of the subsets whose union is equal to U.
T150 19919-19953 Sentence denotes We can assume that k < n < m here.
T151 19954-20031 Sentence denotes We show that this can be viewed as a special case of the optimal problem (6).
T152 20032-20344 Sentence denotes Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes: there is a node i corresponding to each element u i ∈ U, a node j corresponding to each set S j ∈ S, and a directed edge (i, j) with activation probability ip(i, j) = 1 whenever u i ∈ S j .
T153 20345-20458 Sentence denotes Define the probability distribution = { (u), u ∈ V} on the knowledge of nodes being initially infected as follows
T154 20459-20576 Sentence denotes The Set Cover problem is equivalent to deciding if there is a set S of k nodes in this bipartite graph with D(S) ≤ 1.
T155 20577-20692 Sentence denotes Note that for the instance we have defined, activation is a deterministic process, as all probabilities are 0 or 1.
T156 20693-21115 Sentence denotes Monitoring k nodes to detect diffusion initiated from U corresponding to sets in a Set Cover solution results in covering all n nodes corresponding to the ground set U, and if any set S of k nodes has D(S) ≤ 1, then the Set Cover problem must be solvable.ᮀ Since the optimization problem is NP-hard and the network is prohibitively large, we cannot compute the optimum value to verify the actual quality of approximations.
T157 21116-21195 Sentence denotes Hence a natural idea is to employ the greedy algorithm as approximation method.
T158 21196-21306 Sentence denotes To make better use of the greedy algorithm, we consider an equivalent optimization problem Eq. (6) as follows,
T159 21307-21312 Sentence denotes where
T160 21313-21407 Sentence denotes is defined as the (expected) remaining time for taking actions when a contaminant is detected.
T161 21408-21501 Sentence denotes The above alternative formulation has key properties as described in the following Theorem 4.
T162 21502-21531 Sentence denotes Theoretically, a non-negative
T163 21532-21546 Sentence denotes for all S ⊆ T.
T164 21547-21553 Sentence denotes Proof.
T165 21554-21627 Sentence denotes It is obvious that the remaining time function R is monotone and R(∅) =0.
T166 21628-21662 Sentence denotes Now we prove that R is submodular.
T167 21663-21738 Sentence denotes According to the definitions in Eqs. (5) and (10), it suffices to show that
T168 21739-21771 Sentence denotes for all S ⊆ T ⊆ V and v ∈ V \T .
T169 21772-21807 Sentence denotes By the second property of in Propo-
T170 21808-21901 Sentence denotes Note that (u, T) ≤ (u, S) always works by the monotone decrease property of in Proposition 1.
T171 21902-21972 Sentence denotes Now we discuss Eq. (14) separately according to the value of (u, {v}).
T172 21973-22016 Sentence denotes , which holds by the monotone decrease of .
T173 22017-22230 Sentence denotes Eq. (14) is proven and we thus get that R(S) is submodular.ᮀ By above properties in Theorem 4, the problem given in Eq. (9) can be approximated by the greedy algorithm in Algorithm 1 with the set function f : = R.
T174 22231-22408 Sentence denotes For any submodular and monotone function f with f(∅) =0, the problem of finding a set S of size k that maximizes f(S) can be approximated by the greedy algorithm in Algorithm 1.
T175 22409-22579 Sentence denotes The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected.
T176 22580-22755 Sentence denotes It is shown that the algorithm guarantees an approximation ratio of f(S)/f(S * ) ≥ 1 −1/e, where S is the output of the greedy algorithm and S * is the optimal solution [30] .
T177 22756-22972 Sentence denotes In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S, and we turn to run snapshots for 10, 000 trials to obtain an accurate estimate of R(S), mentioned in Section 3.
T178 22973-23021 Sentence denotes This actually leads to expensive selection time.
T179 23022-23222 Sentence denotes Another source of inefficiency in Greedy(k, R) is that there exists O(kN) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes.
T180 23223-23290 Sentence denotes When N is large, the efficiency of the algorithm is unsatisfactory.
T181 23291-23552 Sentence denotes Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for evaluating R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.
T182 23553-23714 Sentence denotes In order to prune the estimation calls in Greedy(k, R), a natural idea is to employ the Cost-Effective Lazy Forward selection (CELF) algorithm proposed in [24] .
T183 23715-23908 Sentence denotes The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of estimation calls can be greatly pruned.
T184 23909-24040 Sentence denotes CELF optimization produces the same sensor set as the original greedy algorithm, but it is much faster than the original one [24] .
T185 24041-24158 Sentence denotes Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks.
T186 24159-24399 Sentence denotes In particular, in the first round to establish the initial upper bounds, CELF needs to estimate R({v}) using MC simulations for each node v, leading to N times of MC calls, which is time-consuming, especially when the network is very large.
T187 24400-24600 Sentence denotes The limitation leads to a rather fundamental question that, can we derive an upper bound of R({v}) which can be used to further prune unnecessary detection time estimations (MC calls) in Greedy(k, R)?
T188 24601-24727 Sentence denotes Motivated by this question and the idea in [40] , in this section we derive an initial upper bound of R({v}) for Greedy(k, R).
T189 24728-24870 Sentence denotes Based on the bound, we propose a new greedy algorithm Upper Bound based Greedy (UBG for short), which outperforms the original CELF algorithm.
T190 24871-24980 Sentence denotes Essentially different from the bounds that derived for the influence spread under the IC model [40] , we here
T191 24982-25036 Sentence denotes In this part, we aim to derive an upper bound of R(v).
T192 25037-25121 Sentence denotes Before introducing the upper bounds in Theorem 5, we first prepare two propositions.
T193 25122-25245 Sentence denotes Let P u (v ∈ J t ) denote the probability that node v becomes infected before step t when the initially infected node is u.
T194 25246-25287 Sentence denotes We have the first proposition as follows.
T195 25288-25364 Sentence denotes For v ∈ V , the remaining time R(v) under the DSI model can be calculated as
T196 25365-25371 Sentence denotes Proof.
T197 25372-25405 Sentence denotes In fact, by the definition in Eq.
T198 25406-25424 Sentence denotes (2), we first have
T199 25425-25646 Sentence denotes where the fourth '=' is due to the fact that u ∈ V (u) = 1 in above derivation. ᮀ Proposition 2 reveals that we can treat the global remaining time R(v) as a summation of all T max propagation steps of local probabilities
T200 25647-25738 Sentence denotes Based on Proposition 2, a following question is, what is the relationship between two sets,
T201 25739-25753 Sentence denotes Proposition 3.
T202 25754-25797 Sentence denotes For k ≥ 1, we have the following inequation
T203 25798-25804 Sentence denotes Proof.
T204 25805-25885 Sentence denotes For k ≥ 1, by the definition of conditional expectation and DSI model, we obtain
T205 25886-25954 Sentence denotes where the first '≤' is due to 1 {v/ ∈J k−1 } ≤ 1, and the second '≤'
T206 25955-26142 Sentence denotes x i in the above derivation.ᮀ Proposition 3 clearly identifies the ordering relationship between two adjacent elements in the series P u (v ∈ J 0 ), P u (v ∈ J 1 ), P u (v ∈ J 2 ), · · ·.
T207 26143-26216 Sentence denotes Now we simplify the results in Proposition 3 by using the form of matrix.
T208 26217-26311 Sentence denotes Let IP be the infection probability matrix with the element at position (u, v) being ip(u, v).
T209 26312-26344 Sentence denotes For t ≥ 0, denote the row vector
T210 26345-26441 Sentence denotes as the probabilities of nodes being infected before step t, i. e ., Â u t (v) := P u (v ∈ J t ).
T211 26442-26483 Sentence denotes Obviously, we have  u 0 (v) = 1 {u} (v).
T212 26484-26542 Sentence denotes Now we can rewrite Proposition 3 by using the matrix form,
T213 26543-26627 Sentence denotes By iteration, we further get that u t ≤ u 0 · (E + IP) t , where E is a unit matrix.
T214 26628-26704 Sentence denotes Furthermore, due to the definition of probability  u t (v), it follows that
T215 26705-26774 Sentence denotes Hereafter, define A ∧ 1 : = {a(i, j) ∧ 1} for a matrix A = {a(i, j)}.
T216 26775-26875 Sentence denotes With the above preparations, we can present the results on upper bound of remaining time as follows,
T217 26876-26944 Sentence denotes where E is a unit matrix and [A] (u,v) means the element at position
T218 26946-27029 Sentence denotes where is a prior distribution on the likelihood of nodes being the infected source.
T219 27030-27101 Sentence denotes We first use a toy example to explain how to calculate the upper bound.
T220 27102-27112 Sentence denotes Example 1.
T221 27113-27197 Sentence denotes Given a directed network G in Fig. 1 with infection probability matrix in Eq. (22) .
T222 27198-27285 Sentence denotes Let T max = 10, we have ((E + IP) t ∧ 1) is intractable when the network size is large.
T223 27286-27370 Sentence denotes To overcome the difficulty, we adopt the following procedure to calculate Tmax−1 t=0
T224 27371-27493 Sentence denotes If t ≥1 ≤ T max − 1, it implies that we do not need to calculate (E + IP) t any more when t ≥ t ≥1 , and therefore we have
T225 27494-27546 Sentence denotes where 1 is a N × N matrix with all elements being 1.
T226 27547-27572 Sentence denotes Additionally we find that
T227 27574-27580 Sentence denotes Proof.
T228 27581-27630 Sentence denotes In fact, when IP is small enough, it follows that
T229 27631-27773 Sentence denotes where the approximation stems from Taylor expansion.ᮀ By Eqs. (21) and (25), the new upper bound of remaining time can be approximately put as
T230 27774-27821 Sentence denotes which is relatively tractable when IP is small.
T231 27822-27908 Sentence denotes Based on the upper bound, we propose a new UBG algorithm for early outbreak detection.
T232 27909-27962 Sentence denotes First we explain the difference between UBG and CELF.
T233 27963-28061 Sentence denotes The CELF algorithm [24] exploits the submodular property to improve the original greedy algorithm.
T234 28062-28246 Sentence denotes However, CELF demands N (the number of nodes in the network) remaining time estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs.
T235 28247-28439 Sentence denotes In contrast, the proposed Upper Bound based Greedy (UBG) algorithm uses the derived new bound to further reduce the number of remaining time estimations, especially in the initialization step.
T236 28440-28573 Sentence denotes This way, the nodes are ranked by their upper bound scores which can be used to prune unnecessary calculations in the CELF algorithm.
T237 28574-28608 Sentence denotes We use Example 2 for illustration.
T238 28610-28748 Sentence denotes Even with the UBG algorithm proposed in Section 5, its running time is still unbearable and may not be suitable for large social networks.
T239 28749-28924 Sentence denotes Although UBG can greatly reduce the number of remaining time estimation calls, each estimation call is very timeconsuming, as it needs produce enough samples and average them.
T240 28925-29091 Sentence denotes Hence a possible alternative to further accelerate UBG is to employ heuristics to approximate the remaining time R(S) or the marginal return R(S ∪ {u}) − R(S) in UBG.
T241 29092-29190 Sentence denotes Along this idea, we here introduce two accelerative algorithms to address the inefficiency of UBG:
T242 29191-29233 Sentence denotes Quickest-Path-UBG and Local-Reduction-UBG.
T243 29234-29474 Sentence denotes The experimental results will show that (i) both of them are efficient in terms of running time, and (ii) the former one applies to sparse networks better, while the latter one is more effective in dense networks in terms of detection time.
T244 29475-29562 Sentence denotes In this part we introduce a tractable heuristic to approximate the remaining time R(S).
T245 29563-29653 Sentence denotes From Eq. (5) and Eq. (10), the key point of estimating R(S) is how to estimate E[ (u, S)].
T246 29654-29773 Sentence denotes Note that E[ (u, S)] measures the expected time delay of a diffusion initiated from node u propagating to sensor set S.
T247 29774-29886 Sentence denotes An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S.
T248 29887-29966 Sentence denotes Hence, a question is how to measure the quickest path from node u to the set S?
T249 29967-30036 Sentence denotes The proposed Quickest-Path-UBG is inspired by answering the question.
T250 30037-30177 Sentence denotes According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that E[X] = 1/p.
T251 30178-30442 Sentence denotes Since the time that a node u spends in infecting its uninfected neighborhood v is distributed geometrically with the parameter ip(u, v), the value 1/ip(u, v) should be the expected time cost that an infection propagates from node u to node v along the edge (u, v).
T252 30443-30467 Sentence denotes Fig. 2 shows an example.
T253 30468-30568 Sentence denotes For any two nodes u and v, let d(u, v) be the smallest time cost among the paths connecting u and v.
T254 30569-30629 Sentence denotes For example, d( , ) = min{5 + 3.3, 10 + 5} = 8.3 in Fig. 2 .
T255 30630-30841 Sentence denotes Then, for a subset S ⊆ V, we define d(u, S) := min v ∈ S d(u, v). (27) Intuitively, d(u, S) denotes the expected time that an infection propagates from u to S along the quickest path in the graph G = (V, E, m) .
T256 30842-30976 Sentence denotes Therefore, a fundamental question arises, is there some approximate relationship between the detection time and the shortest distance.
T257 30977-31009 Sentence denotes Theoretically we have Theorem 6.
T258 31010-31171 Sentence denotes When the shortest path from u to S in graph G = (V, E, m) is unique, the quantity d(u, S) ∧ T max can be used to approximate the detection time E[ (u, S)], i.e.,
T259 31172-31178 Sentence denotes Proof.
T260 31179-31272 Sentence denotes In fact, when the shortest path from u to S in graph G = (V, E, m) is unique, it follows that
T261 31273-31377 Sentence denotes where we use an result borrowed from probability theory for the approximation: consider random variables
T262 31378-31476 Sentence denotes .ᮀ Based on Theorem 6, combining Eq. (5), Eq. (10) and Eq. (28), we have the following derivation,
T263 31477-31520 Sentence denotes Hence we can approximate the remaining time
T264 31521-31611 Sentence denotes rather than the heavy Monte-Carlo estimation MC(S ∪ {u}) in the 09th row of UBG algorithm.
T265 31612-31704 Sentence denotes We nail down this new method as Quickest-Path-UBG, which is shown explicitly in Algorithm 3.
T266 31706-31945 Sentence denotes Note that the reason we call this acceleration as Quickest-Path-UBG rather than Shortest-Path-UBG lies in that, our Quickest-Path-UBG introduces a new measurement to reflect the infection time cost rather than the distance in common sense.
T267 31946-32090 Sentence denotes In this part we turn to a tractable heuristic to approximate the marginal return ı u (S) : = R(S ∪ {u}) − R(S) in the 09th row of UBG algorithm.
T268 32091-32275 Sentence denotes To facilitate the description, we will approximate the marginal reduction D(S) − D(S ∪ {u}) rather than the marginal return R(S ∪ {u}) − R(S), due to the fact that R(S) = T max − D(S).
T269 32276-32417 Sentence denotes In sociology literature, degree and other centrality-based heuristics are commonly used to select influential nodes in social networks [37] .
T270 32418-32580 Sentence denotes These methods usually looked over the multi-step complexity of network diffusions, and assumed that the infection can propagate ahead only one hop before the end.
T271 32581-32689 Sentence denotes In other words, if a node initiates an infection, the diffusion spread is at most its first-order neighbors.
T272 32690-32735 Sentence denotes We call this as one-hop assumption hereafter.
T273 32736-32972 Sentence denotes Especially, experimental results in [24] showed that selecting vertices with maximum indegree as sensors results in earlier infection detection than other heuristics, which validated the rationality of one-hop assumption to some extent.
T274 32973-33116 Sentence denotes Under this observation, we will propose a tractable method upon the one-hop assumption to approximate the marginal reduction D(S) − D(S ∪ {u}).
T275 33117-33554 Sentence denotes We first introduce a useful result in probability theory: if the random variable X i is distributed geometrically with parameter p i for i = 1, · · ·, n and they are independent, then ∧ i=1,· · ·,n X i is distributed geometrically with parameter 1 − According to the characteristic of the DSI model, the time cost that an infected node w spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(w, v).
T276 33555-33679 Sentence denotes If the initially infected node w has multi-neighbors in sensor set S, the infections to different neighbors are independent.
T277 33680-33774 Sentence denotes Under the onehop assumption, the expected detection time of w can thus be calculated like this
T278 33775-33896 Sentence denotes If the initially infected node w has no neighbors in sensor set S, the expected detection time of w is defined as T max .
T279 33897-33938 Sentence denotes If w ∈ S, it follows that E[ (w, S)] = 0.
T280 33939-34084 Sentence denotes Fig. 3 presents an example to show the calculation result of expected detection time E[ (w, S)] for each node w ∈ V under the one-hop assumption.
T281 34085-34220 Sentence denotes With the preparations above, we are about to calculate the marginal reduction ı u (S) = D(S) − D(S ∪ {u}) under the one-hop assumption.
T282 34221-34291 Sentence denotes Assume the set S has been selected as sensor set to detect infections.
T283 34292-34421 Sentence denotes When considering adding another node u as a new sensor into S, the reduction of the expected detection time can be put as follows
T284 34422-34442 Sentence denotes which is by Eq. (5).
T285 34443-34530 Sentence denotes We now use Theorem 7 to present the concrete expression of the right part of Eq. (30) .
T286 34531-34616 Sentence denotes Similar with the definition of Eq. (1), we predefine the parents of set S as follows,
T287 34617-34627 Sentence denotes Theorem 7.
T288 34628-34793 Sentence denotes In the DSI model with infection probability ip(w, v) on a directed graph G = (V, E), the reduction ı u (S) of the expected detection time can be calculated like this
T289 34794-35029 Sentence denotes under one-hop assumption, where f(u, S) is defined as The reason that we call it as Local-Reduction-UBG is that we only consider the local change in monitoring {u} and Par(u) \ Par(S) Table 2 Statistics of the four real-world networks.
T290 35030-35136 Sentence denotes Digger Twitter Epinions Small-world #Node 8194 32,986 51,783 200,000 #Edge 56,440 763,713 476,
T291 35138-35301 Sentence denotes We conduct experiments on both synthetic and real-world data sets to evaluate the UBG algorithm, the Quickest-Path-UBG algorithm and Local-Reduction-UBG algorithm.
T292 35302-35381 Sentence denotes We implement the algorithms using C++ with the Standard Template Library (STL).
T293 35382-35488 Sentence denotes All experiments are run on a Linux (Ubuntu 11.10) machine with six-core 1400 MHz AMD CPU and 32 GB memory.
T294 35489-35553 Sentence denotes Three real and one synthetic data sets are used for comparisons.
T295 35554-35737 Sentence denotes The Digger data 1 is a heterogeneous network, including Digg stories, user actions (submit, digg, comment and reply) with respect to the stories, and friendship relations among users.
T296 35738-35822 Sentence denotes The Twitter and Epinions data sets can both be obtained from Stanford Datasource 2 .
T297 35823-35954 Sentence denotes Epinions is a general consumer review site where visitors can read reviews about a variety of items to help them decide a purchase.
T298 35955-36072 Sentence denotes The synthetic Small-world data set is the type of graphs in which each node can be reached by a small number of hops.
T299 36073-36184 Sentence denotes For smallworld model we set the parameter of the nearest neighbors k = 15 and the rewiring probability p = 0.1.
T300 36185-36304 Sentence denotes The above networks are representative ones, covering a variety of networks with different types of relations and sizes.
T301 36305-36385 Sentence denotes The details of the data sets are listed in Table 2 where degree means in-degree.
T302 36386-36463 Sentence denotes In our experiments, an undirected graph is regarded as a bidirectional graph.
T303 36464-36622 Sentence denotes We compare the UBG in Algorithm 2, the Quickest-Path-UBG in Algorithm 3, and Local-Reduction-UBG in Algorithm 4 with both the greedy and heuristic algorithms.
T304 36623-36636 Sentence denotes • CELF [24] .
T305 36637-36757 Sentence denotes The state-of-the-art greedy algorithm, where uses 10,000 snapshots in the whole process for any network. • DEGREE [37] .
T306 36758-36845 Sentence denotes A heuristic algorithm based on "degree centrality", with high-degree nodes as key ones.
T307 36846-36932 Sentence denotes The seeds are the nodes with the k highest in-degrees. • INTER-MONITOR DISTANCE [35] .
T308 36933-37083 Sentence denotes A heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. • PageRank [5] .
T309 37084-37161 Sentence denotes A link analysis algorithm which ranks the importance of pages in a Web graph.
T310 37162-37278 Sentence denotes We implement the power method with a damping factor of 0.85, and pick the k highest-ranked nodes as seeds. • Random.
T311 37279-37376 Sentence denotes It simply selects k random vertices in the graph as the seed set, which is taken as the baseline.
T312 37377-37558 Sentence denotes In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run Monte-Carlo simulation on the networks 10, 000 times and calculate the mean.
T313 37559-37701 Sentence denotes The simple greedy algorithm is not compared because many works have reported that CELF has the same optimization result and less running time.
T314 37702-37863 Sentence denotes Since the DEGREE heuristic is the state-of-the-art [37] , we do not implement heuristics such as distance centrality and betweenness centrality-based heuristics.
T315 37864-38032 Sentence denotes We mainly report results on a uniform infection probability of 0.1 assigned to each directed link in the network, i.e., ip(u, v) = 0.1 for any directed edge (u, v) ∈ E.
T316 38033-38159 Sentence denotes One can refer to the work [14, 33, 36] for learning real values of the parameters {ip(u, v) : (u, v) ∈ E} from available data.
T317 38160-38257 Sentence denotes Besides, we let the time horizon T max = 30 and the prior distribution be uniform in the network.
T318 38258-38325 Sentence denotes In Section 3, we proposed two Monte-Carlo methods to estimate D(S).
T319 38326-38413 Sentence denotes Table 3 shows the estimation results of Propagation Simulation and Snapshot Simulation.
T320 38414-38502 Sentence denotes Here we select 10 nodes with the highest in-degrees in each network as the sensor set S.
T321 38503-38663 Sentence denotes We can find that Propagation Simulation and Snapshot Simulation release almost the same estimation results, which confirms their equivalence in estimating D(S).
T322 38664-38735 Sentence denotes Fig. 4 shows the cumulative time cost of these two Monte-Carlo methods.
T323 38736-38911 Sentence denotes Here we randomly select 10 sensor sets {S 1 , S 2 , . . ., S 10 } from the Digger data and every sensor set has five sensor nodes, i.e. |S i | = 5 for all i = 1, 2, . . ., 10.
T324 38912-39098 Sentence denotes We can see that the cumulative time cost of Propagation Simulation increases linearly, while that of Snapshot Simulation has a big jump at the first sensor set and then increases slowly.
T325 39099-39310 Sentence denotes The reason behind is that Snapshot Simulation needs to establish numerous snapshots to estimate the first D(S 1 ), and these established snapshots can be reused in the posterior estimations of {D(S i )} 10 i=2 .
T326 39311-39400 Sentence denotes Table 4 shows the gap between the real value of remaining time R(S) and its upper bounds.
T327 39401-39495 Sentence denotes Here we also select ten nodes with the highest in-degrees in each network as the sensor set S.
T328 39497-39677 Sentence denotes In Table 5 , we compare the number of estimation or approximation calls at the first 10 iterations among CELF, UBG, Quickest-Path-UBG and Local-Reduction-UBG on the four data sets.
T329 39678-39847 Sentence denotes From the results, we can observe that the call number in UBG and its two accelerations are significantly reduced compared to that in CELF, especially at the first round.
T330 39848-39985 Sentence denotes One may notice that in Table 5 , CELF occasionally defeats our methods, but the total call number of our methods are much less than CELF.
T331 39986-40162 Sentence denotes As listed in 5, the total call number of the first 10 iterations of UBG and its two accelerations, compared to CELF, is reduced at a rate of 94% at least on the four data sets.
T332 40163-40264 Sentence denotes Similarly, at least 81% reduction of call numbers of CELF can be observed in the first 50 iterations.
T333 40265-40362 Sentence denotes From the observation, we can conclude that our UBG is more efficient than CELF on large networks.
T334 40363-40462 Sentence denotes Detection time measures the time delay of a message propagated from a diffusion source to a sensor.
T335 40463-40627 Sentence denotes We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50 as shown in Fig. 5 .
T336 40628-40710 Sentence denotes UBG, as an updated version of CELF, has competitive results on the four data sets.
T337 40711-40892 Sentence denotes More importantly, the detection times of UBG and CELF are completely the same in the four figures, which explains again that UBG and CELF share the same results in sensor selection.
T338 40893-40983 Sentence denotes The only difference between UBG and CELF is the number of remaining time estimation calls.
T339 40984-41074 Sentence denotes The Quickest-Path-UBG and Local-reduction-UBG always perform better than other heuristics.
T340 41075-41289 Sentence denotes More specifically, if the networks are sparse like Digger and Epinions, the Quickest-Path-UBG is more competitive; if the networks are dense like Twitter and Smallworld, the Local-reduction-UBG is more competitive.
T341 41290-41362 Sentence denotes Selection time measures the time cost of an algorithm selecting sensors.
T342 41363-41423 Sentence denotes Fig. 6 shows the time cost of selecting sensors with k = 50.
T343 41424-41458 Sentence denotes UBG is 4-8 times faster than CELF.
T344 41459-41543 Sentence denotes One may argue that such a low improvement of UBG can be neglected in large networks.
T345 41544-41587 Sentence denotes In fact, UBG scales well to large networks.
T346 41588-41771 Sentence denotes Because, with the size of a network increase, Monte-Carlo simulations take more time, and thus UBG will achieve better performance by pruning more unnecessary Monte-Carlo simulations.
T347 41772-41882 Sentence denotes As to heuristics, Degree and Random are very fast in selecting candidate nodes, which take less than 1 second.
T348 41883-42001 Sentence denotes Quickest-Path-UBG and Local-Reduction-UBG are exciting and adoptable, due to their good performance in detection time.
T349 42002-42131 Sentence denotes The PageRank and Inter-Monitor Distance are slightly slower and undesirable, in view of their poor performance in detection time.
T350 42132-42243 Sentence denotes We run tests on Twitter data set and obtain the selection time w.r.t. parameter ip (the infection probability).
T351 42244-42293 Sentence denotes In the experiments, ip increases from 0.1 to 0.5.
T352 42294-42381 Sentence denotes We assign a uniform infection probability ip to each directed link under the DSI Model.
T353 42382-42502 Sentence denotes From the results in Fig. 7 , we can conclude that with the parameter ip growing larger, the CELF is more time-consuming.
T354 42503-42593 Sentence denotes By contrast, the UBG and its accelerations are robust and insensitive to the parameter ip.
T355 42594-42656 Sentence denotes 2) First, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∪ T(u, S 2 ).
T356 42657-42718 Sentence denotes For any t ∈ T(u, S 1 ∪ S 2 ), I t ∩ (S 1 ∪ S 2 ) / = ∅ holds.
T357 42719-42873 Sentence denotes From I t ∩ (S 1 ∪ S 2 ) = (I t ∩ S 1 ) ∪ (I t ∩ S 2 ) / = ∅, we know either I t ∩ S 1 or I t ∩ S 2 is nonempty, which implies t ∈ T(u, S 1 ) ∪ T(u, S 2 ).
T358 42874-42929 Sentence denotes Hence T(u, S 1 ∪ S 2 ) ⊂ T(u, S 1 ) ∪ T(u, S 2 ) works.
T359 42930-43025 Sentence denotes Conversely, for any t ∈ T(u, S 1 ) ∪ T(u, S 2 ), t will be located in T(u, S 1 ) or T(u, S 2 ).
T360 43026-43091 Sentence denotes Without loss of generality, we assume t ∈ T(u, S 1 ), which means
T361 43093-43143 Sentence denotes is nonempty, which indicates t ∈ T(u, S 1 ∪ S 2 ).
T362 43144-43204 Sentence denotes Hence T(u, S 1 ) ∪ T(u, S 2 ) ⊂ T(u, S 1 ∪ S 2 ) also works.
T363 43205-43265 Sentence denotes Second, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).
T364 43266-43394 Sentence denotes We take infimum operations on both sides of the first step conclusion, i.e., inf T(u, S 1 ∪ S 2 ) = inf T(u, S 1 ) ∪ T(u, S 2 ).
T365 43395-43505 Sentence denotes Since all these sets are be composed of integers, we obviously get T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).
T366 43506-43733 Sentence denotes Finally, we take the infimums with T max on both sides of the second step conclusion, i.e., T(u, S 1 ∪ S 2 ) ∧ T max = T(u, S 1 ) ∧ T(u, S 2 ) ∧ T max , which means (u, S 1 ∪ S 2 ) = (u, S 1 ) ∧ (u, S 2 ) by the definition of .
T367 43734-44069 Sentence denotes 3) Similarly, in order to prove this property, we only need to , {u}) , i.e., for any t ≥ 0, the probability that t is located in T(u, {v}) equals the probability that t is located in 4) If S ⊂ T, in order to prove this property, we only need to prove T(u, S) ⊂ T(u, T ) and notice the fact that the infimum of a larger set is smaller.
T368 44070-44111 Sentence denotes For any t ∈ T(u, S), I t ∩ S / = ∅ holds.
T369 44112-44202 Sentence denotes Besides, as S ⊂ T, I t ∩ S / = ∅ implies I t ∩ T / = ∅, which means t is also in T(u, T ).
T370 44203-44285 Sentence denotes So, T(u, S) ⊂ T(u, T ), by which we can easily conclude the property, is verified.
T371 44286-44370 Sentence denotes 5) If T(u, S) is empty, then T(u, S) =+∞. (u, S) =+ ∞ ∧ T max = T max ∈ [0, T max ].
T372 44371-44488 Sentence denotes If T(u, S) is nonempty, then every element in it is greater than 0 and notice that (u, S) = T(u, S) ∧ T max ≤ T max .
T373 44489-44536 Sentence denotes Therefore, we can conclude 0 ≤ (u, S) ≤ T max .
T374 44537-44666 Sentence denotes We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] , denoted by (A1).
T375 44667-44672 Sentence denotes (A1):
T376 44673-44757 Sentence denotes Let E be a subset of N 2 and X i , Y j , (i, j) ∈ E are pairwise distinct variables.
T377 44758-44889 Sentence denotes We define a formula F : = ∨ {X i ∧ Y j : (i, j) ∈ E}, which is a disjunction of all the conjunctive clauses X i ∧ Y j , (i, j) ∈ E.
T378 44890-44937 Sentence denotes How many valuations that satisfy the formula F?
T379 44938-44983 Sentence denotes Then we introduce the following problem (A2).
T380 44984-44989 Sentence denotes (A2):
T381 44990-45223 Sentence denotes G = (V, E) is a directed graph, where each edge e ∈ E is associated with a weight w e which is geometrically distributed with a parameter p, i.e. P(w e = i) = p(1 − p) i−1 , i ≥ 1. s, t are two nodes in G and k is a positive integer.
T382 45224-45320 Sentence denotes What is the probability of the event that the shortest path from s to t has length T at least k?
T383 45321-45365 Sentence denotes Now we prove that (A1) is reducible to (A2).
T384 45366-45436 Sentence denotes We map the variable X i to a node x i and variable Y j to a node y j .
T385 45437-45497 Sentence denotes If (i, j) exists in E, then we add an edge from x i to y j .
T386 45498-45668 Sentence denotes Also, we add a source node s and an edge (s, x i ), associated with weight w (s,x i ) which is geometrically distributed with parameter 1/2, for any i ∈ {i : (i, j) ∈ E}.
T387 45669-45765 Sentence denotes Thus, w (s,x i ) equals 1 with probability 1/2 and strictly greater than 1 with probability 1/2.
T388 45766-45886 Sentence denotes Similarly, we add a terminal node t and an edge from each y j to t with a weight identically distributed as w (s,x i ) .
T389 45887-45959 Sentence denotes So we construct a graph with geometrically distributed weights on edges.
T390 45960-46215 Sentence denotes Now we can build a probability-preserving bijection between the valuations of X i , Y j and the subgraphs of the constructed random graph: for a valuation , the corresponding subgraph is the one where each edge adjacent to x i has length 1 iff (X i ) = 1.
T391 46216-46243 Sentence denotes The same argument for Y j .
T392 46244-46415 Sentence denotes We set k = 3 and claim the fact that the probability P(T ≥ 3) in (A2) is the number of valuations satisfying that F is divided by 2 N , where N is the number of variables.
T393 46416-46637 Sentence denotes Indeed, the fact is based on the following observation: a valuation is true iff some adjacent pair of X i and Y j is true if the incident edges to the corresponding x i and y j have length 1 in the corresponding subgraph.
T394 46638-46907 Sentence denotes Note that any edge of length greater than 1 is irrelevant as the structure of the graph, which ensures it can never be part of a path of length 3 from s to t, For any path from s to t must jump three times: from s to some x i , from x i to some y j , and from y j to t.
T395 46908-47001 Sentence denotes Thus, from what we have stated, (A1) can be reducible to (A2), which implies (A2) is #P-hard.
T396 47002-47056 Sentence denotes We aim to solve the problem (denoted by (A)) of calcu-
T397 47058-47210 Sentence denotes In order to prove the random variables T(u, S) and c(u, S) are identically distributed, we need to prove P[T (u, S) = t] = P[c(u, S) = t] for any t > 0.
T398 47211-47274 Sentence denotes On the one hand, we can prove P[T (u, S) = t] ≤ P[c(u, S) = t].
T399 47275-47385 Sentence denotes Given t > 0, we find that for any simulation with T(u, S) = t, a snapshot with c(u, S) = t can be constructed.
T400 47386-47476 Sentence denotes By definition, if T(u, S) = t, then J i ∩ S = ∅, i = 0, 1, · · ·, t − 1 and J t ∩ S / = ∅.
T401 47477-47606 Sentence denotes For any node v i+1 ∈ J i+1 \J i / = ∅, suppose that v i+1 is infected through the edge (s, v i+1 ) with s ∈ J i , v i+1 ∈ J i+1 .
T402 47607-47638 Sentence denotes Denote n s = inf{j : s ∈ J j }.
T403 47639-47712 Sentence denotes Then we let the edge (s, v i+1 ) take weight c(s, v i+1 ) = i + 1 − n s .
T404 47713-47781 Sentence denotes For any node s ∈ J t , we denote n s , i.e. n s = sup{j : s ∈ J j }.
T405 47782-47919 Sentence denotes Then for any v ∈ V \J t , let the edge (s , v i+1 ) take weight c(s , v) = t − n s + C, where C is sampled from a geometric distribution.
T406 47920-48012 Sentence denotes In addition, we take a weight sampled from geometric distribution on each of the rest edges.
T407 48013-48124 Sentence denotes In such operations, a snapshot with c(u, S) = t is constructed, which implies P[T (u, S) = t] ≤ P[c(u, S) = t].
T408 48125-48193 Sentence denotes On the other hand, we can also get P[c(u, S) = t] ≤ P[T (u, S) = t].
T409 48194-48278 Sentence denotes For any snapshot with c(u, S) = t, a simulation with T(u, S) = t can be constructed.
T410 48279-48410 Sentence denotes We let node u be the initiator, i.e. J 0 = {u} and denote the smallest number in the set {c(u, v) : u ∈ J 0 , v ∈ V \J 0 } by n 0 .
T411 48411-48501 Sentence denotes Meanwhile, we denote the set of nodes {v : c(u, v) = n s , u ∈ J 0 , v ∈ V \J 0 } by n 0 .
T412 48502-48525 Sentence denotes Let J n 0 = J 0 ∪ n 0 .
T413 48526-48594 Sentence denotes Complementally, we define all the J i = J 0 , i = 0, · · ·, n 0 − 1.
T414 48595-48747 Sentence denotes Denote the smallest number in the set {c(u, v) : u ∈ J t , v ∈ V \J t } by n t and the set of nodes {v : c(u, v) = n s , u ∈ J t , v ∈ V \J t } by n t .
T415 48748-48781 Sentence denotes Let J t+n t = J n t = J t ∪ n t .
T416 48782-48854 Sentence denotes Complementally, we define J i : = J t for all i = t, · · ·, t + n t − 1.
T417 48855-48944 Sentence denotes We now can construct a simulation of the diffusion model (I t ) t≥0 = (J t+1 \J t ) t≥0 .
T418 48945-49038 Sentence denotes Hence, we have a simulation with T(u, S) = t, which implies P[c(u, S) = t] ≤ P[T (u, S) = t].
T419 49039-49173 Sentence denotes Therefore, we have P[T (u, S) = t] = P[c(u, S) = t] for any t > 0, which implies that T(u, S) and c(u, S) are identically distributed.
T420 49174-49255 Sentence denotes Hence, the simulation and snapshot simulations are equivalent in estimating D(S).
T1 49256-49402 Sentence denotes In this paper, we discussed scalable solutions to detection time minimization problem for early detection of dynamic harmful cascades in networks.
T2 49403-49554 Sentence denotes To address this problem, we proposed an upper bound based greedy (UBG) algorithm and its two accelerative algorithms to cater for lager scale networks.
T3 49555-49661 Sentence denotes The UBG solution guarantees a near-optimal detection time, pruning at least 80% Monte-Carlo calls of CELF.
T4 49662-49791 Sentence denotes The novel accelerations on UBG can significantly reduce the selection time and further achieve at least 10 2 times speed-raising.
T5 49792-49840 Sentence denotes There are several interesting future directions.
T6 49841-50012 Sentence denotes First, the detection time D(S) is # P-hard to calculate exactly, it is still a question how to design an efficient algorithm to estimate D(S) with a theoretical guarantee.
T7 50013-50204 Sentence denotes Second, the infection probability ip and prior distribution are predefined in this paper, how to learn these parameters for the DSI model from available cascade data still remains unexplored.