PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T1	0-69	Sentence	denotes	Early detection of dynamic harmful cascades in large-scale networks ଝ
T2	71-79	Sentence	denotes	Abstract
T3	80-217	Sentence	denotes	Quickly detecting harmful cascades in networks can allow us to analyze the causes and prevent further spreading of destructive influence.
T4	218-370	Sentence	denotes	Since it is often impossible to observe the state of all nodes in a network, a common method is to detect harmful cascades from sparsely placed sensors.
T5	371-567	Sentence	denotes	However, the harmful cascades are usually dynamic (e.g., the cascade initiators and diffusion trajectories can change over the time), which can severely destroy the robustness of selected sensors.
T6	568-672	Sentence	denotes	Meanwhile the large scale of current networks greatly increases the time complexity of sensor selection.
T7	673-830	Sentence	denotes	Motivated by the observation, in this paper we investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks.
T8	831-1082	Sentence	denotes	Specifically, we first put forward a dynamic susceptible-infected model to describe harmful cascades, and formally define a detection time minimization (DTM) problem which focuses on effective sensors placement for early detection of dynamic cascades.
T9	1083-1218	Sentence	denotes	We prove that it is #P-hard to calculate the objective function exactly and propose two Monte-Carlo methods to estimate it efficiently.
T10	1219-1303	Sentence	denotes	We prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.
T11	1304-1436	Sentence	denotes	Based on that, we propose an efficient upper bound based greedy (UBG) algorithm with the theoretical performance guarantee reserved.
T12	1437-1530	Sentence	denotes	To further meet different types of large-scale networks, we propose two accelerations of UBG:
T13	1531-1643	Sentence	denotes	Quickest-Path-UBG for sparse networks and Local-Reduction-UBG for dense networks to improve the time complexity.
T14	1644-1760	Sentence	denotes	The experimental results on synthetic and real-world social networks demonstrate the practicality of our approaches.
T15	1762-1880	Sentence	denotes	Harmful cascade spreading through kinds of network structures has become more and more ubiquitous in the modern world.
T16	1881-2038	Sentence	denotes	A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population contact network and lead to an epidemic [12, 34] .
T17	2039-2162	Sentence	denotes	A computer virus on a few servers can fast spread to other servers or computers in a computer connection network [15, 17] .
T18	2163-2281	Sentence	denotes	In a similar vein, a rumor started by a few individuals can spread quickly through the online social network [2, 26] .
T19	2282-2473	Sentence	denotes	It is crucial to detect the harmful cascades as soon as they happen or shortly thereafter, since it allows us to study the causes and prevent further spreading of harmful influence [32, 38] .
T20	2474-2698	Sentence	denotes	In practice, it is commonly infeasible or unaffordable to monitor all nodes at all times, and therefore a common way to detect cascades is to select important nodes where we can place sensors for monitoring [7, 19, 31, 39] .
T21	2699-2979	Sentence	denotes	However, existing methods usually viewed the cascade data as static and deterministic, ignoring an important fact that the harmful cascades are usually dynamic and time-variant, in the sense that the cascade initiators and diffusion trajectories can change randomly over the time.
T22	2980-3117	Sentence	denotes	Actually the dynamic brings new challenges to network monitoring, since it can severely destroy the robustness of selected sensors [22] .
T23	3118-3288	Sentence	denotes	Meanwhile the existing sensor selection algorithms can only work for small networks and not be scalable well to large networks of the https://doi.org/10.1016/j.jocs.2017.
T24	3289-3327	Sentence	denotes	10 .014 1877-7503/© 2017 Elsevier B.V.
T25	3328-3359	Sentence	denotes	All rights reserved. day [28] .
T26	3360-3533	Sentence	denotes	Motivated by these observations, in this paper we investigate the scalable solutions to sensor selection problem for early detection of dynamic harmful cascades in networks.
T27	3534-3618	Sentence	denotes	To model the dynamic property, we carry on our work from a model-driven perspective.
T28	3619-3828	Sentence	denotes	To provide a unified framework, we model all the above examples as an infection spreading in a network G = (V, E), where V is the set of nodes (i.e. individuals) and E is the set of edges (i.e. relationships).
T29	3829-3923	Sentence	denotes	In a population network, the infection is the disease that is transmitted between individuals.
T30	3924-4108	Sentence	denotes	In the example of a computer virus spreading in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, the infection is the rumor.
T31	4109-4444	Sentence	denotes	Under this unified network framework, we specifically propose a new dynamic cascade model to describe harmful cascade diffusions, and define a detection time minimization (DTM) problem S * = argmin \|S\|=k,S⊆V D(S), where k is a given parameter determined by budget or monitor capacity, S is sensor nodes set, and D(S) is detection time.
T32	4445-4553	Sentence	denotes	The DTM problem focuses on effective sensor nodes selection for early detection of dynamic harmful cascades.
T33	4554-4644	Sentence	denotes	We first prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.
T34	4645-4857	Sentence	denotes	Considering the limitation caused by greedy algorithm inefficiency, we then propose two alternative Monte-Carlo methods, each having its pros and cons, to estimate the #P-hard objective function D(S) efficiently.
T35	4858-5086	Sentence	denotes	To further address the scalability issue, we propose an efficient upper bound based greedy (UBG) algorithm and its two accelerations Quickest-Path-UBG and Local-Reduction-UBG to cater for different types of large-scale networks.
T36	5087-5202	Sentence	denotes	These two accelerations are close to UBG in performance but orders of magnitude faster than UBG in time complexity.
T37	5203-5306	Sentence	denotes	Experiments on synthetic and real-world social networks demonstrate the practicality of our approaches.
T38	5307-5505	Sentence	denotes	Detecting the whole network by monitoring finite sensor nodes has been widely applied to detect water contaminations in a water distribution network [19] and virus outbreaks in a human society [7] .
T39	5506-5631	Sentence	denotes	Some early work places sensors by topological measures, e.g. targeting high degree nodes [32] or highly connected nodes [8] .
T40	5632-5760	Sentence	denotes	However the effects along this idea are commonly unsatisfied, since they ignore the multi-step complexity of network diffusions.
T41	5761-5922	Sentence	denotes	The "Battle of Water Sensor Network" challenge [31] motivated a number of works to optimize sensor placement in water networks to detect contamination [16, 18] .
T42	5923-6183	Sentence	denotes	By utilizing submodular property, they proposed to optimize the sensor placement with different criterions such as maximizing the probability of detection, minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena [24] .
T43	6184-6315	Sentence	denotes	In their works the data are a set of deterministic scenarios and the random property of diffusion are not taken into consideration.
T44	6316-6384	Sentence	denotes	What's more, their methods are not scalable to large networks [28] .
T45	6385-6575	Sentence	denotes	In addition to the above two types of methods, another related work [1] set sensors along fixed paths in the network so as to gather sufficient information to locate possible contaminations.
T46	6576-6738	Sentence	denotes	Early detection of contagious outbreaks by monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis and Fowler [7] .
T47	6739-6962	Sentence	denotes	Krause et al. [20] presented efficient schedules for minimizing energy consumption in battery operated sensors, while other works analyzed distributed solutions with limited communication capacities and costs [13, 21, 22] .
T48	6963-7047	Sentence	denotes	In addition, Berry et al. [4] equated the placement problem with a p-median problem.
T49	7048-7173	Sentence	denotes	Li et al. [25] proposed a dynamic-programming (DP) based greedy algorithm which is with a near-optimal performance guarantee.
T50	7174-7351	Sentence	denotes	By contrast, our work is geared to investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks from a model-driven prospective.
T51	7352-7659	Sentence	denotes	The main difference from previous works is that we take dynamic properties of cascades into consideration: (1) the cascade initiator is dynamic since we do not know who will initiate another harmful diffusion next time, and (2) the diffusion process is dynamic since the propagation trajectory is uncertain.
T52	7660-7789	Sentence	denotes	Under these two dynamic sources, we aim to accurately and fast select k nodes as sensors for early detection of harmful cascades.
T53	7790-7958	Sentence	denotes	Besides, as the size of network becomes more and more large-scale, we also need to ensure the effectiveness of the proposed sensor selection methods for large networks.
T54	7959-8093	Sentence	denotes	To minimize the time to find the dynamic harmful cascades, our search attempts to optimize the selection of sensors in a scalable way.
T55	8094-8138	Sentence	denotes	Our contributions are summarized as follows:
T56	8139-8198	Sentence	denotes	• We formulate a Detection Time Minimization (DTM) problem.
T57	8199-8469	Sentence	denotes	To describe the dynamic harmful cascade on networks, we put forward a dynamic cascade model for the harmful diffusion. • We prove that it is #P-hard to exactly calculate the objective function of DTM problem and propose two equivalent Monte-Carlo methods to estimate it.
T58	8470-8631	Sentence	denotes	Each has its advantages and disadvantages. • We show the NP-hardness of DTM problem as it can be shown to contain the Set Cover problem as a simple special case.
T59	8632-8913	Sentence	denotes	We convert the DTM problem to a constrained max optimization problem, prove the submodularity of the new objective function, and then employ the greedy algorithm which achieves an approximation ratio of 1 − 1/e. • We theoretically establish new upper bounds for the remaining time.
T60	8914-9167	Sentence	denotes	Based on these bounds, we further propose a new Upper Bound based Greedy (UBG in short) algorithm which can significantly reduce the number of estimations in greedy-based algorithms, especially at the initial step. • We propose two accelerations of UBG:
T61	9168-9354	Sentence	denotes	Quickest-Path-UBG and Local-Reduction-UBG to address the DTM problem for large networks, which are close to UBG in performance but orders of magnitude faster than UBG in time complexity.
T62	9355-9485	Sentence	denotes	In addition, our methods not only aim at early detection of harmful cascades, but also shed light on a host of other applications.
T63	9486-9639	Sentence	denotes	For example, the goal of Emerging Topic Detection is to identify emergent topics in a social network, assuming full access to the stream of all postings.
T64	9640-9894	Sentence	denotes	Providers, such as Twitter or Facebook, have an immediate access to all tweets or postings as they are submitted to their server [6, 29] , while outside observers need an efficient mechanism to monitor changes, such as the methods developed in this work.
T65	9895-10150	Sentence	denotes	Another example, an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social networks for pieces of information that can be used in trading decisions before they appear in the more popular news sites [23, 27] .
T66	10151-10255	Sentence	denotes	Similarly, intelligence, business and politics analysts are scanning online sources for new information.
T67	10256-10358	Sentence	denotes	The theoretical framework and method established in this paper can also be used in these applications.
T68	10359-10405	Sentence	denotes	The rest of the paper is organized as follows.
T69	10406-10467	Sentence	denotes	Section 2 presents the cascade model and problem formulation.
T70	10468-10581	Sentence	denotes	In Section 3 we show the # P-hardness of detection time calculation and derive two equivalent estimation methods.
T71	10582-10738	Sentence	denotes	Section 4 is devoted to the NP-hardness of DTM problem, the submodular properties of transformed objective function, and the corresponding greedy algorithm.
T72	10739-10860	Sentence	denotes	To prune the unnecessary estimation calls, we analysis the upper bounds for the new proposed UBG algorithm in Section 5 .
T73	10861-10930	Sentence	denotes	Section 6 presents two accelerations of UBG for large scale networks.
T74	10931-10972	Sentence	denotes	Section 7 shows the experimental results.
T75	10973-11008	Sentence	denotes	We conclude the paper in Section 8.
T76	11009-11051	Sentence	denotes	Table 1 outlines the major variables used.
T77	11052-11230	Sentence	denotes	In this section we start with a description of dynamic cascade model and then we define the Detection Time Minimization (DTM) problem abstracted from the early detection problem.
T78	11231-11409	Sentence	denotes	We start somewhat with the framework of [24] , where the models introduced are essentially descriptive to specify a joint distribution over all nodes' behavior in a global sense.
T79	11410-11595	Sentence	denotes	In contrast, we focus on more operational models, from mathematical sociology [3] and interacting particle systems [10] , to explicitly represent the step-by-step dynamics of infection.
T80	11596-11805	Sentence	denotes	By considering operational models for the dynamic diffusion of a harmful item through a network G, represented by a directed graph, we will speak of each individual node as being either infected or uninfected.
T81	11806-12027	Sentence	denotes	Motivated by the properties of contaminants like rumor and virus, we will focus from now on the progressive case in which nodes can switch from being uninfected to being infected, but do not switch in the other direction.
T82	12028-12165	Sentence	denotes	We also focus on the setting that each node's tendency to become infected increases monotonically as more of its neighbors become active.
T83	12166-12450	Sentence	denotes	Thus, the process can be reviewed with respect to some particular uninfected node v: as time unfolds, more and more of v's neighbors become infected; at some point, this may cause v to become infected, and v's decision may in turn trigger further infections by nodes connected with v.
T84	12451-12658	Sentence	denotes	Motivated by above observation, we put forward a dynamic susceptible-infected (DSI) model for the harmful cascade spreading, which can be seen as a variant of the common susceptible-infected (SI) model [3] .
T85	12659-12766	Sentence	denotes	The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover.
T86	12767-12783	Sentence	denotes	Specifically (1)
T87	12784-12904	Sentence	denotes	The DSI model is attached with a probability distribution = { (u), u ∈ V} to describe the dynamic of cascade initiators.
T88	12905-13000	Sentence	denotes	can be seen as a priori knowledge of nodes being infected initially, where u ∈ V (u) = 1 holds.
T89	13001-13119	Sentence	denotes	The DSI model first chooses an initially infected node u ∈ V according to distribution , and then it works as follows.
T90	13120-13201	Sentence	denotes	Let I t ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I 0 = {u}.
T91	13202-13293	Sentence	denotes	Define J t := 0≤i≤t I i be the cumulative set of nodes that get infected before step t ≥ 0.
T92	13294-13417	Sentence	denotes	Then, at step t + 1, each node u ∈ J t may infect its out-neighbors v ∈ V \J t with an independent probability of ip(u, v).
T93	13418-13488	Sentence	denotes	Thus, a node v ∈ V \J t is infected at step t + 1 with the probability
T94	13489-13557	Sentence	denotes	If node v is successfully infected, it is added into the set I t+1 .
T95	13558-13601	Sentence	denotes	Then update J t+1 by J t+1 ←− J t ∪ I t+1 .
T96	13602-13782	Sentence	denotes	Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others.
T97	13783-13849	Sentence	denotes	Obviously the cumulative infected process (J t ) t≥0 is Markovian.
T98	13850-13990	Sentence	denotes	In a diffusion model (I t ) t≥0 , given an initially infected node u ∈ V and a set of sensors S ⊆ V, the random detection time is defined as
T99	13991-14064	Sentence	denotes	where a ∧ b : = min{a, b} and T max is the time interval that we observe.
T100	14065-14088	Sentence	denotes	We assume inf{ ∅ } =+∞.
T101	14089-14225	Sentence	denotes	The random detection time (u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S.
T102	14226-14232	Sentence	denotes	Proof.
T103	14233-14286	Sentence	denotes	Above properties come from the definition in Eq. (2).
T104	14287-14329	Sentence	denotes	We put the detailed proofs in Section 8.1.
T105	14330-14549	Sentence	denotes	Note that the detection time (u, S) can be viewed as a special type of stopping time in stochastic process theory [11] .ᮀ Let the variable X denote the random initiator with distribution , i.e., for every u ∈ V, we have
T106	14550-14656	Sentence	denotes	The (expected) detection time from a random initiator to one of the selected sensors in S can be define as
T107	14657-14716	Sentence	denotes	where E is an expectation operator under the cascade model.
T108	14717-14820	Sentence	denotes	By the conditional expectation, we can calculate the expected detection time D(S) in the following way:
T109	14821-14861	Sentence	denotes	Indeed, Eq. (5) can be reached like this
T110	14862-14917	Sentence	denotes	where (X) denotes the -algebra generated by variable X.
T111	14918-15033	Sentence	denotes	Eq. (5) converts the global detection time E[ (X, S)] as a summation of local detection time E[ (u, S)] with u ∈ V.
T112	15034-15081	Sentence	denotes	This conversion provides a basic for Section 5.
T113	15082-15245	Sentence	denotes	Our goal is to find k nodes as sensors in a network in order to minimize the time until an infection -starting from a random initiator in the network -is detected.
T114	15246-15430	Sentence	denotes	Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S * ⊆ V such that \|S * \| = k and D(S * ) = min{D(S)\|\|S\| = k, S ⊆ V}, i. e .,
T115	15431-15501	Sentence	denotes	where k is a given parameter determined by budget or monitor capacity.
T116	15502-15578	Sentence	denotes	We call this as detection time minimization problem (DTM problem for short).
T117	15579-15677	Sentence	denotes	For the sake of concreteness, we will discuss our results in terms of the DSI model in particular.
T118	15678-15797	Sentence	denotes	To solve the DTM problem Eq. (6), the first issue is how to calculate the objective function D(S) given a sensor set S.
T119	15798-15856	Sentence	denotes	This question isn't as easy as its description in Eq. (4).
T120	15857-16033	Sentence	denotes	For example, the DSI process is underspecified, since we have not prescribed the order in which newly infected nodes in a given step t will attempt to activate their neighbors.
T121	16034-16227	Sentence	denotes	Thus, it is not initially obvious that the process is even well-defined, in the sense that it yields the same distribution over outcomes regardless of how we schedule the attempted activations.
T122	16228-16365	Sentence	denotes	Actually the computation of D(S) is #P-hard, by showing a reduction from the positive partitioned 2-DNF assignment counting problem [9] .
T123	16366-16372	Sentence	denotes	Proof.
T124	16373-16485	Sentence	denotes	We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] .
T125	16486-16725	Sentence	denotes	For detailed proof, see Section 8.2. ᮀ Since it is intractable to exactly compute D(S) on a typically sized graph, a natural idea is to employ Monte-Carlo methods to estimate D(S), which can be implemented in two different ways as follows:
T126	16726-16917	Sentence	denotes	The expected detection time D(S) is obtained by directly simulating the random process of diffusion triggered by a random node, say u, chosen according to the distribution defined as Eq. (3).
T127	16918-17002	Sentence	denotes	Let I t denote the set of nodes newly infected in the t-th iteration with I 0 = {u}.
T128	17003-17139	Sentence	denotes	In the (t + 1)-th iteration, a node u ∈ J t : = ∪ 0≤i≤t I i attempts to activate each uninfected neighbor v / ∈ J t with the probability
T129	17140-17233	Sentence	denotes	The process is repeated until the diffusion hits the sensor set S at some step T(u, S), i.e.,
T130	17234-17269	Sentence	denotes	where inf{ ∅ } : =+∞ as convention.
T131	17270-17390	Sentence	denotes	The detection time of this single simulation is recorded at T(u, S) ∧ T max , which is right (u, S) defined in Eq. (2) .
T132	17391-17518	Sentence	denotes	We run such simulations for many times and finally estimate the expected detecting time D(S) by averaging over all simulations.
T133	17519-17753	Sentence	denotes	According to the characteristic of the DSI model, the time cost that an infected node u spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(u, v), and we denote this time cost as C(u, v), i.e.,
T134	17754-17777	Sentence	denotes	for any integral s ≥ 1.
T135	17778-17948	Sentence	denotes	We can flip all coins a priori to produce a weighted graph G = (V, E, c), where an edge (u, v) is labeled with the time cost c(u, v) -a sample of random variable C(u, v).
T136	17949-18200	Sentence	denotes	Actually such a snapshot provides an easy way to sample the detection time of any sensor set S, which exactly equals to the smallest time cost from the initially infected node, say u (chosen according to the distribution ), to the nearest sensor in S.
T137	18201-18359	Sentence	denotes	Define c(u, S) be the smallest time cost from u to the nearest sensor in S, then the detection time of this single simulation is recorded as c(u, S) ∧ T max .
T138	18360-18477	Sentence	denotes	We produce plenty of snapshots and finally estimate the expected detection time D(S) by averaging over all snapshots.
T139	18478-18565	Sentence	denotes	The snapshot simulation is equivalent to the propagation simulation in estimating D(S).
T140	18566-18654	Sentence	denotes	More specifically, the random variables T(u, S) and c(u, S) are identically distributed.
T141	18655-18749	Sentence	denotes	, it is enough to prove that random variables T(u, S) and c(u, S) are identically distributed.
T142	18750-18787	Sentence	denotes	For proof details, see Section 8.3. ᮀ
T143	18788-18897	Sentence	denotes	We have confirmed that the two methods are equivalent, while either has its own advantages and disadvantages.
T144	18898-19076	Sentence	denotes	For estimating a specific D(S), the simulation method is faster, because it only needs to examine a small portion of edges while the snapshot method has to examine all the edges.
T145	19077-19272	Sentence	denotes	For estimating the expected detection time of different sensor sets, the snapshot method outperforms the simulation method in terms of time complexity, since each snapshot serves all sensor sets.
T146	19273-19446	Sentence	denotes	Under these observations, the heuristic algorithms use the propagation simulation, while the greedy-based algorithms employ the snapshot simulation in the experimental part.
T147	19447-19660	Sentence	denotes	In this section we first show the NP-hardness of DTM problem in Eq. (6), as it can be shown to contain the Set Cover problem as a simple special case, then we propose a simple greedy algorithm for the DTM problem.
T148	19661-19667	Sentence	denotes	Proof.
T149	19668-19918	Sentence	denotes	Consider an instance of the NP-complete Set Cover problem, defined by a collection of subsets S = {S 1 , S 2 , . . ., S m } of a ground set U = {u 1 , u 2 , . . ., u n }, we wish to know whether there exist k of the subsets whose union is equal to U.
T150	19919-19953	Sentence	denotes	We can assume that k < n < m here.
T151	19954-20031	Sentence	denotes	We show that this can be viewed as a special case of the optimal problem (6).
T152	20032-20344	Sentence	denotes	Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes: there is a node i corresponding to each element u i ∈ U, a node j corresponding to each set S j ∈ S, and a directed edge (i, j) with activation probability ip(i, j) = 1 whenever u i ∈ S j .
T153	20345-20458	Sentence	denotes	Define the probability distribution = { (u), u ∈ V} on the knowledge of nodes being initially infected as follows
T154	20459-20576	Sentence	denotes	The Set Cover problem is equivalent to deciding if there is a set S of k nodes in this bipartite graph with D(S) ≤ 1.
T155	20577-20692	Sentence	denotes	Note that for the instance we have defined, activation is a deterministic process, as all probabilities are 0 or 1.
T156	20693-21115	Sentence	denotes	Monitoring k nodes to detect diffusion initiated from U corresponding to sets in a Set Cover solution results in covering all n nodes corresponding to the ground set U, and if any set S of k nodes has D(S) ≤ 1, then the Set Cover problem must be solvable.ᮀ Since the optimization problem is NP-hard and the network is prohibitively large, we cannot compute the optimum value to verify the actual quality of approximations.
T157	21116-21195	Sentence	denotes	Hence a natural idea is to employ the greedy algorithm as approximation method.
T158	21196-21306	Sentence	denotes	To make better use of the greedy algorithm, we consider an equivalent optimization problem Eq. (6) as follows,
T159	21307-21312	Sentence	denotes	where
T160	21313-21407	Sentence	denotes	is defined as the (expected) remaining time for taking actions when a contaminant is detected.
T161	21408-21501	Sentence	denotes	The above alternative formulation has key properties as described in the following Theorem 4.
T162	21502-21531	Sentence	denotes	Theoretically, a non-negative
T163	21532-21546	Sentence	denotes	for all S ⊆ T.
T164	21547-21553	Sentence	denotes	Proof.
T165	21554-21627	Sentence	denotes	It is obvious that the remaining time function R is monotone and R(∅) =0.
T166	21628-21662	Sentence	denotes	Now we prove that R is submodular.
T167	21663-21738	Sentence	denotes	According to the definitions in Eqs. (5) and (10), it suffices to show that
T168	21739-21771	Sentence	denotes	for all S ⊆ T ⊆ V and v ∈ V \T .
T169	21772-21807	Sentence	denotes	By the second property of in Propo-
T170	21808-21901	Sentence	denotes	Note that (u, T) ≤ (u, S) always works by the monotone decrease property of in Proposition 1.
T171	21902-21972	Sentence	denotes	Now we discuss Eq. (14) separately according to the value of (u, {v}).
T172	21973-22016	Sentence	denotes	, which holds by the monotone decrease of .
T173	22017-22230	Sentence	denotes	Eq. (14) is proven and we thus get that R(S) is submodular.ᮀ By above properties in Theorem 4, the problem given in Eq. (9) can be approximated by the greedy algorithm in Algorithm 1 with the set function f : = R.
T174	22231-22408	Sentence	denotes	For any submodular and monotone function f with f(∅) =0, the problem of finding a set S of size k that maximizes f(S) can be approximated by the greedy algorithm in Algorithm 1.
T175	22409-22579	Sentence	denotes	The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected.
T176	22580-22755	Sentence	denotes	It is shown that the algorithm guarantees an approximation ratio of f(S)/f(S * ) ≥ 1 −1/e, where S is the output of the greedy algorithm and S * is the optimal solution [30] .
T177	22756-22972	Sentence	denotes	In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S, and we turn to run snapshots for 10, 000 trials to obtain an accurate estimate of R(S), mentioned in Section 3.
T178	22973-23021	Sentence	denotes	This actually leads to expensive selection time.
T179	23022-23222	Sentence	denotes	Another source of inefficiency in Greedy(k, R) is that there exists O(kN) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes.
T180	23223-23290	Sentence	denotes	When N is large, the efficiency of the algorithm is unsatisfactory.
T181	23291-23552	Sentence	denotes	Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for evaluating R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.
T182	23553-23714	Sentence	denotes	In order to prune the estimation calls in Greedy(k, R), a natural idea is to employ the Cost-Effective Lazy Forward selection (CELF) algorithm proposed in [24] .
T183	23715-23908	Sentence	denotes	The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of estimation calls can be greatly pruned.
T184	23909-24040	Sentence	denotes	CELF optimization produces the same sensor set as the original greedy algorithm, but it is much faster than the original one [24] .
T185	24041-24158	Sentence	denotes	Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks.
T186	24159-24399	Sentence	denotes	In particular, in the first round to establish the initial upper bounds, CELF needs to estimate R({v}) using MC simulations for each node v, leading to N times of MC calls, which is time-consuming, especially when the network is very large.
T187	24400-24600	Sentence	denotes	The limitation leads to a rather fundamental question that, can we derive an upper bound of R({v}) which can be used to further prune unnecessary detection time estimations (MC calls) in Greedy(k, R)?
T188	24601-24727	Sentence	denotes	Motivated by this question and the idea in [40] , in this section we derive an initial upper bound of R({v}) for Greedy(k, R).
T189	24728-24870	Sentence	denotes	Based on the bound, we propose a new greedy algorithm Upper Bound based Greedy (UBG for short), which outperforms the original CELF algorithm.
T190	24871-24980	Sentence	denotes	Essentially different from the bounds that derived for the influence spread under the IC model [40] , we here
T191	24982-25036	Sentence	denotes	In this part, we aim to derive an upper bound of R(v).
T192	25037-25121	Sentence	denotes	Before introducing the upper bounds in Theorem 5, we first prepare two propositions.
T193	25122-25245	Sentence	denotes	Let P u (v ∈ J t ) denote the probability that node v becomes infected before step t when the initially infected node is u.
T194	25246-25287	Sentence	denotes	We have the first proposition as follows.
T195	25288-25364	Sentence	denotes	For v ∈ V , the remaining time R(v) under the DSI model can be calculated as
T196	25365-25371	Sentence	denotes	Proof.
T197	25372-25405	Sentence	denotes	In fact, by the definition in Eq.
T198	25406-25424	Sentence	denotes	(2), we first have
T199	25425-25646	Sentence	denotes	where the fourth '=' is due to the fact that u ∈ V (u) = 1 in above derivation. ᮀ Proposition 2 reveals that we can treat the global remaining time R(v) as a summation of all T max propagation steps of local probabilities
T200	25647-25738	Sentence	denotes	Based on Proposition 2, a following question is, what is the relationship between two sets,
T201	25739-25753	Sentence	denotes	Proposition 3.
T202	25754-25797	Sentence	denotes	For k ≥ 1, we have the following inequation
T203	25798-25804	Sentence	denotes	Proof.
T204	25805-25885	Sentence	denotes	For k ≥ 1, by the definition of conditional expectation and DSI model, we obtain
T205	25886-25954	Sentence	denotes	where the first '≤' is due to 1 {v/ ∈J k−1 } ≤ 1, and the second '≤'
T206	25955-26142	Sentence	denotes	x i in the above derivation.ᮀ Proposition 3 clearly identifies the ordering relationship between two adjacent elements in the series P u (v ∈ J 0 ), P u (v ∈ J 1 ), P u (v ∈ J 2 ), · · ·.
T207	26143-26216	Sentence	denotes	Now we simplify the results in Proposition 3 by using the form of matrix.
T208	26217-26311	Sentence	denotes	Let IP be the infection probability matrix with the element at position (u, v) being ip(u, v).
T209	26312-26344	Sentence	denotes	For t ≥ 0, denote the row vector
T210	26345-26441	Sentence	denotes	as the probabilities of nodes being infected before step t, i. e ., Â u t (v) := P u (v ∈ J t ).
T211	26442-26483	Sentence	denotes	Obviously, we have Â u 0 (v) = 1 {u} (v).
T212	26484-26542	Sentence	denotes	Now we can rewrite Proposition 3 by using the matrix form,
T213	26543-26627	Sentence	denotes	By iteration, we further get that u t ≤ u 0 · (E + IP) t , where E is a unit matrix.
T214	26628-26704	Sentence	denotes	Furthermore, due to the definition of probability Â u t (v), it follows that
T215	26705-26774	Sentence	denotes	Hereafter, define A ∧ 1 : = {a(i, j) ∧ 1} for a matrix A = {a(i, j)}.
T216	26775-26875	Sentence	denotes	With the above preparations, we can present the results on upper bound of remaining time as follows,
T217	26876-26944	Sentence	denotes	where E is a unit matrix and [A] (u,v) means the element at position
T218	26946-27029	Sentence	denotes	where is a prior distribution on the likelihood of nodes being the infected source.
T219	27030-27101	Sentence	denotes	We first use a toy example to explain how to calculate the upper bound.
T220	27102-27112	Sentence	denotes	Example 1.
T221	27113-27197	Sentence	denotes	Given a directed network G in Fig. 1 with infection probability matrix in Eq. (22) .
T222	27198-27285	Sentence	denotes	Let T max = 10, we have ((E + IP) t ∧ 1) is intractable when the network size is large.
T223	27286-27370	Sentence	denotes	To overcome the difficulty, we adopt the following procedure to calculate Tmax−1 t=0
T224	27371-27493	Sentence	denotes	If t ≥1 ≤ T max − 1, it implies that we do not need to calculate (E + IP) t any more when t ≥ t ≥1 , and therefore we have
T225	27494-27546	Sentence	denotes	where 1 is a N × N matrix with all elements being 1.
T226	27547-27572	Sentence	denotes	Additionally we find that
T227	27574-27580	Sentence	denotes	Proof.
T228	27581-27630	Sentence	denotes	In fact, when IP is small enough, it follows that
T229	27631-27773	Sentence	denotes	where the approximation stems from Taylor expansion.ᮀ By Eqs. (21) and (25), the new upper bound of remaining time can be approximately put as
T230	27774-27821	Sentence	denotes	which is relatively tractable when IP is small.
T231	27822-27908	Sentence	denotes	Based on the upper bound, we propose a new UBG algorithm for early outbreak detection.
T232	27909-27962	Sentence	denotes	First we explain the difference between UBG and CELF.
T233	27963-28061	Sentence	denotes	The CELF algorithm [24] exploits the submodular property to improve the original greedy algorithm.
T234	28062-28246	Sentence	denotes	However, CELF demands N (the number of nodes in the network) remaining time estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs.
T235	28247-28439	Sentence	denotes	In contrast, the proposed Upper Bound based Greedy (UBG) algorithm uses the derived new bound to further reduce the number of remaining time estimations, especially in the initialization step.
T236	28440-28573	Sentence	denotes	This way, the nodes are ranked by their upper bound scores which can be used to prune unnecessary calculations in the CELF algorithm.
T237	28574-28608	Sentence	denotes	We use Example 2 for illustration.
T238	28610-28748	Sentence	denotes	Even with the UBG algorithm proposed in Section 5, its running time is still unbearable and may not be suitable for large social networks.
T239	28749-28924	Sentence	denotes	Although UBG can greatly reduce the number of remaining time estimation calls, each estimation call is very timeconsuming, as it needs produce enough samples and average them.
T240	28925-29091	Sentence	denotes	Hence a possible alternative to further accelerate UBG is to employ heuristics to approximate the remaining time R(S) or the marginal return R(S ∪ {u}) − R(S) in UBG.
T241	29092-29190	Sentence	denotes	Along this idea, we here introduce two accelerative algorithms to address the inefficiency of UBG:
T242	29191-29233	Sentence	denotes	Quickest-Path-UBG and Local-Reduction-UBG.
T243	29234-29474	Sentence	denotes	The experimental results will show that (i) both of them are efficient in terms of running time, and (ii) the former one applies to sparse networks better, while the latter one is more effective in dense networks in terms of detection time.
T244	29475-29562	Sentence	denotes	In this part we introduce a tractable heuristic to approximate the remaining time R(S).
T245	29563-29653	Sentence	denotes	From Eq. (5) and Eq. (10), the key point of estimating R(S) is how to estimate E[ (u, S)].
T246	29654-29773	Sentence	denotes	Note that E[ (u, S)] measures the expected time delay of a diffusion initiated from node u propagating to sensor set S.
T247	29774-29886	Sentence	denotes	An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S.
T248	29887-29966	Sentence	denotes	Hence, a question is how to measure the quickest path from node u to the set S?
T249	29967-30036	Sentence	denotes	The proposed Quickest-Path-UBG is inspired by answering the question.
T250	30037-30177	Sentence	denotes	According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that E[X] = 1/p.
T251	30178-30442	Sentence	denotes	Since the time that a node u spends in infecting its uninfected neighborhood v is distributed geometrically with the parameter ip(u, v), the value 1/ip(u, v) should be the expected time cost that an infection propagates from node u to node v along the edge (u, v).
T252	30443-30467	Sentence	denotes	Fig. 2 shows an example.
T253	30468-30568	Sentence	denotes	For any two nodes u and v, let d(u, v) be the smallest time cost among the paths connecting u and v.
T254	30569-30629	Sentence	denotes	For example, d( , ) = min{5 + 3.3, 10 + 5} = 8.3 in Fig. 2 .
T255	30630-30841	Sentence	denotes	Then, for a subset S ⊆ V, we define d(u, S) := min v ∈ S d(u, v). (27) Intuitively, d(u, S) denotes the expected time that an infection propagates from u to S along the quickest path in the graph G = (V, E, m) .
T256	30842-30976	Sentence	denotes	Therefore, a fundamental question arises, is there some approximate relationship between the detection time and the shortest distance.
T257	30977-31009	Sentence	denotes	Theoretically we have Theorem 6.
T258	31010-31171	Sentence	denotes	When the shortest path from u to S in graph G = (V, E, m) is unique, the quantity d(u, S) ∧ T max can be used to approximate the detection time E[ (u, S)], i.e.,
T259	31172-31178	Sentence	denotes	Proof.
T260	31179-31272	Sentence	denotes	In fact, when the shortest path from u to S in graph G = (V, E, m) is unique, it follows that
T261	31273-31377	Sentence	denotes	where we use an result borrowed from probability theory for the approximation: consider random variables
T262	31378-31476	Sentence	denotes	.ᮀ Based on Theorem 6, combining Eq. (5), Eq. (10) and Eq. (28), we have the following derivation,
T263	31477-31520	Sentence	denotes	Hence we can approximate the remaining time
T264	31521-31611	Sentence	denotes	rather than the heavy Monte-Carlo estimation MC(S ∪ {u}) in the 09th row of UBG algorithm.
T265	31612-31704	Sentence	denotes	We nail down this new method as Quickest-Path-UBG, which is shown explicitly in Algorithm 3.
T266	31706-31945	Sentence	denotes	Note that the reason we call this acceleration as Quickest-Path-UBG rather than Shortest-Path-UBG lies in that, our Quickest-Path-UBG introduces a new measurement to reflect the infection time cost rather than the distance in common sense.
T267	31946-32090	Sentence	denotes	In this part we turn to a tractable heuristic to approximate the marginal return ı u (S) : = R(S ∪ {u}) − R(S) in the 09th row of UBG algorithm.
T268	32091-32275	Sentence	denotes	To facilitate the description, we will approximate the marginal reduction D(S) − D(S ∪ {u}) rather than the marginal return R(S ∪ {u}) − R(S), due to the fact that R(S) = T max − D(S).
T269	32276-32417	Sentence	denotes	In sociology literature, degree and other centrality-based heuristics are commonly used to select influential nodes in social networks [37] .
T270	32418-32580	Sentence	denotes	These methods usually looked over the multi-step complexity of network diffusions, and assumed that the infection can propagate ahead only one hop before the end.
T271	32581-32689	Sentence	denotes	In other words, if a node initiates an infection, the diffusion spread is at most its first-order neighbors.
T272	32690-32735	Sentence	denotes	We call this as one-hop assumption hereafter.
T273	32736-32972	Sentence	denotes	Especially, experimental results in [24] showed that selecting vertices with maximum indegree as sensors results in earlier infection detection than other heuristics, which validated the rationality of one-hop assumption to some extent.
T274	32973-33116	Sentence	denotes	Under this observation, we will propose a tractable method upon the one-hop assumption to approximate the marginal reduction D(S) − D(S ∪ {u}).
T275	33117-33554	Sentence	denotes	We first introduce a useful result in probability theory: if the random variable X i is distributed geometrically with parameter p i for i = 1, · · ·, n and they are independent, then ∧ i=1,· · ·,n X i is distributed geometrically with parameter 1 − According to the characteristic of the DSI model, the time cost that an infected node w spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(w, v).
T276	33555-33679	Sentence	denotes	If the initially infected node w has multi-neighbors in sensor set S, the infections to different neighbors are independent.
T277	33680-33774	Sentence	denotes	Under the onehop assumption, the expected detection time of w can thus be calculated like this
T278	33775-33896	Sentence	denotes	If the initially infected node w has no neighbors in sensor set S, the expected detection time of w is defined as T max .
T279	33897-33938	Sentence	denotes	If w ∈ S, it follows that E[ (w, S)] = 0.
T280	33939-34084	Sentence	denotes	Fig. 3 presents an example to show the calculation result of expected detection time E[ (w, S)] for each node w ∈ V under the one-hop assumption.
T281	34085-34220	Sentence	denotes	With the preparations above, we are about to calculate the marginal reduction ı u (S) = D(S) − D(S ∪ {u}) under the one-hop assumption.
T282	34221-34291	Sentence	denotes	Assume the set S has been selected as sensor set to detect infections.
T283	34292-34421	Sentence	denotes	When considering adding another node u as a new sensor into S, the reduction of the expected detection time can be put as follows
T284	34422-34442	Sentence	denotes	which is by Eq. (5).
T285	34443-34530	Sentence	denotes	We now use Theorem 7 to present the concrete expression of the right part of Eq. (30) .
T286	34531-34616	Sentence	denotes	Similar with the definition of Eq. (1), we predefine the parents of set S as follows,
T287	34617-34627	Sentence	denotes	Theorem 7.
T288	34628-34793	Sentence	denotes	In the DSI model with infection probability ip(w, v) on a directed graph G = (V, E), the reduction ı u (S) of the expected detection time can be calculated like this
T289	34794-35029	Sentence	denotes	under one-hop assumption, where f(u, S) is defined as The reason that we call it as Local-Reduction-UBG is that we only consider the local change in monitoring {u} and Par(u) \ Par(S) Table 2 Statistics of the four real-world networks.
T290	35030-35136	Sentence	denotes	Digger Twitter Epinions Small-world #Node 8194 32,986 51,783 200,000 #Edge 56,440 763,713 476,
T291	35138-35301	Sentence	denotes	We conduct experiments on both synthetic and real-world data sets to evaluate the UBG algorithm, the Quickest-Path-UBG algorithm and Local-Reduction-UBG algorithm.
T292	35302-35381	Sentence	denotes	We implement the algorithms using C++ with the Standard Template Library (STL).
T293	35382-35488	Sentence	denotes	All experiments are run on a Linux (Ubuntu 11.10) machine with six-core 1400 MHz AMD CPU and 32 GB memory.
T294	35489-35553	Sentence	denotes	Three real and one synthetic data sets are used for comparisons.
T295	35554-35737	Sentence	denotes	The Digger data 1 is a heterogeneous network, including Digg stories, user actions (submit, digg, comment and reply) with respect to the stories, and friendship relations among users.
T296	35738-35822	Sentence	denotes	The Twitter and Epinions data sets can both be obtained from Stanford Datasource 2 .
T297	35823-35954	Sentence	denotes	Epinions is a general consumer review site where visitors can read reviews about a variety of items to help them decide a purchase.
T298	35955-36072	Sentence	denotes	The synthetic Small-world data set is the type of graphs in which each node can be reached by a small number of hops.
T299	36073-36184	Sentence	denotes	For smallworld model we set the parameter of the nearest neighbors k = 15 and the rewiring probability p = 0.1.
T300	36185-36304	Sentence	denotes	The above networks are representative ones, covering a variety of networks with different types of relations and sizes.
T301	36305-36385	Sentence	denotes	The details of the data sets are listed in Table 2 where degree means in-degree.
T302	36386-36463	Sentence	denotes	In our experiments, an undirected graph is regarded as a bidirectional graph.
T303	36464-36622	Sentence	denotes	We compare the UBG in Algorithm 2, the Quickest-Path-UBG in Algorithm 3, and Local-Reduction-UBG in Algorithm 4 with both the greedy and heuristic algorithms.
T304	36623-36636	Sentence	denotes	• CELF [24] .
T305	36637-36757	Sentence	denotes	The state-of-the-art greedy algorithm, where uses 10,000 snapshots in the whole process for any network. • DEGREE [37] .
T306	36758-36845	Sentence	denotes	A heuristic algorithm based on "degree centrality", with high-degree nodes as key ones.
T307	36846-36932	Sentence	denotes	The seeds are the nodes with the k highest in-degrees. • INTER-MONITOR DISTANCE [35] .
T308	36933-37083	Sentence	denotes	A heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. • PageRank [5] .
T309	37084-37161	Sentence	denotes	A link analysis algorithm which ranks the importance of pages in a Web graph.
T310	37162-37278	Sentence	denotes	We implement the power method with a damping factor of 0.85, and pick the k highest-ranked nodes as seeds. • Random.
T311	37279-37376	Sentence	denotes	It simply selects k random vertices in the graph as the seed set, which is taken as the baseline.
T312	37377-37558	Sentence	denotes	In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run Monte-Carlo simulation on the networks 10, 000 times and calculate the mean.
T313	37559-37701	Sentence	denotes	The simple greedy algorithm is not compared because many works have reported that CELF has the same optimization result and less running time.
T314	37702-37863	Sentence	denotes	Since the DEGREE heuristic is the state-of-the-art [37] , we do not implement heuristics such as distance centrality and betweenness centrality-based heuristics.
T315	37864-38032	Sentence	denotes	We mainly report results on a uniform infection probability of 0.1 assigned to each directed link in the network, i.e., ip(u, v) = 0.1 for any directed edge (u, v) ∈ E.
T316	38033-38159	Sentence	denotes	One can refer to the work [14, 33, 36] for learning real values of the parameters {ip(u, v) : (u, v) ∈ E} from available data.
T317	38160-38257	Sentence	denotes	Besides, we let the time horizon T max = 30 and the prior distribution be uniform in the network.
T318	38258-38325	Sentence	denotes	In Section 3, we proposed two Monte-Carlo methods to estimate D(S).
T319	38326-38413	Sentence	denotes	Table 3 shows the estimation results of Propagation Simulation and Snapshot Simulation.
T320	38414-38502	Sentence	denotes	Here we select 10 nodes with the highest in-degrees in each network as the sensor set S.
T321	38503-38663	Sentence	denotes	We can find that Propagation Simulation and Snapshot Simulation release almost the same estimation results, which confirms their equivalence in estimating D(S).
T322	38664-38735	Sentence	denotes	Fig. 4 shows the cumulative time cost of these two Monte-Carlo methods.
T323	38736-38911	Sentence	denotes	Here we randomly select 10 sensor sets {S 1 , S 2 , . . ., S 10 } from the Digger data and every sensor set has five sensor nodes, i.e. \|S i \| = 5 for all i = 1, 2, . . ., 10.
T324	38912-39098	Sentence	denotes	We can see that the cumulative time cost of Propagation Simulation increases linearly, while that of Snapshot Simulation has a big jump at the first sensor set and then increases slowly.
T325	39099-39310	Sentence	denotes	The reason behind is that Snapshot Simulation needs to establish numerous snapshots to estimate the first D(S 1 ), and these established snapshots can be reused in the posterior estimations of {D(S i )} 10 i=2 .
T326	39311-39400	Sentence	denotes	Table 4 shows the gap between the real value of remaining time R(S) and its upper bounds.
T327	39401-39495	Sentence	denotes	Here we also select ten nodes with the highest in-degrees in each network as the sensor set S.
T328	39497-39677	Sentence	denotes	In Table 5 , we compare the number of estimation or approximation calls at the first 10 iterations among CELF, UBG, Quickest-Path-UBG and Local-Reduction-UBG on the four data sets.
T329	39678-39847	Sentence	denotes	From the results, we can observe that the call number in UBG and its two accelerations are significantly reduced compared to that in CELF, especially at the first round.
T330	39848-39985	Sentence	denotes	One may notice that in Table 5 , CELF occasionally defeats our methods, but the total call number of our methods are much less than CELF.
T331	39986-40162	Sentence	denotes	As listed in 5, the total call number of the first 10 iterations of UBG and its two accelerations, compared to CELF, is reduced at a rate of 94% at least on the four data sets.
T332	40163-40264	Sentence	denotes	Similarly, at least 81% reduction of call numbers of CELF can be observed in the first 50 iterations.
T333	40265-40362	Sentence	denotes	From the observation, we can conclude that our UBG is more efficient than CELF on large networks.
T334	40363-40462	Sentence	denotes	Detection time measures the time delay of a message propagated from a diffusion source to a sensor.
T335	40463-40627	Sentence	denotes	We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50 as shown in Fig. 5 .
T336	40628-40710	Sentence	denotes	UBG, as an updated version of CELF, has competitive results on the four data sets.
T337	40711-40892	Sentence	denotes	More importantly, the detection times of UBG and CELF are completely the same in the four figures, which explains again that UBG and CELF share the same results in sensor selection.
T338	40893-40983	Sentence	denotes	The only difference between UBG and CELF is the number of remaining time estimation calls.
T339	40984-41074	Sentence	denotes	The Quickest-Path-UBG and Local-reduction-UBG always perform better than other heuristics.
T340	41075-41289	Sentence	denotes	More specifically, if the networks are sparse like Digger and Epinions, the Quickest-Path-UBG is more competitive; if the networks are dense like Twitter and Smallworld, the Local-reduction-UBG is more competitive.
T341	41290-41362	Sentence	denotes	Selection time measures the time cost of an algorithm selecting sensors.
T342	41363-41423	Sentence	denotes	Fig. 6 shows the time cost of selecting sensors with k = 50.
T343	41424-41458	Sentence	denotes	UBG is 4-8 times faster than CELF.
T344	41459-41543	Sentence	denotes	One may argue that such a low improvement of UBG can be neglected in large networks.
T345	41544-41587	Sentence	denotes	In fact, UBG scales well to large networks.
T346	41588-41771	Sentence	denotes	Because, with the size of a network increase, Monte-Carlo simulations take more time, and thus UBG will achieve better performance by pruning more unnecessary Monte-Carlo simulations.
T347	41772-41882	Sentence	denotes	As to heuristics, Degree and Random are very fast in selecting candidate nodes, which take less than 1 second.
T348	41883-42001	Sentence	denotes	Quickest-Path-UBG and Local-Reduction-UBG are exciting and adoptable, due to their good performance in detection time.
T349	42002-42131	Sentence	denotes	The PageRank and Inter-Monitor Distance are slightly slower and undesirable, in view of their poor performance in detection time.
T350	42132-42243	Sentence	denotes	We run tests on Twitter data set and obtain the selection time w.r.t. parameter ip (the infection probability).
T351	42244-42293	Sentence	denotes	In the experiments, ip increases from 0.1 to 0.5.
T352	42294-42381	Sentence	denotes	We assign a uniform infection probability ip to each directed link under the DSI Model.
T353	42382-42502	Sentence	denotes	From the results in Fig. 7 , we can conclude that with the parameter ip growing larger, the CELF is more time-consuming.
T354	42503-42593	Sentence	denotes	By contrast, the UBG and its accelerations are robust and insensitive to the parameter ip.
T355	42594-42656	Sentence	denotes	2) First, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∪ T(u, S 2 ).
T356	42657-42718	Sentence	denotes	For any t ∈ T(u, S 1 ∪ S 2 ), I t ∩ (S 1 ∪ S 2 ) / = ∅ holds.
T357	42719-42873	Sentence	denotes	From I t ∩ (S 1 ∪ S 2 ) = (I t ∩ S 1 ) ∪ (I t ∩ S 2 ) / = ∅, we know either I t ∩ S 1 or I t ∩ S 2 is nonempty, which implies t ∈ T(u, S 1 ) ∪ T(u, S 2 ).
T358	42874-42929	Sentence	denotes	Hence T(u, S 1 ∪ S 2 ) ⊂ T(u, S 1 ) ∪ T(u, S 2 ) works.
T359	42930-43025	Sentence	denotes	Conversely, for any t ∈ T(u, S 1 ) ∪ T(u, S 2 ), t will be located in T(u, S 1 ) or T(u, S 2 ).
T360	43026-43091	Sentence	denotes	Without loss of generality, we assume t ∈ T(u, S 1 ), which means
T361	43093-43143	Sentence	denotes	is nonempty, which indicates t ∈ T(u, S 1 ∪ S 2 ).
T362	43144-43204	Sentence	denotes	Hence T(u, S 1 ) ∪ T(u, S 2 ) ⊂ T(u, S 1 ∪ S 2 ) also works.
T363	43205-43265	Sentence	denotes	Second, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).
T364	43266-43394	Sentence	denotes	We take infimum operations on both sides of the first step conclusion, i.e., inf T(u, S 1 ∪ S 2 ) = inf T(u, S 1 ) ∪ T(u, S 2 ).
T365	43395-43505	Sentence	denotes	Since all these sets are be composed of integers, we obviously get T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).
T366	43506-43733	Sentence	denotes	Finally, we take the infimums with T max on both sides of the second step conclusion, i.e., T(u, S 1 ∪ S 2 ) ∧ T max = T(u, S 1 ) ∧ T(u, S 2 ) ∧ T max , which means (u, S 1 ∪ S 2 ) = (u, S 1 ) ∧ (u, S 2 ) by the definition of .
T367	43734-44069	Sentence	denotes	3) Similarly, in order to prove this property, we only need to , {u}) , i.e., for any t ≥ 0, the probability that t is located in T(u, {v}) equals the probability that t is located in 4) If S ⊂ T, in order to prove this property, we only need to prove T(u, S) ⊂ T(u, T ) and notice the fact that the infimum of a larger set is smaller.
T368	44070-44111	Sentence	denotes	For any t ∈ T(u, S), I t ∩ S / = ∅ holds.
T369	44112-44202	Sentence	denotes	Besides, as S ⊂ T, I t ∩ S / = ∅ implies I t ∩ T / = ∅, which means t is also in T(u, T ).
T370	44203-44285	Sentence	denotes	So, T(u, S) ⊂ T(u, T ), by which we can easily conclude the property, is verified.
T371	44286-44370	Sentence	denotes	5) If T(u, S) is empty, then T(u, S) =+∞. (u, S) =+ ∞ ∧ T max = T max ∈ [0, T max ].
T372	44371-44488	Sentence	denotes	If T(u, S) is nonempty, then every element in it is greater than 0 and notice that (u, S) = T(u, S) ∧ T max ≤ T max .
T373	44489-44536	Sentence	denotes	Therefore, we can conclude 0 ≤ (u, S) ≤ T max .
T374	44537-44666	Sentence	denotes	We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] , denoted by (A1).
T375	44667-44672	Sentence	denotes	(A1):
T376	44673-44757	Sentence	denotes	Let E be a subset of N 2 and X i , Y j , (i, j) ∈ E are pairwise distinct variables.
T377	44758-44889	Sentence	denotes	We define a formula F : = ∨ {X i ∧ Y j : (i, j) ∈ E}, which is a disjunction of all the conjunctive clauses X i ∧ Y j , (i, j) ∈ E.
T378	44890-44937	Sentence	denotes	How many valuations that satisfy the formula F?
T379	44938-44983	Sentence	denotes	Then we introduce the following problem (A2).
T380	44984-44989	Sentence	denotes	(A2):
T381	44990-45223	Sentence	denotes	G = (V, E) is a directed graph, where each edge e ∈ E is associated with a weight w e which is geometrically distributed with a parameter p, i.e. P(w e = i) = p(1 − p) i−1 , i ≥ 1. s, t are two nodes in G and k is a positive integer.
T382	45224-45320	Sentence	denotes	What is the probability of the event that the shortest path from s to t has length T at least k?
T383	45321-45365	Sentence	denotes	Now we prove that (A1) is reducible to (A2).
T384	45366-45436	Sentence	denotes	We map the variable X i to a node x i and variable Y j to a node y j .
T385	45437-45497	Sentence	denotes	If (i, j) exists in E, then we add an edge from x i to y j .
T386	45498-45668	Sentence	denotes	Also, we add a source node s and an edge (s, x i ), associated with weight w (s,x i ) which is geometrically distributed with parameter 1/2, for any i ∈ {i : (i, j) ∈ E}.
T387	45669-45765	Sentence	denotes	Thus, w (s,x i ) equals 1 with probability 1/2 and strictly greater than 1 with probability 1/2.
T388	45766-45886	Sentence	denotes	Similarly, we add a terminal node t and an edge from each y j to t with a weight identically distributed as w (s,x i ) .
T389	45887-45959	Sentence	denotes	So we construct a graph with geometrically distributed weights on edges.
T390	45960-46215	Sentence	denotes	Now we can build a probability-preserving bijection between the valuations of X i , Y j and the subgraphs of the constructed random graph: for a valuation , the corresponding subgraph is the one where each edge adjacent to x i has length 1 iff (X i ) = 1.
T391	46216-46243	Sentence	denotes	The same argument for Y j .
T392	46244-46415	Sentence	denotes	We set k = 3 and claim the fact that the probability P(T ≥ 3) in (A2) is the number of valuations satisfying that F is divided by 2 N , where N is the number of variables.
T393	46416-46637	Sentence	denotes	Indeed, the fact is based on the following observation: a valuation is true iff some adjacent pair of X i and Y j is true if the incident edges to the corresponding x i and y j have length 1 in the corresponding subgraph.
T394	46638-46907	Sentence	denotes	Note that any edge of length greater than 1 is irrelevant as the structure of the graph, which ensures it can never be part of a path of length 3 from s to t, For any path from s to t must jump three times: from s to some x i , from x i to some y j , and from y j to t.
T395	46908-47001	Sentence	denotes	Thus, from what we have stated, (A1) can be reducible to (A2), which implies (A2) is #P-hard.
T396	47002-47056	Sentence	denotes	We aim to solve the problem (denoted by (A)) of calcu-
T397	47058-47210	Sentence	denotes	In order to prove the random variables T(u, S) and c(u, S) are identically distributed, we need to prove P[T (u, S) = t] = P[c(u, S) = t] for any t > 0.
T398	47211-47274	Sentence	denotes	On the one hand, we can prove P[T (u, S) = t] ≤ P[c(u, S) = t].
T399	47275-47385	Sentence	denotes	Given t > 0, we find that for any simulation with T(u, S) = t, a snapshot with c(u, S) = t can be constructed.
T400	47386-47476	Sentence	denotes	By definition, if T(u, S) = t, then J i ∩ S = ∅, i = 0, 1, · · ·, t − 1 and J t ∩ S / = ∅.
T401	47477-47606	Sentence	denotes	For any node v i+1 ∈ J i+1 \J i / = ∅, suppose that v i+1 is infected through the edge (s, v i+1 ) with s ∈ J i , v i+1 ∈ J i+1 .
T402	47607-47638	Sentence	denotes	Denote n s = inf{j : s ∈ J j }.
T403	47639-47712	Sentence	denotes	Then we let the edge (s, v i+1 ) take weight c(s, v i+1 ) = i + 1 − n s .
T404	47713-47781	Sentence	denotes	For any node s ∈ J t , we denote n s , i.e. n s = sup{j : s ∈ J j }.
T405	47782-47919	Sentence	denotes	Then for any v ∈ V \J t , let the edge (s , v i+1 ) take weight c(s , v) = t − n s + C, where C is sampled from a geometric distribution.
T406	47920-48012	Sentence	denotes	In addition, we take a weight sampled from geometric distribution on each of the rest edges.
T407	48013-48124	Sentence	denotes	In such operations, a snapshot with c(u, S) = t is constructed, which implies P[T (u, S) = t] ≤ P[c(u, S) = t].
T408	48125-48193	Sentence	denotes	On the other hand, we can also get P[c(u, S) = t] ≤ P[T (u, S) = t].
T409	48194-48278	Sentence	denotes	For any snapshot with c(u, S) = t, a simulation with T(u, S) = t can be constructed.
T410	48279-48410	Sentence	denotes	We let node u be the initiator, i.e. J 0 = {u} and denote the smallest number in the set {c(u, v) : u ∈ J 0 , v ∈ V \J 0 } by n 0 .
T411	48411-48501	Sentence	denotes	Meanwhile, we denote the set of nodes {v : c(u, v) = n s , u ∈ J 0 , v ∈ V \J 0 } by n 0 .
T412	48502-48525	Sentence	denotes	Let J n 0 = J 0 ∪ n 0 .
T413	48526-48594	Sentence	denotes	Complementally, we define all the J i = J 0 , i = 0, · · ·, n 0 − 1.
T414	48595-48747	Sentence	denotes	Denote the smallest number in the set {c(u, v) : u ∈ J t , v ∈ V \J t } by n t and the set of nodes {v : c(u, v) = n s , u ∈ J t , v ∈ V \J t } by n t .
T415	48748-48781	Sentence	denotes	Let J t+n t = J n t = J t ∪ n t .
T416	48782-48854	Sentence	denotes	Complementally, we define J i : = J t for all i = t, · · ·, t + n t − 1.
T417	48855-48944	Sentence	denotes	We now can construct a simulation of the diffusion model (I t ) t≥0 = (J t+1 \J t ) t≥0 .
T418	48945-49038	Sentence	denotes	Hence, we have a simulation with T(u, S) = t, which implies P[c(u, S) = t] ≤ P[T (u, S) = t].
T419	49039-49173	Sentence	denotes	Therefore, we have P[T (u, S) = t] = P[c(u, S) = t] for any t > 0, which implies that T(u, S) and c(u, S) are identically distributed.
T420	49174-49255	Sentence	denotes	Hence, the simulation and snapshot simulations are equivalent in estimating D(S).
T1	49256-49402	Sentence	denotes	In this paper, we discussed scalable solutions to detection time minimization problem for early detection of dynamic harmful cascades in networks.
T2	49403-49554	Sentence	denotes	To address this problem, we proposed an upper bound based greedy (UBG) algorithm and its two accelerative algorithms to cater for lager scale networks.
T3	49555-49661	Sentence	denotes	The UBG solution guarantees a near-optimal detection time, pruning at least 80% Monte-Carlo calls of CELF.
T4	49662-49791	Sentence	denotes	The novel accelerations on UBG can significantly reduce the selection time and further achieve at least 10 2 times speed-raising.
T5	49792-49840	Sentence	denotes	There are several interesting future directions.
T6	49841-50012	Sentence	denotes	First, the detection time D(S) is # P-hard to calculate exactly, it is still a question how to design an efficient algorithm to estimate D(S) with a theoretical guarantee.
T7	50013-50204	Sentence	denotes	Second, the infection probability ip and prior distribution are predefined in this paper, how to learn these parameters for the DSI model from available cascade data still remains unexplored.

T1

0-69

Sentence

denotes

Early detection of dynamic harmful cascades in large-scale networks ଝ

T2

71-79

Sentence

denotes

Abstract

T3

80-217

Sentence

denotes

Quickly detecting harmful cascades in networks can allow us to analyze the causes and prevent further spreading of destructive influence.

T4

218-370

Sentence

denotes

Since it is often impossible to observe the state of all nodes in a network, a common method is to detect harmful cascades from sparsely placed sensors.

T5

371-567

Sentence

denotes

However, the harmful cascades are usually dynamic (e.g., the cascade initiators and diffusion trajectories can change over the time), which can severely destroy the robustness of selected sensors.

T6

568-672

Sentence

denotes

Meanwhile the large scale of current networks greatly increases the time complexity of sensor selection.

T7

673-830

Sentence

denotes

Motivated by the observation, in this paper we investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks.

T8

831-1082

Sentence

denotes

Specifically, we first put forward a dynamic susceptible-infected model to describe harmful cascades, and formally define a detection time minimization (DTM) problem which focuses on effective sensors placement for early detection of dynamic cascades.

T9

1083-1218

Sentence

denotes

We prove that it is #P-hard to calculate the objective function exactly and propose two Monte-Carlo methods to estimate it efficiently.

T10

1219-1303

Sentence

denotes

We prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.

T11

1304-1436

Sentence

denotes

Based on that, we propose an efficient upper bound based greedy (UBG) algorithm with the theoretical performance guarantee reserved.

T12

1437-1530

Sentence

denotes

To further meet different types of large-scale networks, we propose two accelerations of UBG:

T13

1531-1643

Sentence

denotes

Quickest-Path-UBG for sparse networks and Local-Reduction-UBG for dense networks to improve the time complexity.

T14

1644-1760

Sentence

denotes

The experimental results on synthetic and real-world social networks demonstrate the practicality of our approaches.

T15

1762-1880

Sentence

denotes

Harmful cascade spreading through kinds of network structures has become more and more ubiquitous in the modern world.

T16

1881-2038

Sentence

denotes

A contagious disease like Severe Acute Respiratory Syndrome (SARS) can spread quickly through a population contact network and lead to an epidemic [12, 34] .

T17

2039-2162

Sentence

denotes

A computer virus on a few servers can fast spread to other servers or computers in a computer connection network [15, 17] .

T18

2163-2281

Sentence

denotes

In a similar vein, a rumor started by a few individuals can spread quickly through the online social network [2, 26] .

T19

2282-2473

Sentence

denotes

It is crucial to detect the harmful cascades as soon as they happen or shortly thereafter, since it allows us to study the causes and prevent further spreading of harmful influence [32, 38] .

T20

2474-2698

Sentence

denotes

In practice, it is commonly infeasible or unaffordable to monitor all nodes at all times, and therefore a common way to detect cascades is to select important nodes where we can place sensors for monitoring [7, 19, 31, 39] .

T21

2699-2979

Sentence

denotes

However, existing methods usually viewed the cascade data as static and deterministic, ignoring an important fact that the harmful cascades are usually dynamic and time-variant, in the sense that the cascade initiators and diffusion trajectories can change randomly over the time.

T22

2980-3117

Sentence

denotes

Actually the dynamic brings new challenges to network monitoring, since it can severely destroy the robustness of selected sensors [22] .

T23

3118-3288

Sentence

denotes

Meanwhile the existing sensor selection algorithms can only work for small networks and not be scalable well to large networks of the https://doi.org/10.1016/j.jocs.2017.

T24

3289-3327

Sentence

denotes

T25

3328-3359

Sentence

denotes

T26

3360-3533

Sentence

denotes

Motivated by these observations, in this paper we investigate the scalable solutions to sensor selection problem for early detection of dynamic harmful cascades in networks.

T27

3534-3618

Sentence

denotes

To model the dynamic property, we carry on our work from a model-driven perspective.

T28

3619-3828

Sentence

denotes

To provide a unified framework, we model all the above examples as an infection spreading in a network G = (V, E), where V is the set of nodes (i.e. individuals) and E is the set of edges (i.e. relationships).

T29

3829-3923

Sentence

denotes

In a population network, the infection is the disease that is transmitted between individuals.

T30

3924-4108

Sentence

denotes

In the example of a computer virus spreading in a network, the infection is the computer virus, while for the case of a rumor spreading in a social network, the infection is the rumor.

T31

4109-4444

Sentence

denotes

Under this unified network framework, we specifically propose a new dynamic cascade model to describe harmful cascade diffusions, and define a detection time minimization (DTM) problem S * = argmin |S|=k,S⊆V D(S), where k is a given parameter determined by budget or monitor capacity, S is sensor nodes set, and D(S) is detection time.

T32

4445-4553

Sentence

denotes

The DTM problem focuses on effective sensor nodes selection for early detection of dynamic harmful cascades.

T33

4554-4644

Sentence

denotes

We first prove the NP-hardness of DTM problem and design a corresponding greedy algorithm.

T34

4645-4857

Sentence

denotes

Considering the limitation caused by greedy algorithm inefficiency, we then propose two alternative Monte-Carlo methods, each having its pros and cons, to estimate the #P-hard objective function D(S) efficiently.

T35

4858-5086

Sentence

denotes

To further address the scalability issue, we propose an efficient upper bound based greedy (UBG) algorithm and its two accelerations Quickest-Path-UBG and Local-Reduction-UBG to cater for different types of large-scale networks.

T36

5087-5202

Sentence

denotes

These two accelerations are close to UBG in performance but orders of magnitude faster than UBG in time complexity.

T37

5203-5306

Sentence

denotes

Experiments on synthetic and real-world social networks demonstrate the practicality of our approaches.

T38

5307-5505

Sentence

denotes

Detecting the whole network by monitoring finite sensor nodes has been widely applied to detect water contaminations in a water distribution network [19] and virus outbreaks in a human society [7] .

T39

5506-5631

Sentence

denotes

Some early work places sensors by topological measures, e.g. targeting high degree nodes [32] or highly connected nodes [8] .

T40

5632-5760

Sentence

denotes

However the effects along this idea are commonly unsatisfied, since they ignore the multi-step complexity of network diffusions.

T41

5761-5922

Sentence

denotes

The "Battle of Water Sensor Network" challenge [31] motivated a number of works to optimize sensor placement in water networks to detect contamination [16, 18] .

T42

5923-6183

Sentence

denotes

By utilizing submodular property, they proposed to optimize the sensor placement with different criterions such as maximizing the probability of detection, minimizing the detection time, or minimizing the size of the subnetwork affected by the phenomena [24] .

T43

6184-6315

Sentence

denotes

In their works the data are a set of deterministic scenarios and the random property of diffusion are not taken into consideration.

T44

6316-6384

Sentence

denotes

What's more, their methods are not scalable to large networks [28] .

T45

6385-6575

Sentence

denotes

In addition to the above two types of methods, another related work [1] set sensors along fixed paths in the network so as to gather sufficient information to locate possible contaminations.

T46

6576-6738

Sentence

denotes

Early detection of contagious outbreaks by monitoring the neighborhood (friends) of a randomly chosen node (individual) was studied by Christakis and Fowler [7] .

T47

6739-6962

Sentence

denotes

Krause et al. [20] presented efficient schedules for minimizing energy consumption in battery operated sensors, while other works analyzed distributed solutions with limited communication capacities and costs [13, 21, 22] .

T48

6963-7047

Sentence

denotes

In addition, Berry et al. [4] equated the placement problem with a p-median problem.

T49

7048-7173

Sentence

denotes

Li et al. [25] proposed a dynamic-programming (DP) based greedy algorithm which is with a near-optimal performance guarantee.

T50

7174-7351

Sentence

denotes

By contrast, our work is geared to investigate the scalable sensor selection problem for early detection of dynamic harmful cascades in networks from a model-driven prospective.

T51

7352-7659

Sentence

denotes

The main difference from previous works is that we take dynamic properties of cascades into consideration: (1) the cascade initiator is dynamic since we do not know who will initiate another harmful diffusion next time, and (2) the diffusion process is dynamic since the propagation trajectory is uncertain.

T52

7660-7789

Sentence

denotes

Under these two dynamic sources, we aim to accurately and fast select k nodes as sensors for early detection of harmful cascades.

T53

7790-7958

Sentence

denotes

Besides, as the size of network becomes more and more large-scale, we also need to ensure the effectiveness of the proposed sensor selection methods for large networks.

T54

7959-8093

Sentence

denotes

To minimize the time to find the dynamic harmful cascades, our search attempts to optimize the selection of sensors in a scalable way.

T55

8094-8138

Sentence

denotes

Our contributions are summarized as follows:

T56

8139-8198

Sentence

denotes

• We formulate a Detection Time Minimization (DTM) problem.

T57

8199-8469

Sentence

denotes

To describe the dynamic harmful cascade on networks, we put forward a dynamic cascade model for the harmful diffusion. • We prove that it is #P-hard to exactly calculate the objective function of DTM problem and propose two equivalent Monte-Carlo methods to estimate it.

T58

8470-8631

Sentence

denotes

Each has its advantages and disadvantages. • We show the NP-hardness of DTM problem as it can be shown to contain the Set Cover problem as a simple special case.

T59

8632-8913

Sentence

denotes

We convert the DTM problem to a constrained max optimization problem, prove the submodularity of the new objective function, and then employ the greedy algorithm which achieves an approximation ratio of 1 − 1/e. • We theoretically establish new upper bounds for the remaining time.

T60

8914-9167

Sentence

denotes

Based on these bounds, we further propose a new Upper Bound based Greedy (UBG in short) algorithm which can significantly reduce the number of estimations in greedy-based algorithms, especially at the initial step. • We propose two accelerations of UBG:

T61

9168-9354

Sentence

denotes

Quickest-Path-UBG and Local-Reduction-UBG to address the DTM problem for large networks, which are close to UBG in performance but orders of magnitude faster than UBG in time complexity.

T62

9355-9485

Sentence

denotes

In addition, our methods not only aim at early detection of harmful cascades, but also shed light on a host of other applications.

T63

9486-9639

Sentence

denotes

For example, the goal of Emerging Topic Detection is to identify emergent topics in a social network, assuming full access to the stream of all postings.

T64

9640-9894

Sentence

denotes

Providers, such as Twitter or Facebook, have an immediate access to all tweets or postings as they are submitted to their server [6, 29] , while outside observers need an efficient mechanism to monitor changes, such as the methods developed in this work.

T65

9895-10150

Sentence

denotes

Another example, an emerging trend in algorithmic stock trading is the use of automatic search through the Web and social networks for pieces of information that can be used in trading decisions before they appear in the more popular news sites [23, 27] .

T66

10151-10255

Sentence

denotes

Similarly, intelligence, business and politics analysts are scanning online sources for new information.

T67

10256-10358

Sentence

denotes

The theoretical framework and method established in this paper can also be used in these applications.

T68

10359-10405

Sentence

denotes

The rest of the paper is organized as follows.

T69

10406-10467

Sentence

denotes

Section 2 presents the cascade model and problem formulation.

T70

10468-10581

Sentence

denotes

In Section 3 we show the # P-hardness of detection time calculation and derive two equivalent estimation methods.

T71

10582-10738

Sentence

denotes

Section 4 is devoted to the NP-hardness of DTM problem, the submodular properties of transformed objective function, and the corresponding greedy algorithm.

T72

10739-10860

Sentence

denotes

To prune the unnecessary estimation calls, we analysis the upper bounds for the new proposed UBG algorithm in Section 5 .

T73

10861-10930

Sentence

denotes

Section 6 presents two accelerations of UBG for large scale networks.

T74

10931-10972

Sentence

denotes

Section 7 shows the experimental results.

T75

10973-11008

Sentence

denotes

We conclude the paper in Section 8.

T76

11009-11051

Sentence

denotes

Table 1 outlines the major variables used.

T77

11052-11230

Sentence

denotes

In this section we start with a description of dynamic cascade model and then we define the Detection Time Minimization (DTM) problem abstracted from the early detection problem.

T78

11231-11409

Sentence

denotes

We start somewhat with the framework of [24] , where the models introduced are essentially descriptive to specify a joint distribution over all nodes' behavior in a global sense.

T79

11410-11595

Sentence

denotes

In contrast, we focus on more operational models, from mathematical sociology [3] and interacting particle systems [10] , to explicitly represent the step-by-step dynamics of infection.

T80

11596-11805

Sentence

denotes

By considering operational models for the dynamic diffusion of a harmful item through a network G, represented by a directed graph, we will speak of each individual node as being either infected or uninfected.

T81

11806-12027

Sentence

denotes

Motivated by the properties of contaminants like rumor and virus, we will focus from now on the progressive case in which nodes can switch from being uninfected to being infected, but do not switch in the other direction.

T82

12028-12165

Sentence

denotes

We also focus on the setting that each node's tendency to become infected increases monotonically as more of its neighbors become active.

T83

12166-12450

Sentence

denotes

Thus, the process can be reviewed with respect to some particular uninfected node v: as time unfolds, more and more of v's neighbors become infected; at some point, this may cause v to become infected, and v's decision may in turn trigger further infections by nodes connected with v.

T84

12451-12658

Sentence

denotes

Motivated by above observation, we put forward a dynamic susceptible-infected (DSI) model for the harmful cascade spreading, which can be seen as a variant of the common susceptible-infected (SI) model [3] .

T85

12659-12766

Sentence

denotes

The susceptible nodes are those with at least one infected neighbor, and the infected nodes do not recover.

T86

12767-12783

Sentence

denotes

Specifically (1)

T87

12784-12904

Sentence

denotes

The DSI model is attached with a probability distribution = { (u), u ∈ V} to describe the dynamic of cascade initiators.

T88

12905-13000

Sentence

denotes

can be seen as a priori knowledge of nodes being infected initially, where u ∈ V (u) = 1 holds.

T89

13001-13119

Sentence

denotes

The DSI model first chooses an initially infected node u ∈ V according to distribution , and then it works as follows.

T90

13120-13201

Sentence

denotes

Let I t ⊆ V be the set of nodes that gets infected at step t ≥ 0, with I 0 = {u}.

T91

13202-13293

Sentence

denotes

Define J t := 0≤i≤t I i be the cumulative set of nodes that get infected before step t ≥ 0.

T92

13294-13417

Sentence

denotes

Then, at step t + 1, each node u ∈ J t may infect its out-neighbors v ∈ V \J t with an independent probability of ip(u, v).

T93

13418-13488

Sentence

denotes

Thus, a node v ∈ V \J t is infected at step t + 1 with the probability

T94

13489-13557

Sentence

denotes

If node v is successfully infected, it is added into the set I t+1 .

T95

13558-13601

Sentence

denotes

Then update J t+1 by J t+1 ←− J t ∪ I t+1 .

T96

13602-13782

Sentence

denotes

Note that each infected node has more than one chance to activate its susceptible out-neighbors until they get infected, and each node stays infected once it is infected by others.

T97

13783-13849

Sentence

denotes

Obviously the cumulative infected process (J t ) t≥0 is Markovian.

T98

13850-13990

Sentence

denotes

In a diffusion model (I t ) t≥0 , given an initially infected node u ∈ V and a set of sensors S ⊆ V, the random detection time is defined as

T99

13991-14064

Sentence

denotes

where a ∧ b : = min{a, b} and T max is the time interval that we observe.

T100

14065-14088

Sentence

denotes

We assume inf{ ∅ } =+∞.

T101

14089-14225

Sentence

denotes

The random detection time (u, S) denotes the time delay that a contaminant initiated from node u is detected by one of the sensors in S.

T102

14226-14232

Sentence

denotes

Proof.

T103

14233-14286

Sentence

denotes

Above properties come from the definition in Eq. (2).

T104

14287-14329

Sentence

denotes

We put the detailed proofs in Section 8.1.

T105

14330-14549

Sentence

denotes

Note that the detection time (u, S) can be viewed as a special type of stopping time in stochastic process theory [11] .ᮀ Let the variable X denote the random initiator with distribution , i.e., for every u ∈ V, we have

T106

14550-14656

Sentence

denotes

The (expected) detection time from a random initiator to one of the selected sensors in S can be define as

T107

14657-14716

Sentence

denotes

where E is an expectation operator under the cascade model.

T108

14717-14820

Sentence

denotes

By the conditional expectation, we can calculate the expected detection time D(S) in the following way:

T109

14821-14861

Sentence

denotes

Indeed, Eq. (5) can be reached like this

T110

14862-14917

Sentence

denotes

where (X) denotes the -algebra generated by variable X.

T111

14918-15033

Sentence

denotes

Eq. (5) converts the global detection time E[ (X, S)] as a summation of local detection time E[ (u, S)] with u ∈ V.

T112

15034-15081

Sentence

denotes

This conversion provides a basic for Section 5.

T113

15082-15245

Sentence

denotes

Our goal is to find k nodes as sensors in a network in order to minimize the time until an infection -starting from a random initiator in the network -is detected.

T114

15246-15430

Sentence

denotes

Formally, we formulate the problem as the following discrete optimization problem: we want to find a subset S * ⊆ V such that |S * | = k and D(S * ) = min{D(S)||S| = k, S ⊆ V}, i. e .,

T115

15431-15501

Sentence

denotes

where k is a given parameter determined by budget or monitor capacity.

T116

15502-15578

Sentence

denotes

We call this as detection time minimization problem (DTM problem for short).

T117

15579-15677

Sentence

denotes

For the sake of concreteness, we will discuss our results in terms of the DSI model in particular.

T118

15678-15797

Sentence

denotes

To solve the DTM problem Eq. (6), the first issue is how to calculate the objective function D(S) given a sensor set S.

T119

15798-15856

Sentence

denotes

This question isn't as easy as its description in Eq. (4).

T120

15857-16033

Sentence

denotes

For example, the DSI process is underspecified, since we have not prescribed the order in which newly infected nodes in a given step t will attempt to activate their neighbors.

T121

16034-16227

Sentence

denotes

Thus, it is not initially obvious that the process is even well-defined, in the sense that it yields the same distribution over outcomes regardless of how we schedule the attempted activations.

T122

16228-16365

Sentence

denotes

Actually the computation of D(S) is #P-hard, by showing a reduction from the positive partitioned 2-DNF assignment counting problem [9] .

T123

16366-16372

Sentence

denotes

Proof.

T124

16373-16485

Sentence

denotes

We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] .

T125

16486-16725

Sentence

denotes

For detailed proof, see Section 8.2. ᮀ Since it is intractable to exactly compute D(S) on a typically sized graph, a natural idea is to employ Monte-Carlo methods to estimate D(S), which can be implemented in two different ways as follows:

T126

16726-16917

Sentence

denotes

The expected detection time D(S) is obtained by directly simulating the random process of diffusion triggered by a random node, say u, chosen according to the distribution defined as Eq. (3).

T127

16918-17002

Sentence

denotes

Let I t denote the set of nodes newly infected in the t-th iteration with I 0 = {u}.

T128

17003-17139

Sentence

denotes

In the (t + 1)-th iteration, a node u ∈ J t : = ∪ 0≤i≤t I i attempts to activate each uninfected neighbor v / ∈ J t with the probability

T129

17140-17233

Sentence

denotes

The process is repeated until the diffusion hits the sensor set S at some step T(u, S), i.e.,

T130

17234-17269

Sentence

denotes

where inf{ ∅ } : =+∞ as convention.

T131

17270-17390

Sentence

denotes

The detection time of this single simulation is recorded at T(u, S) ∧ T max , which is right (u, S) defined in Eq. (2) .

T132

17391-17518

Sentence

denotes

We run such simulations for many times and finally estimate the expected detecting time D(S) by averaging over all simulations.

T133

17519-17753

Sentence

denotes

According to the characteristic of the DSI model, the time cost that an infected node u spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(u, v), and we denote this time cost as C(u, v), i.e.,

T134

17754-17777

Sentence

denotes

for any integral s ≥ 1.

T135

17778-17948

Sentence

denotes

We can flip all coins a priori to produce a weighted graph G = (V, E, c), where an edge (u, v) is labeled with the time cost c(u, v) -a sample of random variable C(u, v).

T136

17949-18200

Sentence

denotes

Actually such a snapshot provides an easy way to sample the detection time of any sensor set S, which exactly equals to the smallest time cost from the initially infected node, say u (chosen according to the distribution ), to the nearest sensor in S.

T137

18201-18359

Sentence

denotes

Define c(u, S) be the smallest time cost from u to the nearest sensor in S, then the detection time of this single simulation is recorded as c(u, S) ∧ T max .

T138

18360-18477

Sentence

denotes

We produce plenty of snapshots and finally estimate the expected detection time D(S) by averaging over all snapshots.

T139

18478-18565

Sentence

denotes

The snapshot simulation is equivalent to the propagation simulation in estimating D(S).

T140

18566-18654

Sentence

denotes

More specifically, the random variables T(u, S) and c(u, S) are identically distributed.

T141

18655-18749

Sentence

denotes

, it is enough to prove that random variables T(u, S) and c(u, S) are identically distributed.

T142

18750-18787

Sentence

denotes

For proof details, see Section 8.3. ᮀ

T143

18788-18897

Sentence

denotes

We have confirmed that the two methods are equivalent, while either has its own advantages and disadvantages.

T144

18898-19076

Sentence

denotes

For estimating a specific D(S), the simulation method is faster, because it only needs to examine a small portion of edges while the snapshot method has to examine all the edges.

T145

19077-19272

Sentence

denotes

For estimating the expected detection time of different sensor sets, the snapshot method outperforms the simulation method in terms of time complexity, since each snapshot serves all sensor sets.

T146

19273-19446

Sentence

denotes

Under these observations, the heuristic algorithms use the propagation simulation, while the greedy-based algorithms employ the snapshot simulation in the experimental part.

T147

19447-19660

Sentence

denotes

In this section we first show the NP-hardness of DTM problem in Eq. (6), as it can be shown to contain the Set Cover problem as a simple special case, then we propose a simple greedy algorithm for the DTM problem.

T148

19661-19667

Sentence

denotes

Proof.

T149

19668-19918

Sentence

denotes

Consider an instance of the NP-complete Set Cover problem, defined by a collection of subsets S = {S 1 , S 2 , . . ., S m } of a ground set U = {u 1 , u 2 , . . ., u n }, we wish to know whether there exist k of the subsets whose union is equal to U.

T150

19919-19953

Sentence

denotes

We can assume that k < n < m here.

T151

19954-20031

Sentence

denotes

We show that this can be viewed as a special case of the optimal problem (6).

T152

20032-20344

Sentence

denotes

Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes: there is a node i corresponding to each element u i ∈ U, a node j corresponding to each set S j ∈ S, and a directed edge (i, j) with activation probability ip(i, j) = 1 whenever u i ∈ S j .

T153

20345-20458

Sentence

denotes

Define the probability distribution = { (u), u ∈ V} on the knowledge of nodes being initially infected as follows

T154

20459-20576

Sentence

denotes

The Set Cover problem is equivalent to deciding if there is a set S of k nodes in this bipartite graph with D(S) ≤ 1.

T155

20577-20692

Sentence

denotes

Note that for the instance we have defined, activation is a deterministic process, as all probabilities are 0 or 1.

T156

20693-21115

Sentence

denotes

Monitoring k nodes to detect diffusion initiated from U corresponding to sets in a Set Cover solution results in covering all n nodes corresponding to the ground set U, and if any set S of k nodes has D(S) ≤ 1, then the Set Cover problem must be solvable.ᮀ Since the optimization problem is NP-hard and the network is prohibitively large, we cannot compute the optimum value to verify the actual quality of approximations.

T157

21116-21195

Sentence

denotes

Hence a natural idea is to employ the greedy algorithm as approximation method.

T158

21196-21306

Sentence

denotes

To make better use of the greedy algorithm, we consider an equivalent optimization problem Eq. (6) as follows,

T159

21307-21312

Sentence

denotes

where

T160

21313-21407

Sentence

denotes

is defined as the (expected) remaining time for taking actions when a contaminant is detected.

T161

21408-21501

Sentence

denotes

The above alternative formulation has key properties as described in the following Theorem 4.

T162

21502-21531

Sentence

denotes

Theoretically, a non-negative

T163

21532-21546

Sentence

denotes

for all S ⊆ T.

T164

21547-21553

Sentence

denotes

Proof.

T165

21554-21627

Sentence

denotes

It is obvious that the remaining time function R is monotone and R(∅) =0.

T166

21628-21662

Sentence

denotes

Now we prove that R is submodular.

T167

21663-21738

Sentence

denotes

According to the definitions in Eqs. (5) and (10), it suffices to show that

T168

21739-21771

Sentence

denotes

for all S ⊆ T ⊆ V and v ∈ V \T .

T169

21772-21807

Sentence

denotes

By the second property of in Propo-

T170

21808-21901

Sentence

denotes

Note that (u, T) ≤ (u, S) always works by the monotone decrease property of in Proposition 1.

T171

21902-21972

Sentence

denotes

Now we discuss Eq. (14) separately according to the value of (u, {v}).

T172

21973-22016

Sentence

denotes

, which holds by the monotone decrease of .

T173

22017-22230

Sentence

denotes

Eq. (14) is proven and we thus get that R(S) is submodular.ᮀ By above properties in Theorem 4, the problem given in Eq. (9) can be approximated by the greedy algorithm in Algorithm 1 with the set function f : = R.

T174

22231-22408

Sentence

denotes

For any submodular and monotone function f with f(∅) =0, the problem of finding a set S of size k that maximizes f(S) can be approximated by the greedy algorithm in Algorithm 1.

T175

22409-22579

Sentence

denotes

The algorithm iteratively selects a new sensor u that maximizes the incremental change of f and includes the new sensor into the set S until k sensors have been selected.

T176

22580-22755

Sentence

denotes

It is shown that the algorithm guarantees an approximation ratio of f(S)/f(S * ) ≥ 1 −1/e, where S is the output of the greedy algorithm and S * is the optimal solution [30] .

T177

22756-22972

Sentence

denotes

In Greedy(k, R), a thorny problem is that there is no efficient way to compute R(S) given a placement S, and we turn to run snapshots for 10, 000 trials to obtain an accurate estimate of R(S), mentioned in Section 3.

T178

22973-23021

Sentence

denotes

This actually leads to expensive selection time.

T179

23022-23222

Sentence

denotes

Another source of inefficiency in Greedy(k, R) is that there exists O(kN) iterations at the remaining time estimation step, where k is the size of the initial sensor set, and N is the number of nodes.

T180

23223-23290

Sentence

denotes

When N is large, the efficiency of the algorithm is unsatisfactory.

T181

23291-23552

Sentence

denotes

Hence, in order to improve the efficiency of Greedy(k, R), one can either reduce the number of calls for evaluating R(S), or develop advanced heuristic algorithms which can conduct fast and approximate estimations for R(S) at the expense of accuracy guarantees.

T182

23553-23714

Sentence

denotes

In order to prune the estimation calls in Greedy(k, R), a natural idea is to employ the Cost-Effective Lazy Forward selection (CELF) algorithm proposed in [24] .

T183

23715-23908

Sentence

denotes

The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of estimation calls can be greatly pruned.

T184

23909-24040

Sentence

denotes

CELF optimization produces the same sensor set as the original greedy algorithm, but it is much faster than the original one [24] .

T185

24041-24158

Sentence

denotes

Although CELF significantly improves Greedy(k, R), the sensor selection time is still unaffordable on large networks.

T186

24159-24399

Sentence

denotes

In particular, in the first round to establish the initial upper bounds, CELF needs to estimate R({v}) using MC simulations for each node v, leading to N times of MC calls, which is time-consuming, especially when the network is very large.

T187

24400-24600

Sentence

denotes

The limitation leads to a rather fundamental question that, can we derive an upper bound of R({v}) which can be used to further prune unnecessary detection time estimations (MC calls) in Greedy(k, R)?

T188

24601-24727

Sentence

denotes

Motivated by this question and the idea in [40] , in this section we derive an initial upper bound of R({v}) for Greedy(k, R).

T189

24728-24870

Sentence

denotes

Based on the bound, we propose a new greedy algorithm Upper Bound based Greedy (UBG for short), which outperforms the original CELF algorithm.

T190

24871-24980

Sentence

denotes

Essentially different from the bounds that derived for the influence spread under the IC model [40] , we here

T191

24982-25036

Sentence

denotes

In this part, we aim to derive an upper bound of R(v).

T192

25037-25121

Sentence

denotes

Before introducing the upper bounds in Theorem 5, we first prepare two propositions.

T193

25122-25245

Sentence

denotes

Let P u (v ∈ J t ) denote the probability that node v becomes infected before step t when the initially infected node is u.

T194

25246-25287

Sentence

denotes

We have the first proposition as follows.

T195

25288-25364

Sentence

denotes

For v ∈ V , the remaining time R(v) under the DSI model can be calculated as

T196

25365-25371

Sentence

denotes

Proof.

T197

25372-25405

Sentence

denotes

In fact, by the definition in Eq.

T198

25406-25424

Sentence

denotes

(2), we first have

T199

25425-25646

Sentence

denotes

where the fourth '=' is due to the fact that u ∈ V (u) = 1 in above derivation. ᮀ Proposition 2 reveals that we can treat the global remaining time R(v) as a summation of all T max propagation steps of local probabilities

T200

25647-25738

Sentence

denotes

Based on Proposition 2, a following question is, what is the relationship between two sets,

T201

25739-25753

Sentence

denotes

Proposition 3.

T202

25754-25797

Sentence

denotes

For k ≥ 1, we have the following inequation

T203

25798-25804

Sentence

denotes

Proof.

T204

25805-25885

Sentence

denotes

For k ≥ 1, by the definition of conditional expectation and DSI model, we obtain

T205

25886-25954

Sentence

denotes

where the first '≤' is due to 1 {v/ ∈J k−1 } ≤ 1, and the second '≤'

T206

25955-26142

Sentence

denotes

x i in the above derivation.ᮀ Proposition 3 clearly identifies the ordering relationship between two adjacent elements in the series P u (v ∈ J 0 ), P u (v ∈ J 1 ), P u (v ∈ J 2 ), · · ·.

T207

26143-26216

Sentence

denotes

Now we simplify the results in Proposition 3 by using the form of matrix.

T208

26217-26311

Sentence

denotes

Let IP be the infection probability matrix with the element at position (u, v) being ip(u, v).

T209

26312-26344

Sentence

denotes

For t ≥ 0, denote the row vector

T210

26345-26441

Sentence

denotes

as the probabilities of nodes being infected before step t, i. e ., Â u t (v) := P u (v ∈ J t ).

T211

26442-26483

Sentence

denotes

Obviously, we have Â u 0 (v) = 1 {u} (v).

T212

26484-26542

Sentence

denotes

Now we can rewrite Proposition 3 by using the matrix form,

T213

26543-26627

Sentence

denotes

By iteration, we further get that u t ≤ u 0 · (E + IP) t , where E is a unit matrix.

T214

26628-26704

Sentence

denotes

Furthermore, due to the definition of probability Â u t (v), it follows that

T215

26705-26774

Sentence

denotes

Hereafter, define A ∧ 1 : = {a(i, j) ∧ 1} for a matrix A = {a(i, j)}.

T216

26775-26875

Sentence

denotes

With the above preparations, we can present the results on upper bound of remaining time as follows,

T217

26876-26944

Sentence

denotes

where E is a unit matrix and [A] (u,v) means the element at position

T218

26946-27029

Sentence

denotes

where is a prior distribution on the likelihood of nodes being the infected source.

T219

27030-27101

Sentence

denotes

We first use a toy example to explain how to calculate the upper bound.

T220

27102-27112

Sentence

denotes

Example 1.

T221

27113-27197

Sentence

denotes

Given a directed network G in Fig. 1 with infection probability matrix in Eq. (22) .

T222

27198-27285

Sentence

denotes

Let T max = 10, we have ((E + IP) t ∧ 1) is intractable when the network size is large.

T223

27286-27370

Sentence

denotes

To overcome the difficulty, we adopt the following procedure to calculate Tmax−1 t=0

T224

27371-27493

Sentence

denotes

If t ≥1 ≤ T max − 1, it implies that we do not need to calculate (E + IP) t any more when t ≥ t ≥1 , and therefore we have

T225

27494-27546

Sentence

denotes

where 1 is a N × N matrix with all elements being 1.

T226

27547-27572

Sentence

denotes

Additionally we find that

T227

27574-27580

Sentence

denotes

Proof.

T228

27581-27630

Sentence

denotes

In fact, when IP is small enough, it follows that

T229

27631-27773

Sentence

denotes

where the approximation stems from Taylor expansion.ᮀ By Eqs. (21) and (25), the new upper bound of remaining time can be approximately put as

T230

27774-27821

Sentence

denotes

which is relatively tractable when IP is small.

T231

27822-27908

Sentence

denotes

Based on the upper bound, we propose a new UBG algorithm for early outbreak detection.

T232

27909-27962

Sentence

denotes

First we explain the difference between UBG and CELF.

T233

27963-28061

Sentence

denotes

The CELF algorithm [24] exploits the submodular property to improve the original greedy algorithm.

T234

28062-28246

Sentence

denotes

However, CELF demands N (the number of nodes in the network) remaining time estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs.

T235

28247-28439

Sentence

denotes

In contrast, the proposed Upper Bound based Greedy (UBG) algorithm uses the derived new bound to further reduce the number of remaining time estimations, especially in the initialization step.

T236

28440-28573

Sentence

denotes

This way, the nodes are ranked by their upper bound scores which can be used to prune unnecessary calculations in the CELF algorithm.

T237

28574-28608

Sentence

denotes

We use Example 2 for illustration.

T238

28610-28748

Sentence

denotes

Even with the UBG algorithm proposed in Section 5, its running time is still unbearable and may not be suitable for large social networks.

T239

28749-28924

Sentence

denotes

Although UBG can greatly reduce the number of remaining time estimation calls, each estimation call is very timeconsuming, as it needs produce enough samples and average them.

T240

28925-29091

Sentence

denotes

Hence a possible alternative to further accelerate UBG is to employ heuristics to approximate the remaining time R(S) or the marginal return R(S ∪ {u}) − R(S) in UBG.

T241

29092-29190

Sentence

denotes

Along this idea, we here introduce two accelerative algorithms to address the inefficiency of UBG:

T242

29191-29233

Sentence

denotes

Quickest-Path-UBG and Local-Reduction-UBG.

T243

29234-29474

Sentence

denotes

The experimental results will show that (i) both of them are efficient in terms of running time, and (ii) the former one applies to sparse networks better, while the latter one is more effective in dense networks in terms of detection time.

T244

29475-29562

Sentence

denotes

In this part we introduce a tractable heuristic to approximate the remaining time R(S).

T245

29563-29653

Sentence

denotes

From Eq. (5) and Eq. (10), the key point of estimating R(S) is how to estimate E[ (u, S)].

T246

29654-29773

Sentence

denotes

Note that E[ (u, S)] measures the expected time delay of a diffusion initiated from node u propagating to sensor set S.

T247

29774-29886

Sentence

denotes

An intuitive idea is that the most likely propagation path should be the quickest path from node u to the set S.

T248

29887-29966

Sentence

denotes

Hence, a question is how to measure the quickest path from node u to the set S?

T249

29967-30036

Sentence

denotes

The proposed Quickest-Path-UBG is inspired by answering the question.

T250

30037-30177

Sentence

denotes

According to the geometric distribution, if a random variable X is distributed geometrically with a parameter p, it follows that E[X] = 1/p.

T251

30178-30442

Sentence

denotes

Since the time that a node u spends in infecting its uninfected neighborhood v is distributed geometrically with the parameter ip(u, v), the value 1/ip(u, v) should be the expected time cost that an infection propagates from node u to node v along the edge (u, v).

T252

30443-30467

Sentence

denotes

Fig. 2 shows an example.

T253

30468-30568

Sentence

denotes

For any two nodes u and v, let d(u, v) be the smallest time cost among the paths connecting u and v.

T254

30569-30629

Sentence

denotes

For example, d( , ) = min{5 + 3.3, 10 + 5} = 8.3 in Fig. 2 .

T255

30630-30841

Sentence

denotes

Then, for a subset S ⊆ V, we define d(u, S) := min v ∈ S d(u, v). (27) Intuitively, d(u, S) denotes the expected time that an infection propagates from u to S along the quickest path in the graph G = (V, E, m) .

T256

30842-30976

Sentence

denotes

Therefore, a fundamental question arises, is there some approximate relationship between the detection time and the shortest distance.

T257

30977-31009

Sentence

denotes

Theoretically we have Theorem 6.

T258

31010-31171

Sentence

denotes

When the shortest path from u to S in graph G = (V, E, m) is unique, the quantity d(u, S) ∧ T max can be used to approximate the detection time E[ (u, S)], i.e.,

T259

31172-31178

Sentence

denotes

Proof.

T260

31179-31272

Sentence

denotes

In fact, when the shortest path from u to S in graph G = (V, E, m) is unique, it follows that

T261

31273-31377

Sentence

denotes

where we use an result borrowed from probability theory for the approximation: consider random variables

T262

31378-31476

Sentence

denotes

.ᮀ Based on Theorem 6, combining Eq. (5), Eq. (10) and Eq. (28), we have the following derivation,

T263

31477-31520

Sentence

denotes

Hence we can approximate the remaining time

T264

31521-31611

Sentence

denotes

rather than the heavy Monte-Carlo estimation MC(S ∪ {u}) in the 09th row of UBG algorithm.

T265

31612-31704

Sentence

denotes

We nail down this new method as Quickest-Path-UBG, which is shown explicitly in Algorithm 3.

T266

31706-31945

Sentence

denotes

Note that the reason we call this acceleration as Quickest-Path-UBG rather than Shortest-Path-UBG lies in that, our Quickest-Path-UBG introduces a new measurement to reflect the infection time cost rather than the distance in common sense.

T267

31946-32090

Sentence

denotes

In this part we turn to a tractable heuristic to approximate the marginal return ı u (S) : = R(S ∪ {u}) − R(S) in the 09th row of UBG algorithm.

T268

32091-32275

Sentence

denotes

To facilitate the description, we will approximate the marginal reduction D(S) − D(S ∪ {u}) rather than the marginal return R(S ∪ {u}) − R(S), due to the fact that R(S) = T max − D(S).

T269

32276-32417

Sentence

denotes

In sociology literature, degree and other centrality-based heuristics are commonly used to select influential nodes in social networks [37] .

T270

32418-32580

Sentence

denotes

These methods usually looked over the multi-step complexity of network diffusions, and assumed that the infection can propagate ahead only one hop before the end.

T271

32581-32689

Sentence

denotes

In other words, if a node initiates an infection, the diffusion spread is at most its first-order neighbors.

T272

32690-32735

Sentence

denotes

We call this as one-hop assumption hereafter.

T273

32736-32972

Sentence

denotes

Especially, experimental results in [24] showed that selecting vertices with maximum indegree as sensors results in earlier infection detection than other heuristics, which validated the rationality of one-hop assumption to some extent.

T274

32973-33116

Sentence

denotes

Under this observation, we will propose a tractable method upon the one-hop assumption to approximate the marginal reduction D(S) − D(S ∪ {u}).

T275

33117-33554

Sentence

denotes

We first introduce a useful result in probability theory: if the random variable X i is distributed geometrically with parameter p i for i = 1, · · ·, n and they are independent, then ∧ i=1,· · ·,n X i is distributed geometrically with parameter 1 − According to the characteristic of the DSI model, the time cost that an infected node w spends in infecting its uninfected neighbor v is distributed geometrically with parameter ip(w, v).

T276

33555-33679

Sentence

denotes

If the initially infected node w has multi-neighbors in sensor set S, the infections to different neighbors are independent.

T277

33680-33774

Sentence

denotes

Under the onehop assumption, the expected detection time of w can thus be calculated like this

T278

33775-33896

Sentence

denotes

If the initially infected node w has no neighbors in sensor set S, the expected detection time of w is defined as T max .

T279

33897-33938

Sentence

denotes

If w ∈ S, it follows that E[ (w, S)] = 0.

T280

33939-34084

Sentence

denotes

Fig. 3 presents an example to show the calculation result of expected detection time E[ (w, S)] for each node w ∈ V under the one-hop assumption.

T281

34085-34220

Sentence

denotes

With the preparations above, we are about to calculate the marginal reduction ı u (S) = D(S) − D(S ∪ {u}) under the one-hop assumption.

T282

34221-34291

Sentence

denotes

Assume the set S has been selected as sensor set to detect infections.

T283

34292-34421

Sentence

denotes

When considering adding another node u as a new sensor into S, the reduction of the expected detection time can be put as follows

T284

34422-34442

Sentence

denotes

which is by Eq. (5).

T285

34443-34530

Sentence

denotes

We now use Theorem 7 to present the concrete expression of the right part of Eq. (30) .

T286

34531-34616

Sentence

denotes

Similar with the definition of Eq. (1), we predefine the parents of set S as follows,

T287

34617-34627

Sentence

denotes

Theorem 7.

T288

34628-34793

Sentence

denotes

In the DSI model with infection probability ip(w, v) on a directed graph G = (V, E), the reduction ı u (S) of the expected detection time can be calculated like this

T289

34794-35029

Sentence

denotes

under one-hop assumption, where f(u, S) is defined as The reason that we call it as Local-Reduction-UBG is that we only consider the local change in monitoring {u} and Par(u) \ Par(S) Table 2 Statistics of the four real-world networks.

T290

35030-35136

Sentence

denotes

Digger Twitter Epinions Small-world #Node 8194 32,986 51,783 200,000 #Edge 56,440 763,713 476,

T291

35138-35301

Sentence

denotes

We conduct experiments on both synthetic and real-world data sets to evaluate the UBG algorithm, the Quickest-Path-UBG algorithm and Local-Reduction-UBG algorithm.

T292

35302-35381

Sentence

denotes

We implement the algorithms using C++ with the Standard Template Library (STL).

T293

35382-35488

Sentence

denotes

All experiments are run on a Linux (Ubuntu 11.10) machine with six-core 1400 MHz AMD CPU and 32 GB memory.

T294

35489-35553

Sentence

denotes

Three real and one synthetic data sets are used for comparisons.

T295

35554-35737

Sentence

denotes

The Digger data 1 is a heterogeneous network, including Digg stories, user actions (submit, digg, comment and reply) with respect to the stories, and friendship relations among users.

T296

35738-35822

Sentence

denotes

The Twitter and Epinions data sets can both be obtained from Stanford Datasource 2 .

T297

35823-35954

Sentence

denotes

Epinions is a general consumer review site where visitors can read reviews about a variety of items to help them decide a purchase.

T298

35955-36072

Sentence

denotes

The synthetic Small-world data set is the type of graphs in which each node can be reached by a small number of hops.

T299

36073-36184

Sentence

denotes

For smallworld model we set the parameter of the nearest neighbors k = 15 and the rewiring probability p = 0.1.

T300

36185-36304

Sentence

denotes

The above networks are representative ones, covering a variety of networks with different types of relations and sizes.

T301

36305-36385

Sentence

denotes

The details of the data sets are listed in Table 2 where degree means in-degree.

T302

36386-36463

Sentence

denotes

In our experiments, an undirected graph is regarded as a bidirectional graph.

T303

36464-36622

Sentence

denotes

We compare the UBG in Algorithm 2, the Quickest-Path-UBG in Algorithm 3, and Local-Reduction-UBG in Algorithm 4 with both the greedy and heuristic algorithms.

T304

36623-36636

Sentence

denotes

• CELF [24] .

T305

36637-36757

Sentence

denotes

The state-of-the-art greedy algorithm, where uses 10,000 snapshots in the whole process for any network. • DEGREE [37] .

T306

36758-36845

Sentence

denotes

A heuristic algorithm based on "degree centrality", with high-degree nodes as key ones.

T307

36846-36932

Sentence

denotes

The seeds are the nodes with the k highest in-degrees. • INTER-MONITOR DISTANCE [35] .

T308

36933-37083

Sentence

denotes

A heuristic algorithm which requires any pair of sensors to be at least d hops away, where d is as large as it can choose k monitors. • PageRank [5] .

T309

37084-37161

Sentence

denotes

A link analysis algorithm which ranks the importance of pages in a Web graph.

T310

37162-37278

Sentence

denotes

We implement the power method with a damping factor of 0.85, and pick the k highest-ranked nodes as seeds. • Random.

T311

37279-37376

Sentence

denotes

It simply selects k random vertices in the graph as the seed set, which is taken as the baseline.

T312

37377-37558

Sentence

denotes

In our experiments, to obtain the detection time of sensor sets provided by heuristic algorithms, we run Monte-Carlo simulation on the networks 10, 000 times and calculate the mean.

T313

37559-37701

Sentence

denotes

The simple greedy algorithm is not compared because many works have reported that CELF has the same optimization result and less running time.

T314

37702-37863

Sentence

denotes

Since the DEGREE heuristic is the state-of-the-art [37] , we do not implement heuristics such as distance centrality and betweenness centrality-based heuristics.

T315

37864-38032

Sentence

denotes

We mainly report results on a uniform infection probability of 0.1 assigned to each directed link in the network, i.e., ip(u, v) = 0.1 for any directed edge (u, v) ∈ E.

T316

38033-38159

Sentence

denotes

One can refer to the work [14, 33, 36] for learning real values of the parameters {ip(u, v) : (u, v) ∈ E} from available data.

T317

38160-38257

Sentence

denotes

Besides, we let the time horizon T max = 30 and the prior distribution be uniform in the network.

T318

38258-38325

Sentence

denotes

In Section 3, we proposed two Monte-Carlo methods to estimate D(S).

T319

38326-38413

Sentence

denotes

Table 3 shows the estimation results of Propagation Simulation and Snapshot Simulation.

T320

38414-38502

Sentence

denotes

Here we select 10 nodes with the highest in-degrees in each network as the sensor set S.

T321

38503-38663

Sentence

denotes

We can find that Propagation Simulation and Snapshot Simulation release almost the same estimation results, which confirms their equivalence in estimating D(S).

T322

38664-38735

Sentence

denotes

Fig. 4 shows the cumulative time cost of these two Monte-Carlo methods.

T323

38736-38911

Sentence

denotes

Here we randomly select 10 sensor sets {S 1 , S 2 , . . ., S 10 } from the Digger data and every sensor set has five sensor nodes, i.e. |S i | = 5 for all i = 1, 2, . . ., 10.

T324

38912-39098

Sentence

denotes

We can see that the cumulative time cost of Propagation Simulation increases linearly, while that of Snapshot Simulation has a big jump at the first sensor set and then increases slowly.

T325

39099-39310

Sentence

denotes

The reason behind is that Snapshot Simulation needs to establish numerous snapshots to estimate the first D(S 1 ), and these established snapshots can be reused in the posterior estimations of {D(S i )} 10 i=2 .

T326

39311-39400

Sentence

denotes

Table 4 shows the gap between the real value of remaining time R(S) and its upper bounds.

T327

39401-39495

Sentence

denotes

Here we also select ten nodes with the highest in-degrees in each network as the sensor set S.

T328

39497-39677

Sentence

denotes

In Table 5 , we compare the number of estimation or approximation calls at the first 10 iterations among CELF, UBG, Quickest-Path-UBG and Local-Reduction-UBG on the four data sets.

T329

39678-39847

Sentence

denotes

From the results, we can observe that the call number in UBG and its two accelerations are significantly reduced compared to that in CELF, especially at the first round.

T330

39848-39985

Sentence

denotes

One may notice that in Table 5 , CELF occasionally defeats our methods, but the total call number of our methods are much less than CELF.

T331

39986-40162

Sentence

denotes

As listed in 5, the total call number of the first 10 iterations of UBG and its two accelerations, compared to CELF, is reduced at a rate of 94% at least on the four data sets.

T332

40163-40264

Sentence

denotes

Similarly, at least 81% reduction of call numbers of CELF can be observed in the first 50 iterations.

T333

40265-40362

Sentence

denotes

From the observation, we can conclude that our UBG is more efficient than CELF on large networks.

T334

40363-40462

Sentence

denotes

Detection time measures the time delay of a message propagated from a diffusion source to a sensor.

T335

40463-40627

Sentence

denotes

We run tests on the four data sets and obtain detection time results w.r.t. parameter k (the number of sensors), where k increases from 1 to 50 as shown in Fig. 5 .

T336

40628-40710

Sentence

denotes

UBG, as an updated version of CELF, has competitive results on the four data sets.

T337

40711-40892

Sentence

denotes

More importantly, the detection times of UBG and CELF are completely the same in the four figures, which explains again that UBG and CELF share the same results in sensor selection.

T338

40893-40983

Sentence

denotes

The only difference between UBG and CELF is the number of remaining time estimation calls.

T339

40984-41074

Sentence

denotes

The Quickest-Path-UBG and Local-reduction-UBG always perform better than other heuristics.

T340

41075-41289

Sentence

denotes

More specifically, if the networks are sparse like Digger and Epinions, the Quickest-Path-UBG is more competitive; if the networks are dense like Twitter and Smallworld, the Local-reduction-UBG is more competitive.

T341

41290-41362

Sentence

denotes

Selection time measures the time cost of an algorithm selecting sensors.

T342

41363-41423

Sentence

denotes

Fig. 6 shows the time cost of selecting sensors with k = 50.

T343

41424-41458

Sentence

denotes

UBG is 4-8 times faster than CELF.

T344

41459-41543

Sentence

denotes

One may argue that such a low improvement of UBG can be neglected in large networks.

T345

41544-41587

Sentence

denotes

In fact, UBG scales well to large networks.

T346

41588-41771

Sentence

denotes

Because, with the size of a network increase, Monte-Carlo simulations take more time, and thus UBG will achieve better performance by pruning more unnecessary Monte-Carlo simulations.

T347

41772-41882

Sentence

denotes

As to heuristics, Degree and Random are very fast in selecting candidate nodes, which take less than 1 second.

T348

41883-42001

Sentence

denotes

Quickest-Path-UBG and Local-Reduction-UBG are exciting and adoptable, due to their good performance in detection time.

T349

42002-42131

Sentence

denotes

The PageRank and Inter-Monitor Distance are slightly slower and undesirable, in view of their poor performance in detection time.

T350

42132-42243

Sentence

denotes

We run tests on Twitter data set and obtain the selection time w.r.t. parameter ip (the infection probability).

T351

42244-42293

Sentence

denotes

In the experiments, ip increases from 0.1 to 0.5.

T352

42294-42381

Sentence

denotes

We assign a uniform infection probability ip to each directed link under the DSI Model.

T353

42382-42502

Sentence

denotes

From the results in Fig. 7 , we can conclude that with the parameter ip growing larger, the CELF is more time-consuming.

T354

42503-42593

Sentence

denotes

By contrast, the UBG and its accelerations are robust and insensitive to the parameter ip.

T355

42594-42656

Sentence

denotes

2) First, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∪ T(u, S 2 ).

T356

42657-42718

Sentence

denotes

For any t ∈ T(u, S 1 ∪ S 2 ), I t ∩ (S 1 ∪ S 2 ) / = ∅ holds.

T357

42719-42873

Sentence

denotes

From I t ∩ (S 1 ∪ S 2 ) = (I t ∩ S 1 ) ∪ (I t ∩ S 2 ) / = ∅, we know either I t ∩ S 1 or I t ∩ S 2 is nonempty, which implies t ∈ T(u, S 1 ) ∪ T(u, S 2 ).

T358

42874-42929

Sentence

denotes

Hence T(u, S 1 ∪ S 2 ) ⊂ T(u, S 1 ) ∪ T(u, S 2 ) works.

T359

42930-43025

Sentence

denotes

Conversely, for any t ∈ T(u, S 1 ) ∪ T(u, S 2 ), t will be located in T(u, S 1 ) or T(u, S 2 ).

T360

43026-43091

Sentence

denotes

Without loss of generality, we assume t ∈ T(u, S 1 ), which means

T361

43093-43143

Sentence

denotes

is nonempty, which indicates t ∈ T(u, S 1 ∪ S 2 ).

T362

43144-43204

Sentence

denotes

Hence T(u, S 1 ) ∪ T(u, S 2 ) ⊂ T(u, S 1 ∪ S 2 ) also works.

T363

43205-43265

Sentence

denotes

Second, we prove T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).

T364

43266-43394

Sentence

denotes

We take infimum operations on both sides of the first step conclusion, i.e., inf T(u, S 1 ∪ S 2 ) = inf T(u, S 1 ) ∪ T(u, S 2 ).

T365

43395-43505

Sentence

denotes

Since all these sets are be composed of integers, we obviously get T(u, S 1 ∪ S 2 ) = T(u, S 1 ) ∧ T(u, S 2 ).

T366

43506-43733

Sentence

denotes

Finally, we take the infimums with T max on both sides of the second step conclusion, i.e., T(u, S 1 ∪ S 2 ) ∧ T max = T(u, S 1 ) ∧ T(u, S 2 ) ∧ T max , which means (u, S 1 ∪ S 2 ) = (u, S 1 ) ∧ (u, S 2 ) by the definition of .

T367

43734-44069

Sentence

denotes

3) Similarly, in order to prove this property, we only need to , {u}) , i.e., for any t ≥ 0, the probability that t is located in T(u, {v}) equals the probability that t is located in 4) If S ⊂ T, in order to prove this property, we only need to prove T(u, S) ⊂ T(u, T ) and notice the fact that the infimum of a larger set is smaller.

T368

44070-44111

Sentence

denotes

For any t ∈ T(u, S), I t ∩ S / = ∅ holds.

T369

44112-44202

Sentence

denotes

Besides, as S ⊂ T, I t ∩ S / = ∅ implies I t ∩ T / = ∅, which means t is also in T(u, T ).

T370

44203-44285

Sentence

denotes

So, T(u, S) ⊂ T(u, T ), by which we can easily conclude the property, is verified.

T371

44286-44370

Sentence

denotes

5) If T(u, S) is empty, then T(u, S) =+∞. (u, S) =+ ∞ ∧ T max = T max ∈ [0, T max ].

T372

44371-44488

Sentence

denotes

If T(u, S) is nonempty, then every element in it is greater than 0 and notice that (u, S) = T(u, S) ∧ T max ≤ T max .

T373

44489-44536

Sentence

denotes

Therefore, we can conclude 0 ≤ (u, S) ≤ T max .

T374

44537-44666

Sentence

denotes

We prove the theorem by a reduction from the counting problem of the positive partitioned 2-DNF assignment [9] , denoted by (A1).

T375

44667-44672

Sentence

denotes

(A1):

T376

44673-44757

Sentence

denotes

Let E be a subset of N 2 and X i , Y j , (i, j) ∈ E are pairwise distinct variables.

T377

44758-44889

Sentence

denotes

We define a formula F : = ∨ {X i ∧ Y j : (i, j) ∈ E}, which is a disjunction of all the conjunctive clauses X i ∧ Y j , (i, j) ∈ E.

T378

44890-44937

Sentence

denotes

How many valuations that satisfy the formula F?

T379

44938-44983

Sentence

denotes

Then we introduce the following problem (A2).

T380

44984-44989

Sentence

denotes

(A2):

T381

44990-45223

Sentence

denotes

G = (V, E) is a directed graph, where each edge e ∈ E is associated with a weight w e which is geometrically distributed with a parameter p, i.e. P(w e = i) = p(1 − p) i−1 , i ≥ 1. s, t are two nodes in G and k is a positive integer.

T382

45224-45320

Sentence

denotes

What is the probability of the event that the shortest path from s to t has length T at least k?

T383

45321-45365

Sentence

denotes

Now we prove that (A1) is reducible to (A2).

T384

45366-45436

Sentence

denotes

We map the variable X i to a node x i and variable Y j to a node y j .

T385

45437-45497

Sentence

denotes

If (i, j) exists in E, then we add an edge from x i to y j .

T386

45498-45668

Sentence

denotes

Also, we add a source node s and an edge (s, x i ), associated with weight w (s,x i ) which is geometrically distributed with parameter 1/2, for any i ∈ {i : (i, j) ∈ E}.

T387

45669-45765

Sentence

denotes

Thus, w (s,x i ) equals 1 with probability 1/2 and strictly greater than 1 with probability 1/2.

T388

45766-45886

Sentence

denotes

Similarly, we add a terminal node t and an edge from each y j to t with a weight identically distributed as w (s,x i ) .

T389

45887-45959

Sentence

denotes

So we construct a graph with geometrically distributed weights on edges.

T390

45960-46215

Sentence

denotes

Now we can build a probability-preserving bijection between the valuations of X i , Y j and the subgraphs of the constructed random graph: for a valuation , the corresponding subgraph is the one where each edge adjacent to x i has length 1 iff (X i ) = 1.

T391

46216-46243

Sentence

denotes

The same argument for Y j .

T392

46244-46415

Sentence

denotes

We set k = 3 and claim the fact that the probability P(T ≥ 3) in (A2) is the number of valuations satisfying that F is divided by 2 N , where N is the number of variables.

T393

46416-46637

Sentence

denotes

Indeed, the fact is based on the following observation: a valuation is true iff some adjacent pair of X i and Y j is true if the incident edges to the corresponding x i and y j have length 1 in the corresponding subgraph.

T394

46638-46907

Sentence

denotes

Note that any edge of length greater than 1 is irrelevant as the structure of the graph, which ensures it can never be part of a path of length 3 from s to t, For any path from s to t must jump three times: from s to some x i , from x i to some y j , and from y j to t.

T395

46908-47001

Sentence

denotes

Thus, from what we have stated, (A1) can be reducible to (A2), which implies (A2) is #P-hard.

T396

47002-47056

Sentence

denotes

We aim to solve the problem (denoted by (A)) of calcu-

T397

47058-47210

Sentence

denotes

In order to prove the random variables T(u, S) and c(u, S) are identically distributed, we need to prove P[T (u, S) = t] = P[c(u, S) = t] for any t > 0.

T398

47211-47274

Sentence

denotes

On the one hand, we can prove P[T (u, S) = t] ≤ P[c(u, S) = t].

T399

47275-47385

Sentence

denotes

Given t > 0, we find that for any simulation with T(u, S) = t, a snapshot with c(u, S) = t can be constructed.

T400

47386-47476

Sentence

denotes

By definition, if T(u, S) = t, then J i ∩ S = ∅, i = 0, 1, · · ·, t − 1 and J t ∩ S / = ∅.

T401

47477-47606

Sentence

denotes

For any node v i+1 ∈ J i+1 \J i / = ∅, suppose that v i+1 is infected through the edge (s, v i+1 ) with s ∈ J i , v i+1 ∈ J i+1 .

T402

47607-47638

Sentence

denotes

Denote n s = inf{j : s ∈ J j }.

T403

47639-47712

Sentence

denotes

Then we let the edge (s, v i+1 ) take weight c(s, v i+1 ) = i + 1 − n s .

T404

47713-47781

Sentence

denotes

For any node s ∈ J t , we denote n s , i.e. n s = sup{j : s ∈ J j }.

T405

47782-47919

Sentence

denotes

Then for any v ∈ V \J t , let the edge (s , v i+1 ) take weight c(s , v) = t − n s + C, where C is sampled from a geometric distribution.

T406

47920-48012

Sentence

denotes

In addition, we take a weight sampled from geometric distribution on each of the rest edges.

T407

48013-48124

Sentence

denotes

In such operations, a snapshot with c(u, S) = t is constructed, which implies P[T (u, S) = t] ≤ P[c(u, S) = t].

T408

48125-48193

Sentence

denotes

On the other hand, we can also get P[c(u, S) = t] ≤ P[T (u, S) = t].

T409

48194-48278

Sentence

denotes

For any snapshot with c(u, S) = t, a simulation with T(u, S) = t can be constructed.

T410

48279-48410

Sentence

denotes

We let node u be the initiator, i.e. J 0 = {u} and denote the smallest number in the set {c(u, v) : u ∈ J 0 , v ∈ V \J 0 } by n 0 .

T411

48411-48501

Sentence

denotes

Meanwhile, we denote the set of nodes {v : c(u, v) = n s , u ∈ J 0 , v ∈ V \J 0 } by n 0 .

T412

48502-48525

Sentence

denotes

Let J n 0 = J 0 ∪ n 0 .

T413

48526-48594

Sentence

denotes

Complementally, we define all the J i = J 0 , i = 0, · · ·, n 0 − 1.

T414

48595-48747

Sentence

denotes

Denote the smallest number in the set {c(u, v) : u ∈ J t , v ∈ V \J t } by n t and the set of nodes {v : c(u, v) = n s , u ∈ J t , v ∈ V \J t } by n t .

T415

48748-48781

Sentence

denotes

Let J t+n t = J n t = J t ∪ n t .

T416

48782-48854

Sentence

denotes

Complementally, we define J i : = J t for all i = t, · · ·, t + n t − 1.

T417

48855-48944

Sentence

denotes

We now can construct a simulation of the diffusion model (I t ) t≥0 = (J t+1 \J t ) t≥0 .

T418

48945-49038

Sentence

denotes

Hence, we have a simulation with T(u, S) = t, which implies P[c(u, S) = t] ≤ P[T (u, S) = t].

T419

49039-49173

Sentence

denotes

Therefore, we have P[T (u, S) = t] = P[c(u, S) = t] for any t > 0, which implies that T(u, S) and c(u, S) are identically distributed.

T420

49174-49255

Sentence

denotes

Hence, the simulation and snapshot simulations are equivalent in estimating D(S).

T1

49256-49402

Sentence

denotes

In this paper, we discussed scalable solutions to detection time minimization problem for early detection of dynamic harmful cascades in networks.

T2

49403-49554

Sentence

denotes

To address this problem, we proposed an upper bound based greedy (UBG) algorithm and its two accelerative algorithms to cater for lager scale networks.

T3

49555-49661

Sentence

denotes

The UBG solution guarantees a near-optimal detection time, pruning at least 80% Monte-Carlo calls of CELF.

T4

49662-49791

Sentence

denotes

The novel accelerations on UBG can significantly reduce the selection time and further achieve at least 10 2 times speed-raising.

T5

49792-49840

Sentence

denotes

There are several interesting future directions.

T6

49841-50012

Sentence

denotes

First, the detection time D(S) is # P-hard to calculate exactly, it is still a question how to design an efficient algorithm to estimate D(S) with a theoretical guarantee.

T7

50013-50204

Sentence

denotes

Second, the infection probability ip and prior distribution are predefined in this paper, how to learn these parameters for the DSI model from available cascade data still remains unexplored.

CORD-19:6e2db98aa242e3e84116afdfad5e250943c55ac5 JSON TXT 12 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:6e2db98aa242e3e84116afdfad5e250943c55ac5 JSONTXT 12 Projects

Annnotations TAB TSV DIC JSON TextAE

CORD-19:6e2db98aa242e3e84116afdfad5e250943c55ac5 JSON TXT 12 Projects