Video synopsis: A survey 视频摘要论文翻译

Aggie ·
更新时间:2024-09-20
· 993 次阅读

视频摘要:一项调查

Computer Vision and Image Understanding 181 (2019) 26–38

Computer Vision and Image Understanding

journal homepage:www.elsevier.com/locate/cviu

作者:

Kemal Batuhan Baskurt , Refik Samet

Keywords:

Video surveillance 、Video processing、 Video synopsis、 Motion detection 、Object tracking、 Optimization 、Background generation 、Stitching

关键词:

视频监控,视频处理,视频摘要,运动检测,目标跟踪,优化,背景生成,拼接

ABSTRACT

Video synopsis is an activity-based video condensation approach to achieve efficient video browsing and retrieval for surveillance cameras. It is one of the most effective ways to reduce the inactive density of input video in order to provide fast and easy retrieval of the parts of interest. Unlike frame-based video summarization methods, the interested activities are shifted in the time domain to obtain video representation that is more compact. Although the number of studies on video synopsis has increased over the past years, there has still been no survey study on the subject. The aim in this article is to review state-of-the-art approaches in video synopsis studies and provide a comprehensive analysis. The methodology of video synopsis is described to provide an overview on the flow of the algorithm. Recent literature is investigated into different aspects such as optimization type, camera topology, input data domain, and activity clustering mechanisms. Commonly used performance evaluation techniques are also examined. Finally, the current situation of the literature and potential future research directions are discussed after an exhaustive analysis that covers most of the studies from early on to the present in this field. To the best of our knowledge, this study is the first review of published video synopsis approaches.

视频摘要是一种基于活动的视频压缩方法,用于实现监控摄像头的高效视频浏览和检索。这是降低输入视频非活动密度的最有效的方法之一,可以提供快速、方便地检索感兴趣的部分。与基于帧的视频摘要方法不同,感兴趣的活动在时域内进行移位,以获得更紧凑的视频表示。虽然在过去的几年里,关于视频摘要的研究数量有所增加,但是仍然没有关于这一主题的调查研究。在这篇文章的目的是回顾最先进的方法在视频摘要的研究,并提供一个全面的分析。介绍了视频摘要的方法,对算法的流程进行了概述。最近的文献从优化类型、摄像机拓扑结构、输入数据域和活动聚类机制等方面进行了研究。常用的性能评估技术也进行了审查。最后,通过对该领域早期到目前的大部分研究进行详尽的分析,讨论了该领域的文献现状和未来可能的研究方向。据我们所知,本研究是首次回顾已发表的视频摘要方法。

1 INTRODUCTION

Control and management of huge amounts of recorded video is becoming more difficult to deal with each passing day when considering the rapid increment in security camera usage in daily life. Efficient video browsing and retrieval are critical issues when considering the amount of raw video data to be summarized. The manpower required to monitor visual data is a challenging problem. Therefore, video condensation techniques are being widely investigated via a large number of applications in diverse disciplines.

考虑到安全摄像头在日常生活中的快速增长,对大量记录视频的控制和管理变得日益困难。在考虑要汇总的原始视频数据量时,高效的视频浏览和检索是关键问题。监控可视数据所需的人力是一个具有挑战性的问题。因此,视频凝聚技术正通过在不同学科的大量应用而被广泛研究。

A popular approach to solve video condensation problem is video synopsis, which has been investigated in the literature over the last decade. Video synopsis provides activity-based video condensation instead of frame-based techniques such as video fast-forward (Smith and Kanade,1998), video abstraction (Truong and Venkatesh,2007), and video summarization (Chakraborty et al.,2015). Video synopsis operates on an activity as a processing unit while frame-based approaches use a frame. Video synopsis achieves higher efficiency than frame-based video condensation techniques as smaller processing units provide the opportunity of better condensation because of more detailed video analysis. Activities can be shifted in the time domain and more than one activity can be showed simultaneously in a frame even though they come from different time periods.

解决视频凝结问题的一种流行方法是视频摘要,这在过去十年的文献中已经被研究过。视频摘要提供了基于活动的视频压缩,而不是基于帧的技术,如视频快进(Smith和Kanade,1998)、视频概要(Truong和Venkatesh,2007)和视频摘要(Chakraborty等,2015)。视频摘要将活动作为一个处理单元进行操作,而基于帧的方法则使用一帧作为处理单元进行操作。视频摘要比基于帧的视频压缩技术具有更高的效率,因为更详细的视频分析使得更小的处理单元提供更好的压缩机会。活动可以在时间域内进行移位,并且可以在一帧内同时显示多个活动,即使它们来自不同的时间段。

The aim of video synopsis approaches is to find the best rearrangement of the activities in order to display most of them in the shortest time period. The biggest problem is handling activity collisions as they can lead to the loss of important content, thereby reducing efficiency. Collisions also cause a chaotic viewing experience which decreases the visual quality for surveillance applications. Displaying the maximum number of objects with minimal collisions means more computational complexity comparing to frame-based methods, because of processing the activities separately instead of processing the whole frame at once. Thus, video synopsis has become the hot spot in video summarization, especially with the support of technological improvement on computational capacity of current computers over the past years.

视频摘要方法的目的是找出活动的最佳重排,以便在最短的时间内展示大部分活动。最大的问题是处理活动冲突,因为它们可能导致重要内容的丢失,从而降低效率。碰撞还会导致混乱的观看体验,降低了监视应用程序的视觉质量。与基于帧的方法相比,在最小冲突的情况下显示最多的对象数量意味着更大的计算复杂性,(这是)因为(我们是)单独处理(每个)活动而不是一次处理(视频的)整个帧。因此,视频摘要已经成为视频摘要研究的热点,特别是近年来,随着现代计算机计算能力的不断提高,视频摘要的研究也越来越受到重视。

Existing video synopsis studies can be categorized by different aspects such as optimization type, camera topology, input data domain, and activity clustering. The aim of optimization is to find the best temporal positions of selected activities in order to obtain a more compact representation, which is the most important part of algorithm flow in video synopsis. Therefore, the most dominant criteria for categorization is optimization type, which is divided in two categories, namely on-line and off-line. A large part of the approaches performs off- line optimization of all activities to find the global optimum. However, latest approaches increasingly use on-line optimization that applies rearrangement on each new activity to find the local optimum. Aspects of camera topology have divided studies into two groups: single and multi-camera solutions. Most of the approaches are oriented toward the single-camera view that makes the optimization problem easier. Multi-camera approaches need to build a global energy definition which covers all of the camera network with the intention of finding the optimal solution for all. On the other hand, they provide the opportunity to display and analyze activities in a wider perspective. Some studies focusing on run-time performance propose techniques applied directly to compressed data instead of losing time and computation power by transforming data to the pixel domain. Even though their run-time performance is significantly increased, condensation ratio cannot compete with pixel-domain methods. Besides, some studies apply activity clustering to group similar activities and display them together with the aim of providing better understanding of the scene as focusing on similar activities is easier for the user.

现有的视频摘要研究可以从优化类型、摄像机拓扑结构、输入数据域和活动聚类等方面进行分类。优化的目的是找到所选活动的最佳时间位置,以获得更紧凑的表示,这是视频摘要的算法流中最重要的部分。因此,最主要的分类标准是优化类型,分为在线和离线两类。大部分方法对所有活动进行离线优化,以找到全局最优。然而,最新的方法越来越多地使用在线优化,即对每个新活动进行重排以找到局部最优。相机拓扑方面的研究分为两组:单相机解决方案和多相机解决方案。大多数方法都是面向单摄像机视图的,这样可以简化优化问题。多摄像机方法需要建立一个覆盖所有摄像机网络的全局能量定义,以找到所有摄像机网络的最优解。另一方面,它们提供了从更广阔的角度展示和分析活动的机会。一些关注于运行时性能的研究提出了直接应用于压缩数据的技术,而不是通过将数据转换为像素域来损失时间和计算能力。尽管它们的运行时性能得到了显著提高,但冷凝比仍然无法与像素域方法相匹敌。此外,还有一些研究将活动聚类应用于将相似的活动进行分组并一起显示,目的是为了让用户更好地理解场景,因为关注相似的活动对用户来说更容易。

In this paper, we analyze 35 video synopsis approaches that cover all of the existing studies up to this point. Approaches are analyzed on the aforementioned aspects and the diversity of pre/post-processing methods used in existing video synopsis approaches are examined in detail.

在这篇论文中,我们分析了35种视频摘要方法,它们涵盖了目前为止所有的研究。对上述方法进行了分析,并对现有视频摘要方法中使用的预处理/后处理方法的多样性进行了详细的研究。

The rest of the paper is organized as follows. Section2provides an overview of existing video synopsis approaches emphasizing on novelty and contribution to the field. Methods used in algorithm flow of video synopsis are described in Section3. An analysis of the approaches according to optimization type, camera topology, input data domain, and activity clustering is described in Section4. Evaluation criteria and commonly used datasets are presented in Section5. Finally, Section6 contains conclusions on the study.

论文的其余部分组织如下。第二部分概述了现有的强调新颖性和对该领域的贡献的视频概要方法。第3节描述了视频摘要的算法流程。第4节根据优化类型、摄像机拓扑结构、输入数据域和活动聚类对这些方法进行了分析。评估标准和常用数据集在第5节中给出。最后,第六部分为研究结论。

2 Related works

Video synopsis is an activity-based video condensation technique and the main purpose is to display as many activities as possible simultaneously in the shortest time period. An activity represents a group of object instances belonging to a time period in which the object is visible. The activities extracted from the source are shifted in the time domain to calculate their optimal positions with the minimum number of collisions. Unlike frame-based video summarization techniques, activities from different time periods can be shifted into the same frame through pixel based analysis. Therefore, more efficient condensation performance is achieved compared to frame-based video summarization methods.

视频摘要是一种基于活动的视频压缩技术,其主要目的是在最短的时间内同时展示尽可能多的活动。活动表示属于某个时间段的一组对象实例,在该时间段内对象是可见的。从视频源中提取的活动在时域中移位,以最小碰撞次数计算出它们的最佳位置。与基于帧的视频摘要技术不同,不同时间段的活动可以通过基于像素的分析转移到同一帧中。因此,与基于帧的视频摘要方法相比,该方法具有更高效的凝结性能。

Activity-based video condensation was proposed by Rav-Acha et al. (2006) under the name of video synopsis, a novel approach that shifts detected activities in time domain to display them simultaneously over a shorter time period, as depicted in Fig.1. Their approach contained two main phases: on-line and off-line. The on-line phase included activity generation and storing them into a queue. Subsequently, off-line phase started after selecting a time range of video synopsis with tube rearrangement, background generation, and object stitching. A global energy function containing activity, temporal consistency, and collision cost was defined, then simulated annealing method (Kirkpatrick et al., 1983) was applied for energy minimization, as illustrated in Fig.2.

基于活动的视频压缩是由ravi-acha等人(2006)以视频摘要的名字提出的,这是一种新的方法,将检测到的活动在时域内进行转移,从而在更短的时间内同时显示出来,如图1所示。他们的方法包括两个主要阶段:在线和离线。在线阶段包括活动生成和将活动存储到队列中。然后,选择一个时间范围内的视频摘要,通过管道重排、背景生成和对象拼接,最后进入了离线阶段。该阶段定义了一个包含活动、时间一致性和碰撞代价的全局能量函数,然后采用模拟退火方法(Kirkpatrick et al., 1983)进行能量最小化,如图2所示。

Their study is important as the video synopsis approach was proposed for the first time. Even though the study led to follow up ones, it is still a primitive version of video synopsis. In this manner, researchers continue to improve the approach by applying video synopsis to endless video streams, as reported by Pritch et al.(2007). The term ‘tube’ for representing activity consisting of object trajectories in video frames was first used in this study and has been widely used in the literature ever since.

他们的研究是重要的,因为视频摘要的方法是首次提出。虽然这项研究引领了后续的一些同类研究,但它仍然是视频概要的一个原始版本。通过这种方式,研究人员通过持续改进方法将视频摘要应用于无穷无尽的视频流,Pritch等人(2007)报道了这一点。在本研究中,首次使用了表示视频帧中由物体轨迹组成的活动的术语tube,并从此在文献中得到了广泛的应用。

They applied a better object detection method to improve the precision of video synopsis and proposed a more detailed energy function definition compared to Rav-Acha et al.(2006) using additional terms. However, these two studies only focused on theoretical improvement without any effort on practical implementation, and so the authors unified and expanded on their previous research in Pritch et al.(2008) by providing an analysis of computation performance. Tubes were shifted by jumps of 10 frames and moving object detection was applied to every 10th frame, thereby reducing image resolution, etc. Even though it is not sufficient for full adaptation to real world applications, the proposed approach became more applicable to video surveillance scenarios by the performance improvement. Their study also made a positive contribution to the field by providing an analysis of run-time performance of both the on-line and off-line steps in the method.

他们应用了一种更好的目标检测方法来提高视频摘要的精度,并提出了一个比使用附加术语的ravi-acha等人(2006)更详细的能量函数定义。然而,这两项研究只注重理论的改进,而没有注重实际的实施,因此作者通过对计算性能的分析,将他们在之前的研究 Pritch et al.(2008) 进行了统一和扩展。通过10帧的跳跃移动管道,每隔10帧进行移动目标检测,从而降低图像分辨率等。虽然该方法不能完全适应真实世界的应用,但通过性能改进,使其更适用于视频监控场景。他们的研究还对该领域作出了积极贡献,提供了对该方法中联机和脱机步骤的运行时性能的分析。

Subsequently, they offered activity clustering in order to display similar activities together (Pritch et al.,2009). Appearance and motion features were used for clustering, and they provided the opportunity to display a video synopsis of the same person’s activities or all of the activities in the same direction. Differently from previous approaches, long tubes were divided into ‘tubelets’, which were subsets with a maximum of 50 video frames. As clustering similar activities was novel in video synopsis at that time, they contributed to the field by providing a different perspective on existing studies.

随后,他们提供了活动聚类,以便一起显示类似的活动(Pritch et al.,2009)。使用外观和运动特征进行聚类,它们提供了机会来显示同一个人的活动或同一方向的所有活动的视频摘要。不同于以往的方法,长管被分为小管,这是一个最大50个视频帧的子集。由于聚类类似的活动在当时的视频摘要中是新颖的,它们通过对现有研究提供不同的视角,对该领域做出了贡献。

The studies mentioned up to this point are by the authors who proposed video synopsis for the first time. Even so, there are still limitations such as time consuming optimization on video with dense activity, huge memory requirement, and uncertainty on determination of video synopsis length, although they improved on their first proposed approach with several subsequent studies. Their studies are important as they were pioneering to following studies and helped to build the principal methodology adopted by the following studies over a long period of time.

到目前为止所提到的研究都是作者首次提出的视频摘要。尽管如此,仍然存在一些限制,如对活动密集的视频进行耗时优化、需要大量内存以及确定视频摘要长度的不确定性,尽管他们在随后的几项研究中改进了他们最初提出的方法。他们的研究很重要,因为他们是后续研究的先驱,并帮助建立了后续研究在很长一段时间内采用的主要方法。

Xu et al.(2008) formulated the optimization problem of activities in terms of set theory, in which a universal set representing optimal temporal positions of the activities was obtained. The main difference from the preceding approaches is that temporal consistency was not considered on rearrangement of the activities. Even though a comparison of results with Pritch et al.(2007) was provided in which their method outperformed the classical one, their study did not attracted much attention and was not adopted by following studies.The probable reason for this was their simple optimization method to obtain local optima compared to global solution of Pritch et al.(2007).

Xu等人(2008)从集合理论出发,提出了活动的最优问题,得到了一个表示活动的最优时间位置的全集。与前一种方法的主要区别在于,在重新安排活动时没有考虑时间一致性。尽管与Pritch等人(2007)的结果进行了比较,他们的方法优于经典方法,但他们的研究并没有引起太多的关注,也没有被后续的研究采用。可能的原因是,与Pritch等人(2007)的全局解相比,他们获得局部最优的优化方法过于简单。

Yildiz et al.(2008) applied a pixel-based analysis instead of an object-based one for activity detection. Input video was shrunk to only obtain the parts with high activity by extracting horizontal paths with minimum energy in video frames. They extracted the inactive parts of the video instead of temporal shifting of the activities. A pipeline-based framework was proposed to obtain real-time video synopsis with low memory consumption (Vural and Akgul,2009). This study was extended to integrate with an eye tracking technology which was able to detect video parts that the operator did not pay attention to or vice versa. In this way, they provided the opportunity to cluster similar activities to be displayed together in the video synopsis. Their approach applied pixel-based optimization without object boundary information. Therefore, object unity might be broken in the video synopsis. Visual quality of the generated video synopsis was lower than object-based approaches, especially on scenes with high activity density.

Yildiz等人(2008)将基于像素的分析代替基于对象的分析用于活动检测。在视频帧中,通过提取能量最小的水平路径,将输入视频压缩到只获取高活性部分。他们提取了视频中不活跃的部分,而不是活动的时间变化。提出了一个基于流水线的框架来获得低内存消耗的实时视频摘要(Vural and Akgul,2009)。这项研究扩展到与眼动跟踪技术相结合,该技术能够检测操作者没有注意到的视频部分,反之亦然。通过这种方式,他们提供了将类似的活动聚集在一起并在视频摘要中展示的机会。他们的方法采用无对象边界信息的基于像素的优化。因此,在视频摘要中可能会打破对象的统一性。生成的视频摘要的视觉质量低于基于对象的方法,特别是在高活动密度的场景中。

Rodriguez(2010) contributed to the field by using an object detection method unaffected by camera motion, thus activities obtained from moving cameras could be displayed in the video synopsis. A template for a matching-based clustering method was also used to group similar activities used in the video synopsis. Chou et al.(2015) proposed the clustering of similar activities. Four regions in a camera view were first defined as possible entrance and exit locations, then activities were clustered by these regions. They used a method to cluster similar trajectories with different sampling rates, speeds, and sizes to achieve optimal results for their video synopsis. Lin et al.(2015) also proposed an approach using clustering activities with novel methods for anomaly detection, object tracking, and optimization in a video synopsis. Learning-based anomaly detection was applied to detect activities which were later clustered using predefined regions of the scene similar to the previous approach by Chou et al.(2015) using entrance and exit regions. Even though different activity clustering criteria are used in these mentioned methods, their main purpose was to make video synopsis easier to view by displaying activities with similar properties together. Besides using an additional activity clustering step in their methodology, they contributed to the field by the adaptation of clustering metrics to optimization. Their methods open new paths of investigation and possible improvements.

Rodriguez(2010)使用不受摄像机运动影响的对象检测方法对该领域做出了贡献,因此从移动摄像机获得的活动可以在视频摘要中显示。一个基于匹配的聚类方法的模板也被用来对视频摘要中使用的类似活动进行分组。Chou等(2015)提出了相似活动的聚类。摄像机视图中的四个区域首先被定义为可能的入口和出口位置,然后活动由这些区域聚集。他们使用了一种方法来聚集具有不同采样率、速度和大小的相似轨迹,以获得最佳的视频摘要结果。Lin等人(2015)也提出了一种使用聚类活动的方法,并在视频摘要中使用新的方法进行异常检测、目标跟踪和优化。基于学习的异常检测被应用于检测活动,这些活动随后使用场景的预定义区域进行聚类,类似于Chou等人(2015)使用入口和出口区域的前一种方法。虽然在这些方法中使用了不同的活动聚类标准,但它们的主要目的是通过将具有相似属性的活动显示在一起,使视频摘要更容易查看。除了在他们的方法中使用额外的活动聚类步骤外,他们还通过调整聚类指标来优化该领域。他们的方法开辟了研究的新途径,并可能得到改进。

Differently from general tradition of temporal shifting in video synopsis, Nie et al.(2013) changed both the temporal and spatial positions of the activities in order to prevent collisions. Background belonging to the activities that had been spatially shifted was expanded to keep the background consistency. A synthetic background expansion was applied until there was enough space to put all activities into without any collisions, as shown in Fig.3. Their method is the only one to shift the spatial position of the activities. Activity collisions were minimized in this way but their novelty also brought some shortcomings such as changing the background may damage the understanding of a scene since the background was extended to regions that did not have activity in the sample images. The mentioned extension could not be applied if there were no available regions without activity, thus application of the proposed method is limited to only specific scenes.

与视频摘要中一般的时间移位不同,聂等人(2013)为了防止碰撞,改变了活动的时间和空间位置。对属于空间移位的活动的背景进行扩展,以保持背景的一致性。使用合成的背景展开,直到有足够的空间将所有的活动放入,没有任何碰撞,如图3所示。他们的方法是唯一改变活动空间位置的方法。通过这种方法可以使活动冲突最小化,但是这种新颖的方法也带来了一些缺点,比如改变背景可能会破坏对场景的理解,因为背景被扩展到了样本图像中没有活动的区域。如果没有无活动的可用区域,则不能应用上述扩展,因此,该方法的应用仅限于特定场景。

Li et al.(2016) proposed a different approach to solving the object collision problem in video synopsis in which colliding objects were scaled down in order to minimize the collision. A metric representing the scale down factor of each object was used in the optimization step. Even though the object collision problem was minimized technically, the proposed method might disturb the user. For instance, a reduction in object size causes an artificial view of the video synopsis as a car and a person that appear close in the scene might have similar sizes. Nevertheless, even this situation is prevented to a certain degree by an additional metric. He et al.(2017a,b) took activity collision analysis one step further by defining collision statuses between activities such as collision-free, colliding in the same direction, and colliding in opposite directions. They also proposed a graph-based optimization method by considering these collision states to improve the activity density and put activity collisions at the center of their optimization strategy.

Li等人(2016)在视频摘要中提出了一种不同的方法来解决对象碰撞问题,即将碰撞对象按比例缩小以使碰撞最小化。在优化步骤中使用一个表示每个对象的伸缩因子的度量。即使在技术上最小化了对象冲突问题,所提出的方法也可能对用户造成干扰。例如,对象大小的减少导致视频摘要的人工视图,因为在场景中出现的汽车和人可能有相似的大小。然而,即使是这种情况也在一定程度上受到另一种衡量标准的阻碍。He等人(2017a,b)通过定义无碰撞、同一方向碰撞、相反方向碰撞等活动之间的碰撞状态,进一步开展了活动碰撞分析。他们还提出了一种基于图的优化方法,通过考虑这些碰撞状态来提高活动密度,并将活动碰撞作为优化策略的中心。

Hence, a more detailed analysis of activity collision was provided compared to other video synopsis studies. Besides improvements by minimizing collisions, other metrics such as activity cost, chronological order, etc. were ignored. Therefore their optimization method still needs to be improved to find the optimal rearrangement.

因此,与其他的视频概要研究相比,我们对活动碰撞进行了更详细的分析。除了通过最小化冲突进行改进外,其他指标,如活动成本、时间顺序等,被忽略了。因此,它们的优化方法仍需改进,来找到最优重排。

Huang et al.(2014) emphasized the importance of on-line optimization techniques which enable tube rearrangement at the time of detection without any need to wait before starting optimization. Moreover, a synopsis table representing activities with their frame numbers for each pixel was proposed. Even though rearrangement obtained the local optimum, video synopsis could be generated a real-time video synopsis while activity analysis was being processed. The biggest problem with their on-line method was completely ignoring activity collision situations in order to improve run-time performance, and another deficiency of the proposed optimization method was using manually determined threshold values instead of a more complex decision mechanism. With this in mind, a tradeoff between run-time performance and condensation ratio arose that decreased precision.

Huang等人(2014)强调了在线优化技术的重要性,该技术使管道在检测时重新排列而无需等待就可以开始优化。此外,我们还提出了一个概要表,用每个像素的帧数来表示活动。即使重排得到了局部最优,在进行活动分析的同时,也可以生成实时的视频摘要他们的在线方法最大的问题是为了提高运行时性能而完全忽略了活动冲突的情况,而所提出的优化方法的另一个不足之处是使用手动确定的阈值,而不是使用更复杂的决策机制。考虑到这一点,运行时性能和冷凝比之间的权衡导致了精度的下降。

Zhu et al.(2014) mentioned deficiency in video synopsis due to a single-camera view since when considering video surveillance applications, an activity generally happens in more than one camera view. Thus, they proposed a multi-camera video synopsis approach with a panoramic view constructed using homography between partially overlapping camera views. Activities from different cameras were associated via trajectory matching in overlapping camera views. They also proposed a key frame selection approach for the activities whereby key frames of an activity in which the appearance or motion of an object is changed significantly are used instead of all of the frames for reducing redundancy of consecutive frames. Similarly,Zhu et al. (2016a) proposed a multi-camera video synopsis approach using a timestamp selection method to find critical moments of an activity. Key timestamps were defined as when objects first appear, the merge time with any other object, and the split and disappear time in the video. Unlike Zhu et al.(2014), object re-identification using visual information was applied between camera views. The energy function for optimization was also improved so as to be adaptable with multi- camera topology. The chronological order of objects was kept not only in one camera view but also among different camera views.

Zhu等人(2014)提到了由于单摄像头视角而导致的视频概要的不足,因为在考虑视频监控应用时,一个活动通常发生在多个摄像头视角下。因此,他们提出了一种多摄像机视频概要方法,利用部分重叠摄像机视图之间的单应性构建全景视图。不同相机的活动通过重叠相机视图中的轨迹匹配进行关联。他们还提出了活动的关键帧选择方法,即使用一个活动的关键帧,其中一个物体的外观或运动发生了显著的变化,而不是所有的帧,以减少连续帧的冗余。类似地,Zhu等人(2016a)提出了一种多摄像头视频摘要方法,使用时间戳选择方法来寻找活动的关键时刻。键时间戳定义为对象首次出现的时间,与任何其他对象合并的时间,以及在视频中分割和消失的时间。与Zhu et al.(2014)不同的是,使用视觉信息的对象再识别被应用于相机视图之间。优化的能量函数也进行了改进,以适应多摄像机的拓扑结构。对象的时间顺序不仅保留在一个相机视图中,也保留在不同的相机视图中。

Hoshen and Peleg(2015) suggested a multi-camera video synopsis approach which defined a master camera and slave cameras around the master. Once an activity was detected in the master camera, a video synopsis containing activities of slave cameras belonging to related time period is generated. Although object re-identification between the cameras was not applied, they aimed to provide a wider perspective on the activity of master. Mahapatra et al.(2016) offered another video synopsis framework on multiple cameras having overlapping field-of- views for which a common ground plane via a homography between camera overlaps was generated. Activities were classified into seven categories, namely walking, running, bending, jumping, hand shaking, one hand waving, and both hands waving. Thus, they provided video synopsis of specific activity types.

Hoshen和Peleg(2015)提出了一种多摄像头视频摘要方法,定义了主摄像头和围绕主摄像头的副摄像头。一旦在主摄像机中检测到一个活动,就会生成一个包含从属摄像机在相关时间段内的活动的视频摘要。虽然相机之间的对象重新识别没有应用,他们的目的是为主摄像机的活动提供一个更广阔的视角。Mahapatra等人(2016)提出了另一种关于多个摄像机具有重叠视场的视频概要框架,通过摄像机重叠之间的单应性生成了一个共同的地面平面。活动分为步行、跑步、弯腰、跳跃、握手、一挥手和双手挥手七类。因此,他们提供了特定活动类型的视频概要。

Multi-camera video synopsis approaches are more applicable to real-world applications when considering distributed video surveillance networks. Nevertheless, optimization becomes more complicated with additional metrics used for the association of objects in different cameras. Another important point is overlapping of camera views. Studies applied to non-overlapping camera views seem more efficient as they have one less restriction on camera topology.

在考虑分布式视频监控网络时,多摄像机视频摘要方法更适用于实际应用。然而,优化变得更加复杂,在不同的相机中使用额外的指标来关联对象。另一个重点是相机视图的重叠。应用于非重叠相机视图的研究似乎更有效,因为它们对相机拓扑的限制更少。

Different than the approaches explained up to now, Lin et al.(2017) mainly focused on acceleration of computing speed of video synopsis via a distributed processing model. Their framework included computing and storage nodes created for distributed computation in which the nodes represented different computers on a network or application threads. Their video synopsis algorithm was divided into several steps such as video initialization, and object detection, tracking, classification, optimization, etc., which were computed in a distributed fashion. Input video was segmented and each segment analyzed on a different node and tubes generated on each node were stored on storage nodes. Finally, another node generated the final video synopsis using data on the storage node. The region of interest of the scene was also defined in order to reduce the region of input processing. Furthermore, video size and frames per second were also reduced to increase performance without affecting the accuracy of object detection. This was the first study to perform a video synopsis with a distributed architecture and was innovative when considering the distributed camera topology of video surveillance applications. This study provided the opportunity to apply high precision but time consuming optimization methods close to real-time performance.

与目前解释的方法不同,Lin等(2017)主要关注通过分布式处理模型提高视频摘要的计算速度。它们的框架包括为分布式计算创建的计算和存储节点,其中节点表示网络或应用程序线程上的不同计算机。他们的视频摘要算法分为视频初始化、目标检测、跟踪、分类、优化等几个步骤,并以分布式方式进行计算。对输入视频进行分段,每个分段在不同的节点上进行分析,每个节点上生成的管存储在存储节点上。最后,另一个节点使用存储节点上的数据生成最终的视频摘要。为了减少输入处理区域,还定义了场景感兴趣区域。此外,在不影响目标检测精度的情况下,还降低了视频大小和每秒帧数以提高性能。这是第一个使用分布式架构进行视频摘要的研究,并且在考虑视频监控应用的分布式摄像机拓扑结构时进行了创新。本研究提供了机会,以应用高精度但耗时的优化方法接近实时性能。

Besides, there are video synopsis approaches which work on compressed domains (Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017). They emphasized that video decoding increases the complexity of the approach and makes it hard to work in real-time, thus activity detection was carried out on compressed video and required that flags were set for use in the optimization step. Partial decoding was applied to improve the run-time performance of the approaches. Nevertheless, their object detection methods in the compressed domain were simple compared to pixel-based methods. Because inefficiency in object detection directly affects video synopsis performance, these methods need more improvement on precision.

此外,还有一些视频概要方法在压缩域上工作(Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017)。他们强调,视频解码增加了该方法的复杂性,使其难以实时工作,因此对压缩视频进行活动检测,并要求在优化步骤中设置标记。采用部分译码的方法提高了算法的运行时性能。然而,与基于像素的目标检测方法相比,它们在压缩域的目标检测方法较为简单。由于目标检测的低效率直接影响视频摘要的性能,因此这些方法在精度上还有待提高。

The video synopsis approaches mentioned so far have commonly focused on the optimization step of the flow. Nevertheless, there have been studies that have focused on other steps such as background generation and object tracking specified for video synopsis. Feng et al. (2010) proposed a background generation approach aimed at choosing video frames with the most activity and representing changes in the scene. Thus, they later propose sticky tracking to minimize the object blinking problem which causes ghost objects in video synopsis (Feng et al.,2012). Objects with intersected trajectories were merged as a unique activity to be used in the video synopsis, the purpose is not to obtain perfect object tracking but to provide activity coherence.

到目前为止所提到的视频概要方法通常都集中在流的优化步骤上。然而,也有一些研究关注于其他步骤,如为视频摘要指定的背景生成和对象跟踪。Feng等人(2010)提出了一种背景生成方法,旨在选择最活跃的视频帧,并表示场景中的变化。因此,他们后来提出了粘性跟踪,以最小化视频摘要中导致幽灵物体的物体闪烁问题(Feng et al.,2012)。将轨迹相交的物体合并作为一种独特的活动来使用在视频摘要中,目的不是为了获得完美的物体跟踪,而是为了提供活动的连贯性。

Baskurt and Samet(2018) proposed another object tracking approach specified for requirements of video synopsis. Their approach focused on long term tracking to represent each target with just one activity in video synopsis. The target object was modeled with more than one correlation filter which represent the different appearances of the target during the tracking. Robustness across the environment challenges such as illumination variation, scale and appearance changes was obtained in this way. Lu et al.(2013) focused on object detection artifacts such as shadow and interruption of object tracking which reduce efficiency content analysis. They proposed support for both motion detection and object tracking methods with additional visual features in order to eliminate shadow and increase the robustness of the tracking method against collision. Baskurt and Samet(2017) also focused to increase robustness of object detection by proposing an adaptive background generation approach. Hsia et al.(2016) concentrated on efficiently searching an activity database to generate video synopsis. A novel range tree approach was proposed whose main purpose was to find the tubes selected by the user in an efficient way and to reduce the complexity of the algorithm.

Baskurt和Samet(2018)针对视频摘要的要求,提出了另一种目标跟踪方法。他们的方法侧重于长期跟踪,以表示每个目标只有一个活动的视频摘要。利用多个相关滤波器对目标进行建模,这些相关滤波器表示目标在跟踪过程中的不同状态。鲁棒性跨越环境的挑战,如照明变化,规模和外观变化是通过这种方式获得的。Lu等人(2013)重点研究了目标检测伪影、目标跟踪中断等降低效率内容分析的对象检测伪影。为了消除阴影,增强跟踪方法对碰撞的鲁棒性,他们提出了支持运动检测和附加视觉特征的目标跟踪方法。Baskurt和Samet(2017)也提出了一种自适应的背景生成方法来提高目标检测的鲁棒性。Hsia等人(2016)专注于高效搜索活动数据库以生成视频摘要。提出了一种新的距离树方法,其主要目的是有效地找到用户选择的管材,降低算法的复杂度。

These studies have made an important contribution to other video synopsis studies. Each step in the video synopsis pipeline feeds others, thus failure in the steps especially before optimization such as object detection and object tracking directly affect video synopsis output. Improving the optimization step is not enough to obtain the best results in a video synopsis. Therefore, the specific adaptation of commonly known methods from different fields such as object detection and tracking makes important contribution to the study of video synopsis.

这些研究对其他视频摘要的研究做出了重要的贡献。视频摘要管道中的每一步都是对其他步骤的补充,因此,在优化之前,特别是在对象检测和对象跟踪等步骤的失败将直接影响视频摘要的输出。改进优化步骤不足以在视频摘要中获得最佳结果。因此,对来自不同领域的常用方法如目标检测和跟踪的具体适应,对视频摘要的研究做出了重要贡献。

Finally, Zhu et al.(2013,2016b) emphasized using support of non-visual data in video synopsis. Information on weather forecasts, traffic monitoring, and scheduled public events were associated with visual data to cluster activities and achieve better video content analysis. Even though using non-visual data helped activity clustering or provided a better understanding of the activities, these studies did not mainly focus on video synopsis, rather on data acquisition and association with the activities.

最后,Zhu等(2013,2016b)强调在视频摘要中使用非视觉数据的支持。天气预报、交通监控和预定公共活动的信息与可视化数据相关联,以聚集活动并实现更好的视频内容分析。虽然使用非视觉数据有助于活动聚类或更好地理解活动,但这些研究并不主要关注视频摘要,而是数据获取和与活动的关联。

To summarize this section, an overview emphasizing novelty and contribution of video synopsis approaches was presented. Studies were summarized with comments on both their pros and cons. It is evident that there is important variety in the studies as some of them focused on several steps in their methodology whereas others aimed to improve performance efficiency. While one branch of studies tried to move the video synopsis approach to multi-camera topology, others focused on contributing by changing the input data domain. Furthermore, some studies suggested performing an additional activity clustering step to display similar activities together. In this sense, recent literature on the field of video synopsis can be divided into several categories that are analyzed and discussed in Section4.

在总结本节的基础上,着重介绍了视频摘要方法的新颖性和贡献。总结了各项研究,并对其利弊提出了评论。显然,这些研究中有许多重要的变化,其中一些侧重于其方法的几个步骤,而另一些则旨在提高绩效效率。虽然有一个研究分支试图将视频摘要方法转移到多摄像机拓扑,但其他研究侧重于通过改变输入数据域进行贡献。此外,一些研究建议执行一个额外的活动聚类步骤来一起显示类似的活动。从这个意义上说,最近关于视频摘要领域的文献可以分为几个类别,这些类别将在第4节中进行分析和讨论。

3 The methodology of video synopsis

视频摘要的方法论

In this section, we analyze methodology of video synopsis described in Fig.4. Video synopsis generation starts with object detection, then object tracking is applied to create activities. Next, activity clustering is applied to display similar activities together followed by optimization of the selected activities to obtain optimal temporal rearrangement. Afterwards, a time-lapse background representing the time period of the selected activities is created, and finally, activities are stitched to the generated background. Table1 gives an overview of the methods used in object detection, object tracking and optimization which are the most critical steps of the methodology.

在本节中,我们将分析图4所示的视频概要的方法。视频摘要的生成从对象检测开始,然后应用对象跟踪来创建活动。然后利用活动聚类来显示相似的活动,并对所选的活动进行优化,得到最优的时间重排。然后,创建一个表示所选活动时间段的延时背景,最后将活动缝合到生成的背景上。表1概述了用于对象检测、对象跟踪和优化的方法,这些是方法中最关键的步骤。

Object detection is used as the first step in the algorithm flow of video synopsis. The preference in most of the methods is to use motion for defining the objects. Simple motion detection methods such as pixel difference, temporal median, etc. show poor performance in complex scenes with dynamic background objects, dense motion, and significant variation of illumination. These environmental difficulties are handled better by more complex background modeling algorithms provided in Table1. Human detection methods instead of motion detection are also used for object detection. They provide more precise results as the false detection ratio is lower. Motion detection methods are more likely to be affected by artifact as they provide lower level image analysis compared to human detection methods. On the other hand, using motion for object detection provides the opportunity of using different types of objects as targets. Motion detection methods are also scene independent compared to template matching or training-based methods that need target-specific training beforehand.

视频摘要算法流程的第一步是目标检测。大多数方法的首选项是使用motion来定义对象。简单的运动检测方法,如像素差、时间中值等,在具有动态背景对象、运动密集、光照变化明显的复杂场景中表现较差。通过表1中提供的更复杂的背景建模算法,可以更好地处理这些环境困难。物体检测也采用人体检测代替运动检测。由于误检率较低,因此可以提供更精确的结果。与人类检测方法相比,运动检测方法提供的图像分析层次较低,因此更容易受到伪影的影响。另一方面,使用运动进行目标检测提供了使用不同类型的对象作为目标的机会。与模板匹配或基于训练的方法相比,运动检测方法是独立于场景的,而这些方法需要预先针对特定的目标进行训练。

After detecting targets, object tracking associates detected objects in consecutive frames to build object trajectory, which represents an activity in a video synopsis. It has direct effect on video synopsis performance since tracking failures that cause broken trajectories, mismatch of colliding objects, etc. decrease their accuracy and creating more than one activity for the same object breaks the semantic completeness. These deficiencies also make the optimization problem more difficult as redundant activities will be generated. Therefore, robust object tracking methods specified for video synopsis significantly contribute to the accuracy of a video synopsis.

目标检测完成后,目标跟踪关联器在连续帧中检测目标,构建目标轨迹,该轨迹表示视频摘要中的一个活动。它对视频摘要的性能有直接的影响,因为跟踪失败会导致轨迹中断、碰撞对象的不匹配等,降低了它们的准确性,为同一对象创建多个活动会破坏语义的完整性。这些不足也使得优化问题更加困难,因为会产生冗余的活动。因此,为视频摘要指定的鲁棒目标跟踪方法对视频摘要的准确性有重要的贡献。

Some of the video synopsis approaches cluster the activities according to different criteria such as motion direction, action type, target type, etc. Their point is to improve visual quality of video synopsis as viewing similar activities together makes the video easier to trace by the user. Details of the approaches that apply activity clustering are discussed in Section4.4.

一些视频摘要方法根据不同的标准,如动作方向、动作类型、目标类型等,对活动进行聚类。他们的观点是提高视频摘要的视觉质量,因为一起观看类似的活动可以让用户更容易跟踪视频。应用活动集群的方法的详细信息将在第4.4节中讨论。

Optimization step which is the most important part of video synopsis is applied after obtaining the activities of source video. Optimization aims to find best re-arrangement of the activities in order to display most of them in the shorter time period with minimum collision. Activities are shifted in time domain to place in optimal position in video synopsis. Finding optimal position of the activities are determined by some constraints such as background consistency, spatial collision, temporal consistency, etc. Detailed analysis of the optimization approaches used in video synopsis is provided in Section4.1.

在获得源视频的活动信息后,应用视频摘要中最重要的优化步骤。优化的目的是找到最佳的活动重新安排,以便在更短的时间内以最小的冲突显示大部分活动。在视频摘要中,活动在时域中被转移到最佳位置。通过背景一致性、空间冲突、时间一致性等约束条件来确定活动的最优位置。在4.1节中详细分析了视频摘要中使用的优化方法。

A time-lapse background representing activities and scene changes covering a corresponding time period needs to be created after finding the optimal places for the activities. Video synopsis output seems more natural with better background generation considering that the output is a synthetic video after rearrangement of the activities belonging to different time periods. Improvement of background generation provides a better user experience as visual inconsistency is minimized. Background generation does not affect the condensation performance of video synopsis, it just provides better visual quality. However, it has not been applied in most of the studies in the literature.

在找到活动的最佳地点后,需要创建一个表示活动和场景变化的延时背景,覆盖相应的时间段。视频摘要输出似乎更自然,背景生成更好,考虑到输出是属于不同时间段的活动重新安排后的合成视频。改进的背景生成提供了一个更好的用户体验,因为视觉不一致性最小化。背景生成并不影响视频摘要的凝结性能,只是提供了更好的视觉质量。然而,在文献的大部分研究中并没有应用。

Stitching objects to a time-lapse background is the last step in the video synopsis flow. Stitching does not have an effect on the precision of the approaches, it just improves the visual quality of the output. Therefore, no great attention has been paid to improving this step. Most of the studies did not apply a specific stitching or blending algorithm other than pixel exchange of the object and the generated background. However, using a proper stitching method increases the quality of output as objects from different time periods are displayed at the same time over a unique background.

视频摘要流的最后一步是将对象拼接到延时背景上。拼接不会影响方法的精度,它只是提高了输出的视觉质量。因此,对这一步骤的改进并没有给予很大的重视。大部分的研究都没有使用特定的拼接或混合算法,只是对物体和生成的背景进行像素交换。然而,使用适当的拼接方法可以提高输出的质量,因为不同时间段的物体可以同时显示在一个独特的背景上。

Methodology of video synopsis commonly applied in the literature was explained in this section. Next section categorizes the literature of video synopsis from different aspects such as optimization type, camera topology, input data domain and the activity selection criteria. Detailed analysis of the video synopsis approaches according to mentioned aspects is provided.

本节解释了文献中常用的视频摘要方法。接下来从优化类型、摄像机拓扑结构、输入数据域、活动选择标准等方面对视频摘要文献进行分类。并从以上几个方面对视频概要方法进行了详细的分析。

4 Classification of video synopsis approaches

视频概要方法的分类

Video synopsis approaches can be divided in four groups by content, namely optimization type, camera topology, input data domain, and activity clustering. The distribution of the studies over the years is provided in Fig.5, and the ratio of publications according to four mentioned groups is shown in Fig.6.

视频摘要方法按内容可分为优化类型、摄像机拓扑结构、输入数据域和活动聚类四类。多年来的研究分布如图5所示,按四个提到的组的发表率如图6所示。

It is evident that off-line optimization approaches have been more dominant than on-line approaches. Although on-line approaches appeared early on, they have always been in a minority. Similarly, single- camera approaches are more popular against multi-camera approaches. There were no multi-camera approach until 2014 even though video synopsis was first proposed in 2006. Rare interest on approaches using the compressed domain appeared in 2013, 2014 and 2017. Also, there has been no consistent trend on video synopsis approaches that applies activity clustering as they appear in specific time periods. A general overview shows that while there is no significant trend on approaches to the compressed domain and activity clustering, number of on-line and multi-camera approaches has increased in recent years. This situation gives us a clue about future trends in the field of video synopsis. Following subsections provide detailed analyses on the four mentioned aspects.

很明显,离线优化方法比在线优化方法更具有优势。虽然联机方法很早就出现了,但它们始终是少数。类似地,单摄像机方法比多摄像机方法更受欢迎。直到2014年才有多摄像机的方法,尽管视频摘要是在2006年首次提出的。2013年、2014年和2017年出现了对使用压缩域的方法的罕见兴趣。此外,在特定时间段出现的应用活动聚类的视频摘要方法也没有一致的趋势。总体来看,虽然压缩域和活动聚类的方法没有明显的发展趋势,但近年来在线和多摄像机方法的数量有所增加。这一情况为我们了解视频摘要领域的未来趋势提供了线索。下面的小节将对上述四个方面进行详细的分析。

4.1. Aspect 1: Optimization type

方面1:优化类型

Optimization is the most important step in video synopsis. All optimization methods aim to obtain mapping of activities from the source video to proper positions in the video synopsis. The final goal is to display all of the activities in the shortest time period while avoiding collisions as much as possible. Generally, the optimization problem is defined as minimization of the global energy function that consists of several costs such as maximum activity, background and temporal consistency, and spatial collisions. While some studies used additional costs, others did not use all of them. A brief explanation of commonly used costs is provided as follows:

优化是视频摘要中最重要的步骤。所有的优化方法都是为了获得活动在视频摘要中从源视频到合适位置的映射。最终的目标是在最短的时间内显示所有的活动,同时尽可能避免碰撞。通常,优化问题被定义为全局能量函数的最小化,该全局能量函数包含多个代价,如最大活动、背景和时间一致性以及空间冲突。虽然一些研究使用了额外的成本,但其他研究并没有全部使用。现将常用花费简要说明如下

• The activity cost forces the inclusion of the maximum number of activities in a video synopsis. Activities staying outside are penalized by this term. Leaving out any activity in video synopsis approaches is not desired therefore, this term is used by almost all approaches.

活动成本要求在视频摘要中包含活动的最大数量。外出活动按本规定处罚。因此,省略视频摘要方法中的任何活动都是不可取的,这个术语几乎被所有方法使用。

• The aim of the background consistency cost is to guarantee stitch- ing of tubes to background images having a similar appearance. This term measures the cost of stitching an object to the time-lapse background. Inconsistency between a tube and the background is penalized as it is assumed that each tube is surrounded by pixels from its original background.

背景一致性代价的目的是保证管道与具有相似外观的背景图像的缝合。这一术语衡量的是将一个物体缝合到延时背景上的成本。管和背景之间的不一致是不利的,因为它是假设每个管是由像素从它的原始背景包围。

• The role of the temporal consistency cost is to preserve the temporal order of the activities, therefore activity shifts that break the temporal order are penalized. Changing temporal order of the activities in optimization phase may provide more compact representation by increasing variation of activity sequences. On the other hand, preserving chronological order is important for causality relation of the activities. Analyzing the activities that have interaction in the source video is easier if the temporal consistency is preserved. Approaches generally use a weight parameter for this term in order to balance the semantic integrity and the optimal activity representation of the video synopsis.

• 时间一致性成本的作用是维持活动的时间秩序,因此打破时间秩序的活动转移会受到惩罚。在优化阶段改变活动的时间顺序可以通过增加活动序列的变化提供更紧凑的表示。另一方面,保持时间顺序对于活动的因果关系是很重要的。如果保持时间一致性,分析源视频中具有交互作用的活动将更容易。为了平衡视频摘要的语义完整性和最佳的活动表现形式,通常使用一个权重参数来表示该术语。

• The collision cost prevents spatial collisions of the activities in order to provide better visual quality. Spatial collisions of the activities are penalized by increasing total energy. Handling spatial collision of the activities is main problem of the optimization step. Activities are generally collided with each other considering the crowded scenes captured by the surveillance cameras. Allowing collisions in video synopsis decreases the visual clarity and the traceability of the activities even it provides more compact output with higher number of activities in shorter time period. Nevertheless, video synopsis longer than source video may be created if the spatial collision is completely prevented especially for the crowded scenes. This term is placed in the center of activity optimization phase as it is the most challenging problem in the representation. Majority of the approaches focus on finding optimal solution for activity collision.

• 为了提供更好的视觉质量,碰撞成本防止了活动的空间碰撞。空间碰撞的活动是惩罚增加总能量。活动空间冲突的处理是优化步骤的主要问题。考虑到监控摄像头捕捉到的拥挤场景,活动通常会相互冲突。在视频摘要中允许冲突会降低活动的视觉清晰度和可跟踪性,即使它在较短的时间内提供了更紧凑的输出和更多的活动。然而,如果能完全避免空间冲突,特别是在拥挤的场景中,可以创建比源视频更长的视频摘要。这个术语处于活动优化阶段的中心,因为它是表示中最具挑战性的问题。大多数方法都侧重于寻找活动冲突的最优解。

While the activity and the background consistency costs are calculated for each activity separately, the temporal consistency and the collision costs are calculated between the activities in video synopsis. Weight parameters are generally used especially for temporal consistency and the spatial collision costs to find optimal solution. An illustration of different activity representations that can be obtained after minimization of the same energy function with different weights of the temporal consistency cost is provided in Fig.7. Scenarios for preserving chronological order absolutely (a), preserving chronological order partially (b) and ignoring chronological order © are represented. Fig.7 shows that displaying activities in same chronological order of the source video costs longer video synopsis.

分别计算每个活动的活动一致性成本和背景一致性成本,在视频摘要中计算活动之间的时间一致性成本和冲突成本。为了寻找最优解,通常使用权值参数来确定时间一致性和空间冲突代价。图7给出了用不同的时间一致性代价的权值将相同的能量函数最小化后得到的不同的活动表示形式。描述了绝对保留时间顺序(a)、部分保留时间顺序(b)和忽略时间顺序©的场景。从图7可以看出,按照源视频的时间顺序显示活动需要花费更长的视频摘要。

All the activities are represented in 28 frames in this case as illustrated in Fig.7(a). Ignoring chronological order of the activities by lower weight parameter provides more compact



摘要 video

需要 登录 后方可回复, 如果你还没有账号请 注册新账号