Computer Vision and Image Understanding 181 (2019) 26–38

Computer Vision and Image Understanding



Kemal Batuhan Baskurt , Refik Samet


Video surveillance 、Video processing、 Video synopsis、 Motion detection 、Object tracking、 Optimization 、Background generation 、Stitching




Video synopsis is an activity-based video condensation approach to achieve efficient video browsing and retrieval for surveillance cameras. It is one of the most effective ways to reduce the inactive density of input video in order to provide fast and easy retrieval of the parts of interest. Unlike frame-based video summarization methods, the interested activities are shifted in the time domain to obtain video representation that is more compact. Although the number of studies on video synopsis has increased over the past years, there has still been no survey study on the subject. The aim in this article is to review state-of-the-art approaches in video synopsis studies and provide a comprehensive analysis. The methodology of video synopsis is described to provide an overview on the flow of the algorithm. Recent literature is investigated into different aspects such as optimization type, camera topology, input data domain, and activity clustering mechanisms. Commonly used performance evaluation techniques are also examined. Finally, the current situation of the literature and potential future research directions are discussed after an exhaustive analysis that covers most of the studies from early on to the present in this field. To the best of our knowledge, this study is the first review of published video synopsis approaches.



Control and management of huge amounts of recorded video is becoming more difficult to deal with each passing day when considering the rapid increment in security camera usage in daily life. Efficient video browsing and retrieval are critical issues when considering the amount of raw video data to be summarized. The manpower required to monitor visual data is a challenging problem. Therefore, video condensation techniques are being widely investigated via a large number of applications in diverse disciplines.


A popular approach to solve video condensation problem is video synopsis, which has been investigated in the literature over the last decade. Video synopsis provides activity-based video condensation instead of frame-based techniques such as video fast-forward (Smith and Kanade,1998), video abstraction (Truong and Venkatesh,2007), and video summarization (Chakraborty et al.,2015). Video synopsis operates on an activity as a processing unit while frame-based approaches use a frame. Video synopsis achieves higher efficiency than frame-based video condensation techniques as smaller processing units provide the opportunity of better condensation because of more detailed video analysis. Activities can be shifted in the time domain and more than one activity can be showed simultaneously in a frame even though they come from different time periods.


The aim of video synopsis approaches is to find the best rearrangement of the activities in order to display most of them in the shortest time period. The biggest problem is handling activity collisions as they can lead to the loss of important content, thereby reducing efficiency. Collisions also cause a chaotic viewing experience which decreases the visual quality for surveillance applications. Displaying the maximum number of objects with minimal collisions means more computational complexity comparing to frame-based methods, because of processing the activities separately instead of processing the whole frame at once. Thus, video synopsis has become the hot spot in video summarization, especially with the support of technological improvement on computational capacity of current computers over the past years.


Existing video synopsis studies can be categorized by different aspects such as optimization type, camera topology, input data domain, and activity clustering. The aim of optimization is to find the best temporal positions of selected activities in order to obtain a more compact representation, which is the most important part of algorithm flow in video synopsis. Therefore, the most dominant criteria for categorization is optimization type, which is divided in two categories, namely on-line and off-line. A large part of the approaches performs off- line optimization of all activities to find the global optimum. However, latest approaches increasingly use on-line optimization that applies rearrangement on each new activity to find the local optimum. Aspects of camera topology have divided studies into two groups: single and multi-camera solutions. Most of the approaches are oriented toward the single-camera view that makes the optimization problem easier. Multi-camera approaches need to build a global energy definition which covers all of the camera network with the intention of finding the optimal solution for all. On the other hand, they provide the opportunity to display and analyze activities in a wider perspective. Some studies focusing on run-time performance propose techniques applied directly to compressed data instead of losing time and computation power by transforming data to the pixel domain. Even though their run-time performance is significantly increased, condensation ratio cannot compete with pixel-domain methods. Besides, some studies apply activity clustering to group similar activities and display them together with the aim of providing better understanding of the scene as focusing on similar activities is easier for the user.


In this paper, we analyze 35 video synopsis approaches that cover all of the existing studies up to this point. Approaches are analyzed on the aforementioned aspects and the diversity of pre/post-processing methods used in existing video synopsis approaches are examined in detail.


The rest of the paper is organized as follows. Section2provides an overview of existing video synopsis approaches emphasizing on novelty and contribution to the field. Methods used in algorithm flow of video synopsis are described in Section3. An analysis of the approaches according to optimization type, camera topology, input data domain, and activity clustering is described in Section4. Evaluation criteria and commonly used datasets are presented in Section5. Finally, Section6 contains conclusions on the study.


2 Related works

Video synopsis is an activity-based video condensation technique and the main purpose is to display as many activities as possible simultaneously in the shortest time period. An activity represents a group of object instances belonging to a time period in which the object is visible. The activities extracted from the source are shifted in the time domain to calculate their optimal positions with the minimum number of collisions. Unlike frame-based video summarization techniques, activities from different time periods can be shifted into the same frame through pixel based analysis. Therefore, more efficient condensation performance is achieved compared to frame-based video summarization methods.


Activity-based video condensation was proposed by Rav-Acha et al. (2006) under the name of video synopsis, a novel approach that shifts detected activities in time domain to display them simultaneously over a shorter time period, as depicted in Fig.1. Their approach contained two main phases: on-line and off-line. The on-line phase included activity generation and storing them into a queue. Subsequently, off-line phase started after selecting a time range of video synopsis with tube rearrangement, background generation, and object stitching. A global energy function containing activity, temporal consistency, and collision cost was defined, then simulated annealing method (Kirkpatrick et al., 1983) was applied for energy minimization, as illustrated in Fig.2.

基于活动的视频压缩是由ravi-acha等人(2006)以视频摘要的名字提出的,这是一种新的方法,将检测到的活动在时域内进行转移,从而在更短的时间内同时显示出来,如图1所示。他们的方法包括两个主要阶段:在线和离线。在线阶段包括活动生成和将活动存储到队列中。然后,选择一个时间范围内的视频摘要,通过管道重排、背景生成和对象拼接,最后进入了离线阶段。该阶段定义了一个包含活动、时间一致性和碰撞代价的全局能量函数,然后采用模拟退火方法(Kirkpatrick et al., 1983)进行能量最小化,如图2所示。

Their study is important as the video synopsis approach was proposed for the first time. Even though the study led to follow up ones, it is still a primitive version of video synopsis. In this manner, researchers continue to improve the approach by applying video synopsis to endless video streams, as reported by Pritch et al.(2007). The term ‘tube’ for representing activity consisting of object trajectories in video frames was first used in this study and has been widely used in the literature ever since.


They applied a better object detection method to improve the precision of video synopsis and proposed a more detailed energy function definition compared to Rav-Acha et al.(2006) using additional terms. However, these two studies only focused on theoretical improvement without any effort on practical implementation, and so the authors unified and expanded on their previous research in Pritch et al.(2008) by providing an analysis of computation performance. Tubes were shifted by jumps of 10 frames and moving object detection was applied to every 10th frame, thereby reducing image resolution, etc. Even though it is not sufficient for full adaptation to real world applications, the proposed approach became more applicable to video surveillance scenarios by the performance improvement. Their study also made a positive contribution to the field by providing an analysis of run-time performance of both the on-line and off-line steps in the method.

他们应用了一种更好的目标检测方法来提高视频摘要的精度,并提出了一个比使用附加术语的ravi-acha等人(2006)更详细的能量函数定义。然而,这两项研究只注重理论的改进,而没有注重实际的实施,因此作者通过对计算性能的分析,将他们在之前的研究 Pritch et al.(2008) 进行了统一和扩展。通过10帧的跳跃移动管道,每隔10帧进行移动目标检测,从而降低图像分辨率等。虽然该方法不能完全适应真实世界的应用,但通过性能改进,使其更适用于视频监控场景。他们的研究还对该领域作出了积极贡献,提供了对该方法中联机和脱机步骤的运行时性能的分析。

Subsequently, they offered activity clustering in order to display similar activities together (Pritch et al.,2009). Appearance and motion features were used for clustering, and they provided the opportunity to display a video synopsis of the same person’s activities or all of the activities in the same direction. Differently from previous approaches, long tubes were divided into ‘tubelets’, which were subsets with a maximum of 50 video frames. As clustering similar activities was novel in video synopsis at that time, they contributed to the field by providing a different perspective on existing studies.

随后,他们提供了活动聚类,以便一起显示类似的活动(Pritch et al.,2009)。使用外观和运动特征进行聚类,它们提供了机会来显示同一个人的活动或同一方向的所有活动的视频摘要。不同于以往的方法,长管被分为小管,这是一个最大50个视频帧的子集。由于聚类类似的活动在当时的视频摘要中是新颖的,它们通过对现有研究提供不同的视角,对该领域做出了贡献。

The studies mentioned up to this point are by the authors who proposed video synopsis for the first time. Even so, there are still limitations such as time consuming optimization on video with dense activity, huge memory requirement, and uncertainty on determination of video synopsis length, although they improved on their first proposed approach with several subsequent studies. Their studies are important as they were pioneering to following studies and helped to build the principal methodology adopted by the following studies over a long period of time.


Xu et al.(2008) formulated the optimization problem of activities in terms of set theory, in which a universal set representing optimal temporal positions of the activities was obtained. The main difference from the preceding approaches is that temporal consistency was not considered on rearrangement of the activities. Even though a comparison of results with Pritch et al.(2007) was provided in which their method outperformed the classical one, their study did not attracted much attention and was not adopted by following studies.The probable reason for this was their simple optimization method to obtain local optima compared to global solution of Pritch et al.(2007).


Yildiz et al.(2008) applied a pixel-based analysis instead of an object-based one for activity detection. Input video was shrunk to only obtain the parts with high activity by extracting horizontal paths with minimum energy in video frames. They extracted the inactive parts of the video instead of temporal shifting of the activities. A pipeline-based framework was proposed to obtain real-time video synopsis with low memory consumption (Vural and Akgul,2009). This study was extended to integrate with an eye tracking technology which was able to detect video parts that the operator did not pay attention to or vice versa. In this way, they provided the opportunity to cluster similar activities to be displayed together in the video synopsis. Their approach applied pixel-based optimization without object boundary information. Therefore, object unity might be broken in the video synopsis. Visual quality of the generated video synopsis was lower than object-based approaches, especially on scenes with high activity density.

Yildiz等人(2008)将基于像素的分析代替基于对象的分析用于活动检测。在视频帧中,通过提取能量最小的水平路径,将输入视频压缩到只获取高活性部分。他们提取了视频中不活跃的部分,而不是活动的时间变化。提出了一个基于流水线的框架来获得低内存消耗的实时视频摘要(Vural and Akgul,2009)。这项研究扩展到与眼动跟踪技术相结合,该技术能够检测操作者没有注意到的视频部分,反之亦然。通过这种方式,他们提供了将类似的活动聚集在一起并在视频摘要中展示的机会。他们的方法采用无对象边界信息的基于像素的优化。因此,在视频摘要中可能会打破对象的统一性。生成的视频摘要的视觉质量低于基于对象的方法,特别是在高活动密度的场景中。

Rodriguez(2010) contributed to the field by using an object detection method unaffected by camera motion, thus activities obtained from moving cameras could be displayed in the video synopsis. A template for a matching-based clustering method was also used to group similar activities used in the video synopsis. Chou et al.(2015) proposed the clustering of similar activities. Four regions in a camera view were first defined as possible entrance and exit locations, then activities were clustered by these regions. They used a method to cluster similar trajectories with different sampling rates, speeds, and sizes to achieve optimal results for their video synopsis. Lin et al.(2015) also proposed an approach using clustering activities with novel methods for anomaly detection, object tracking, and optimization in a video synopsis. Learning-based anomaly detection was applied to detect activities which were later clustered using predefined regions of the scene similar to the previous approach by Chou et al.(2015) using entrance and exit regions. Even though different activity clustering criteria are used in these mentioned methods, their main purpose was to make video synopsis easier to view by displaying activities with similar properties together. Besides using an additional activity clustering step in their methodology, they contributed to the field by the adaptation of clustering metrics to optimization. Their methods open new paths of investigation and possible improvements.


Differently from general tradition of temporal shifting in video synopsis, Nie et al.(2013) changed both the temporal and spatial positions of the activities in order to prevent collisions. Background belonging to the activities that had been spatially shifted was expanded to keep the background consistency. A synthetic background expansion was applied until there was enough space to put all activities into without any collisions, as shown in Fig.3. Their method is the only one to shift the spatial position of the activities. Activity collisions were minimized in this way but their novelty also brought some shortcomings such as changing the background may damage the understanding of a scene since the background was extended to regions that did not have activity in the sample images. The mentioned extension could not be applied if there were no available regions without activity, thus application of the proposed method is limited to only specific scenes.


Li et al.(2016) proposed a different approach to solving the object collision problem in video synopsis in which colliding objects were scaled down in order to minimize the collision. A metric representing the scale down factor of each object was used in the optimization step. Even though the object collision problem was minimized technically, the proposed method might disturb the user. For instance, a reduction in object size causes an artificial view of the video synopsis as a car and a person that appear close in the scene might have similar sizes. Nevertheless, even this situation is prevented to a certain degree by an additional metric. He et al.(2017a,b) took activity collision analysis one step further by defining collision statuses between activities such as collision-free, colliding in the same direction, and colliding in opposite directions. They also proposed a graph-based optimization method by considering these collision states to improve the activity density and put activity collisions at the center of their optimization strategy.


Hence, a more detailed analysis of activity collision was provided compared to other video synopsis studies. Besides improvements by minimizing collisions, other metrics such as activity cost, chronological order, etc. were ignored. Therefore their optimization method still needs to be improved to find the optimal rearrangement.


Huang et al.(2014) emphasized the importance of on-line optimization techniques which enable tube rearrangement at the time of detection without any need to wait before starting optimization. Moreover, a synopsis table representing activities with their frame numbers for each pixel was proposed. Even though rearrangement obtained the local optimum, video synopsis could be generated a real-time video synopsis while activity analysis was being processed. The biggest problem with their on-line method was completely ignoring activity collision situations in order to improve run-time performance, and another deficiency of the proposed optimization method was using manually determined threshold values instead of a more complex decision mechanism. With this in mind, a tradeoff between run-time performance and condensation ratio arose that decreased precision.


Zhu et al.(2014) mentioned deficiency in video synopsis due to a single-camera view since when considering video surveillance applications, an activity generally happens in more than one camera view. Thus, they proposed a multi-camera video synopsis approach with a panoramic view constructed using homography between partially overlapping camera views. Activities from different cameras were associated via trajectory matching in overlapping camera views. They also proposed a key frame selection approach for the activities whereby key frames of an activity in which the appearance or motion of an object is changed significantly are used instead of all of the frames for reducing redundancy of consecutive frames. Similarly,Zhu et al. (2016a) proposed a multi-camera video synopsis approach using a timestamp selection method to find critical moments of an activity. Key timestamps were defined as when objects first appear, the merge time with any other object, and the split and disappear time in the video. Unlike Zhu et al.(2014), object re-identification using visual information was applied between camera views. The energy function for optimization was also improved so as to be adaptable with multi- camera topology. The chronological order of objects was kept not only in one camera view but also among different camera views.

Zhu等人(2014)提到了由于单摄像头视角而导致的视频概要的不足,因为在考虑视频监控应用时,一个活动通常发生在多个摄像头视角下。因此,他们提出了一种多摄像机视频概要方法,利用部分重叠摄像机视图之间的单应性构建全景视图。不同相机的活动通过重叠相机视图中的轨迹匹配进行关联。他们还提出了活动的关键帧选择方法,即使用一个活动的关键帧,其中一个物体的外观或运动发生了显著的变化,而不是所有的帧,以减少连续帧的冗余。类似地,Zhu等人(2016a)提出了一种多摄像头视频摘要方法,使用时间戳选择方法来寻找活动的关键时刻。键时间戳定义为对象首次出现的时间,与任何其他对象合并的时间,以及在视频中分割和消失的时间。与Zhu et al.(2014)不同的是,使用视觉信息的对象再识别被应用于相机视图之间。优化的能量函数也进行了改进,以适应多摄像机的拓扑结构。对象的时间顺序不仅保留在一个相机视图中,也保留在不同的相机视图中。

Hoshen and Peleg(2015) suggested a multi-camera video synopsis approach which defined a master camera and slave cameras around the master. Once an activity was detected in the master camera, a video synopsis containing activities of slave cameras belonging to related time period is generated. Although object re-identification between the cameras was not applied, they aimed to provide a wider perspective on the activity of master. Mahapatra et al.(2016) offered another video synopsis framework on multiple cameras having overlapping field-of- views for which a common ground plane via a homography between camera overlaps was generated. Activities were classified into seven categories, namely walking, running, bending, jumping, hand shaking, one hand waving, and both hands waving. Thus, they provided video synopsis of specific activity types.


Multi-camera video synopsis approaches are more applicable to real-world applications when considering distributed video surveillance networks. Nevertheless, optimization becomes more complicated with additional metrics used for the association of objects in different cameras. Another important point is overlapping of camera views. Studies applied to non-overlapping camera views seem more efficient as they have one less restriction on camera topology.


Different than the approaches explained up to now, Lin et al.(2017) mainly focused on acceleration of computing speed of video synopsis via a distributed processing model. Their framework included computing and storage nodes created for distributed computation in which the nodes represented different computers on a network or application threads. Their video synopsis algorithm was divided into several steps such as video initialization, and object detection, tracking, classification, optimization, etc., which were computed in a distributed fashion. Input video was segmented and each segment analyzed on a different node and tubes generated on each node were stored on storage nodes. Finally, another node generated the final video synopsis using data on the storage node. The region of interest of the scene was also defined in order to reduce the region of input processing. Furthermore, video size and frames per second were also reduced to increase performance without affecting the accuracy of object detection. This was the first study to perform a video synopsis with a distributed architecture and was innovative when considering the distributed camera topology of video surveillance applications. This study provided the opportunity to apply high precision but time consuming optimization methods close to real-time performance.


Besides, there are video synopsis approaches which work on compressed domains (Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017). They emphasized that video decoding increases the complexity of the approach and makes it hard to work in real-time, thus activity detection was carried out on compressed video and required that flags were set for use in the optimization step. Partial decoding was applied to improve the run-time performance of the approaches. Nevertheless, their object detection methods in the compressed domain were simple compared to pixel-based methods. Because inefficiency in object detection directly affects video synopsis performance, these methods need more improvement on precision.

此外,还有一些视频概要方法在压缩域上工作(Wang et al.,2013a,b;Zhong et al.,2014;Liao et al., 2017)。他们强调,视频解码增加了该方法的复杂性,使其难以实时工作,因此对压缩视频进行活动检测,并要求在优化步骤中设置标记。采用部分译码的方法提高了算法的运行时性能。然而,与基于像素的目标检测方法相比,它们在压缩域的目标检测方法较为简单。由于目标检测的低效率直接影响视频摘要的性能,因此这些方法在精度上还有待提高。

The video synopsis approaches mentioned so far have commonly focused on the optimization step of the flow. Nevertheless, there have been studies that have focused on other steps such as background generation and object tracking specified for video synopsis. Feng et al. (2010) proposed a background generation approach aimed at choosing video frames with the most activity and representing changes in the scene. Thus, they later propose sticky tracking to minimize the object blinking problem which causes ghost objects in video synopsis (Feng et al.,2012). Objects with intersected trajectories were merged as a unique activity to be used in the video synopsis, the purpose is not to obtain perfect object tracking but to provide activity coherence.

到目前为止所提到的视频概要方法通常都集中在流的优化步骤上。然而,也有一些研究关注于其他步骤,如为视频摘要指定的背景生成和对象跟踪。Feng等人(2010)提出了一种背景生成方法,旨在选择最活跃的视频帧,并表示场景中的变化。因此,他们后来提出了粘性跟踪,以最小化视频摘要中导致幽灵物体的物体闪烁问题(Feng et al.,2012)。将轨迹相交的物体合并作为一种独特的活动来使用在视频摘要中,目的不是为了获得完美的物体跟踪,而是为了提供活动的连贯性。

Baskurt and Samet(2018) proposed another object tracking approach specified for requirements of video synopsis. Their approach focused on long term tracking to represent each target with just one activity in video synopsis. The target object was modeled with more than one correlation filter which represent the different appearances of the target during the tracking. Robustness across the environment challenges such as illumination variation, scale and appearance changes was obtained in this way. Lu et al.(2013) focused on object detection artifacts such as shadow and interruption of object tracking which reduce efficiency content analysis. They proposed support for both motion detection and object tracking methods with additional visual features in order to eliminate shadow and increase the robustness of the tracking method against collision. Baskurt and Samet(2017) also focused to increase robustness of object detection by proposing an adaptive background generation approach. Hsia et al.(2016) concentrated on efficiently searching an activity database to generate video synopsis. A novel range tree approach was proposed whose main purpose was to find the tubes selected by the user in an efficient way and to reduce the complexity of the algorithm.


These studies have made an important contribution to other video synopsis studies. Each step in the video synopsis pipeline feeds others, thus failure in the steps especially before optimization such as object detection and object tracking directly affect video synopsis output. Improving the optimization step is not enough to obtain the best results in a video synopsis. Therefore, the specific adaptation of commonly known methods from different fields such as object detection and tracking makes important contribution to the study of video synopsis.


Finally, Zhu et al.(2013,2016b) emphasized using support of non-visual data in video synopsis. Information on weather forecasts, traffic monitoring, and scheduled public events were associated with visual data to cluster activities and achieve better video content analysis. Even though using non-visual data helped activity clustering or provided a better understanding of the activities, these studies did not mainly focus on video synopsis, rather on data acquisition and association with the activities.


To summarize this section, an overview emphasizing novelty and contribution of video synopsis approaches was presented. Studies were summarized with comments on both their pros and cons. It is evident that there is important variety in the studies as some of them focused on several steps in their methodology whereas others aimed to improve performance efficiency. While one branch of studies tried to move the video synopsis approach to multi-camera topology, others focused on contributing by changing the input data domain. Furthermore, some studies suggested performing an additional activity clustering step to display similar activities together. In this sense, recent literature on the field of video synopsis can be divided into several categories that are analyzed and discussed in Section4.


3 The methodology of video synopsis


In this section, we analyze methodology of video synopsis described in Fig.4. Video synopsis generation starts with object detection, then object tracking is applied to create activities. Next, activity clustering is applied to display similar activities together followed by optimization of the selected activities to obtain optimal temporal rearrangement. Afterwards, a time-lapse background representing the time period of the selected activities is created, and finally, activities are stitched to the generated background. Table1 gives an overview of the methods used in object detection, object tracking and optimization which are the most critical steps of the methodology.


Object detection is used as the first step in the algorithm flow of video synopsis. The preference in most of the methods is to use motion for defining the objects. Simple motion detection methods such as pixel difference, temporal median, etc. show poor performance in complex scenes with dynamic background objects, dense motion, and significant variation of illumination. These environmental difficulties are handled better by more complex background modeling algorithms provided in Table1. Human detection methods instead of motion detection are also used for object detection. They provide more precise results as the false detection ratio is lower. Motion detection methods are more likely to be affected by artifact as they provide lower level image analysis compared to human detection methods. On the other hand, using motion for object detection provides the opportunity of using different types of objects as targets. Motion detection methods are also scene independent compared to template matching or training-based methods that need target-specific training beforehand.


After detecting targets, object tracking associates detected objects in consecutive frames to build object trajectory, which represents an activity in a video synopsis. It has direct effect on video synopsis performance since tracking failures that cause broken trajectories, mismatch of colliding objects, etc. decrease their accuracy and creating more than one activity for the same object breaks the semantic completeness. These deficiencies also make the optimization problem more difficult as redundant activities will be generated. Therefore, robust object tracking methods specified for video synopsis significantly contribute to the accuracy of a video synopsis.


Some of the video synopsis approaches cluster the activities according to different criteria such as motion direction, action type, target type, etc. Their point is to improve visual quality of video synopsis as viewing similar activities together makes the video easier to trace by the user. Details of the approaches that apply activity clustering are discussed in Section4.4.


Optimization step which is the most important part of video synopsis is applied after obtaining the activities of source video. Optimization aims to find best re-arrangement of the activities in order to display most of them in the shorter time period with minimum collision. Activities are shifted in time domain to place in optimal position in video synopsis. Finding optimal position of the activities are determined by some constraints such as background consistency, spatial collision, temporal consistency, etc. Detailed analysis of the optimization approaches used in video synopsis is provided in Section4.1.


A time-lapse background representing activities and scene changes covering a corresponding time period needs to be created after finding the optimal places for the activities. Video synopsis output seems more natural with better background generation considering that the output is a synthetic video after rearrangement of the activities belonging to different time periods. Improvement of background generation provides a better user experience as visual inconsistency is minimized. Background generation does not affect the condensation performance of video synopsis, it just provides better visual quality. However, it has not been applied in most of the studies in the literature.


Stitching objects to a time-lapse background is the last step in the video synopsis flow. Stitching does not have an effect on the precision of the approaches, it just improves the visual quality of the output. Therefore, no great attention has been paid to improving this step. Most of the studies did not apply a specific stitching or blending algorithm other than pixel exchange of the object and the generated background. However, using a proper stitching method increases the quality of output as objects from different time periods are displayed at the same time over a unique background.


Methodology of video synopsis commonly applied in the literature was explained in this section. Next section categorizes the literature of video synopsis from different aspects such as optimization type, camera topology, input data domain and the activity selection criteria. Detailed analysis of the video synopsis approaches according to mentioned aspects is provided.


4 Classification of video synopsis approaches


Video synopsis approaches can be divided in four groups by content, namely optimization type, camera topology, input data domain, and activity clustering. The distribution of the studies over the years is provided in Fig.5, and the ratio of publications according to four mentioned groups is shown in Fig.6.


It is evident that off-line optimization approaches have been more dominant than on-line approaches. Although on-line approaches appeared early on, they have always been in a minority. Similarly, single- camera approaches are more popular against multi-camera approaches. There were no multi-camera approach until 2014 even though video synopsis was first proposed in 2006. Rare interest on approaches using the compressed domain appeared in 2013, 2014 and 2017. Also, there has been no consistent trend on video synopsis approaches that applies activity clustering as they appear in specific time periods. A general overview shows that while there is no significant trend on approaches to the compressed domain and activity clustering, number of on-line and multi-camera approaches has increased in recent years. This situation gives us a clue about future trends in the field of video synopsis. Following subsections provide detailed analyses on the four mentioned aspects.


4.1. Aspect 1: Optimization type


Optimization is the most important step in video synopsis. All optimization methods aim to obtain mapping of activities from the source video to proper positions in the video synopsis. The final goal is to display all of the activities in the shortest time period while avoiding collisions as much as possible. Generally, the optimization problem is defined as minimization of the global energy function that consists of several costs such as maximum activity, background and temporal consistency, and spatial collisions. While some studies used additional costs, others did not use all of them. A brief explanation of commonly used costs is provided as follows:


• The activity cost forces the inclusion of the maximum number of activities in a video synopsis. Activities staying outside are penalized by this term. Leaving out any activity in video synopsis approaches is not desired therefore, this term is used by almost all approaches.


• The aim of the background consistency cost is to guarantee stitch- ing of tubes to background images having a similar appearance. This term measures the cost of stitching an object to the time-lapse background. Inconsistency between a tube and the background is penalized as it is assumed that each tube is surrounded by pixels from its original background.


• The role of the temporal consistency cost is to preserve the temporal order of the activities, therefore activity shifts that break the temporal order are penalized. Changing temporal order of the activities in optimization phase may provide more compact representation by increasing variation of activity sequences. On the other hand, preserving chronological order is important for causality relation of the activities. Analyzing the activities that have interaction in the source video is easier if the temporal consistency is preserved. Approaches generally use a weight parameter for this term in order to balance the semantic integrity and the optimal activity representation of the video synopsis.

• 时间一致性成本的作用是维持活动的时间秩序,因此打破时间秩序的活动转移会受到惩罚。在优化阶段改变活动的时间顺序可以通过增加活动序列的变化提供更紧凑的表示。另一方面,保持时间顺序对于活动的因果关系是很重要的。如果保持时间一致性,分析源视频中具有交互作用的活动将更容易。为了平衡视频摘要的语义完整性和最佳的活动表现形式,通常使用一个权重参数来表示该术语。

• The collision cost prevents spatial collisions of the activities in order to provide better visual quality. Spatial collisions of the activities are penalized by increasing total energy. Handling spatial collision of the activities is main problem of the optimization step. Activities are generally collided with each other considering the crowded scenes captured by the surveillance cameras. Allowing collisions in video synopsis decreases the visual clarity and the traceability of the activities even it provides more compact output with higher number of activities in shorter time period. Nevertheless, video synopsis longer than source video may be created if the spatial collision is completely prevented especially for the crowded scenes. This term is placed in the center of activity optimization phase as it is the most challenging problem in the representation. Majority of the approaches focus on finding optimal solution for activity collision.

• 为了提供更好的视觉质量,碰撞成本防止了活动的空间碰撞。空间碰撞的活动是惩罚增加总能量。活动空间冲突的处理是优化步骤的主要问题。考虑到监控摄像头捕捉到的拥挤场景,活动通常会相互冲突。在视频摘要中允许冲突会降低活动的视觉清晰度和可跟踪性,即使它在较短的时间内提供了更紧凑的输出和更多的活动。然而,如果能完全避免空间冲突,特别是在拥挤的场景中,可以创建比源视频更长的视频摘要。这个术语处于活动优化阶段的中心,因为它是表示中最具挑战性的问题。大多数方法都侧重于寻找活动冲突的最优解。

While the activity and the background consistency costs are calculated for each activity separately, the temporal consistency and the collision costs are calculated between the activities in video synopsis. Weight parameters are generally used especially for temporal consistency and the spatial collision costs to find optimal solution. An illustration of different activity representations that can be obtained after minimization of the same energy function with different weights of the temporal consistency cost is provided in Fig.7. Scenarios for preserving chronological order absolutely (a), preserving chronological order partially (b) and ignoring chronological order © are represented. Fig.7 shows that displaying activities in same chronological order of the source video costs longer video synopsis.


All the activities are represented in 28 frames in this case as illustrated in Fig.7(a). Ignoring chronological order of the activities by lower weight parameter provides more compact

