Methods, Techniques, Trends and ChallengesWhile Implementing Content based Video Retrieval Systems

Jitesh Neve


With reference to my last article published, this paper contains the discussion related to one step close to implementation.Content Based Video Retrieval (CBVR) has been increasingly used to describe the process of retrieving desired videos from a large collection on the basis of features that are extracted from the videos. The extracted features are used to index, classify and retrieve desired and relevant videos while filtering out undesired ones. Videos can be represented by their audio, texts, faces and objects in their frames. An individual video possesses unique motion features, color histograms, motion histograms, text features, audio features, features extracted from faces and objects existing in its frames. Videos containing useful information and occupying significant space in the databases are under-utilized unless CBVR systems capable of retrieving desired videos by sharply selecting relevant while filtering out undesired videos exist. Results have shown performance improvement (higher precision and recall values) when features suitable to particular types of videos are utilized wisely. Various combinations of these features can also be used to achieve desired performance. In this paper a complex and wide area of CBVR and CBVR systems has been presented in a comprehensive and simple way. Processes at different stages in CBVR systems are described in a systematic way. Types of features, their combinations and their utilization methods, techniques and algorithms are also shown. Various querying methods, some of the features like GLCM, Gabor Magnitude, algorithm to obtain similarity like Kullback-Leibler distance method and Relevance Feedback Method are discussed.


In today’s electronic world huge amount of useful digital information like images, audio and video data apart from textual data exists online and is available to public, government authorities, professionals and researchers very easily and accessible at reasonably cheaper cost due to rapid growth in availability of user friendly and cheaper multimedia acquisition devices at a very large scale like high resolution camera in mobile phones, handy cams and other advanced digital devices, availability of high capacity storage devices like memory cards, hard disks, etc., large scale usage of internet by rapidly growing number of applications used by digital devices to upload huge amount of multimedia information, advanced web technology and internet infrastructure. Video data possesses a lot of information for those using multimedia systems and applications like digital libraries, publications, education, broadcasting and entertainment. Such applications are useful only when video retrieval systems are efficient enough to retrieve videos and other important information from large databases as quick as possible. However, it is extremely challenging for the existing web search engines to search for video over the web so novel methodologies are required that are capable of manipulating the video information according to the content.

For multimedia mining, combinations of multimedia data are stored and arranged using techniques like classification and annotation of videos. Most of the web based video retrieval systems work by indexing and searching videos based on texts associated with them but this technique does not perform well because the texts do not contain enough information of the videos. Since video retrieval is not effective using conventional query-by-text retrieval technique, Content Based Video Retrieval (CBVR) is considered as one of the best practical solutions for better retrieval quality. Due to exploitation of rich video content, there is a tremendous scope in area of video retrieval to enhance the performance of conventional search engines. This is leading the area of CBVR into a direction promising to create more effective video search engines in future.

Content Based Video Retrieval Process:

With the advancement of multimedia, digital video creation has become very common. There is enormous information present in the video, necessitating search techniques for specific content. In this case retrieval becomes a bit complicated because there are several formats of the video. Also there are many languages the audio from the film may contain. So that it’s not easy as much as text or pattern searching or retrieval.

The video takes into consideration four different levels which are frame, shot, scene, and story level. In frame level, each frame is treated separately as static image, set of contiguous frames all acquired through a continuous camera recording make shot level, set of contiguous shots common semantic significance make scene level and the complete video object is story level.
A hierarchical video browser consists of a number of levels, from the video title, to groups of shots, to shots, and to individual frames. Representative frames are displayed at each level. Subsequent levels are displayed when selected.

Discussion Related to the Challenges:

A system that supports video content-based indexing and retrieval has, in general, two stages: The first one, the database population stage, performs the following tasks: Video segmentation: Segment the video into constituent shots, Key frame selection: Select one frame frames, Hence shots. The second stage, the retrieval subsystem processes the presented query (usually in form of QBE), performs similarity matching

Operations, and finally displays results to the user. Every year video content is growing in volume and there are different techniques available to capture, compress, displayer more to represent each shot and Feature extraction: Extract low-level and other features from key frames or their interrelationships in order to represent these, store and transmit video while editing and manipulating video based on their content is still a non-trivial activity.

Recent advances in multimedia technologies allow the capture and storage of video data with relatively inexpensive computers. However, without appropriate search techniques all these data are hardly usable. Users want to query the content instead of the raw video data. Today research is focused on video retrieval Edge Detection and DCT based block matching is used for shot segmentation and the region based approach is used for retrieval. In content based Video Retrieval (CBVR) the feature extraction plays the main role. The features are extracted from the regions by using SIFT features. Features of the query object are compared with the shot Features for retrieval. The Internet forms today’s largest source of Information containing a high density of multimedia objects and its content is often semantically related. The identification of relevant media objects in such a vast collection poses a major problem that is studied in the area of multimedia information retrieval. Before the emergence of content-based retrieval, media was annotated with text, allowing the media to be accessed by text-based searching based on the classification of subject or semantics.

In typical content-based retrieval systems, the contents of the media in the database are extracted and described by multi-dimensional feature vectors, also called descriptors. A novel video retrieval system using Generalized Eigen value Decomposition the system contains two major subsystems: database creation and database searching. In both subsystems, we propose new methods for shot-feature extraction, feature dimension transformation and feature similarity measuring base on GED.

Classification of Videos:

Classification of videos helps to increase efficiency of video retrieval and it is one of the most important tasks. During process of Video classificationinformation is obtained from features extracted out of the video components, videos are then, placed in categories defined earlier. Information including visual and motion features of various components of video like objects, shots and scenes is obtained. Most of the classification techniques are either semantic content classification or non-semantic content classification. The most suitable one is employed as per the type of a video and application and thus, video can be classified to the most suitable and closest among all pre-defined categories. Semantic video classification can be performed at three levels of a video. Video genres, video events and objects in the video.

Video genres based classification is to classify videos into one of the pre-defined categories of videos. These categories of videos are kinds of videos commonly exist like videos of sports, news, cartoons, movies, wildlife, documentary movies, etc. Video genres based classification has better and broader detection capability while objects and events have narrow detection range. Event based video classification is based on event detection in a video data and to classify it into one of the pre-defined categories. An event is said to be occurred if it has significant and visible video content.

A video can have many events and each event has sub-events. One of the most important steps in content based video classification is to classify events of a video. Shots are most elementary component of a video. Classification of shots determines classification of videos. Shots are classified using features of objects in shots. Different kinds of video features, motion, color, texture and edge for every shot are extracted for video retrieval. Image retrieval methods and techniques can be used for key frame based video retrieval systems. Low level visual features of key-frames are exploited for this purpose. In key-frame based retrieval, as a video is abstracted and represented by features of its key-frames, indexing methods of image database can be applied to shot indexing. Each shot and all its key-frames are linked to each other. For a video retrieval, a shot is searched by identifying its key-frame. Computational cost involved while using all frames of a shot to retrieve a video is much higher than that when only key frames are used to represent a shot. Visual features of these key frames are compared with those of the videos in the database for retrieval.

Key-frames are also employed in faceand object based video retrieval. A large number of CBVR systems among the existing ones are working with key-frames. Key-frames can deliver a lot of useful information for retrieval purpose and if required, static features of key-frames can also be used to measure video similarity along with motion features and object features. Object based video classification is based on object detection in video data. Faces and texts are also used as a method to classify videos. Four types of TV programs are classified by method proposed by Dimitrova et al. Faces and Texts are detected and then tracked to each frame of video segment. Frames are labeled for a specific type according to respective faces and texts.
An HMM (hidden markov model) is trained to classify each type of frame using their labels. The appearance of textual information while streaming of video frames enables making an automated video retrieval system based on texts appearing in consecutive frames. Video classification using objects such as faces and texts work only in specific environment and this classification for video indexing has the limitation that they are not generic. Object based video classification usually shows poor performance.

Querying a Video:

  1. Query by Object: The object image is provided. The occurrences of objects in video database are detected and locations of the object determine success of the query.

  2. Query by Text: As it is popular for content based image retrieval, example images can be used as query to retrieve relevant videos in a database of videos (query by example) but it has a limitation that motion information of the video being searched is not utilized. It relies only on the appearance information. Also, finding video clip for the interested concept may become too complex using example image. Textual query offers more natural interface and claims to be better approach for querying in video databases.

  3. Query by Example: Query by example is better if visual features of the query are used for content based video retrieval. Low level features are obtained from key frames of the query video and then they are compared to separate out the similar videos using their key frames visual features.

  4. Query by shot: Some systems utilize the entire video shot as the query instead of key frames. This can be a better option but with a higher computational cost.

  5. Query by clip: A clip can be used for better performance of video retrieval as compared to the technique when a shot is used because a shot do not represents sufficient information about the whole context. All the clips which possess a significant similarity or relevancy with the query clip are retrieved.

  6. Query by Faces and Texts: Faces and Texts can also be used as a query to retrieve a video segment containing frames labeled for a specific type according to faces and texts. A suitable algorithm can be used to search the video enquired by the query clip using information obtained from faces and texts in frames of the query clip.

Proposed Method:

Content Based Video Retrieval consists of two phases asregistration phase and query execution phase.

A. Registration Phase

In this phase a feature vector table is built up fromtransformed visual contents of each video which is to beregistered and stored in video database.Algorithm for registration phase is as below -
1. Select a video to be stored.
2. Extract key frames of the respected video. In proposedwork every 20th frame is taken as key frame.
3. Extract Red, Green and Blue components of each keyframe.
4. Apply Hybrid Wavelet Transform on individual planeof each key frame.
5. Prepare the feature vector of energy coefficients.
6. Repeat the step 1 to 5 for all videos in database to getthe feature vector table.

B. Query Execution Phase

Algorithm for query execution phase for a given queryvideo is stated as below -
1. Extract key frames of query video. In proposed workevery 20th frame is taken as key frame.
2. Extract Red, Green and Blue components of each keyframe.
3. Apply Hybrid Wavelet Transform on individual planeof each key frame.
4. Prepare the feature vector of energy coefficients.
5. Compare the query video feature with feature databaseusing Similarity measure, to get the set of relevantmatches from database.

Variations in Proposed Research Work:

The paper explores both registration phase and queryexecution phase to analyze the impact of fractional coefficientson reduction in feature vector size and retrieval efficiency.Performance comparison of constituent orthogonal transforms-Cosine and Haar with Hybrid Wavelet Transform is also donefor Content Based Video Retrieval. The paper also explores thecomputational efficiency of Cosine, Haar and Cosine-HaarHybrid Wavelet transforms.


There has been tremendous research and testing on the system proposed so far. I am sure, the content based video retrieval will be improved after integrating the proposed concepts inline.


[1] Shweta Ghodeswar, B.B.Meshram Technicians ?Content Based Video Retrieval?
[2] R. Kanagavalli, Dr. K. Duraiswamy ?Object Based Video RetrievalsInternational Journal of Communications andEngineering Volume 06– No.6, Issue: 01 March2012.
[3] Dr. S. D. Sawarkar , V. S. Kubde ?Content Based Video Retrieval using trajectory and Velocity features? International Journal of Electronics and Computer Science Engineering ISSN- 2277-1956
[4] Ali Amiri, Mahmood Fathy, and AtusaNaseri ?A Novel Video Retrieval System Using GED-based Similarity Measure? International Journal of Signal Processing, Image Processing and Pattern Recognition Vol. 2, No.3, September 2009
[5] S.ThangaRamya 1 P.Rangarajan ?Knowledge Based Methods for Video Data Retrieval? International Journal of Computer Science & Information Technology (IJCSIT) Vol 3, No 5, Oct 2011
[6] B. V. Patel A. V. Deorankar, B. B. Meshram ?Content Based Video Retrieval using Entropy, Edge Detection, Black and White Color Features? Computer Engineering and Technology (ICCET), 2010 2nd International Conference on
[7] Gao, X. and X. Tang, 2002. Unsupervised video shot segmentation and model-free anchorperson detection for news video story parsing. IEEE Trans. Circuits Syst. Video Technol., 12: 765-776.
[8] Shan Li, Moon-Chuen Lee, 2005. An improved sliding window method for shot change detection. Proceeding of the 7th IASTED International Conference Signal and Image Processing, Aug. 15-17, IIonolulu, IIawaii, USA, pp: 464-468.
[9] Hamdy K. Elminir, Mohamed Abu ElSoud ?Multi feature content based video retrieval using high level semantic concept? IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 4, No 2, July 2012
[10] John, S., Boreczky and D. Lynn, 1998. ?A hidden Markova model framework for video segmentation using audio an image features?. In: Proceedings of IEEE International conference on Acoustics, Speech and Signal Processing, May 12-15