Best video retrieval without using Crawling Search Algorithms

Jitesh Neve

It is very difficult to beat the traditional process of retrieval. Which is a bit failure prone method. The traditional process includes the concept of crawling. So proposing a new process which will be crawling free. Contents based video retrieval is the scientific operation where the video results are retrieved on the basis of the text query. Whereas the text query doesn’t suffice the retrieval operation on the videos. Because videos contain image collections, audio, frames and shots. Also there may exists many versions of the same video. Also video has many different formats in the recent market. So it looks a bit complicated and challenging to retrieve videos. Hence there is a lot of work in color domain, but improvements still exists.

So for now proposing some new things and directions on content based video retrieval. With the advancement of multimedia, digital video creation has become very common. There is enormous information present in the video, necessitating search techniques for specific content. In this paper, we present an approach that enables search based on the textual information present in the video. Regions of textual information are identified within the frames of the video. Video is then annotated with the textual content present in the images.

Current Process:

Currently, there is the existence of the concept of crawling. The search engine has to crawl the web. Basically this is done by the crawling processes these are often called as the insect threads. These insect threads are nothing but the processes of the uniquely implemented algorithms of crawling. Recently they are renamed as ‘Crawlers’. The crawlers use to crawl (search) entire web (i.e. World Wide Web). The result of the crawling information has to be maintained by the server.

And the search engine generally searches on the server where the crawled information is stored. One drawback of this system is, the crawling needs to be continuously done. And the data needs to be spontaneously updated on the server.

See the above image, I tried searching “OSI Reference Model” in mages module. I found some non-relevant images for the query. Which results don’t match, I have marked them with red outline. This case happened because the image URL trapped by the crawler with some description containing the words matching to the query. Hence, after firing the query, it was first matched with the context of that particular image. Then the matching results came to the result set.

The same case happens with the videos. The images are bit simpler to identify the relevancy. But the videos are bit complicated in this context. Therefore taking help of the researches about video retrieval and content based video classification.


Large amount of multimedia information is getting generated in various domains. The need for efficient techniques for accessing the relevant information from these multimedia databases is on the rise. Content based information retrieval techniques are the bare minimum to access the relevant information in these collections. Search and retrieval techniques have reached a maturity for textual data resulting in powerful search engines. Active research has been seen in the area of image retrieval and video retrieval during the last decade. Large video collections, which were thought of impossible at one stage, are becoming increasingly common due to the advancement of storage technologies. Conventional approaches for video indexing are based on characterizing the video content using a set of computational features. These techniques often do not exploit the commonly available high-level information conveyed by textual captions. Retrieval of relevant videos from large video databases has immense applications in various domains. Retrieval from broadcast news video database is vital for effective information access.

In introduction first we don’t need to introduce a video. Everyone is familiar with the concept video. Shortly saying about it, a video is basically a collection of the images having same height and width with a specific frame rate per second. Video also contains the audio and played with the film.

Video Retrieval: With the advancement of multimedia, digital video creation has become very common. There is enormous information present in the video, necessitating search techniques for specific content. In this case retrieval becomes a bit complicated because there are several formats of the video. Also there are many languages the audio from the film may contain. So that it’s not easy as much as text or pattern searching or retrieval.

Video Retrieval Schemes:

Several approaches have been reported for indexing and retrieval of video collections. They model spatial and temporal characteristics of the video for representation of the video content. In spatial domain, feature vectors are computed from different parts of the frames and their relationship is encoded as a descriptor. The temporal analysis partitions the video into basic elements like frames, shots, scenes or video-segments. Each of the video segments are then characterized by the appearance and dynamism of the video content. It is often assumed that features like, histograms, moments, texture and motion vectors, can describe the information content of the video clip. In a database of videos, one can query for relevant videos with example images, as is popular for content based image retrieval. Extending this approach has the limitation that it does not utilize the motion information of the video and employs only the appearance information. Moreover, finding example video clips for the concept of interest can be quite complex for many applications. Textual query is a promising approach for querying in video databases, since it offers a more natural interface.

Text Detection in video frames:

An important step in characterizing video clips based on the textual content is the robust detection of the textual blocks in images/frames. Banalization techniques, which use global, local, or adaptive thresholding, are the simplest methods for text localization in images. These methods are widely used for document image segmentation. Text detection in videos and natural images needs more advanced techniques. Methods which detect text in video use either region property in the spatial domain or the textural characteristics.

Text Query for Retrieval:

An advanced video retrieval solution could identify the text present in the video, recognize the text, and compute the similarity between the query string and pre-indexed textual information present in the video. However, success of this technique depends on two important aspects: (a) Quality of the input video (b) Availability of an OCR for robustly recognizing the text images. In many practical situations, we find video clips where the resolution of the broadcast is not enough even for a reasonable robust OCR to work on. Moreover for many of the languages, we do not have OCRs available for decoding the textual content in the frames. Since we do not have OCRs available to work effectively on the text in the video data, we use text images to index the videos.


This section details the procedure used to retrieve the videos containing the search query. The video clips are indexed according to the textual regions extracted from the videos. Text images are used instead of text strings for indexing and retrieval except that the notion of the alphabet is replaced by an abstract feature-representation. For each word, we derive a sequence of feature vectors by splitting the word images vertically.

For each of the strips, features based on their profiles, structural property etc. are extracted. These features are computed very similar to the method described in, where a search engine for content based access to document image collections in a digital library is presented. Profile-based features are useful as shape signatures. Upper profile, Lower profile and Transition profiles are employed here as features for representing the words. Background-to-Ink Transition records the number of transitions from an ink pixel to the background pixel for every image column. Upper and Lower Profiles are computed by recording the distance from the top/bottom boundary of the word image to the closest ink pixel for each image column i.e. the distance between the upper/lower boundary of the word image and the first pixel in each column of the image.

These features are normalized before using them for characterizing the text-block. Structural features like mean and standard deviation are also used for characterizing images. Mean is used to capture the average number of pixels per column in a word image. Standard deviation is the sum of squared deviation of pixels per column from the mean. It measures the spread of ink pixels in the word image.

Word matching and retrieval:

Given a textual query, we render an image of the text query, extract features and match with the feature sequences stored in the database. We employ a matching algorithm which compares the features of the rendered image and features present in the database. In [9], it is shown that using Dynamic Time Warping (DTW), can be used for finding similarity of images. The DTW is a popular method for matching sequences like strings and speech samples. Hence we use DTW to compare the feature sequences corresponding to images in video frames. In order to handle the size variations, we normalize all the features during computation. Our approach enables searching in videos with text in any language.

Dynamic Time Warping (DTW):

Dynamic Time Warping (DTW) is a dynamic programming based procedure used to align sets of sequences of feature vectors and compute a similarity measure. DTW computes the optimal alignment between sequences of feature vectors, so that the cumulative distance measure consisting of local distances between aligned sequences of signals is minimized. The mapping is mostly non-linear where the input signals are of different lengths. Let the features extracted from the word images be A1, A2, A3, . . . AM and B1, B2, B3, . . . BN . Then DTW cost between the two sequences is calculated using the equation:

Where d(i, j) is the cost in aligning the i th element of A with j th element of B and is computed using a simple squared Euclidean distance.
Here, we are proposing a system which is yet to implement. The proposed system will be responsible for the improved video retrieval. I will effectively utilize the methodology of the content based video retrieval.

Content based Video Retrieval (CBVR), in the application of image retrieval problem, that is, the problem of searching for digital videos in large databases. “Content-based” means that the search will analyze the actual content of the video. The term ‘Content’ in this context might refer colors, shapes, textures. Without the ability to examine video content, searches must rely on images provided by the user. Although the term "search engine" is often used indiscriminately to describe crawler-based search engines, human-powered directories, and everything in between, they are not all the same. Each type of "search engine" gathers and ranks listings in radically different ways. We can improve the results by improving processes. The things is we can add VQ in the process to improve the retrieval. We can also incorporate Linde-Buzo-Gray (LBG) for improvement.

Vector Quantization: Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. It was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms.

Need of Video Quantization: The usage of video codecs based on vector quantization has declined significantly in favor of those based on motion compensated prediction combined with transform coding, e.g. those defined in MPEG standards, as the low decoding complexity of vector quantization has become less relevant.

Linde-Buzo-Gray (LBG): The method one of the popular used to generate codebook is the Linde-Buzo-Gray (LBG) algorithm. In this technique centroid is calculated as the first code vector for the training set. Two vectors C1 & C2 are then generated by adding and subtracting a constant error to from the code vector


There has been tremendous research and testing on the system proposed so far. I am sure, the content based video retrieval will be improved after integrating the proposed concepts inline.
As described an approach for video search based on the text present in the videos using dynamic time warping. Also this approach has wider applicability compared to other techniques. Our future work will focus on improving the accuracy as well as the speed of techniques used here. Accuracy can be improved by using better techniques as well as using a large feature set which discriminates words better from each other. Speed can be improved by optimizing our implementation of the Dynamic Time Warping algorithm as well as looking at related computational techniques to minimize the number of possible matches.


  1. Dr.Sudeep.D.Thepade, Krishnasagar.S.Subhedarpage, Ankur.A.Mali / “Performance Rise in Content Based Video Retrieval using Multi-level Thepade's sorted  ternary Block Truncation Coding with  Intermediate block videos and even-odd videos”, IEEE - 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI) 978-1-4673-6217-7/13.

  2. Sudeep D. Thepade, Pritam H. Patil “Novel Keyframe Extraction for Video Content Summarization using LBG Codebook
    Generation Technique of Vector Quantization”, International Journal of Computer Applications (0975 – 8887) Volume 111 – No 9, February 2015.
  3. Dr.H.B.Kekre, Sudeep D. Thepade, “Image Retrieval using Augmented Block Truncation Coding Techniques”, ACM International Conference on Advances in Computing, Communication and Control (ICAC3-2009), pp. 384-390, 23-24 Jan 2009, Fr. Conceicao Rodrigous College of Engg., Mumbai. Is uploaded on online ACM portal.

  4. Dr.H.B.Kekre, Sudeep D. Thepade, “Using YUV Color Space to Hoist the Performance of Block Truncation Coding for Image Retrieval”, IEEE International Advance Computing Conference 2009 (IACC_09), Thapar University, Patiala, INDIA, 6-7 March 2009.

  5. Dr.H.B.Kekre, Sudeep D. Thepade, “Color Based Image Retrieval using Amendment Block Truncation Coding with YCbCr Color Space”, International Journal on Imaging (IJI), Volume 2, Number A09, Autumn 2009, pp. 2-14. Available online at (ISSN: 0974-0627)

  6. C. V. Jawahar, Balakrishna Chennupati, Balamanohar Paluri, Video Retrieval Based on Textual Queries, Center for Visual Information Technology, International Institute of Information Technology, Gachibowli, Hyderabad - 500 019. Available at -

  7. A.K.Jain and Bin Yu. Automatic text location in images and video frames. Pattern Recognition.Vol.3, 3:2055– 2076, Dec 1998.

  8.  Andreas Girgensohn John Adcock Matthew Cooper and Lynn Wilcox. Interactive search in large video collections. Conference on Human Factors in Computing Systems, April 2005.