Leveraging Analytics for Software Testing

Manoj Philip Mathen

Software Testing is no longer a ritual that has to be performed at the end of the Software Development Life Cycle. The increasing role of IT in running businesses has necessitated Quality be inbuilt into software. Defects exist and it is pertinent to ensure that they are detected, fixed at the earliest. Software Testing plays a crucial role in this. We have seen how 7-8 second pit-stops in Formula One racing determine the results of 1-2 hour long racing events. Pit-stops are today not seen as a penalty, but an opportunity to ensure win. In the same manner, Software Testing can be looked up as an opportunity (before the product release) to improve the quality and win the race!

For this, a key aspect is to reduce the time taken for testing and improve the effectiveness of the testing that is done in the short time. Let us take a closer look at the lifecycle of a defect, examine the various stages and check what can be optimized.

The first event Defect Injectioncan happen at any stage in an SDLC. It can happen during the requirement definition or later during the build stage. While there are methods and approaches to reduce / prevent the injection, the scope of this article is to analyze how the time lapsed post defect injection can be reduced, as in Formula One!

From Figure 1 above, the total time lapsed (T) = A + B + C + D
Wherein A = The time elapsed between the time a defect is injected and the detection of the defect

B = The time elapsed between the time a defect is detected and the same is reported
C = The time lapsed from the time a defect is reported and the same is fixed
D = The time lapsed from the time a defect is fixed and the same is retested (including any regression tests)

Further analysis into each of these events might help us to optimize the total time T, however a relatively new concept is to know the defect beforehand itself. In other words the ability to predict the defect and then (a) try avoiding them from being injected (b) scout (detect) them in case it has slipped in.  If so, the distance ‘A’ and ‘B’ can be reduced drastically in the above equation.

Let us explore further on how this can be made possible – Defect Prediction,leveraging analytics for software testing and predicting the possible defects upfront is an approach.

Application Ecosystem and Unstructured Data

There is a lot of information that can be gleamed from the application / system eco-system. Historical application data is like the ‘fingerprint’ or DNA of an application and is too useful to be ignored. Majority of such data is ‘unstructured’ and hence comes with certain challenges. The data cannot be consumed directly as such. Some of the common sources of data are shown in figure #2 below.

Broadly speaking, unstructured data falls under 2 major categories.

  1. Bitmap objects
  2. Textual objects

Bitmap objects comprises of all images, video, audio files. Textual objects includes all the log files, server statistics, emails, reports etc. In the case of Defect analytics, we will be mostly encountering textual objects.

Defect Prediction Model

Following are the high level steps in coming up with a Defect Prediction Model. This is a generic model and it is vital that the context specifics for any application / system / enterprise be incorporated to fine tune this for the specific enterprise.

  • Identify any sources of application / system information which might contain historic data, future predictions / analysis etc. Some of the common sources of unstructured data is given in Figure #2 above.

  • Indexing. Data needs to be indexed for faster retrieval and search, which are crucial for dynamic defect modeling. Index is defined on the basis of some value in data. In case an index is not identified, the entire data set / folder will be scanned for retrieving the desired information every time. Indexing in unstructured data is difficult, since data doesn’t follow any pattern etc. However, it is very much feasible when it comes to application logs across environments, where a pattern can be found. Ex: Warning, Error etc. are good index words.

  • Tags/ Metadata: Data in a document can be tagged using metadata. This facilitates easy search and retrieval. It is important to identify a soft structure for the data, this is again difficult as it involves heterogeneous data sources and new sources can be added.

  • Classification / Taxonomy: Here application / system data is classified based on the relationship that exist between the data. Data needs to be grouped and placed in hierarchies based on the relationship / taxonomy that exists in the particular context. Inconsistent nomenclature and naming standards within an organization is a common challenge encountered in this stage.

  • Content Addressable Storage (CAS):Data can be stored on their metadata. It assigns a unique name to every object stored in it. The object is retrieved based on its content and not its location. Application logs, emails, extracts, server logs can be stored using CAS

  • Build Analytics Engine: Once the storage is decided, analytics engine can be build. Most enterprises would have corporate licenses for relevant BI/ analytics reporting software and the same can be leveraged.

UIMA: Unstructured Information Management Architecture

UIMA is an open source platform and one of the leading industry standard for content analytics. Other general frameworks that can be used for natural language processing includes the General Architecture for Text Engineering (GATE) and Natural Language Toolkit (NLTK). UIMA helps in multi-modal analytics of unstructured information. UIMA is integrated with the search technologies build by IBM. UIMA stores information in a structured format. Once structured, the data can be mined, searched and information can be extracted.
The Defect Prediction Model can leverage some of the below techniques for unstructured data analysis:

  • Breaking up of documents into separate words
  • Grouping and classifying according to taxonomy
  • Detecting ‘error’ and ‘warning’ from application logs
  • Detecting ‘events’ and ‘time’ from application behavior (databases, customer records, logs etc.)
  • Detecting relationships between various elements

Figure #3 above depicts a straw man for Defect Prediction Engine. This leverages the UIMA model and can be fine-tuned further depending on the particular context.


Software Defect Prediction gives us an early start for a better quality software. Given the increased focus in Analytics, machine learning and IoT, it is high time that these technologies be leveraged for the betterment of Software Testing. The above proposed model is a small step towards that goal.

About the Author

Manoj Philip Mathen has spent the last 12+ years at the forefront of distributed computing technology. He has developed an in-depth understanding of enterprise software architecture, having worked in several development and testing engagements within the FSI space. An author and speaker at several international forums, he has several works on validating SOA based applications, DWT. His current areas of interest include End-to-End validation of Software Systems.

He is currently a Sr. Solutions Architect, working for an IT Major.