Advertisement

Real Life Case Study on File Enumeration using Talend ETL

Abstract

1. Introduction

Learning is an integral part in one’s life and it is very true for a person who works especially in the software industry.  The changes in the software field is very rapid and the professional needs to update their skills based on the demand in business. Now a days, coming to classroom for learning new skills is quite challenging due to their work pressure. Hence lot of people are opting for online training such as e-learning or WebEx.  Once a person attends the training through online mode, he should be evaluated through online test or sometimes only active participation in the training session is considered.  One of the way to evaluate the participant’s skill learning is to assign credit points after attending the training session. One of the main criteria to calculate the credit point for a training session is whether he attended all the training sessions for a given course. Since the training session is conducted through online mode so the attendance for the participants are captured during the training session and stored in Excel Sheet format. If the number of trainings conducted are large in number then collating all the attendance records and prepare the report for attendees Vs credit point for each training session consumes more time and effort.

For example, if we prepare the report manually, it will take lot of time as there are lot of steps involved in this process. First one has to collect all the attendance sheets, for example if a training is of 10 classes, we need to collate 10 different sheets, put them in a single workbook and write some formulas or sometimes a Macro or script. For this the person responsible to do this task needs to have some knowledge on how this has to be done. This costs the time, effort.

If we automate these tasks through open source tool it will have the following advantages:

1.      No need to collate the sheets manually since the tool links to the directory and read the files from that folder automatically.

2.      The execution time is faster.

3.      No need to pay for the tool since it’s an open source

4.      Manual error can be avoided.

5.      Very accurate

There are many open source ETL(Extract, Transform and Load) tools are exist to automate these kind of tasks.  Talend is one of the open source ETL tool is chosen for this work.

Talend offers Eclipse based Interface.It has data quality features which allows more customized SQL queries and Java code. Talend Open studio is one of the most powerful data Integration ETL tool in the market. It fitsvery wellfor every kind of Data Integration process from small file transformation to big Data migration, analysis of the given data and more over it has a highly scalable Architecture feature.  To get further information of Talend, the readers can refer at: https://www.talend.com/products/data-integration, https://www.talend.com/resource/etl-tool.html

In this article, we will explain how to automate credit point calculation based on the participant’s attendance in the training sessions.

2. A Real Life Case Study

2.1 Business Requirement

·         Many IT organizations train their employees on different skills to deploy them in projects.

·         Employees attend trainings online from their work place and their attendance needs to be tracked.

·         This helps organization to deploy those employees who have participated in training for certain minimum threshold duration that is required for gaining knowledge on such skills.

·         Such information will be useful for organization to track employee attendance and hence plan them to deploy in IT projects.

·         Employees provide their name and organization email-ID and log into online training sessions.

·         After session is completed attendance files get generated for each day in a folder.

Format of zipped attendance files with sample data is attached along with this article.

·         If Session is for four days, four attendance files will get as shown below and also there should be visualization that depicts duration each eligible participant has attended.

 

2.1. Sample data in attendance file for one session as shown below:

Figure 1 shows the sample data of day 1 of the WebEx session. If the session is for 4 days duration, we will have similar attendance sheets getting generated for remaining days but the data can be different per say, participants not joining a session, participant joining the session late, closing the session before the threshold time etc.

The files then are kept in a folder where the job will read from.

 

2.2 Job Design in Talend Open Studio

 

Step 1: Create Configfile.txt as shown below which is a key-value pair file.  This configuration file will be read by Talend Job and send this to the job upon which the job will get executed.

ConfigFile.txt

key,value

dirpathparam,C:\Users\kavicp\Downloads\td

Session_Duration,480

numofsessionsconsidered,3

Dirpathparam: It is directory where unzipped session attendance files exists.

Session_Duration: Duration of each session which is uniform for every session.

numofsessionsconsidered: How many sessions participants need to attend if they have to be considered as eligible for training

Step 2: Under Repository window create context group that will have contexts which will be used to read key,value details from configfile.txt created in step 1.

 

Step 3: Create a job in Talend 

Figure 3: Job showing the ETL process

Figure 4: Subjob showing components reading data from config file and loading values in talend environment

Figure 4 shows components which reads the directory which has session attendance files also reads parameters from configuration file and load context variables.

tSystem
component executes one or more system commands. Set the property value command “cmd /c cd”.  This will provide directory wherethe configuration file exists.

 

tFileInputDelimitedcomponent opens a file and reads it row by row to split them up into fields then sends fields as defined in the Schema to the next Job component, via a Row link

This component reads keys and values from config file using property Filename:

Here FileName is set to ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "config.txt"

Global variable : ((String)globalMap.get("tSystem_1_OUTPUT")) will get path of  directory

Where config file exists and that is concatenated with ~/~ and further concatenated with

“config.txt”

 

Click on edit schema and provide below schema for Key and columns which will map to schema of config file key and value fields.

Key string not null

Value string not null

tContextLoad Component  can be used to load a context from a flow. In this case study this is used to load below context items to talend job environment:

key,value

dirpathparam,C:\Users\kavip\Downloads\td

Session_Duration,480

numofsessionsconsidered,3

Figure 7 shows the components which iterates through session attendance files and generates the consolidated file that contains attendance details of all participants of all sessions.

tFileList component retrieves a set of files or folders based on a file mask pattern and iterates. In this Case Study this is used to retrieve set of participant’s attendance files which are .csv files and will iterate on each one of them.

The properties to be set for this component are:

a)      Directory: context.dirpathparam . This directory has participant’s attendance files

b)      Filemask: *.csvThis property is used for retrieving files of .csv

 

Figure 8: properties of tFileList component

tFileInputDelimitedComponent opens a file and reads it row by row to split them up into fields then sends fields as defined in the Schema to the next Job component. In this Case Study this is used to read attendance csv files from dir path. The properties that are to be set for this component:

FileName: ((String)globalMap.get("tFileList_1_CURRENT_FILEPATH")). This global variable gets file path of attendance file that is being processed.

Row Separator: "\n" -à row separator is new line in attendance file

Field Separator: "\t"  --à Field separator tab for each of the fields in attendance file

Header: 3  --à to read from 3rd line as attendance file has names of columns on 3rd row

Figure 9: properties to be set for tFileInputDelimited component

tMap Component: Transforms and route data from single or multiple sources to single or multiple destinations. Use this component to map input and output fields

Assign the below expression to Duration column.   This expression will remove “mins”or “min” keyword from duration column

StringHandling.RIGHT(row7.Duration,1).equals("s")?StringHandling.TRIM( StringHandling.LEFT(row7.Duration,StringHandling.LEN(row7.Duration) -5)):StringHandling.TRIM( StringHandling.LEFT(row7.Duration,StringHandling.LEN(row7.Duration) -4));

tConvertType component: Allows specific conversions at runtime from one Talend java type to another. This component convert’s duration column from string to integer as it is required in subsequent steps to sum the “Duration” column

tFileOutputDelimited Component: generates intermediate file that will have consolidated attendance records of all participants of all sessions attended. Inter.csv file will be created in same folder where config file exists by setting FileName property value

FileName : ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "inter.csv"

Click on sync columns to get same schema from tconverttype component.

The inter.csv file created in Figure 12 will be used as an input by the tFileinputDelimiter component for further processing as shown in figure 13.

The functionality of Figure 13is to generate three excel files:

a)      Eligible candidates

b)      Ineligible due to less sessions attended than threshold set by organization

c)      Ineligible due to less duration attended than threshold set by organization on a specific date

tFileInputDelimitedcomponent: Opens a file and reads it row by row to split them up into fields then sends fields as defined in the Schema to the next Job component, via a Row link. In this Case Study this is used to read inter.csv file that is created by earlier sub job.

Figure 14: properties of tfileInputDelimiter

Set the below properties in the component shown in Figure 14

FileName : ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "inter.csv"

Row Seperator: "\n" -à row separator is new line in attendance file

Field Seperator:";"  --à Field separator ; for each of the fields in consolidated attendance file

Header: 1  --à to read column data from 1st line as attendance file has names of columns on 3rd row

Set the schema as mentioned below

Name string 23

Email string 27

Duration integer 7

Date string 10

 tSortRow Component: Sorts input data based on one or several columns, by sort type and order

Set the properties as mentioned below

Name Asc

Email Asc

tAggregateRow_2Component: Receives a flow and aggregates it based on one or more columns. For each output line, are provided the aggregation key and the relevant result of set operations (min, max, sum...).Set the following properties for this component.In this Case Study these properties will calculate sum(duration) attended by each employee email-ID and date attended.

Group By: Email ,Date ( click on + icon to add)

Operation : Sum on Duration

 tMap 2Component:Is used to filters input rows by setting one or more conditions on the selected columns.tMap_2 has 2 outputs. One is op1 and another is ineligible.

Duration is give greater than or equal to 70%  of duration of session ( Context.Session_Duration i.e threshold for attendance as set by organization) records will be sent to op1 otherwise to ineligible output.( i.e These are ineligible candidates on specific day).

Figure 15:Mapping of input fields to respective outputs

tFileoutputExcel2component: outputs data to a XL file. In this Case study information of every participant who is ineligible on specific day is generated. Properties that are set for this component:

FileName : ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "ineligiblecandidates_onspecificday.xlsx"

Sheet Name: “EligibleCandidates”

tAggregateRow_1Component: Receives a flow and aggregates it based on one or more columns. Set the following properties:

Group By:Email (click on + icon to add)

Operation: Sum on Duration

 tFileOutputExcel component: generates excel files.

tFileOutputExcel_3: This component generates excel file due to less sessions attended.

This will get created in same directory where config file is available.

FileName: ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "eligiblecandidates.xlsx"

Sheet: “InEligiblecandidates”. This sheet will get created.

tFileOutputExcel_3: Generates excel file containing list of participants with less sessions attendance.This will get created in same directory where config file is available.

FileName: ((String)globalMap.get("tSystem_1_OUTPUT")) + "/" + "eligiblecandidates.xlsx"

Sheet: “Eligiblecandidates”. This sheet will get created.

tFiledelete component: Deletes a specified file. In this case study inte.csv which is consolidated file will be deleted after every run of job as there is no need to have this file after job is run.

Figure 18:Properties to be set to delete inter.csv file

Step 4: Create a build and execute Talend Job. Unzip archive and place config.txt in folder where TrainingAttendanceJob_run.bat file exists. This bat file has code that will execute job when clicked on it.When job is executed, the excel files will get generated as per case study business requirement. The output can also be viewed as a bar chart using tBarChart component

Figure 20: Table showing eligible candidates

Figure 21: Bar chart showing eligible candidates.

The output can be generated both in a tabular format and in a Bar graph. The Bar chart can be used by the middle and higher management for analysis purpose.

To summarize, Talend open Studio is Open Source tool having rich set of components that deal with reading and enumerating files for further processing.   As we have seen through the case study, the amount of time now spend on manually collating the sheets is reduced to a greater extent. Needless to say that the usage of tool makes our life easy and human errors can be eliminated.

 About the Authors

 

 

 








Added on March 23, 2017 Comment

Comments

#1

Kanta commented on March 24, 2017 at 12:19 p.m.

Talend open Studio is really good.

Post a comment