Vivek K. Singh and Amish Choudhary
Open Source Evangelist
Data Quality is a proven technology for Data enrichment and record linking. We are extending Data Quality and Analytics techniques to delve into student’s demographic and teacher’s engagement data for Indian education system. In this paper, we are proposing a process of collecting education data and do data quality using open source data quality tool (osDQ) also called “Aggregate Profiler”: http://sourceforge.net/projects/dataquality/. Record linking techniques are used to define correlation for engagement and demographic analysis.
In the paper “Integrated Data Quality (DQ) along Data Life cycle”, author proposed a model for Data life cycle consisting of eight Phases. In this paper we applied the data life cycle phases to Indian education system using Open Source Data Quality and Profiling tool. The tool provides capability for data cleaning and data analysis to get correlation of data. While the process allows deriving various inferences based on data, we are using the process to correlate the impact on student grades of the commute distance and schedule load of teachers.
During our visit to various schools and parent interviews we felt the need to analyze the student’s demographic data and teacher’s engagement data to see the correlation on students grades of the time spend during the commute and teacher’s calendar. Indian education system is one of the most cumbersome due to sheer volume of data, non-uniform standard for collecting and storing the data. Hence we are proposing using Data Management Framework to cleanse and analyze the data. The tool provide
• Big data support – HIVE thrift server support
• Format Creation, Format Matching ( Phone, Date, String and Number), Format standardization
• Fuzzy Logic based similarity check, Cardinality check between tables and files
• Export and import from XML, XLS or CSV format, PDF export
• File Analysis, Regex search, Standardization, DB search
• Complete DB Scan, SQL interface, Data Dictionary, Schema Comparison
• Statistical Analysis, Reporting ( dimension and measure based)
• Pattern Matching , DeDuplication, Case matching, Basket Analysis, Distribution Chart
• Data generation and Data masking features
• Meta Data Information, Reverse engineering of Data Model
• Timeliness analysis , String length analysis
• Mysql, Oracle,Postgres,MS Access, Db2, SQL Server certified
Data management framework defines eight phases as shown as Rim in Figure 1 Data Management Framework Process.
3.1 Data Definition Phase
We started by profiling the data. Figure 2 shows the profile of data like Record count, Pattern information and Distribution chart using Aggregate profiler tool. We defined the data nomenclature and entity identification by doing semantic validation using tools Fuzzy logic based similarity check, standardization and Case formatting feature. The following entities were identified: Students, Teachers, Grade, Teacher’s calendar, Time, Student home pin code, School’s pin code.
3.2 Data Capture Phase
During data capture phase we used Aggregate Profiler tool’s Mysql JDBC connectors to get Student data. Teacher calendar data was imported from spreadsheet using Import file option using First row column name option. From the captured data we identified relevant Entity Attributes to be captured and discarded. Validation of data sources was done for authenticity of data.
3.3 Data Validation and Formatting Phase
Most of data cleaning was done during this phase. We decided to use the tool feature to provide database comparison. Also we defined the range rules to find outliners like too large distance or students in hostels. Timeliness of data was used to weed out old and stale data like teacher’s calendar.
3.4 Data Movement and Sharing Phase
During this phase each department was treated as separate domains. Data that was shared across department boundary like student identification was also identified. Similarly reference data attributes that can change was identifed during this phase and migration plan were put in place. For e.g. student can change residence during or between the sessions.
Record linking was done using fuzzy logic based similarity check (Figure 3) we massaged demographic data to provide uniform data – like locations “Electronic City” and “ECity” where mapped to same location/ Also Name Correlation was done using same logic – Carol and Carole were mapped to same teacher.
3.5 Data Modeling and Storage Phase
After cleaning the data and applying various filtering, validation and formatting rules, relevant data pertaining to student and school demographics, student grade and teachers schedule calendar was stored in relevant demograhic tables in Mysql database.
Figure 4 Student Entity Data
3.6 Data Archival Phase
We did not archive any information
3.7 Data Retrieval Phase
During this phase we provided Data access layer using Aggregate profiler’s JDBC connectors Figure 5. Tools provide connector to several RDBMS and Hadoop Hive. We have stored out data in mysql so we used mysql connector. If dataset is very large (Volume, Variety and Velocity) Hive can be used. RDBMS was faster than big data.
Figure 5 Data Connectors
3.8 Data Presentation Phase
During this phase we did data analysis to determine correlation between the data. Figure 6 shows the bar chart of Average grade of student as a function of commuting distance
Applying the data management framework we found a strong correlation between commuting distance and student grades Figure 6. However there was a weak correlation between teacher’s load and student grades Figure 7.
Figure 7 Average Grades vs. Teaching Hours
The strong correlation between commuting distance and student grade can be explained due to student fatigue factor and amount of time Indian parents get to teach the students at home.
However in Average Grade vs. Teaching hours spend by teachers a lot of other factors like subject taught and skill level of teachers also comes into play.
The data management framework and Aggregate Profiler Data Quality tool allows the data quality to be applied across multiple domains for cleaning, analysing and presenting data across different domains. It can be used by education institutions to do meaningful analysis of data by focusing on subset of data and identify area of improvents.
 Singh, Vivek. K. 2012. Integrated Data Quality along Data Life Cycle. SiliconIndia (http://bi.siliconindia.com/bi-expert/Integrated-Data-Quality-DQ-along-Data-Life-Cycle-Part-1-eid-423.html)
 Gibson, Neal and Talburt, John 2012. Hive: Crowdsourcing education data. In Journal of Computing Sciences in Colleges Volume 25 Issue 5 (May 2010), 72-78.