Articles

Data Cleansing Pitfalls

by Global Employees Software Development Company
Data scrubbing or cleansing by a data entry expert includes the process of detecting and removing errors and inconsistencies from information to improve the data quality. The necessity for data cleansing significantly increases as several data sources are integrated. The process of making data consistent and accurate is riddled with multiple issues, few of which are discussed below:

High Data Volume

Applications like Data Warehouses load massive amounts of information from a number of sources continually and further they carry a substantial quantity of dirty data (errors). In this type of instance the activity of data cleansing will become both formidable and significant at the same time.

Misspellings

These can happen mainly due to typing errors. Wrong spelling may be detected then corrected for grammatical errors and common words, but, as database constrain massive amounts of data which is unique, it’s difficult to detect spelling errors at input-level. Furthermore, spelling errors in data like addresses and names always are hard to find and fix.

Lexical Errors

Such errors happen in information because of name discrepancies between the specified format and structure of the data items. For instance, a certain database will record attributes for age, name, height, and sex. As a person doesn’t enter an intermediate value (like age) the information for following attributes changes field. Within the above instance, as a person doesn’t enter a value for age, a value for sex, and say male is read as the age and the value of height is read as the sex.

Misfielded Value

This problem arises as the entered values are correct as far format goes, yet doesn’t belong to the field. For instance in field of city, the value recorded was Germany.

Domain Format Errors

Such errors arise as the value for a specific attribute is correct, yet doesn’t comply with the domain format. For instance, a certain NAME database will require the first name and surname to be separated using a comma, yet the input is without a comma. In this instance, the input might be right, yet it doesn’t comply with the domain format.

Irregularities

These deal with non-uniform usage of values or units. For instance, while doing an entry of employee salary, the salary is discussed using various currencies.  Such information will require subjective interpretation and often can result in incorrect results.

Missing Values

These arise as a consequence of omissions which occur when gathering the data. They’ll signify unavailability of values within the data entry process. Both null values and dummy values are included within missing values. For instance, 999-9999 and 000-0000within the phone number field.

Contradiction


This error arises as the same real world entity is defined by different data values. For instance in the personal database for the exact same person there will include two records with different birth dates, but, other values and entities are the same.

Cryptic Values and Abbreviations

Those include using cryptic values and abbreviations in fields which a data entry expert encounters. For instance, rather than a complete mention of a college name, just initials are used. This type of error increases the odds of duplication and reduces sorting ability.

Sponsor Ads


About Global Employees Advanced   Software Development Company

131 connections, 0 recommendations, 442 honor points.
Joined APSense since, June 23rd, 2016, From West Sacramento, United States.

Created on Dec 31st 1969 18:00. Viewed 0 times.

Comments

No comment, be the first to comment.
Please sign in before you comment.