California State University Cleaning and Profiling Code Worksheet


Cleaning and Profiling CodeUse only Hadoop MapReduce in this part of your project.
Do not use anything else.
You must write and submit 2 separate MapReduce jobs:
MR Job 1.Data profiling – to explore your data- Name the files:,, use these exact names for your classes)- This MR job counts the number of records in a dataset- Run it on the original dataset, before cleaning, and output the number of records- Run it on the cleaned dataset (result of MR Job 2 described below), output number of records – If the number of records don’t match, you should figure out why that is- Re-submit a schema if it has changed.MR Job 2.Data cleaning – to avoid nasty exceptions later on in your analytic- Name the files:,, use these exact names for your classes)- This MR job cleans the data – for example, by dropping columns you don’t need.- It should write out a new file with only the columns you will use in your analytic.- The selected columns for your data schemaFor full credit, provide the classes for each job

2 attachmentsSlide 1 of 2attachment_1attachment_1attachment_2attachment_2

Unformatted Attachment Preview

Data Profiling
Data Cleaning
Data Profiling
Data Profiling
Data profiling helps you discover, understand and organize your data.
Data profiling helps cover the basics with your data, verifying that the information in
your tables matches the descriptions.
For example, a state column might use a combination of both two-letter codes
and the fully spelled out (sometimes incorrectly) name of the state. Data
profiling would uncover this inconsistency and inform the creation of a
standardization rule that could make them all consistent, two-letter codes.
Sometimes the data profiling process leads you to render your dataset unusable
Data Profiling
Why profile data?
Data profiling allows you to answer the following questions about your data:

Is the data complete? Are there blank or null values?

Is the data unique? How many distinct values are there? Is the data duplicated?

Are there anomalous patterns in your data? What is the distribution of patterns in your data?

Are these the patterns you expect?

What range of values exist, and are they expected? What are the maximum, minimum, and average
values for given data? Are these the ranges you expect?
Data profiling helps you discover, understand and organize your data.
Data Profiling
Structure discovery, also known as structure analysis, validates that the data that you have is
consistent and formatted correctly. There are several different processes that you can use for this,
such as pattern matching.
For example, if you have a data set of phone numbers, pattern matching helps you find the
valid sets of formats within the data set.
Pattern matching also helps you understand whether a field is text- or number-based along with
other format-specific information.
Structure discovery also examines simple basic statistics in the data. By using statistics like the
minimum and maximum values, means, medians, modes and standard deviations, you can gain
insight into the validity of the data.
Data Profiling
Content discovery is the process of looking more closely into the individual elements of the database
to check data quality. This can help you find areas that contain null values or values that are incorrect
or ambiguous.
Many data management tasks start with an accounting for all the inconsistent and ambiguous entries
in your data sets.
For example, finding and correcting your data to fit street addresses into the correct format is an
essential part of this step. The potential problems that could arise from non-standard data, like being
unable to reach customers via mail because the data set includes incorrectly formatted addresses, are
costly and can be addressed early in the data management process.
Data Profiling – Statistics Gathering
Attribute Level (data row level) profiling
• All Data Types
• Null Count – Null Percentage: number and/or percentage of records with a null
• Mode–Most frequent value
• PatternCount–Number of difference distinct patterns observed; mm/dd/yyyy or
999999.99 for example
• Datatype observed always (or almost always) in the column
• Length of data in the column
• Uniqueness
Data Profiling – Statistics Gathering
• Numeric Data Types
• Mean
• Median
• Precision
• Standard Deviation
For fields with non-unique data the frequency distribution
(group-by) results can yield very interesting results
• Can be compared with allowed values
• Frequent and infrequent values should be studied
Data Cleaning
Finding incorrect records in a dataset and removing or replacing them with
clean data.
The complete data cleaning process can be broken down into two broad
data cleaning steps:
1. Identify and fill in missing values.
2. Correct existing data.
Data Cleaning
Remove entries that have letters or non-numeric values where there should be only
numbers (such as zip codes and phone numbers) and entries with invalid characters (like
@ or ‘ symbols in names or physical addresses).
Fix missing values
Fix unwanted values that do not fit in the dataset
Be mindful of outliers but in some cases if they are suspicious and do not make sense,
they are removed.
Data Cleaning
Datasets – for team organization
All datasets should be “brought together”
Once you have your data profiled and cleaned you work together to produce your
You might want to bring in another dataset
MapReduce *can* be used to brought your datasets together, but not required for
the project

Purchase answer to see full

Explanation & Answer:
1 Script




User generated content is uploaded by users for the purposes of learning and should be used following Studypool’s honor code & terms of service.

Reviews, comments, and love from our customers and community:

Article Writing

Keep doing what you do, I am really impressed by the work done.



PowerPoint Presentation

I am speechless…WoW! Thank you so much!

Stacy V.

Part-time student

Dissertation & Thesis

This was a very well-written paper. Great work fast.

M.H.H. Tony


Annotated Bibliography

I love working with this company. You always go above and beyond and exceed my expectations every time.

Francisca N.


Book Report / Review

I received my order wayyyyyyy sooner than I expected. Couldn’t ask for more.

Mary J.


Essay (Any Type)

On time, perfect paper

Prof. Kate (Ph.D)


Case Study

Awesome! Great papers, and early!

Kaylin Green


Proofreading & Editing

Thank you Dr. Rebecca for editing my essays! She completed my task literally in 3 hours. For sure will work with her again, she is great and follows all instructions

Rebecca L.


Critical Thinking / Review

Extremely thorough summary, understanding and examples found for social science readings, with edits made as needed and on time. Transparent

Arnold W.




Joshua W.


Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>