University of Tampa Python Spark Project

Description

Answer these questions using Spark code. Submit your code (in a py file) and the answers to the questions (in a text file). The answers should use the full dataset, not the small dataset. Start with the code shown below. (Hint: for any tasks that say max/largest, don’t use sortByKey, because that’s much slower than a better option.)Which day had the largest number of installed drives, and what was this number?How many distinct drives (by model+serial) are installed (i.e., that exist in the data) in each year?What’s the max drive capacity per year?Full dataset: change the file path to: file:///ssd/data/backblaze.csv (146 million rows) – my solution took 17minRun spark like this: spark-submit backblaze-spark.py –master=local[5]Or to hide log messages: spark-submit backblaze-spark.py –master=local[5] 2> /dev/nullLook at /home/jeckroth/cinf201/2022-spring/spark/backblaze.py for some more example code.Starting code with some examples that you can remove:””””from pyspark.sql import SparkSession

spark = SparkSession.builder.appName(“Backblaze”).getOrCreate()

schema = “day DATE, serial STRING, model STRING, capacity LONG, failure INTEGER”
d = spark.read.schema(schema).load(“file:///home/jeckroth/cinf201/spark/assignment/small-backblaze.csv”, format=”csv”, sep=”,”, header=”true”)
d = d.rdd

# print first 10 rows
print(d.take(10))

## How many failures occurred each year?

# make key (year) & value (failure 0/1)
d2 = d.map(lambda row: (row.day.year, row.failure))
# add up failures per year
failureCounts = d2.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt)
print(failureCounts.collect())

## Which model (not serial number) has the most failures overall?

# grab model & failure from data, model is the key
d3 = d.map(lambda row: (row.model, row.failure))
# count failures for that model; result so far: [(modelX, 55), (modelY, 2100)]
d3 = d3.reduceByKey(lambda cnt, rowcnt: cnt + rowcnt)
# flip keys and values; result so far: [(55, modelX), (2100, modelY)]
d3 = d3.map(lambda pair: (pair[1], pair[0]))
# sort by value (second in the pair)
d3 = d3.sortByKey(ascending=False) ### NOT EFFICIENT TECHNIQUE
print(d3.collect())

1 attachmentsSlide 1 of 1attachment_1attachment_1

Tags:
code

dataset

Python Spark

User generated content is uploaded by users for the purposes of learning and should be used following Studypool’s honor code & terms of service.

Reviews, comments, and love from our customers and community:

Article Writing

Keep doing what you do, I am really impressed by the work done.

Alexender

Researcher

PowerPoint Presentation

I am speechless…WoW! Thank you so much!

Stacy V.

Part-time student

Dissertation & Thesis

This was a very well-written paper. Great work fast.

M.H.H. Tony

Student

Annotated Bibliography

I love working with this company. You always go above and beyond and exceed my expectations every time.

Francisca N.

Student

Book Report / Review

I received my order wayyyyyyy sooner than I expected. Couldn’t ask for more.

Mary J.

Student

Essay (Any Type)

On time, perfect paper

Prof. Kate (Ph.D)

Student

Case Study

Awesome! Great papers, and early!

Kaylin Green

Student

Proofreading & Editing

Thank you Dr. Rebecca for editing my essays! She completed my task literally in 3 hours. For sure will work with her again, she is great and follows all instructions

Rebecca L.

Researcher

Critical Thinking / Review

Extremely thorough summary, understanding and examples found for social science readings, with edits made as needed and on time. Transparent

Arnold W.

Customer

Coursework

Perfect!

Joshua W.

Student

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>