I have been using PySpark for some time now and I thought to share with you the process of how I begin learning Spark, my experiences, problems I encountered, and how I solved them! You are more than welcome to suggest and/or request code snippets in the comments section below or at my twitter @siaterliskonsta
Contents
DataFrames
From Spark’s website, a DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.
Reading Files
Here are some options on how to read files with PySpark.
Selecting Columns
Now that you have some data in your DataFrame you may need to select some specific columns instead of the whole thing. This is how you do it
Filtering
GroupBy
RDDs
Reading Files
Let’s see how to read files into Spark RDDs (Resilient Distributed Datasets)