摇钱树网咖计费软件可以用在学校机房常用软件呢

点击联系发帖人 时间：2018-01-25 08:31

学校机房常用软件

Spark SQL and DataFrames - Spark 2.1.1 Documentation
Spark SQL, DataFrames and Datasets Guide
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided
by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally,
Spark SQL uses this extra information to perform extra optimizations. There are several ways to
interact with Spark SQL including SQL and the Dataset API. When computing a result
the same execution engine is used, independent of which API/language you are using to express the
computation. This unification means that developers can easily switch back and forth between
different APIs based on which provides the most natural way to express a given transformation.
All of the examples on this page use sample data included in the Spark distribution and can be run in
the spark-shell, pyspark shell, or sparkR shell.
One use of Spark SQL is to execute SQL queries.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to
configure this feature, please refer to the
section. When running
SQL from within another programming language the results will be returned as a .
You can also interact with the SQL interface using the
Datasets and DataFrames
A Dataset is a distributed collection of data.
Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong
typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized
execution engine. A Dataset can be
from JVM objects and then
manipulated using functional transformations (map, flatMap, filter, etc.).
The Dataset API is available in
. Python does not have the support for the Dataset API. But due to Python’s dynamic nature,
many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally
row.columnName). The case for R is similar.
A DataFrame is a Dataset organized into named columns. It is conceptually
equivalent to a table in a relational database or a data frame in R/Python, but with richer
optimizations under the hood. DataFrames can be constructed from a wide array of
as: structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API is available in Scala,
Java, , and .
In Scala and Java, a DataFrame is represented by a Dataset of Rows.
In , DataFrame is simply a type alias of Dataset[Row].
While, in , users need to use Dataset&Row& to represent a DataFrame.
Throughout this document, we will often refer to Scala/Java Datasets of Rows as DataFrames.
Getting Started
Starting Point: SparkSession
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName(&Spark SQL basic example&)
.config(&spark.some.config.option&, &some-value&)
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession;
SparkSession spark = SparkSession
.builder()
.appName(&Java Spark SQL basic example&)
.config(&spark.some.config.option&, &some-value&)
.getOrCreate();
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
The entry point into all functionality in Spark is the
class. To create a basic SparkSession, just use SparkSession.builder:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName(&Python Spark SQL basic example&) \
.config(&spark.some.config.option&, &some-value&) \
.getOrCreate()
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
The entry point into all functionality in Spark is the
class. To initialize a basic SparkSession, just call sparkR.session():
sparkR.session(appName = &R Spark SQL basic example&, sparkConfig = list(spark.some.config.option = &some-value&))
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Note that when invoked for the first time, sparkR.session() initializes a global SparkSession singleton instance, and always returns a reference to this instance for successive invocations. In this way, users only need to initialize the SparkSession once, then SparkR functions like read.df will be able to access this global instance implicitly, and users don’t need to pass the SparkSession instance around.
SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to
write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables.
To use these features, you do not need to have an existing Hive setup.
Creating DataFrames
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
val df = spark.read.json(&examples/src/main/resources/people.json&)
// Displays the content of the DataFrame to stdout
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset&Row& df = spark.read().json(&examples/src/main/resources/people.json&);
// Displays the content of the DataFrame to stdout
df.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
With a SparkSession, applications can create DataFrames from an ,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
# spark is an existing SparkSession
df = spark.read.json(&examples/src/main/resources/people.json&)
# Displays the content of the DataFrame to stdout
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
With a SparkSession, applications can create DataFrames from a local R data.frame,
from a Hive table, or from .
As an example, the following creates a DataFrame based on the content of a JSON file:
df &- read.json(&examples/src/main/resources/people.json&)
# Displays the content of the DataFrame
NA Michael
# Another method to print the first few rows and optionally truncate the printing of long values
showDF(df)
## +----+-------+
## +----+-------+
## |null|Michael|
19| Justin|
## +----+-------+
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation in , ,
As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.
Here we include some basic examples of structured data processing using Datasets:
// This import is needed to use the $-notation
import spark.implicits._
// Print the schema in a tree format
df.printSchema()
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the &name& column
df.select(&name&).show()
// +-------+
// +-------+
// |Michael|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select($&name&, $&age& + 1).show()
// +-------+---------+
name|(age + 1)|
// +-------+---------+
// |Michael|
// | Justin|
// +-------+---------+
// Select people older than 21
df.filter($&age& & 21).show()
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
// Count people by age
df.groupBy(&age&).count().show()
// +----+-----+
// | age|count|
// +----+-----+
// +----+-----+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
For a complete list of the types of operations that can be performed on a Dataset refer to the .
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
// col(&...&) is preferable to df.col(&...&)
import static org.apache.spark.sql.functions.col;
// Print the schema in a tree format
df.printSchema();
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// Select only the &name& column
df.select(&name&).show();
// +-------+
// +-------+
// |Michael|
// | Justin|
// +-------+
// Select everybody, but increment the age by 1
df.select(col(&name&), col(&age&).plus(1)).show();
// +-------+---------+
name|(age + 1)|
// +-------+---------+
// |Michael|
// | Justin|
// +-------+---------+
// Select people older than 21
df.filter(col(&age&).gt(21)).show();
// +---+----+
// |age|name|
// +---+----+
// | 30|Andy|
// +---+----+
// Count people by age
df.groupBy(&age&).count().show();
// +----+-----+
// | age|count|
// +----+-----+
// +----+-----+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
For a complete list of the types of operations that can be performed on a Dataset refer to the .
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
In Python it’s possible to access a DataFrame’s columns either by attribute
(df.age) or by indexing (df['age']). While the former is convenient for
interactive data exploration, users are highly encouraged to use the
latter form, which is future proof and won’t break with column names that
are also attributes on the DataFrame class.
# spark, df are from the previous example
# Print the schema in a tree format
df.printSchema()
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
# Select only the &name& column
df.select(&name&).show()
# +-------+
# +-------+
# |Michael|
# | Justin|
# +-------+
# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
# +-------+---------+
name|(age + 1)|
# +-------+---------+
# |Michael|
# | Justin|
# +-------+---------+
# Select people older than 21
df.filter(df['age'] & 21).show()
# +---+----+
# |age|name|
# +---+----+
# | 30|Andy|
# +---+----+
# Count people by age
df.groupBy(&age&).count().show()
# +----+-----+
# | age|count|
# +----+-----+
# +----+-----+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
For a complete list of the types of operations that can be performed on a DataFrame refer to the .
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
# Create the DataFrame
df &- read.json(&examples/src/main/resources/people.json&)
# Show the content of the DataFrame
NA Michael
# Print the schema in a tree format
printSchema(df)
## |-- age: long (nullable = true)
## |-- name: string (nullable = true)
# Select only the &name& column
head(select(df, &name&))
## 1 Michael
# Select everybody, but increment the age by 1
head(select(df, df$name, df$age + 1))
name (age + 1.0)
## 1 Michael
# Select people older than 21
head(where(df, df$age & 21))
# Count people by age
head(count(groupBy(df, &age&)))
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
For a complete list of the types of operations that can be performed on a DataFrame refer to the .
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the .
Running SQL Queries Programmatically
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&)
val sqlDF = spark.sql(&SELECT * FROM people&)
sqlDF.show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a Dataset&Row&.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
// Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&);
Dataset&Row& sqlDF = spark.sql(&SELECT * FROM people&);
sqlDF.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView(&people&)
sqlDF = spark.sql(&SELECT * FROM people&)
sqlDF.show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
The sql function enables applications to run SQL queries programmatically and returns the result as a SparkDataFrame.
df &- sql(&SELECT * FROM table&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Global Temporary View
Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it
terminates. If you want to have a temporary view that is shared among all sessions and keep alive
until the Spark application terminates, you can create a global temporary view. Global temporary
view is tied to a system preserved database global_temp, and we must use the qualified name to
refer it, e.g. SELECT * FROM global_temp.view1.
// Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&)
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
// Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&);
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
// Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
# Register the DataFrame as a global temporary view
df.createGlobalTempView(&people&)
# Global temporary view is tied to a system preserved database `global_temp`
spark.sql(&SELECT * FROM global_temp.people&).show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
# Global temporary view is cross-session
spark.newSession().sql(&SELECT * FROM global_temp.people&).show()
# +----+-------+
# +----+-------+
# |null|Michael|
19| Justin|
# +----+-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
CREATE GLOBAL TEMPORARY VIEW temp_view AS SELECT a + 1, b * 2 FROM tbl
SELECT * FROM global_temp.temp_view
Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
a specialized
to serialize the objects
for processing or transmitting over the network. While both encoders and standard serialization are
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
the bytes back into an object.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface
case class Person(name: String, age: Long)
// Encoders are created for case classes
val caseClassDS = Seq(Person(&Andy&, 32)).toDS()
caseClassDS.show()
// +----+---+
// |name|age|
// +----+---+
// |Andy| 32|
// +----+---+
// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)
// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name
val path = &examples/src/main/resources/people.json&
val peopleDS = spark.read.json(path).as[Person]
peopleDS.show()
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
import java.util.Arrays;
import java.util.Collections;
import java.io.Serializable;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
public static class Person implements Serializable {
private String name;
private int age;
public String getName() {
return name;
public void setName(String name) {
this.name = name;
public int getAge() {
return age;
public void setAge(int age) {
this.age = age;
// Create an instance of a Bean class
Person person = new Person();
person.setName(&Andy&);
person.setAge(32);
// Encoders are created for Java beans
Encoder&Person& personEncoder = Encoders.bean(Person.class);
Dataset&Person& javaBeanDS = spark.createDataset(
Collections.singletonList(person),
personEncoder
javaBeanDS.show();
// +---+----+
// |age|name|
// +---+----+
// | 32|Andy|
// +---+----+
// Encoders for most common types are provided in class Encoders
Encoder&Integer& integerEncoder = Encoders.INT();
Dataset&Integer& primitiveDS = spark.createDataset(Arrays.asList(1, 2, 3), integerEncoder);
Dataset&Integer& transformedDS = primitiveDS.map(new MapFunction&Integer, Integer&() {
public Integer call(Integer value) throws Exception {
return value + 1;
}, integerEncoder);
transformedDS.collect(); // Returns [2, 3, 4]
// DataFrames can be converted to a Dataset by providing a class. Mapping based on name
String path = &examples/src/main/resources/people.json&;
Dataset&Person& peopleDS = spark.read().json(path).as(personEncoder);
peopleDS.show();
// +----+-------+
// +----+-------+
// |null|Michael|
19| Justin|
// +----+-------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
Interoperating with RDDs
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first
method uses reflection to infer the schema of an RDD that contains specific types of objects. This
reflection based approach leads to more concise code and works well when you already know the schema
while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
you to construct Datasets when the columns and their types are not known until runtime.
Inferring the Schema Using Reflection
The Scala interface for Spark SQL supports automatically converting an RDD containing case classes
to a DataFrame. The case class
defines the schema of the table. The names of the arguments to the case class are read using
reflection and become the names of the columns. Case classes can also be nested or contain complex
types such as Seqs or Arrays. This RDD can be implicitly converted to a DataFrame and then be
registered as a table. Tables can be used in subsequent SQL statements.
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
import org.apache.spark.sql.Encoder
// For implicit conversions from RDDs to DataFrames
import spark.implicits._
// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
.textFile(&examples/src/main/resources/people.txt&)
.map(_.split(&,&))
.map(attributes =& Person(attributes(0), attributes(1).trim.toInt))
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView(&people&)
// SQL statements can be run by using the sql methods provided by Spark
val teenagersDF = spark.sql(&SELECT name, age FROM people WHERE age BETWEEN 13 AND 19&)
// The columns of a row in the result can be accessed by field index
teenagersDF.map(teenager =& &Name: & + teenager(0)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// or by field name
teenagersDF.map(teenager =& &Name: & + teenager.getAs[String](&name&)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager =& teenager.getValuesMap[Any](List(&name&, &age&))).collect()
// Array(Map(&name& -& &Justin&, &age& -& 19))
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
Spark SQL supports automatically converting an RDD of
into a DataFrame.
The BeanInfo, obtained using reflection, defines the schema of the table. Currently, Spark SQL
does not support JavaBeans that contain Map field(s). Nested JavaBeans and List or Array
fields are supported though. You can create a JavaBean by creating a class that implements
Serializable and has getters and setters for all of its fields.
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
// Create an RDD of Person objects from a text file
JavaRDD&Person& peopleRDD = spark.read()
.textFile(&examples/src/main/resources/people.txt&)
.javaRDD()
.map(new Function&String, Person&() {
public Person call(String line) throws Exception {
String[] parts = line.split(&,&);
Person person = new Person();
person.setName(parts[0]);
person.setAge(Integer.parseInt(parts[1].trim()));
return person;
// Apply a schema to an RDD of JavaBeans to get a DataFrame
Dataset&Row& peopleDF = spark.createDataFrame(peopleRDD, Person.class);
// Register the DataFrame as a temporary view
peopleDF.createOrReplaceTempView(&people&);
// SQL statements can be run by using the sql methods provided by spark
Dataset&Row& teenagersDF = spark.sql(&SELECT name FROM people WHERE age BETWEEN 13 AND 19&);
// The columns of a row in the result can be accessed by field index
Encoder&String& stringEncoder = Encoders.STRING();
Dataset&String& teenagerNamesByIndexDF = teenagersDF.map(new MapFunction&Row, String&() {
public String call(Row row) throws Exception {
return &Name: & + row.getString(0);
}, stringEncoder);
teenagerNamesByIndexDF.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
// or by field name
Dataset&String& teenagerNamesByFieldDF = teenagersDF.map(new MapFunction&Row, String&() {
public String call(Row row) throws Exception {
return &Name: & + row.&String&getAs(&name&);
}, stringEncoder);
teenagerNamesByFieldDF.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of
key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table,
and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.
from pyspark.sql import Row
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile(&examples/src/main/resources/people.txt&)
parts = lines.map(lambda l: l.split(&,&))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))
# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView(&people&)
# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql(&SELECT name FROM people WHERE age &= 13 AND age &= 19&)
# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: &Name: & + p.name).collect()
for name in teenNames:
print(name)
# Name: Justin
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Programmatically Specifying the Schema
When case classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed
and fields will be projected differently for different users),
a DataFrame can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.
For example:
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile(&examples/src/main/resources/people.txt&)
// The schema is encoded in a string
val schemaString = &name age&
// Generate the schema based on the string of schema
val fields = schemaString.split(& &)
.map(fieldName =& StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(&,&))
.map(attributes =& Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView(&people&)
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql(&SELECT name FROM people&)
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes =& &Name: & + attributes(0)).show()
// +-------------+
// +-------------+
// |Name: Michael|
Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.
When JavaBean classes cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a Dataset&Row& can be created programmatically with three steps.
Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of
Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided
by SparkSession.
For example:
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
// Create an RDD
JavaRDD&String& peopleRDD = spark.sparkContext()
.textFile(&examples/src/main/resources/people.txt&, 1)
.toJavaRDD();
// The schema is encoded in a string
String schemaString = &name age&;
// Generate the schema based on the string of schema
List&StructField& fields = new ArrayList&&();
for (String fieldName : schemaString.split(& &)) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
StructType schema = DataTypes.createStructType(fields);
// Convert records of the RDD (people) to Rows
JavaRDD&Row& rowRDD = peopleRDD.map(new Function&String, Row&() {
public Row call(String record) throws Exception {
String[] attributes = record.split(&,&);
return RowFactory.create(attributes[0], attributes[1].trim());
// Apply the schema to the RDD
Dataset&Row& peopleDataFrame = spark.createDataFrame(rowRDD, schema);
// Creates a temporary view using the DataFrame
peopleDataFrame.createOrReplaceTempView(&people&);
// SQL can be run over a temporary view created using DataFrames
Dataset&Row& results = spark.sql(&SELECT name FROM people&);
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
Dataset&String& namesDS = results.map(new MapFunction&Row, String&() {
public String call(Row row) throws Exception {
return &Name: & + row.getString(0);
}, Encoders.STRING());
namesDS.show();
// +-------------+
// +-------------+
// |Name: Michael|
Name: Andy|
// | Name: Justin|
// +-------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java" in the Spark repo.
When a dictionary of kwargs cannot be defined ahead of time (for example,
the structure of records is encoded in a string, or a text dataset will be parsed and
fields will be projected differently for different users),
a DataFrame can be created programmatically with three steps.
Create an RDD of tuples or lists from the original RDD;
Create the schema represented by a StructType matching the structure of
tuples or lists in the RDD created in the step 1.
Apply the schema to the RDD via createDataFrame method provided by SparkSession.
For example:
# Import data types
from pyspark.sql.types import *
sc = spark.sparkContext
# Load a text file and convert each line to a Row.
lines = sc.textFile(&examples/src/main/resources/people.txt&)
parts = lines.map(lambda l: l.split(&,&))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))
# The schema is encoded in a string.
schemaString = &name age&
fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)
# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)
# Creates a temporary view using the DataFrame
schemaPeople.createOrReplaceTempView(&people&)
# SQL can be run over DataFrames that have been registered as a table.
results = spark.sql(&SELECT name FROM people&)
results.show()
# +-------+
# +-------+
# |Michael|
# | Justin|
# +-------+
Find full example code at "examples/src/main/python/sql/basic.py" in the Spark repo.
Aggregations
provide common
aggregations such as count(), countDistinct(), avg(), max(), min(), etc.
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
to work with strongly typed Datasets.
Moreover, users are not limited to the predefined aggregate functions and can create their own.
Untyped User-Defined Aggregate Functions
Users have to extend the
abstract class to implement a custom untyped aggregate function. For example, a user-defined average
can look like:
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
object MyAverage extends UserDefinedAggregateFunction {
// Data types of input arguments of this aggregate function
def inputSchema: StructType = StructType(StructField(&inputColumn&, LongType) :: Nil)
// Data types of values in the aggregation buffer
def bufferSchema: StructType = {
StructType(StructField(&sum&, LongType) :: StructField(&count&, LongType) :: Nil)
// The data type of the returned value
def dataType: DataType = DoubleType
// Whether this function always returns the same output on the identical input
def deterministic: Boolean = true
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0L
// Updates the given aggregation buffer `buffer` with new input data from `input`
def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
if (!input.isNullAt(0)) {
buffer(0) = buffer.getLong(0) + input.getLong(0)
buffer(1) = buffer.getLong(1) + 1
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)
// Calculates the final result
def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)
// Register the function to access it
spark.udf.register(&myAverage&, MyAverage)
val df = spark.read.json(&examples/src/main/resources/employees.json&)
df.createOrReplaceTempView(&employees&)
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
val result = spark.sql(&SELECT myAverage(salary) as average_salary FROM employees&)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala" in the Spark repo.
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.expressions.MutableAggregationBuffer;
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public static class MyAverage extends UserDefinedAggregateFunction {
private StructType inputSchema;
private StructType bufferSchema;
public MyAverage() {
List&StructField& inputFields = new ArrayList&&();
inputFields.add(DataTypes.createStructField(&inputColumn&, DataTypes.LongType, true));
inputSchema = DataTypes.createStructType(inputFields);
List&StructField& bufferFields = new ArrayList&&();
bufferFields.add(DataTypes.createStructField(&sum&, DataTypes.LongType, true));
bufferFields.add(DataTypes.createStructField(&count&, DataTypes.LongType, true));
bufferSchema = DataTypes.createStructType(bufferFields);
// Data types of input arguments of this aggregate function
public StructType inputSchema() {
return inputSchema;
// Data types of values in the aggregation buffer
public StructType bufferSchema() {
return bufferSchema;
// The data type of the returned value
public DataType dataType() {
return DataTypes.DoubleType;
// Whether this function always returns the same output on the identical input
public boolean deterministic() {
return true;
// Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to
// standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides
// the opportunity to update its values. Note that arrays and maps inside the buffer are still
// immutable.
public void initialize(MutableAggregationBuffer buffer) {
buffer.update(0, 0L);
buffer.update(1, 0L);
// Updates the given aggregation buffer `buffer` with new input data from `input`
public void update(MutableAggregationBuffer buffer, Row input) {
if (!input.isNullAt(0)) {
long updatedSum = buffer.getLong(0) + input.getLong(0);
long updatedCount = buffer.getLong(1) + 1;
buffer.update(0, updatedSum);
buffer.update(1, updatedCount);
// Merges two aggregation buffers and stores the updated buffer values back to `buffer1`
public void merge(MutableAggregationBuffer buffer1, Row buffer2) {
long mergedSum = buffer1.getLong(0) + buffer2.getLong(0);
long mergedCount = buffer1.getLong(1) + buffer2.getLong(1);
buffer1.update(0, mergedSum);
buffer1.update(1, mergedCount);
// Calculates the final result
public Double evaluate(Row buffer) {
return ((double) buffer.getLong(0)) / buffer.getLong(1);
// Register the function to access it
spark.udf().register(&myAverage&, new MyAverage());
Dataset&Row& df = spark.read().json(&examples/src/main/resources/employees.json&);
df.createOrReplaceTempView(&employees&);
df.show();
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
Dataset&Row& result = spark.sql(&SELECT myAverage(salary) as average_salary FROM employees&);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedUntypedAggregation.java" in the Spark repo.
Type-Safe User-Defined Aggregate Functions
User-defined aggregations for strongly typed Datasets revolve around the
abstract class.
For example, a type-safe user-defined average can look like:
import org.apache.spark.sql.expressions.Aggregator
import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Encoders
import org.apache.spark.sql.SparkSession
case class Employee(name: String, salary: Long)
case class Average(var sum: Long, var count: Long)
object MyAverage extends Aggregator[Employee, Average, Double] {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
def zero: Average = Average(0L, 0L)
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
def reduce(buffer: Average, employee: Employee): Average = {
buffer.sum += employee.salary
buffer.count += 1
// Merge two intermediate values
def merge(b1: Average, b2: Average): Average = {
b1.sum += b2.sum
b1.count += b2.count
// Transform the output of the reduction
def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count
// Specifies the Encoder for the intermediate value type
def bufferEncoder: Encoder[Average] = Encoders.product
// Specifies the Encoder for the final output value type
def outputEncoder: Encoder[Double] = Encoders.scalaDouble
val ds = spark.read.json(&examples/src/main/resources/employees.json&).as[Employee]
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
// Convert the function to a `TypedColumn` and give it a name
val averageSalary = MyAverage.toColumn.name(&average_salary&)
val result = ds.select(averageSalary)
result.show()
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala" in the Spark repo.
import java.io.Serializable;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.TypedColumn;
import org.apache.spark.sql.expressions.Aggregator;
public static class Employee implements Serializable {
private String name;
private long salary;
// Constructors, getters, setters...
public static class Average implements Serializable
private long sum;
private long count;
// Constructors, getters, setters...
public static class MyAverage extends Aggregator&Employee, Average, Double& {
// A zero value for this aggregation. Should satisfy the property that any b + zero = b
public Average zero() {
return new Average(0L, 0L);
// Combine two values to produce a new value. For performance, the function may modify `buffer`
// and return it instead of constructing a new object
public Average reduce(Average buffer, Employee employee) {
long newSum = buffer.getSum() + employee.getSalary();
long newCount = buffer.getCount() + 1;
buffer.setSum(newSum);
buffer.setCount(newCount);
return buffer;
// Merge two intermediate values
public Average merge(Average b1, Average b2) {
long mergedSum = b1.getSum() + b2.getSum();
long mergedCount = b1.getCount() + b2.getCount();
b1.setSum(mergedSum);
b1.setCount(mergedCount);
return b1;
// Transform the output of the reduction
public Double finish(Average reduction) {
return ((double) reduction.getSum()) / reduction.getCount();
// Specifies the Encoder for the intermediate value type
public Encoder&Average& bufferEncoder() {
return Encoders.bean(Average.class);
// Specifies the Encoder for the final output value type
public Encoder&Double& outputEncoder() {
return Encoders.DOUBLE();
Encoder&Employee& employeeEncoder = Encoders.bean(Employee.class);
String path = &examples/src/main/resources/employees.json&;
Dataset&Employee& ds = spark.read().json(path).as(employeeEncoder);
ds.show();
// +-------+------+
name|salary|
// +-------+------+
// |Michael|
// | Justin|
// +-------+------+
MyAverage myAverage = new MyAverage();
// Convert the function to a `TypedColumn` and give it a name
TypedColumn&Employee, Double& averageSalary = myAverage.toColumn().name(&average_salary&);
Dataset&Double& result = ds.select(averageSalary);
result.show();
// +--------------+
// |average_salary|
// +--------------+
// +--------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaUserDefinedTypedAggregation.java" in the Spark repo.
Data Sources
Spark SQL supports operating on a variety of data sources through the DataFrame interface.
A DataFrame can be operated on using relational transformations and can also be used to create a temporary view.
Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section
describes the general methods for loading and saving data using the Spark Data Sources and then
goes into specific options that are available for the built-in data sources.
Generic Load/Save Functions
In the simplest form, the default data source (parquet unless otherwise configured by
spark.sql.sources.default) will be used for all operations.
val usersDF = spark.read.load(&examples/src/main/resources/users.parquet&)
usersDF.select(&name&, &favorite_color&).write.save(&namesAndFavColors.parquet&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& usersDF = spark.read().load(&examples/src/main/resources/users.parquet&);
usersDF.select(&name&, &favorite_color&).write().save(&namesAndFavColors.parquet&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.load(&examples/src/main/resources/users.parquet&)
df.select(&name&, &favorite_color&).write.save(&namesAndFavColors.parquet&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/users.parquet&)
write.df(select(df, &name&, &favorite_color&), &namesAndFavColors.parquet&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Manually Specifying Options
You can also manually specify the data source that will be used along with any extra options
that you would like to pass to the data source. Data sources are specified by their fully qualified
name (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short
names (json, parquet, jdbc, orc, libsvm, csv, text). DataFrames loaded from any data
source type can be converted into other types using this syntax.
val peopleDF = spark.read.format(&json&).load(&examples/src/main/resources/people.json&)
peopleDF.select(&name&, &age&).write.format(&parquet&).save(&namesAndAges.parquet&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& peopleDF =
spark.read().format(&json&).load(&examples/src/main/resources/people.json&);
peopleDF.select(&name&, &age&).write().format(&parquet&).save(&namesAndAges.parquet&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.read.load(&examples/src/main/resources/people.json&, format=&json&)
df.select(&name&, &age&).write.save(&namesAndAges.parquet&, format=&parquet&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/people.json&, &json&)
namesAndAges &- select(df, &name&, &age&)
write.df(namesAndAges, &namesAndAges.parquet&, &parquet&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Run SQL on files directly
Instead of using read API to load a file into DataFrame and query it, you can also query that
file directly with SQL.
val sqlDF = spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
Dataset&Row& sqlDF =
spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&);
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
df = spark.sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- sql(&SELECT * FROM parquet.`examples/src/main/resources/users.parquet`&)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Save Modes
Save operations can optionally take a SaveMode, that specifies how to handle existing data if
present. It is important to realize that these save modes do not utilize any locking and are not
atomic. Additionally, when performing an Overwrite, the data will be deleted before writing out the
Scala/JavaAny LanguageMeaning
SaveMode.ErrorIfExists (default)
"error" (default)
When saving a DataFrame to a data source, if data already exists,
an exception is expected to be thrown.
SaveMode.Append
When saving a DataFrame to a data source, if data/table already exists,
contents of the DataFrame are expected to be appended to existing data.
SaveMode.Overwrite
"overwrite"
Overwrite mode means that when saving a DataFrame to a data source,
if data/table already exists, existing data is expected to be overwritten by the contents of
the DataFrame.
SaveMode.Ignore
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
the save operation is expected to not save the contents of the DataFrame and to not
change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL.
Saving to Persistent Tables
DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable
command. Notice that an existing Hive deployment is not necessary to use this feature. Spark will create a
default local Hive metastore (using Derby) for you. Unlike the createOrReplaceTempView command,
saveAsTable will materialize the contents of the DataFrame and create a pointer to the data in the
Hive metastore. Persistent tables will still exist even after your Spark program has restarted, as
long as you maintain your connection to the same metastore. A DataFrame for a persistent table can
be created by calling the table method on a SparkSession with the name of the table.
By default saveAsTable will create a “managed table”, meaning that the location of the data will
be controlled by the metastore. Managed tables will also have their data deleted automatically
when a table is dropped.
Currently, saveAsTable does not expose an API supporting the creation of an “external table” from a DataFrame.
However, this functionality can be achieved by providing a path option to the DataFrameWriter with path as the key
and location of the external table as its value (a string) when saving the table with saveAsTable. When an External table
is dropped only its metadata is removed.
Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore. This brings several benefits:
Since the metastore can return only necessary partitions for a query, discovering all the partitions on the first query to the table is no longer needed.
Hive DDLs such as ALTER TABLE PARTITION ... SET LOCATION are now available for tables created with the Datasource API.
Note that partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.
Parquet Files
is a columnar format that is supported by many other data processing systems.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
of the original data. When writing Parquet files, all columns are automatically converted to be nullable for
compatibility reasons.
Loading Data Programmatically
Using the data from the above example:
// Encoders for most common types are automatically provided by importing spark.implicits._
import spark.implicits._
val peopleDF = spark.read.json(&examples/src/main/resources/people.json&)
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write.parquet(&people.parquet&)
// Read in the parquet file created above
// Parquet files are self-describing so the schema is preserved
// The result of loading a Parquet file is also a DataFrame
val parquetFileDF = spark.read.parquet(&people.parquet&)
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView(&parquetFile&)
val namesDF = spark.sql(&SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19&)
namesDF.map(attributes =& &Name: & + attributes(0)).show()
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
Dataset&Row& peopleDF = spark.read().json(&examples/src/main/resources/people.json&);
// DataFrames can be saved as Parquet files, maintaining the schema information
peopleDF.write().parquet(&people.parquet&);
// Read in the Parquet file created above.
// Parquet files are self-describing so the schema is preserved
// The result of loading a parquet file is also a DataFrame
Dataset&Row& parquetFileDF = spark.read().parquet(&people.parquet&);
// Parquet files can also be used to create a temporary view and then used in SQL statements
parquetFileDF.createOrReplaceTempView(&parquetFile&);
Dataset&Row& namesDF = spark.sql(&SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19&);
Dataset&String& namesDS = namesDF.map(new MapFunction&Row, String&() {
public String call(Row row) {
return &Name: & + row.getString(0);
}, Encoders.STRING());
namesDS.show();
// +------------+
// +------------+
// |Name: Justin|
// +------------+
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
peopleDF = spark.read.json(&examples/src/main/resources/people.json&)
# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet(&people.parquet&)
# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet(&people.parquet&)
# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView(&parquetFile&)
teenagers = spark.sql(&SELECT name FROM parquetFile WHERE age &= 13 AND age &= 19&)
teenagers.show()
# +------+
# +------+
# |Justin|
# +------+
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df &- read.df(&examples/src/main/resources/people.json&, &json&)
# SparkDataFrame can be saved as Parquet files, maintaining the schema information.
write.parquet(df, &people.parquet&)
# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile &- read.parquet(&people.parquet&)
# Parquet files can also be used to create a temporary view and then used in SQL statements.
createOrReplaceTempView(parquetFile, &parquetFile&)
teenagers &- sql(&SELECT name FROM parquetFile WHERE age &= 13 AND age &= 19&)
head(teenagers)
## 1 Justin
# We can also run custom R-UDFs on Spark DataFrames. Here we prefix all the names with &Name:&
schema &- structType(structField(&name&, &string&))
teenNames &- dapply(df, function(p) { cbind(paste(&Name:&, p$name)) }, schema)
for (teenName in collect(teenNames)$name) {
cat(teenName, &\n&)
## Name: Michael
## Name: Andy
## Name: Justin
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
CREATE TEMPORARY VIEW parquetTable
USING org.apache.spark.sql.parquet
path &examples/src/main/resources/people.parquet&
SELECT * FROM parquetTable
Partition Discovery
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
the path of each partition directory. The Parquet data source is now able to discover and infer
partitioning information automatically. For example, we can store all our previously used
population data into a partitioned table using the following directory structure, with two extra
columns, gender and country as partitioning columns:
└── table
├── gender=male
├── ...
├── country=US
└── data.parquet
├── country=CN
└── data.parquet
└── ...
└── gender=female
├── ...
├── country=US
└── data.parquet
├── country=CN
└── data.parquet
└── ...
By passing path/to/table to either SparkSession.read.parquet or SparkSession.read.load, Spark SQL
will automatically extract the partitioning information from the paths.
Now the schema of the returned DataFrame becomes:
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
Notice that the data types of the partitioning columns are automatically inferred. Currently,
numeric data types and string type are supported. Sometimes users may not want to automatically
infer the data types of the partitioning columns. For these use cases, the automatic type inference
can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to
true. When type inference is disabled, string type will be used for the partitioning columns.
Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
by default. For the above example, if users pass path/to/table/gender=male to either
SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a
partitioning column. If users need to specify the base path that partition discovery
should start with, they can set basePath in the data source options. For example,
when path/to/table/gender=male is the path of the data and
users set basePath to path/to/table/, gender will be a partitioning column.
Schema Merging
Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
turned it off by default starting from 1.5.0. You may enable it by
setting data source option mergeSchema to true when reading Parquet files (as shown in the
examples below), or
setting the global SQL option spark.sql.parquet.mergeSchema to true.
// This is used to implicitly convert an RDD to a DataFrame.
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i =& (i, i * i)).toDF(&value&, &square&)
squaresDF.write.parquet(&data/test_table/key=1&)
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i =& (i, i * i * i)).toDF(&value&, &cube&)
cubesDF.write.parquet(&data/test_table/key=2&)
// Read the partitioned table
val mergedDF = spark.read.option(&mergeSchema&, &true&).parquet(&data/test_table&)
mergedDF.printSchema()
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
|-- value: int (nullable = true)
|-- square: int (nullable = true)
|-- cube: int (nullable = true)
|-- key: int (nullable = true)
Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.
import java.io.Serializable;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
public static class Square implements Serializable {
private int value;
private int square;
// Getters and setters...
public static class Cube implements Serializable {
private int value;
private int cube;
// Getters and setters...
List&Square& squares = new ArrayList&&();
for (int value = 1; value &= 5; value++) {
Square square = new Square();
square.setValue(value);
square.setSquare(value * value);
squares.add(square);
// Create a simple DataFrame, store into a partition directory
Dataset&Row& squaresDF = spark.createDataFrame(squares, Square.class);
squaresDF.write().parquet(&data/test_table/key=1&);
List&Cube& cubes = new ArrayList&&();
for (int value = 6; value &= 10; value++) {
Cube cube = new Cube();
cube.setValue(value);
cube.setCube(value * value * value);
cubes.add(cube);
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
Dataset&Row& cubesDF = spark.createDataFrame(cubes, Cube.class);
cubesDF.write().parquet(&data/test_table/key=2&);
// Read the partitioned table
Dataset&Row& mergedDF = spark.read().option(&mergeSchema&, true).parquet(&data/test_table&);
mergedDF.printSchema();
// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths
|-- value: int (nullable = true)
|-- square: int (nullable = true)
|-- cube: int (nullable = true)
|-- key: int (nullable = true)
Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.
from pyspark.sql import Row
# spark is from the previous example.
# Create a simple DataFrame, stored into a partition directory
sc = spark.sparkContext
squaresDF = spark.createDataFrame(sc.parallelize(range(1, 6))
.map(lambda i: Row(single=i, double=i ** 2)))
squaresDF.write.parquet(&data/test_table/key=1&)
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
cubesDF = spark.createDataFrame(sc.parallelize(range(6, 11))
.map(lambda i: Row(single=i, triple=i ** 3)))
cubesDF.write.parquet(&data/test_table/key=2&)
# Read the partitioned table
mergedDF = spark.read.option(&mergeSchema&, &true&).parquet(&data/test_table&)
mergedDF.printSchema()
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
|-- double: long (nullable = true)
|-- single: long (nullable = true)
|-- triple: long (nullable = true)
|-- key: integer (nullable = true)
Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.
df1 &- createDataFrame(data.frame(single=c(12, 29), double=c(19, 23)))
df2 &- createDataFrame(data.frame(double=c(19, 23), triple=c(23, 18)))
# Create a simple DataFrame, stored into a partition directory
write.df(df1, &data/test_table/key=1&, &parquet&, &overwrite&)
# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
write.df(df2, &data/test_table/key=2&, &parquet&, &overwrite&)
# Read the partitioned table
df3 &- read.df(&data/test_table&, &parquet&, mergeSchema = &true&)
printSchema(df3)
# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths
|-- single: double (nullable = true)
|-- double: double (nullable = true)
|-- triple: double (nullable = true)
|-- key: integer (nullable = true)
Find full example code at "examples/src/main/r/RSparkSQLExample.R" in the Spark repo.
Hive metastore Parquet table conversion
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own
Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the
spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.
Hive/Parquet Schema Reconciliation
There are two key differences between Hive and Parquet from the perspective of table schema
processing.
Hive is case insensitive, while Parquet is not
Hive considers all columns nullable, while nullability in Parquet is significant
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
Fields that have the same name in both schema must have the same data type regardless of
nullability. The reconciled field should have the data type of the Parquet side, so that
nullability is respected.
The reconciled schema contains exactly those fields defined in Hive metastore schema.
Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
Any fields that only appear in the Hive metastore schema are added as nullable field in the
reconciled schema.
Metadata Refreshing
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
conversion is enabled, metadata of those converted tables are also cached. If these tables are
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
// spark is an existing SparkSession
spark.catalog.refreshTable(&my_table&)
// spark is an existing SparkSession
spark.catalog().refreshTable(&my_table&);
# spark is an existing SparkSession
spark.catalog.refreshTable(&my_table&)
REFRESH TABLE my_table;
Configuration
Configuration of Parquet can be done using the setConf method on SparkSession or by running
SET key=value commands using SQL.
Property NameDefaultMeaning
spark.sql.parquet.binaryAsString
Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do
not differentiate between binary data and strings when writing out the Parquet schema. This
flag tells Spark SQL to interpret}

51无线网