Spark SQL & DataFrames | Apache Spark


本站和网页 http://spark.apache.org/sql/ 的作者无关,不对其内容负责。快照谨为网络故障时之索引,不代表被搜索网站的即时页面。

Spark SQL & DataFrames | Apache Spark
Download
Libraries
SQL and DataFrames
Spark Streaming
MLlib (machine learning)
GraphX (graph)
Third-Party Projects
Documentation
Latest Release (Spark 3.3.1)
Older Versions and Other Resources
Frequently Asked Questions
Examples
Community
Mailing Lists & Resources
Contributing to Spark
Improvement Proposals (SPIP)
Issue Tracker
Powered By
Project Committers
Project History
Developers
Useful Developer Tools
Versioning Policy
Release Process
Security
Apache Software Foundation
Apache Homepage
License
Sponsorship
Thanks
Security
Event
Spark SQL is Apache Spark's module for working with structured data.
Integrated
Seamlessly mix SQL queries with Spark programs.
Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API. Usable in Java, Scala, Python and R.
results = spark.sql( "SELECT * FROM people")
names = results.map(lambda p: p.name)
Apply functions to results of SQL queries.
Uniform data access
Connect to any data source the same way.
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.
spark.read.json("s3n://...") .registerTempTable("json")
results = spark.sql(
"""SELECT *
FROM people
JOIN json ...""")
Query and join different data sources.
Hive integration
Run SQL or HiveQL queries on existing warehouses.
Spark SQL supports the HiveQL syntax as well as Hive SerDes and UDFs, allowing
you to access existing Hive warehouses.
Spark SQL can use existing Hive metastores, SerDes, and UDFs.
Standard connectivity
Connect through JDBC or ODBC.
A server mode provides industry standard JDBC and ODBC connectivity for business intelligence tools.
Use your existing BI tools to query big data.
Performance & scalability
Spark SQL includes a cost-based optimizer, columnar storage and code generation to make queries fast.
At the same time, it scales to thousands of nodes and multi hour queries using the Spark engine, which provides full mid-query fault tolerance.
Don't worry about using a different engine for historical data.
Community
Spark SQL is developed as part of Apache Spark. It thus gets
tested and updated with each Spark release.
If you have questions about the system, ask on the
Spark mailing lists.
The Spark SQL developers welcome contributions. If you'd like to help out,
read how to
contribute to Spark, and send us a patch!
Getting started
To get started with Spark SQL:
Download Spark. It includes Spark SQL as a module.
Read the Spark SQL and DataFrame guide to learn the API.
Download Apache SparkIncludes Spark SQL
Latest News
Spark 3.2.3 released
(Nov 28, 2022)
Spark 3.3.1 released
(Oct 25, 2022)
Spark 3.2.2 released
(Jul 17, 2022)
Spark 3.3.0 released
(Jun 16, 2022)
Archive
Download Spark
Built-in Libraries:
SQL and DataFrames
Spark Streaming
MLlib (machine learning)
GraphX (graph)
Third-Party Projects
Apache Spark, Spark, Apache, the Apache feather logo, and the Apache Spark project logo are either registered
trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
See guidance on use of Apache Spark trademarks.
All other marks mentioned may be trademarks or registered trademarks of their respective owners.
Copyright 2018 The Apache Software Foundation, Licensed under the
Apache License, Version 2.0.