Download

td-spark uses YY.MM.patch versioning scheme to show the release year, month, and the patch update number. Note: td-spark-assembly-latest_xxxx.jar is an alias to the last release version td-spark-assembly-YY.MM.patch_(spark version).

WARNING: Spark 2.4.x + Scala 2.11 support has been deprecated since December 2020. Consider migrating to Spark 3.x + Scala 2.12.

For Spark (Scala)

td-spark is a library that can be used with your own Spark cluster. Download one of the jar files below and specify the file path as an argument of spark-submit command --jars (path to td-spark-assembly-xxx.jar):

For PySpark (Python)

Install td-pyspark from PyPI with pip:

$ pip install td-pyspark

If you want to install PySpark as well, specify [spark] option:

$ pip install td-pyspark[spark]

Docker Images

Docker images of td-spark packaged with Spark is available on DockerHub devtd. Here are example commands for running spark-shell with td-spark using Docker:

Spark 3.2.0:

$ docker pull devtd/td-spark-shell:latest_spark3.2.0
$ docker run -it -e TD_API_KEY=$TD_API_KEY devtd/td-spark-shell:latest_spark3.2.0

PySpark 3.2.0:

$ docker pull devtd/td-spark-pyspark:latest_spark3.2.0
$ docker run -it -e TD_API_KEY=$TD_API_KEY devtd/td-spark-pyspark:latest_spark3.2.0

Release Notes

v21.10.0

  • Upgrade to Spark 3.2.0

  • Internal library version upgrade:

    • Upgrade jackson to 2.12.3

    • Upgrade Airframe to 21.10.0

    • Upgrade msgpack-java to 0.9.0

    • Upgrade td-client-java to 0.9.6

    • Upgrade to presto-jdbc 350

    • Upgrade to fluency-treasuredata to 2.6.0

Downloads

v21.5.0

  • Upgrade to Spark 3.1.1 and Hadoop 3.2

  • Ramp up reading a large number of partitions

  • (Experimental) Support vectorized reader. To enable it, set spark.td.enableVectorizedReader to true. This is currently experimental and may change in future versions

  • Internal library version upgrade:

    • Upgrade json4s to 3.7.0-M5

    • Upgrade msgpack-java to 0.8.22

    • Upgrade fluency to 2.5.1

    • Upgrade td-client-java to 0.9.5

    • Upgrade Airframe to 21.3.1

Downloads

v21.3.0

  • Upgrade to Spark 3.0.2 and Python 3.9

  • Fixed a bug when including null values in ArrayType

  • Support for Spark 2.4.x was removed as of 21.3.0

Downloads

v20.12.0

  • Fixed a bug when creating partitions

Downloads

v20.10.0

  • Upgrade to Spark 2.4.7, Spark 3.0.1

  • Fixed a bug that caused upload failure of DataFrame if it contains time column whose type is not Long

  • Fixed a bug when reading Map type values inside a column

  • Fixed the partition reader to reflect spark.sql.maxPartitionBytes and spark.sql.files.openCostInBytes configuration parameters. This will reduce the number of necessary Spark tasks by packing multiple partition read tasks into a single task. See also Spark SQL Performance Tuning Guide.

  • Internal library version upgrade:

    • Upgrade jackson to 2.10.5

    • Upgrade json4s to 3.6.6

    • Upgrade fluency to 2.4.1

    • Upgrade presto-jdbc version to 338 to fix the performance issue using with JDK11

    • Upgrade Airframe to 20.10.0

    • Upgrade to Scala 2.11.12, Scala 2.12.12

    • Upgrade td-client-java to 0.9.3

Downloads

v20.6.2

  • A bug fix for properly handling HTTP responses when receiving 5xx errors from APIs.

Downloads

v20.6.1

This release supports Spark 2.4.6 and Spark 3.0.0 (official release).

Downloads

v20.6.0

Downloads

Major Changes

  • Support swapping table contents

Bug Fixes

  • Bump to msgpack-java 0.8.20 with JDK8 compatibility

  • Fixed NPE in reading specific Array column values

  • Handle 504 responses properly

Internal Changes

v20.4.0

Downloads

Changes

  • Spark 2.4.5 support

  • Support ap02 for spark.td.site configuration

v20.2.0

Downloads

Changes

  • Spark 3.0.0-preview2 support

v19.11.1

Downloads

Bug Fixes

  • Fixed a bug in uploading DataFrame whose time column contains null or non unixtime values.

  • Fixed an error when installing td_pyspark using Python 2

v19.11.0

Downloads

  • td-spark-assembly-19.11.0_spark2.4.4.jar (Spark 2.4.4, Scala 2.11)

  • td-spark-assembly-19.11.0_spark3.0.0-preview.jar (Spark 3.0.0-preview, Scala 2.12)

  • Commands for running spark-shell with Docker:

    • Spark 2.4.4: docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-shell:19.11.0_spark2.4.4

    • Spark 3.0.0-preview: docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-shell:19.11.0_spark3.0.0-preview

    • PySpark 2.4.4: docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-pyspark:19.11.0_spark2.4.4

    • PySpark 3.0.0.dev0: docker run -it -e TD_API_KEY=$TD_API_KEY armtd/td-spark-pyspark:19.11.0_spark3.0.0-preview

Major Changes

  • Support Spark 2.4.4 (Scala 2.11) and Spark 3.0.0-preview (Scala 2.12, pyspark 3.0.0.dev0)

  • Support using multiple TD accounts with val td2 = td.withApiKey("...") (Scala), td2 = td.with_apikey("...") (Python).

Bug Fixes

  • Fixed the table preview of array column values inserted from td-spark

Internal Changes

v19.7.0

Downloads

Major Changes

  • Fully support PySpark. Install the package from PyPI: https://pypi.org/project/td-pyspark/

Bug fixes

  • Fixed scala-parser-combinator error when using td.presto(sql).

  • Bump to Fluency 2.3.2 with configuration fix

  • Add retry around drop table/database

Internal Changes

Upgrade to Airframe 19.8.9