PySpark学习

发布时间 2023-04-17 11:21:41作者: 大脚板同志

学习基于Amit Nandi 的 Spark for Python Developers

 

1.1  word count example

 

 

Chapter 5    Streaming Live Data with Spark

 

目的:“investigate various implementations using live sources of data such as TCP sockets to the Twitter firehose and put in place a low latency,

high throughput, and scalabel data pipeline combining Spark, Kafka and Flume."

fault tolerance

Main Spark Streaming fault tolerance mechanisms are check pointing, automatic driver restart, and automatic failover. Spark enables recovery

from driver failure using check pointing, which preserves the application state. Furthermore, Failures require recomputing results and DStream operations

have exactly-one semantics.

Processing live data with TCP sockets