spark 3.x idea linux远程开发环境搭建

发布时间 2024-01-13 21:21:06作者: zhjh256

依赖包

jdk 8或11都行,不要到jdk 17

  jdk 17第一个问题是jdk内部类默认不允许反射,很多配置要改。

scala 2.13

  scala 2.13版本是为scala 3.0版本准备的,改进挺多。可通过scala编程(第四版)学习。

hadoop 3.2.1

  因为windows hadoop winutils当前最新仅仅到3.2.1,所以最好是相同版本,或者不要差太多。

  默认spark自带hadoop是没有native模式的,性能会比较差,所以需要装个原装的hadoop。

hive 3.1.3

  spark表虽然可以独立存在,但是会报警告,所以hive也得安装,运行spark on hive。

spark 3.3.4

  使用最新版本就可以了。

postgresql或mysql

  存储hive metastore元数据,这里以postgresql为例。

zookeeper

  安装3.5.9,因为hbase和hdfs都需要依赖zookeeper选举。

上面装完之后,完整的大数据基建就起来了,spark sql、spark流式任务、spark图、spark机器学习任务随便跑,一套架子搞定所有的计算和ETL,快速扩容、快速调度。一般大数据还会包含hbase和flink,这俩相对独立,可以不装。

linux单机standalone搭建

  虽然说spark可以自己调度,但是如果使用了HDFS存储的话,那么最好就直接搞个单机yarn,而不是standalone。这样更加接近生产模式。

  配置$SPARK_HOME/conf下的三个文件。

  spark-env.sh

export JAVA_HOME="/usr/local/jdk-11.0.21+9"
export SPARK_MASTER_HOST=192.168.231.128
export SPARK_LOCAL_IP=192.168.231.128
export SPARK_MASTER_PORT=7077
#history 配置历史服务
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 -Dspark.history.retainedApplications=30 -Dspark.history.fs.logDirectory=/root/spark-3.3.4-bin-hadoop3-scala2.13/history"

  workers

192.168.231.128

  spark-defaults.conf 

spark.master                     spark://192.168.231.128:7077
spark.eventLog.enabled           true
spark.eventLog.dir               /root/spark-3.3.4-bin-hadoop3-scala2.13/logs

 

idea环境配置scala spark支持,支持通过spark-submit和idea直接提交测试

  1、安装scala插件

  2、全局库增加scala

   3、模块设置将scala加入

   4、maven配置scala插件

  父工程pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.study.sparkmaven</groupId>
    <artifactId>sparkmaven</artifactId>
    <packaging>pom</packaging>
    <version>1.0-SNAPSHOT</version>
    <modules>
        <module>hello</module>
        <module>wordcount</module>
    </modules>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
        <!-- scala 的版本 -->
        <version.scala>2.13.9</version.scala>
        <version.maven.compile.plugin>3.9.0</version.maven.compile.plugin>
        <version.maven.scala.plugin>2.15.2</version.maven.scala.plugin>
        <version.maven.jar.plugin>3.2.2</version.maven.jar.plugin>
        <version.maven.dependency.plugin>3.2.0</version.maven.dependency.plugin>
        <version.maven.assembly.plug>3.3.0</version.maven.assembly.plug>
        <spark.version>3.2.0</spark.version>
    </properties>

    <build>
        <pluginManagement>
            <plugins>
                <!-- maven-compile-plugin 插件 -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>${version.maven.compile.plugin}</version>
                    <!-- 配置信息 -->
                    <configuration>
                        <!-- 源码 -->
                        <source>${maven.compiler.source}</source>
                        <target>${maven.compiler.target}</target>
                        <!-- 编码方式 -->
                        <encoding>UTF-8</encoding>
                        <!-- 支持调试 -->
                        <debug>true</debug>
                    </configuration>
                </plugin>
                <plugin>
                    <groupId>org.scala-tools</groupId>
                    <artifactId>maven-scala-plugin</artifactId>
                    <version>${version.maven.scala.plugin}</version>
                    <configuration>
                        <!-- scala的版本号 -->
                        <scalaVersion>${version.scala}</scalaVersion>
                    </configuration>
                    <!-- 配置监听器 -->
                    <executions>
                        <!-- 监听器 -->
                        <execution>
                            <!-- 如果有多个监听器则必须设置id,而且不能重复 -->
                            <id>scala-compile</id>
                            <!-- 监听的操作 -->
                            <phase>compile</phase>
                            <!-- 监听器触发后执行的操作 -->
                            <goals>
                                <goal>compile</goal>
                            </goals>
                        </execution>
                        <execution>
                            <id>scala-test-compile</id>
                            <phase>test-compile</phase>
                            <goals>
                                <goal>testCompile</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
                <!-- maven-assembly-plugin -->
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-assembly-plugin</artifactId>
                    <version>${version.maven.assembly.plug}</version>
                    <!-- 配置 -->
                    <configuration>
                        <!-- 启动的功能 -->
                        <descriptorRefs>
                            <!-- 打jar包自动拷贝依赖 -->
                            <descriptorRef>jar-with-dependencies</descriptorRef>
                        </descriptorRefs>
                    </configuration>
                    <executions>
                        <!-- 监听器 -->
                        <execution>
                            <id>make-assembly</id>
                            <!-- 监听 打包命令 -->
                            <phase>package</phase>
                            <goals>
                                <!-- 目前只有这个操作 -->
                                <goal>single</goal>
                            </goals>
                        </execution>
                    </executions>
                </plugin>
            </plugins>
        </pluginManagement>
    </build>

    <dependencyManagement>
        <dependencies>
            <!-- scala 库 -->
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-library</artifactId>
                <version>${version.scala}</version>
            </dependency>
            <!-- scala 编译 -->
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-compiler</artifactId>
                <version>${version.scala}</version>
            </dependency>
            <!-- scala 映射 -->
            <dependency>
                <groupId>org.scala-lang</groupId>
                <artifactId>scala-reflect</artifactId>
                <version>${version.scala}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.13</artifactId>
                <version>${spark.version}</version>
                <!-- idea运行需要注释掉,打包需要放开 -->
                <!-- <scope>provided</scope>-->
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.13</artifactId>
                <version>${spark.version}</version>
            </dependency>
        </dependencies>
    </dependencyManagement>
</project>

  子工程pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>sparkmaven</artifactId>
        <groupId>com.study.sparkmaven</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>wordcount</artifactId>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.13</artifactId>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.13</artifactId>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
            </plugin>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <launchers>
                        <launcher>
                            <id>wordcount</id>
                            <mainClass>com.sparklearn.WordCount</mainClass>
                        </launcher>
                    </launchers>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <classpathPrefix>lib/</classpathPrefix>
                            <mainClass>com.sparklearn.WordCount</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

  到此为止,开发、编译、打包就没有问题了。

本地运行(windows)

  如果打出来都是直接cp到linux上运行,或者使用远程开发,则这一步是可选的。

  下载hadoop以及hadoop win包即可,记得下载相同版本,解压win包,覆盖%HADOOP_HOME%/bin,然后将覆盖后的hadoop.dll拷贝到c:/windows/system32目录下即可。然后idea直接运行如下即可:

package com.sparklearn

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {

  def main(args: Array[String]): Unit = {
// 如果是通过spark-submit提交运行,则不需要设置setMaster(),设置了会代替命令行选项--master,导致结果不正确,所以最好通过命令行选项控制,windows环境则设置local,否则不要设置。
// 但是如果通过java -jar xxx.jar直接运行spark任务,则Master必须指定。 val conf
= new SparkConf().setMaster("local[*]").setAppName("wordcount").set("spark.testing.memory","2147480000") val sc = new SparkContext(conf) sc.textFile("D:\\hadoop-3.2.1\\data\\a.txt") .flatMap(_.split(" ")) .map(_->1) .reduceByKey(_+_) .collect() .foreach(println(_)) } }

  不同于spark-submit,本地模式不需要在启动standalone,直接idea run即可调试。

  详见https://blog.csdn.net/saranjiao/article/details/106082374。local都是代表本地运行,只有spark://xxx:xx才是提交集群运行。

  上述scala打出来的包也可以通过spark-submit运行,如spark-submit --master spark://127.0.0.1:7077 --class com.sparklearn.WordCount wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar。也可以直接运行java -jar wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar

  如果是通过java -jar xxx.jar直接运行,则要么目标环境CLASSPATH包含了spark依赖的包如spark-core、spark-sql以及依赖的scala包等,否则会包类找不到。

  如果是通过spark-submit提交,则驱动也托管给了spark集群或yarn集群,所以打包的时候只要指定provided即可,包会非常小,当然带着依赖也可以(不会报错),只不过很大。

远程调试 

  如果在windows端开发,在linux端运行,则必不可少的需要调试的情况。

spark运行流程及main函数

  调度是通过执行应用中的main函数来的,所以叫做任务,并非简单的外挂jar,指定方法(星环TDH启动快,大概率做法就是这里给改了,不用每次启动jvm),但是spark原生要求SparkSession必须在main内初始化、而且只有一个,所以这里需要一个隔离机制。

   https://blog.51cto.com/u_16099276/7409034

  https://blog.csdn.net/weixin_42011858/article/details/129516511

  https://blog.csdn.net/luoyepiaoxue2014/article/details/128076590

终止正在运行的spark任务

  https://blog.csdn.net/wankunde/article/details/117813432

spark streaming流式计算

package com.sparklearn

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.storage.StorageLevel

object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.sparkContext.setLogLevel("WARN")

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

  同样,打包后提交到linux服务器运行。

  先启动nc。

[lightdb@lightdb-dev ~]$ nc -lk 9999

  再提交任务。

[root@lightdb-dev ~]# spark-submit --master spark://127.0.0.1:7077 --class com.sparklearn.NetworkWordCount wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar localhost 9999
24/01/03 12:35:41 INFO SparkContext: Running Spark version 3.3.4
24/01/03 12:35:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/03 12:35:41 INFO ResourceUtils: ==============================================================
24/01/03 12:35:41 INFO ResourceUtils: No custom resources configured for spark.driver.
24/01/03 12:35:41 INFO ResourceUtils: ==============================================================
24/01/03 12:35:41 INFO SparkContext: Submitted application: NetworkWordCount
24/01/03 12:35:41 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/01/03 12:35:41 INFO ResourceProfile: Limiting resource is cpu
....
24/01/03 12:35:43 INFO SingleEventLogFileWriter: Logging events to file:/root/spark-3.3.4-bin-hadoop3-scala2.13/logs/local-1704256542621.inprogress
-------------------------------------------
Time: 1704256544000 ms
-------------------------------------------

  nc交互中输入随便内容,如下:

[lightdb@lightdb-dev ~]$ nc -lk 9999
888
666
非 fe
s^C
[lightdb@lightdb-dev ~]$ 

  spark作业输出:

-------------------------------------------
Time: 1704256626000 ms
-------------------------------------------
(888,1)

-------------------------------------------
Time: 1704256627000 ms
-------------------------------------------
(666,1)

-------------------------------------------
Time: 1704256628000 ms
-------------------------------------------

-------------------------------------------
Time: 1704256629000 ms
-------------------------------------------

-------------------------------------------
Time: 1704256630000 ms
-------------------------------------------

-------------------------------------------
Time: 1704256631000 ms
-------------------------------------------
(非,1)
(fe,1)

  优雅的停止spark streaming任务

spark sql

rdd和dataframes的区别和关系

spark与外部数据库的交互

spark graphx

spark机器学习

典型错误

scalac: error while loading package, class file 'D:\scala-2.13.0\lib\scala-library.jar(scala/reflect/package.class)' is broken
(class java.lang.RuntimeException/error reading Scala signature of package.class: Scala signature package has wrong version
 expected: 5.0
 found: 5.2 in package.class)

安装的scala是2.13,依赖也按照https://www.80wz.com/qawfw/880.html换成了2.13,仍然报上面的错误。

spark shell,spark submit也都没问题了。spark standalone已经启动,通过idea SparkContext提交:

package com.sparklearn

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("spark://192.168.231.128:7077").setAppName("wordcount").set("spark.testing.memory","2147480000")
    val sc = new SparkContext(conf)
    sc.textFile("/root/spark-3.3.4-bin-hadoop3-scala2.13/data/a.txt")
      .flatMap(_.split(" "))
      .map(_->1)
      .reduceByKey(_+_)
      .collect()
      .foreach(println(_))
  }
}

报错如下:

PS D:\sparkmaven\wordcount\target> java -jar .\wordcount-1.0-SNAPSHOT-jar-with-dependencies.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
24/01/02 16:01:05 INFO SparkContext: Running Spark version 3.2.0
24/01/02 16:01:05 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
        at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
        at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
        at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
        at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
        at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1691)
        at org.apache.hadoop.security.SecurityUtil.setConfigurationInternal(SecurityUtil.java:104)
        at org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:88)
        at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:312)
        at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:300)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:575)
        at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2510)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2510)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:314)
        at com.sparklearn.WordCount$.main(WordCount.scala:9)
        at com.sparklearn.WordCount.main(WordCount.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
        at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
        at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
        at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
        ... 13 more
24/01/02 16:01:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/01/02 16:01:05 INFO ResourceUtils: ==============================================================
24/01/02 16:01:05 INFO ResourceUtils: No custom resources configured for spark.driver.
24/01/02 16:01:05 INFO ResourceUtils: ==============================================================
24/01/02 16:01:05 INFO SparkContext: Submitted application: wordcount
24/01/02 16:01:05 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, scri
pt: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
24/01/02 16:01:05 INFO ResourceProfile: Limiting resource is cpu
24/01/02 16:01:05 INFO ResourceProfileManager: Added ResourceProfile id: 0
24/01/02 16:01:05 INFO SecurityManager: Changing view acls to: zjhua
24/01/02 16:01:05 INFO SecurityManager: Changing modify acls to: zjhua
24/01/02 16:01:05 INFO SecurityManager: Changing view acls groups to:
24/01/02 16:01:05 INFO SecurityManager: Changing modify acls groups to:
24/01/02 16:01:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(zjhua); groups with view permissions: Set(); users  with m
odify permissions: Set(zjhua); groups with modify permissions: Set()
24/01/02 16:01:06 INFO Utils: Successfully started service 'sparkDriver' on port 58717.
24/01/02 16:01:06 INFO SparkEnv: Registering MapOutputTracker
24/01/02 16:01:06 INFO SparkEnv: Registering BlockManagerMaster
24/01/02 16:01:06 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
24/01/02 16:01:06 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$ (in unnamed module @0x2928854b) cannot access class sun.nio.ch.DirectBuffer (in module jav
a.base) because module java.base does not export sun.nio.ch to unnamed module @0x2928854b
        at org.apache.spark.storage.StorageUtils$.<init>(StorageUtils.scala:213)
        at org.apache.spark.storage.StorageUtils$.<clinit>(StorageUtils.scala)
        at org.apache.spark.storage.BlockManagerMasterEndpoint.<init>(BlockManagerMasterEndpoint.scala:110)
        at org.apache.spark.SparkEnv$.$anonfun$create$9(SparkEnv.scala:348)
        at org.apache.spark.SparkEnv$.registerOrLookupEndpoint$1(SparkEnv.scala:287)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:336)
        at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:191)
        at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:460)
        at com.sparklearn.WordCount$.main(WordCount.scala:9)
        at com.sparklearn.WordCount.main(WordCount.scala)

  关于hadoop环境变量的配置,可以参考https://blog.csdn.net/lvoelife/article/details/133349627。对于linux下运行而言,这不是必须,但是如果要通过本地工程运行或者提交到linux运行,则需要配置winutils。所以,一种方式是直接idea设置远程开发。

  does not export sun.nio.ch to unnamed module:需要在环境变量中设置 JAVA_OPT 和 JAVA_TOOL_OPTIONS,值统一为 --add-exports=java.base/sun.nio.ch=ALL-UNNAMED。

服务器上直接spark-submit提交可以,远程报错(单机standalone模式),如下:

24/01/02 19:30:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.231.128:7077...
24/01/02 19:30:16 INFO TransportClientFactory: Successfully created connection to /192.168.231.128:7077 after 34 ms (0 ms spent in bootstraps)
24/01/02 19:30:35 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
24/01/02 19:30:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.231.128:7077...
24/01/02 19:30:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://192.168.231.128:7077...
24/01/02 19:31:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
24/01/02 19:31:16 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
24/01/02 19:31:16 INFO SparkUI: Stopped Spark web UI at http://DESKTOP-UPH3K6S.mshome.net:4040
24/01/02 19:31:16 INFO StandaloneSchedulerBackend: Shutting down all executors
24/01/02 19:31:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
24/01/02 19:31:16 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
24/01/02 19:31:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
24/01/02 19:31:16 INFO MemoryStore: MemoryStore cleared
24/01/02 19:31:16 INFO BlockManager: BlockManager stopped
24/01/02 19:31:16 INFO BlockManagerMaster: BlockManagerMaster stopped
24/01/02 19:31:16 WARN MetricsSystem: Stopping a MetricsSystem that is not running
24/01/02 19:31:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
24/01/02 19:31:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56604.
24/01/02 19:31:16 INFO NettyBlockTransferService: Server created on DESKTOP-UPH3K6S.mshome.net:56604
24/01/02 19:31:16 INFO SparkContext: Successfully stopped SparkContext
24/01/02 19:31:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
24/01/02 19:31:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, DESKTOP-UPH3K6S.mshome.net, 56604, None)
24/01/02 19:31:16 ERROR SparkContext: Error initializing SparkContext.
java.lang.NullPointerException: Cannot invoke "org.apache.spark.rpc.RpcEndpointRef.askSync(Object, scala.reflect.ClassTag)" because the return value of "org.apache.spark.storage.BlockManagerMa
ster.driverEndpoint()" is null
        at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
        at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
        at com.sparklearn.WordCount$.main(WordCount.scala:9)
        at com.sparklearn.WordCount.main(WordCount.scala)
24/01/02 19:31:16 INFO SparkContext: SparkContext already stopped.
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.apache.spark.rpc.RpcEndpointRef.askSync(Object, scala.reflect.ClassTag)" because the return value of "org.apache.s
park.storage.BlockManagerMaster.driverEndpoint()" is null
        at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:78)
        at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:518)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:596)
        at com.sparklearn.WordCount$.main(WordCount.scala:9)
        at com.sparklearn.WordCount.main(WordCount.scala)
24/01/02 19:31:16 INFO ShutdownHookManager: Shutdown hook called
24/01/02 19:31:16 INFO ShutdownHookManager: Deleting directory D:\Temp\spark-5f86f6fc-7eeb-4939-b802-0f94466b812e

最后一个尚未找到原因,通过ssh将打出来的jar包传递到服务器,在服务端spark-submit提交。

spark 2和spark 3找不到org.apache.spark.Logging

  参见http://www.taodudu.cc/news/show-4644278.html?action=onClick和https://download.csdn.net/download/yewakui2253/10440634,直接用org.apache.spark.internal.Logging是不生效的。

  可以通过给job增加log4j,spark-submit命令行指定log4j或者全局修改log4j配置。如下:

    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.sparkContext.setLogLevel("WARN")

Spark任务提交jar包依赖管理

   除了fat jar,以及集群环境包含了依赖库、配置了CLASSPATH外,还可以通过几种方式,参见https://blog.51cto.com/u_16087105/6223665。

各种警告处理

Spark-submit报错:Failed to get database default, returning NoSuchObjectException

  用spark-sql运行sql的时候报错Failed to get database default, returning NoSuchObjectException,因为其中需要用到hive,但是我的spark里面没有带hive,所以需要下载hive,然后给spark配置一下hive

  把hive配置的hive-site.xml (hive/conf/hive-site.xml)文件拷贝到 spark 目录下即可(spark/conf/hive-site.xml)

  hive的安装及hiveserver2配置hive元数据从derby改成postgresql

Hive On Spark 与 Spark On Hive 区别

Hive On Spark
在 Hive 中集成 Spark,Hive 既作为元数据存储,又负责解析 HQL 语句,只是将 Hive 的运行引擎更换为 Spark,由 Spark 负责运算工作,而不再是默认的 MR 引擎,但部署较为复杂。

Spark On Hive
Hive 只负责元数据存储,由 Spark 来解析与执行 SQL 语句,其中的 SQL 语法为 Spark SQL,且部署简单。

Spark on Hive 的优点在于它提供了更灵活的编程接口,适用于各种数据处理需求,但性能可能不如 Hive on Spark,特别是在处理复杂查询时。

Unable to load native-hadoop library for your platform

  参考https://blog.csdn.net/csdnliu123/article/details/105488895,下载独立hadoop,然后配置LD_LIBRARY_PATH即可。

WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0

  参考https://stackoverflow.com/questions/41136946/hive-metastore-warnings-in-spark解决

了解spark的执行 spark web ui

  https://baijiahao.baidu.com/s?id=1754353885149130981&wfr=spider&for=pc

  通常利用spark web ui进行性能分析与优化 https://blog.csdn.net/Gavin_ke/article/details/130093634

spark表和hive表的区别

Spark 表和 Hive 表有以下区别:

技术栈不同: Spark 表使用 Spark 程序读取和写入,而 Hive 表使用 HiveQL 语句读取和写入。

存储不同: Spark 表存储在 Spark 内存中,而 Hive 表存储在 Hadoop 的 HDFS 上。

处理速度不同: Spark 表可以通过分布式计算和内存存储提高处理速度,而 Hive 表处理速度较慢。

支持的数据源不同: Spark 表可以读取多种数据源,包括 HDFS、Hive 表、关系型数据库、NoSQL 数据库等,而 Hive 表只能读取 HDFS 上的数据。

You can set spark.sql.legacy.createHiveTableByDefault to false

24/01/05 10:27:37 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.

https://blog.51cto.com/u_16213399/7219855

spark 2到3是有较大的改进和不兼容性的,参见官方说明升级. https://spark.apache.org/docs/3.0.0-preview2/sql-migration-guide.html

hadoop hdfs启动报权限问题

  启动hdfs报错:hadoop100 Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password) 

  创建ssh免密即可。

[root@lightdb-dev ~]# cd .ssh/
[root@lightdb-dev .ssh]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): 
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:RMPAOE2aY4qKzB4qzR+DdpFy0iGOdPORAEIubBC/hHs root@lightdb-dev
The key's randomart image is:
+---[RSA 3072]----+
|=+.. =ooo        |
|=o  +o+...       |
|o=+o*+  .        |
|+*.*o+..         |
|+ E =.  S        |
|=. = .           |
|o*o +            |
|+.+. o           |
|o. ..            |
+----[SHA256]-----+
[root@lightdb-dev .ssh]# ssh lightdb-dev 
root@lightdb-dev's password: 
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Jan  5 14:37:30 2024
[root@lightdb-dev ~]# exit
注销
Connection to lightdb-dev closed.
[root@lightdb-dev .ssh]# ssh lightdb-dev 
root@lightdb-dev's password: 

[root@lightdb-dev .ssh]# 
[root@lightdb-dev .ssh]# 
[root@lightdb-dev .ssh]# ssh-copy-id lightdb-dev 
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/root/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
root@lightdb-dev's password: 

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'lightdb-dev'"
and check to make sure that only the key(s) you wanted were added.

[root@lightdb-dev .ssh]# ssh lightdb-dev 
Activate the web console with: systemctl enable --now cockpit.socket

Last login: Fri Jan  5 14:40:02 2024 from 192.168.231.128

 

参考

https://blog.51cto.com/jiayq/5501230 idea配置

https://blog.csdn.net/weixin_42641909/article/details/96740257 spark配置Log4j的各种方式

https://blog.csdn.net/yiluohan0307/article/details/80048765 远程调试