zimmem
  • Introduction
  • 大数据运维
    • cdh
      • Create Cloudera Manager extensions
      • Create Custom Cloudera Add-on Service
      • Create Custom Repository
      • extjs-dependency-by-oozie.md
      • install-cdh-on-ubuntu-16.md
      • Install Cloudera Manager on CentOS 6
      • Install Cloudera Manager on Ubuntu 16
      • Read/Write hive table when running spark2 by oozie
    • hadoop
      • hdfs.md
      • Resource Configuration
    • hbase
      • README.md
    • hive
      • README.md
    • impala
      • Impala Maintenance Instructions
    • spark
      • cvc.md
      • Spark DataFrame Join
      • Spark Shell
      • Spark Submit
    • Apache Ambari Installation
    • big-data-platform-ops.md
    • hdp.md
  • 区块链
    • hyperledger-fabric
      • img
      • Block Struct
      • 概念
      • 在 kubernetes 上搭建 hyperledger-fabric
  • 数据处理
    • datasets
      • 人脸数据集
    • 降维
      • LDA.md
      • PCA 主成份分析
    • face
      • 人脸关键点检测方法
    • machine-learning
      • gan
        • GAN 学习资料
      • nlp
        • word2vec
      • 机器学习资源
    • 数学知识
      • 统计学
        • 基础知识
        • 交叉熵
        • # 方差与标准差
      • Mathjax Demo
      • 矩阵知识 {matrix}
    • recomand-system
      • 推荐系统相关文章与开源方案
    • tensorflow
      • images
      • dataset
      • 分布式训练
      • Optimizer Compare
      • saved_model
    • 十大数据挖掘数法
      • math
        • 最小二乘法
      • AdaBoost.md
      • Apriori.md
      • 数据挖掘十大经典算法之 C4.5
      • CART.md
      • EM.md
      • k-means.md
      • kNN.md
      • Naive-Baye.md
      • PageRank.md
      • Svm.md
      • top-10-data-mining-algorithm.md
    • 机器学习模型评估
    • Tensorflow Model Serving
  • 数据仓库
    • user-portrait.md
    • Articles
  • 运维开发
    • docker
      • 安装后免 sudo
    • gong-ju
      • install-latest-git-version-on-centos.md
    • linux
      • images
      • iptables
      • network
      • performance-diagnosis.md
      • Shell Syntax
      • SSH with Kerbors5 on Ubuntu
      • wget
    • network
      • install-merlin-on-asus-router.md
    • docker.md
    • Performance diagnosis
    • RabbitMQ
  • 开发语言与框架
    • dubbo
      • README.md
    • java
      • mybatis
      • spring
        • spring-mvc
          • 扩展点
        • 常用扩展
        • Srping MVC
      • How to use G1 garbage collector
      • How to Print GC log
      • spring.md
    • nodejs
      • npm 配置 registry 镜像或代理
    • spring
      • Hive Maintenance Instructions
      • 配置项
      • xxx.md
  • 存储系统
    • mongodb
      • README.md
    • mysql
      • operation
        • 数据导入导出
      • Install Mysql Server by Yum
      • mysql-diff.md
      • Mysql Settings
      • User Management
  • tools
    • Intellij IDEA
  • _book
    • .vscode
    • 大数据运维
      • cdh
      • hadoop
      • hbase
      • hive
      • impala
      • spark
    • 区块链
      • hyperledger-fabric
        • img
    • 数据处理
      • dimension-reduction
      • machine-learning
      • math
        • statistics
      • recomand-system
      • tensorflow
        • images
      • top-10-data-mining-algorithm
        • math
    • 数据仓库
    • 运维开发
      • docker
      • gong-ju
      • linux
        • images
      • network
    • gitbook
      • fonts
        • fontawesome
      • gitbook-plugin-alerts
      • gitbook-plugin-expandable-chapters-interactive
      • gitbook-plugin-fontsettings
      • gitbook-plugin-highlight
      • gitbook-plugin-livereload
      • gitbook-plugin-mathjax
      • gitbook-plugin-search-plus-mod
      • gitbook-plugin-sharing
      • images
    • 开发语言与框架
      • dubbo
      • java
        • spring
          • spring-mvc
      • nodejs
      • spring
    • 存储系统
      • mongodb
      • mysql
        • operation
    • styles
      • README.md
    • Configuration
    • Sandbox
Powered by GitBook
On this page
  • 问题描述
  • 问题查找
  • 解决方法
  1. 大数据运维
  2. cdh

Read/Write hive table when running spark2 by oozie

PreviousInstall Cloudera Manager on Ubuntu 16Nexthadoop

Last updated 6 years ago

问题描述

CDH 5.12 集群安装 后, 在机器上直接通过 spark2-shell 或 spark2-submit 都能支持 spark.sql 读写 hive 表, 但当通过 oozie 的 shell action 时, spark 读取不到 hive catalog , 因而无法读写 hive 表, 不管是直接在 shell 结点直接调 spark2-submit 或者把命令放在 sh 脚本里, 也不管 --deploy-mode 是 client 或 cluster

问题查找

  1. 通过对比两种不同方式调起的 java 进程的 classpath , 发现通过 oozie 调起的进程 classpath 中少了 /opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/conf/yarn-conf 这个目录

  2. 使用 debug 模式(bash -x) 诊断上述目录是如何加入到 classpath 中的, 最终发现在 /opt/cloudera/parcels/SPARK2-2.2.0.cloudera2-1.cdh5.12.0.p0.232957/lib/spark2/conf/spark-env.sh 中有

    HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}
  3. 用同样方法查找 oozie 中执行 spark2-submit 没有把上述目录加入 classpath 的原因, 发现oozie 执行脚本是,

    HADOOP_CONF_DIR 已被设置为 /data/yarn/nm/usercache/icarbonx/appcache/application_1521461396754_4654/container_e37_1521461396754_4654_01_000002/oozie-hadoop-conf-1523771262850, 继续阅读 oozie 相关源码,找到最终原因, 相关代码如下

//https://github.com/apache/oozie/blob/release-4.3.1/sharelib/oozie/src/main/java/org/apache/oozie/action/hadoop/ShellMain.java
public class ShellMain extends LauncherMain {

 private void prepareHadoopConfigs(Configuration actionConf, Map<String, String> envp, File currDir) throws IOException {
        if (actionConf.getBoolean(CONF_OOZIE_SHELL_SETUP_HADOOP_CONF_DIR, false)) {
            String actionXml = envp.get(OOZIE_ACTION_CONF_XML);
            if (actionXml != null) {
                File confDir = new File(currDir, "oozie-hadoop-conf-" + System.currentTimeMillis());
                writeHadoopConfig(actionXml, confDir);
                if (actionConf.getBoolean(CONF_OOZIE_SHELL_SETUP_HADOOP_CONF_DIR_WRITE_LOG4J_PROPERTIES, true)) {
                    System.out.println("Writing " + LOG4J_PROPERTIES + " to " + confDir);
                    writeLoggerProperties(actionConf, confDir);
                }
                System.out.println("Setting " + HADOOP_CONF_DIR + " and " + YARN_CONF_DIR
                    + " to " + confDir.getAbsolutePath());
                envp.put(HADOOP_CONF_DIR, confDir.getAbsolutePath());
                envp.put(YARN_CONF_DIR, confDir.getAbsolutePath());
            }
        }
    }


}
//https://github.com/apache/oozie/blob/release-4.3.1/sharelib/oozie/src/main/java/org/apache/oozie/action/hadoop/LauncherMain.java

public abstract class LauncherMain {

     protected static String[] HADOOP_SITE_FILES = new String[]
            {"core-site.xml", "hdfs-site.xml", "mapred-site.xml", "yarn-site.xml"};

     protected void writeHadoopConfig(String actionXml, File basrDir) throws IOException {
        File actionXmlFile = new File(actionXml);
        System.out.println("Copying " + actionXml + " to " + basrDir + "/" + Arrays.toString(HADOOP_SITE_FILES));
        basrDir.mkdirs();
        File[] dstFiles = new File[HADOOP_SITE_FILES.length];
        for (int i = 0; i < dstFiles.length; i++) {
            dstFiles[i] = new File(basrDir, HADOOP_SITE_FILES[i]);
        }
        copyFileMultiplex(actionXmlFile, dstFiles);
    }
}

解决方法

CDH -> Spark2 -> Configuration -> Spark 2 Client Advanced Configuration Snippet (Safety Valve) for spark2-conf/spark-env.sh 添加

if [ -n "$SPARK_CONF_DIR/yarn-conf" ] && [ "$HADOOP_CONF_DIR" !=  "$SPARK_CONF_DIR/yarn-conf" ]; then
  export SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:$SPARK_CONF_DIR/yarn-conf"
fi
spark2