This website requires JavaScript.

重新编译Cloudera Apache Spark并加入官方未支持的模块

都说开源的软件坑多,事实上的确如此,折腾了好几天才编译出来=。= ,给大家伙儿分享下过程,本文采用Vagrant导入一个预配置好的CentOS 7编译环境进行CDH Spark的源码编译。

总体可以按照https://github.com/teamclairvoyant/vagrant-sparkbuilder的说明进行操作,注意事项如下:

Vagrant 配置

千万不要使用Windows,除非你会改配置文件,否则不要用Windows尝试vagrant-sparkbuilder这个项目。目前Mac OS 环境和CentOS环境都会碰到秘钥授权问题,解决方法见下文。

VBox安装

CentOS环境中装好VirtualBox以后记得开启CPU虚拟化,然后把组件给装了

sudo /usr/lib/virtualbox/vboxdrv.sh setup

Vagrant镜像下载

国内网络实在太渣,在启用项目的时候会下载一个CentOS的镜像,直接复制地址下载好以后用命令导入吧:

vagrant box add D:\hc-download  --name bento/centos-7.2

秘钥问题

前驱步骤都不是事儿,主要卡在这一步,启动虚拟机以后使用的是私钥SSH访问,但是项目可能在多种环境测试过,我这边出现了授权错误。具体解决方法如下,stackoverflow真心是好网站但是。。。点赞最多的不一定是我们要的那个解决方案。。。

1.Log in to vagrant machine: vagrant ssh, use default password vagrant. 2.Create ssh keys: for example, ssh-keygen -t rsa -b 4096 -C "vagrant" (as adviced by GitHub's relevant guide). 3.Rename the public key file (by default id_rsa.pub), overriding the old one: mv .ssh/id_rsa.pub .ssh/authorized_keys. 4.Reload ssh service in case needed: sudo service ssh reload. 5.Copy the private key file (by default id_rsa) to the host machine: for instance, use a fine combination of cat and clipboard, cat .ssh/id_rsa, paint and copy (better ways must exist, go invent one!). 6.Logout from the vagrant machine: logout. 7.Find the current private key used by vagrant by looking at its configuration: vagrant ssh-config (look for instance ÌdentityFile "/[...]/private_key". 8.Replace the current private key with the one you created at the host machine: for example, nano /[...]/private_key and paste from the clipboard, if all else fails. (Note, however, that if your private_key is not project specific but shared by multiple vagrant machines, you better configure the path yourself in order to not break other perfectly working machines! Changing the path is as simple as adding a line config.ssh.private_key_path = "path/to/private_key" into the Vagrantfile.) 9.Test the setup: vagrant ssh should now work.

编译

编译最闹心的其实还是网络,如果有翻墙的VPN各位记得配置好然后开启,另外要注意编译需要很大内存,默认的4G内存完全不够,请修改项目文件Vagrantfile中的v.memoryv.cpus

编译代码如下

$ patch -p0 </vagrant/undelete.patch # 修改make-distribution.sh,让编译的结果包含SparkR等组件
$ ./make-distribution.sh -DskipTests \
    -Dhadoop.version=2.6.0-cdh5.8.0 \
    -Phadoop-2.6 \
    -Pyarn \
    -Psparkr \
    -Phive \
    -Pflume-provided \
    -Phadoop-provided \
    -Phbase-provided \
    -Phive-provided \
    -Pparquet-provided \
    -Phive-thriftserver

参考

从源码编译Cloudera CDH的Spark 在Ubuntu14.04 + CDH5.7集群环境下安装R语言环境和sparkR

0条评论
avatar