spark windows pragram
开发一般在linux下开发方便些,然而在windows下有更多的工具可以使用,也更用户便捷性,因此,在windows下搭建配置开发环境很是必要,这里是工作中的一些经验总结,把环境配置在windows下,可以跑python、spark等程序。前后可能连接不顺畅,主要是想到哪里就记录下,另外也是一点一点积累。
安装git
下载Git,在windows下生成自己的ssh_key,在目录下/c/Users/your conputer name/.ssh,1
id_rsa id_rsa.pub known_hosts known_hosts.old
接着把id_rsa.pub里的key字串拷贝到gitlab里,git库更改,则使用如下命令1
ssh_keygen -f "/***/****/.ssh/known_hosts" -R 库地址,主域名
windows的域名解析,在/c/Windows/System32/drivers/etc/里有个文件hosts,其中把域名和ip以空格每行写入即可。注意的是,需要管理员权限,如果没有,则自己把用户的权限提升,并且赋给该用户有写入的权限。
配置jdk
开发环境离不开jdk的配置,在windows下配置jdk,步骤如下:1
2
3
4
5
6
7
81,下载jdk,以jdk1.8.0_101为例,首先从jdk的官网下载;
2,配置环境变量,
JAVA_HOME,D:\programs\Java\jdk1.8.0_101
CLASSPATH,%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar
Path,D:\programs\Java\jdk1.8.0_101\bin
3,在cmd下测试,java -version
在linux下的环境变量配置,文件/etc/profile1
2
3
4export JAVA_HOME=/home/min/programs/jdk/jdk1.8.0_74
export PATH=$JAVA_HOME/bin:$PATH
export JRE_HOME=$JAVA_HOME/jre
export CLASSPATH=./:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
安装eclipse
在eclipse的官网下载,Eclipse IDE for Java Developers Version: Neon Release (4.6.0),然后安装插件,1
2
3help->install new software
add:scala http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site
add:python http://pydev.org/updates
然后,静静地等待安装完。
sbt安装
在windows下线下载sbt包,然后配置路径,1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
311,sbt/conf/repo.properties是sbt读取开源包的路径
[repositories]
local
//comp-maven: http://repo.data.1verge.net/nexus/content/groups/public/
// store_cn: http://maven.oschina.net/content/groups/public/
//store_mir: http://mirrors.ibiblio.org/maven2/
// store_0: http://maven.net.cn/content/groups/public/
store_1: http://repo.typesafe.com/typesafe/ivy-releases/
//store_2: http://repo2.maven.org/maven2/
sbt-releases-repo: http://repo.typesafe.com/typesafe/ivy-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
sbt-plugins-repo: http://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/, [organization]/[module]/(scala_[scalaVersion]/)(sbt_[sbtVersion]/)[revision]/[type]s/[artifact](-[classifier]).[ext]
maven-central: http://repo1.maven.org/maven2/
2,sbt/conf/sbtconfig.txt是sbt下载包存放的路径
# Set the java args to high
-Xmx512M
-XX:MaxPermSize=256m
-XX:ReservedCodeCacheSize=128m
# Set the extra SBT options
-Dsbt.log.format=true
-Dsbt.ivy.home=D:/programs/sbt/.ivy2
-Dsbt.global.base=D:/programs/sbt/.sbt
-Dsbt.repository.config=D:/programs/sbt/conf/repo.properties
3,环境变量配置,PATH D:\programs\sbt\bin
在linux下,在/bin/目录下创建/bin/sbt,并且内容配置如下,1
2SBT_OPTS="-Xms512M -Xmx1536M -Xss1M -XX:+CMSClassUnloadingEnabled -XX:MaxPermSize=256M -Dsbt.override.build.repos=true"
java $SBT_OPTS -jar `dirname $0`/sbt-launch.jar "$@"
在$HOME/.sbt/repositories写入sbt包下载的路径1
2
3
4
5
6
7[repositories]
local
my-flinkspector:https://dl.bintray.com/ottogroup/maven/
my-sbt-releases:https://repository.apache.org/content/repositories/releases/
my-sbt-public:https://repository.apache.org/content/repositories/public/
my-maven-public:http://repo.maven.apache.org/maven2/
my-maven-proxy-releases: http://maven.nlpcn.org/
maven安装
下载maven的安装包,然后配置环境变量,1
2
3
4
5
6在linux下,/etc/profile
export MAVEN_HOME=/home/min/programs/maven/maven3.0.4
export PATH=$MAVEN_HOME/bin:$PATH
在windows下,
PATH D:\programs\maven-3.0.4\bin
hadoop配置
在windows下,hadoop的环境可以使用第三方已经搭配好的试试,下载hadoop的包,1
2
3
4
5
6hadoop-common-2.2.0-bin/bin:
hadoop,hadoop.cmd,hadoop.dll,hadoop.exp,hadoop.lib,hadoop.pdb,windows.exe等
配置环境变量,
HADOOP_HOME, E:\workspace\native-hadoop-bin\hadoop-common-2.2.0-bin
Path,$HADOOP_HOME\bin
在linux下配置hadoop的步骤,
1,ssh 保证可连自己机器,可选操作,只是需要输入自己机器密码而已
2,下载hadoop-2.2.0版本,
3,配置,都是在hadoop-2.2.0/etc/hadoop下进行,1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/min/programs/hadoop/hadoop-2.2.0/tmp</value>
</property>
</configuration>
hadoop-env.sh
# The java implementation to use.
export JAVA_HOME=/home/min/programs/jdk/jdk1.8.0_74
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/min/programs/hadoop/hadoop-2.2.0/dfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/min/programs/hadoop/hadoop-2.2.0/dfs/data</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
3,启动hadoop,1
2
3
4
5
6
7cd /***/hadoop-2.2.0/bin
./hadoop namenode -format
cd /***/hadoop-2.2.0/sbin
./start-all.sh
hbase安装
1,下载hbase-1.0.3版本;
2,配置,1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38hbase-site.xml
<configuration>
<property>
<name>hbase.rootdir</name>
<value>/home/min/programs/hbase-1.0.3/hbaseData</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>127.0.0.1</value>
</property>
<property>
<name>hbase.rest.port</name>
<value>8080</value>
</property>
<property>
<name>hbase.rest.readonly</name>
<value>true</value>
</property>
<property>
<name>hbase.rest.authentication.type</name>
<value>kerberos</value>
</property>
<property>
<name>hbase.rest.authentication.kerberos.principal</name>
<value>HTTP/_HOST@HADOOP.LOCALDOMAIN</value>
</property>
<property>
<name>hbase.rest.authentication.kerberos.keytab</name>
<value>$KEYTAB</value>
</property>
</configuration>
hbase-env.sh
# The java implementation to use. Java 1.7+ required.
export JAVA_HOME=/home/min/programs/jdk/jdk1.8.0_74
3,启动hbase1
2cd /**/hbase-1.0.3/bin
./start-hbase.sh
python安装
在windows下安装python,只需在官网下载包安装即可,然后配置环境变量即可,1
Path, D:\programs\python35
python安装额外的包,都可以用pip安装,1
2
3
4
5
6
7
8
9
10
11
12进入到D:\programs\python35\Scripts
pip install requests
pip install beautifulsoup4
pip install jieba
pip install py4j -i https://pypi.douban.com/simple
pip install hmmlearn
将\spark-1.6.0-bin-hadoop2.6\python下的pySpark文件拷贝到python35\Lib\site-packages\
在windowd下跑spark的python程序需要指定spark本地目录,
import os
os.environ["SPARK_HOME"]="E:\programs\spark\spark-2.0.0-bin-hadoop2.6"