Hadoop介紹
Hadoop-大數(shù)據(jù)開源世界的亞當(dāng)夏娃。
核心是HDFS數(shù)據(jù)存儲(chǔ)系統(tǒng),和MapReduce分布式計(jì)算框架。
HDFS
原理是把大塊數(shù)據(jù)切碎,
每個(gè)碎塊復(fù)制三份,分開放在三個(gè)廉價(jià)機(jī)上,一直保持有三塊可用的數(shù)據(jù)互為備份。使用的時(shí)候只從其中一個(gè)備份讀出來,這個(gè)碎塊數(shù)據(jù)就有了。
存數(shù)據(jù)的叫datenode(格子間),管理datenode的叫namenode(執(zhí)傘人)。
MapReduce
原理是大任務(wù)先分堆處理-Map,再匯總處理結(jié)果-Reduce。分和匯是多臺(tái)服務(wù)器并行進(jìn)行,才能體現(xiàn)集群的威力。難度在于如何把任務(wù)拆解成符合MapReduce模型的分和匯,以及中間過程的輸入輸出<k,v> 都是什么。
單機(jī)版Hadoop介紹
對(duì)于學(xué)習(xí)hadoop原理和hadoop開發(fā)的人來說,搭建一套hadoop系統(tǒng)是必須的。但
- 配置該系統(tǒng)是非常頭疼的,很多人配置過程就放棄了。
- 沒有服務(wù)器供你使用
這里介紹一種免配置的單機(jī)版hadoop安裝使用方法,可以簡單快速的跑一跑hadoop例子輔助學(xué)習(xí)、開發(fā)和測(cè)試。
要求筆記本上裝了linux虛擬機(jī),虛擬機(jī)上裝了Docker。
安裝
使用docker下載sequenceiq/hadoop-docker:2.7.0鏡像并運(yùn)行。
[root@bogon ~]# docker pull sequenceiq/hadoop-docker:2.7.0
2.7.0: Pulling from sequenceiq/hadoop-docker860d0823bcab: Pulling fs layer e592c61b2522: Pulling fs layer
下載成功輸出
Digest: sha256:a40761746eca036fee6aafdf9fdbd6878ac3dd9a7cd83c0f3f5d8a0e6350c76a
Status: Downloaded newer image for sequenceiq/hadoop-docker:2.7.0
啟動(dòng)
[root@bogon ~]# docker run -it sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
Starting sshd: [ OK ]
Starting namenodes on [b7a42f79339c]
b7a42f79339c: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-b7a42f79339c.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-b7a42f79339c.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-root-secondarynamenode-b7a42f79339c.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn--resourcemanager-b7a42f79339c.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-root-nodemanager-b7a42f79339c.out
啟動(dòng)成功后命令行shell會(huì)自動(dòng)進(jìn)入Hadoop的容器環(huán)境,不需要執(zhí)行docker exec。在容器環(huán)境進(jìn)入/usr/local/hadoop/sbin,執(zhí)行./start-all.sh和./mr-jobhistory-daemon.sh start historyserver,如下
bash-4.1# cd /usr/local/hadoop/sbin
bash-4.1# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [b7a42f79339c]
b7a42f79339c: namenode running as process 128. Stop it first.
localhost: datanode running as process 219. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 402. Stop it first.
starting yarn daemons
resourcemanager running as process 547. Stop it first.
localhost: nodemanager running as process 641. Stop it first.
bash-4.1# ./mr-jobhistory-daemon.sh start historyserver
chown: missing operand after `/usr/local/hadoop/logs'
Try `chown --help' for more information.
starting historyserver, logging to /usr/local/hadoop/logs/mapred--historyserver-b7a42f79339c.out
Hadoop啟動(dòng)完成,如此簡單。
要問分布式部署有多麻煩,數(shù)數(shù)光配置文件就有多少個(gè)吧!我親眼見過一個(gè)hadoop老鳥,因?yàn)樾聯(lián)Q的服務(wù)器hostname主機(jī)名帶橫線“-”,配了一上午,環(huán)境硬是沒起來。
運(yùn)行自帶的例子
回到Hadoop主目錄,運(yùn)行示例程序
bash-4.1# cd /usr/local/hadoop
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
20/07/05 22:34:41 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
20/07/05 22:34:43 INFO input.FileInputFormat: Total input paths to process : 31
20/07/05 22:34:43 INFO mapreduce.JobSubmitter: number of splits:31
20/07/05 22:34:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1594002714328_0001
20/07/05 22:34:44 INFO impl.YarnClientImpl: Submitted Application application_1594002714328_0001
20/07/05 22:34:45 INFO mapreduce.Job: The url to track the job: http://b7a42f79339c:8088/proxy/application_1594002714328_0001/
20/07/05 22:34:45 INFO mapreduce.Job: Running job: job_1594002714328_0001
20/07/05 22:35:04 INFO mapreduce.Job: Job job_1594002714328_0001 running in uber mode : false
20/07/05 22:35:04 INFO mapreduce.Job: map 0% reduce 0%
20/07/05 22:37:59 INFO mapreduce.Job: map 11% reduce 0%
20/07/05 22:38:05 INFO mapreduce.Job: map 12% reduce 0%
mapreduce計(jì)算完成,有如下輸出
20/07/05 22:55:26 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=569
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5929
Total time spent by all reduces in occupied slots (ms)=8545
Total time spent by all map tasks (ms)=5929
Total time spent by all reduce tasks (ms)=8545
Total vcore-seconds taken by all map tasks=5929
Total vcore-seconds taken by all reduce tasks=8545
Total megabyte-seconds taken by all map tasks=6071296
Total megabyte-seconds taken by all reduce tasks=8750080
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=159
CPU time spent (ms)=1280
Physical memory (bytes) snapshot=303452160
Virtual memory (bytes) snapshot=1291390976
Total committed heap usage (bytes)=136450048
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
hdfs命令查看輸出結(jié)果
bash-4.1# bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.server.namenode.
2 dfs.period
2 dfs.audit.log.maxfilesize
2 dfs.audit.log.maxbackupindex
1 dfsmetrics.log
1 dfsadmin
1 dfs.servers
1 dfs.replication
1 dfs.file
例子講解
grep是一個(gè)在輸入中計(jì)算正則表達(dá)式匹配的mapreduce程序,篩選出符合正則的字符串以及出現(xiàn)次數(shù)。
shell的grep結(jié)果會(huì)顯示完整的一行,這個(gè)命令只顯示行中匹配的那個(gè)字符串
grep input output 'dfs[a-z.]+'
正則表達(dá)式dfs[a-z.]+,表示字符串要以dfs開頭,后面是小寫字母或者換行符n之外的任意單個(gè)字符都可以,數(shù)量一個(gè)或者多個(gè)。
輸入是input里的所有文件,
bash-4.1# ls -lrt
total 48
-rw-r--r--. 1 root root 690 May 16 2015 yarn-site.xml
-rw-r--r--. 1 root root 5511 May 16 2015 kms-site.xml
-rw-r--r--. 1 root root 3518 May 16 2015 kms-acls.xml
-rw-r--r--. 1 root root 620 May 16 2015 httpfs-site.xml
-rw-r--r--. 1 root root 775 May 16 2015 hdfs-site.xml
-rw-r--r--. 1 root root 9683 May 16 2015 hadoop-policy.xml
-rw-r--r--. 1 root root 774 May 16 2015 core-site.xml
-rw-r--r--. 1 root root 4436 May 16 2015 capacity-scheduler.xml
結(jié)果輸出到output。
計(jì)算流程如下
稍有不同的是這里有兩次reduce,第二次reduce就是把結(jié)果按照出現(xiàn)次數(shù)排個(gè)序。map和reduce流程開發(fā)者自己隨意組合,只要各流程的輸入輸出能銜接上就行。
管理系統(tǒng)介紹
Hadoop提供了web界面的管理系統(tǒng),
端口號(hào) 用途 50070 Hadoop Namenode UI端口 50075 Hadoop Datanode UI端口 50090 Hadoop SecondaryNamenode 端口 50030 JobTracker監(jiān)控端口 50060 TaskTrackers端口 8088 Yarn任務(wù)監(jiān)控端口 60010 Hbase HMaster監(jiān)控UI端口 60030 Hbase HRegionServer端口 8080 Spark監(jiān)控UI端口 4040 Spark任務(wù)UI端口
加命令參數(shù)
docker run命令要加入?yún)?shù),才能訪問UI管理頁面
docker run -it -p 50070:50070 -p 8088:8088 -p 50075:50075 sequenceiq/hadoop-docker:2.7.0 /etc/bootstrap.sh -bash --privileged=true
執(zhí)行這條命令后在宿主機(jī)瀏覽器就可以查看系統(tǒng)了,當(dāng)然如果Linux有瀏覽器也可以查看。我的Linux沒有圖形界面,所以在宿主機(jī)查看。
50070 Hadoop Namenode UI端口
50075 Hadoop Datanode UI端口
8088 Yarn任務(wù)監(jiān)控端口
已完成和正在運(yùn)行的mapreduce任務(wù)都可以在8088里查看,上圖有g(shù)erp和wordcount兩個(gè)任務(wù)。
一些問題
一、./sbin/mr-jobhistory-daemon.sh start historyserver必須執(zhí)行,否則運(yùn)行任務(wù)過程中會(huì)報(bào)
20/06/29 21:18:49 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:10020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
JAVA.io.IOException: java.net.ConnectException: Call From 87a4217b9f8a/172.17.0.1 to 0.0.0.0:10020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.Apache.org/hadoop/ConnectionRefused
二、./start-all.sh必須執(zhí)行否則報(bào)形如 Unknown Job job_1592960164748_0001錯(cuò)誤
三、docker run命令后面必須加--privileged=true,否則運(yùn)行任務(wù)過程中會(huì)報(bào)java.io.IOException: Job status not available
四、注意,Hadoop 默認(rèn)不會(huì)覆蓋結(jié)果文件,因此再次運(yùn)行上面實(shí)例會(huì)提示出錯(cuò),需要先將 ./output 刪除。或者換成output01試試?
總結(jié)
本文方法可以低成本的完成Hadoop的安裝配置,對(duì)于學(xué)習(xí)理解和開發(fā)測(cè)試都有幫助的。如果開發(fā)自己的Hadoop程序,需要將程序打jar包上傳到share/hadoop/mapreduce/目錄,執(zhí)行
bin/hadoop jar share/hadoop/mapreduce/yourtest.jar
來運(yùn)行程序觀察效果。






