이야기박스
Hadoop 시리즈. Hadoop3 설치하기 - 기본 설치 본문
이전 포스팅에서 하둡, 특히 HDFS란 무엇인가에 대해서 가볍게 알아보았다면, 이번에는 직접 설치해여 몸으로 하둡을 겪어보려고 합니다.
포스팅을 구성하면서 어떤 하둡을 설치해볼지 고민을 많이 했었습니다.
CDH? HDP?
여러 고민을 하다가 기초 학습을 위해서 오픈소스로 나와있는 Apache Hadoop을 설치해보기로 결정하였습니다.
특히, 최근에 화두로 떠오르고 있는 Hadoop3을 설치해보기로 하였습니다 ㅎㅎ
시스템 구조
Master Node 3ea, Worker Node 3ea. 총 6개의 노드로 하둡 클러스터를 구성해보려고 합니다.
노드 | 역할 | 비고 |
story-hadoop-master01 | Active Namenode | |
story-hadoop-master02 | Standby Namenode | |
story-hadoop-master03 | Standby Namenode | Observer Node 테스트 예정 |
story-hadoop-worker01 | Datanode | |
story-hadoop-worker02 | Datanode | |
story-hadoop-worker03 | Datanode |
# Namenode HA, 3ea?
Hadoop3에서부터 3개 이상의 Namenode 구성이 가능해졌습니다. 그러므로 이번 포스팅에서는 3개의 네임노드 구성을 진행해볼 예정입니다. ㅎㅎ
문서를 가볍게 살펴보다보니 아래 문구가 있네요.
Note: The minimum number of NameNodes for HA is two, but you can configure more. Its suggested to not exceed 5 - with a recommended 3 NameNodes - due to communication overheads.
오버헤드로 인해서 5개 이상의 네임노드 구성은 지양해야 할 것 같습니다.
# Observer Namenode?
저도 처음 접해본 개념입니다.
간략하게 문서를 읽어보니, Observer Namenode란 Active Namenode의 Read 동작에서 발생하는 오버헤드를 줄여주는 시스템 인 것 같습니다.
NOTE: the feature for Observer NameNode to participate in failover is not implemented yet. Therefore, as described in the next section, you should only use transitionToObserver to bring up an observer and put it outside the ZooKeeper controlled failover group. You should not use transitionToStandby since the host for the Observer NameNode cannot have ZKFC running.
본문을 읽다보니 위 `Note`가 눈에 뛰어서 가져왔습니다.
옵저버 네임노드를 사용하려면 현재 ZKFC를 사용하지 못하는 이슈가 있는 듯 합니다. 어서 해소되었으면 좋겠군요 ㅎㅎ
Step 1. 준비 과정
Step 1-1. 자바 설치
$ sudo apt-get install openjdk-8-jdk -y
echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc
Step 1-2.주키퍼 설치
주키퍼 설치는 이전에 작성했던 포스팅을 참고 부탁드립니다.
Step 2. Hadoop 설치
Step 2-1. Hadoop 파일 다운로드
$ wget https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
Step 2-2. 압축 해제
/opt 경로에 압축을 해제하였습니다.
sudo tar -zxvf hadoop-3.2.2.tar.gz -C /opt/
Step 2-3. 심볼릭 링크 생성해주기
간편하게 접근하기 위하여 심볼릭 링크도 생성해줍니다.
sudo ln -s /opt/hadoop-3.2.2/ /opt/hadoop
저는 deploy라는 계정을 사용하고 있기 때문에 이걸로 접근 권한을 변경해 주었습니다.
sudo chown -R deploy. /opt/hadoop
Step 2-4. 로그 경로 생성
# 로그 경로 생성
sudo mkdir -p /var/log/hadoop
sudo mkdir -p /var/log/hadoop-mapreduce
# 권한 부여
sudo chown -R deploy. /var/log/hadoop
sudo chown -R deploy. /var/log/hadoop-mapreduce
Step 2-5. conf 경로 분할하여 생성
하둡 설정을 제가 기억하기 좋은 곳으로 따로 링크를 걸어주었습니다.
# 경로 생성
sudo mkdir -p /etc/hadoop
# 심볼릭 링크 걸어주기
sudo ln -s /opt/hadoop/etc/hadoop /etc/hadoop/conf
Step 2-6. pid
하둡의 pid 정보가 담길 수 있는 경로를 생성하여 줍니다.
sudo mkdir -p /hadoop
sudo mkdir -p /hadoop/pid
sudo chown -R deploy. /hadoop
Step 2-7. bashrc 설정
하둡 실행 경로를 등록하여 간편하게 명령어를 호출할 수 있게 구성하였습니다.
echo 'export PATH=\\$PATH:/opt/hadoop/bin' >> ~/.bashrc
source ~/.bashrc
Step 3. configuration
개인적으로 하둡 관리에서 가장 어려운.. 설정파일 관리입니다.
어렵기 때문에 아주 기본적인 설정들만 담아두었습니다. ㅎㅎ 내용이 길기 때문에 접어두기로 첨부하여 둡니다.
Step 3-1. core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://story-test</value>
<final>true</final>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>story-hadoop-master01:2181,story-hadoop-master02:2181,story-hadoop-master03:2181</value>
</property>
</configuration>
Step 3-2. hadoop-env.sh
export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_HOME=${HADOOP_HOME:-/opt/hadoop}
export HADOOP_HEAPSIZE="1024"
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"
HADOOP_JOBTRACKER_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xmx1024m -Dhadoop.security.logger=INFO,DRFAS -Dmapred.audit.logger=INFO,MRAUDIT -Dhadoop.mapreduce.jobsummary.logger=INFO,JSA ${HADOOP_JOBTRACKER_OPTS}"
HADOOP_TASKTRACKER_OPTS="-server -Xmx1024m -Dhadoop.security.logger=ERROR,console -Dmapred.audit.logger=ERROR,console ${HADOOP_TASKTRACKER_OPTS}"
SHARED_HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms1024m -Xmx2048m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT"
export HADOOP_DATANODE_OPTS="-server -XX:ParallelGCThreads=4 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/var/log/hadoop/hs_err_pid%p.log -XX:NewSize=200m -XX:MaxNewSize=200m -Xloggc:/var/log/hadoop/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Xms1024m -Xmx2048m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,DRFAAUDIT ${HADOOP_DATANODE_OPTS} -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx${HADOOP_HEAPSIZE}m $HADOOP_CLIENT_OPTS"
HADOOP_NFS3_OPTS="-Xmx1024m -Dhadoop.security.logger=ERROR,DRFAS ${HADOOP_NFS3_OPTS}"
HADOOP_BALANCER_OPTS="-server -Xmx1024m ${HADOOP_BALANCER_OPTS}"
# On secure datanodes, user to run the datanode as after dropping privileges
export HADOOP_SECURE_DN_USER=${HADOOP_SECURE_DN_USER:-""}
# Extra ssh options. Empty by default.
export HADOOP_SSH_OPTS="-o ConnectTimeout=5 -o SendEnv=HADOOP_CONF_DIR"
# Where log files are stored. $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=/var/log/hadoop
export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"INFO,DRFA"}
export HADOOP_SECURITY_LOGGER=${HADOOP_SECURITY_LOGGER:-"INFO,DRFAS"}
export HDFS_AUDIT_LOGGER=${HDFS_AUDIT_LOGGER:-"INFO,DRFA,NullAppender"}
# History server logs
export HADOOP_MAPRED_LOG_DIR=/var/log/hadoop-mapreduce
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=/var/log/hadoop
# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
# host:path where hadoop code should be rsync'd from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1
# The directory where pid files are stored. /tmp by default.
export HADOOP_PID_DIR=/hadoop/pid
export HADOOP_SECURE_DN_PID_DIR=/hadoop/pid
# History server pid
export HADOOP_MAPRED_PID_DIR=/hadoop/pid
YARN_RESOURCEMANAGER_OPTS="-Dyarn.server.resourcemanager.appsummary.logger=INFO,RMSUMMARY"
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See 'man nice'.
# export HADOOP_NICENESS=10
# Add database libraries
JAVA_JDBC_LIBS=""
if [ -d "/usr/share/java" ]; then
for jarFile in `ls /usr/share/java | grep -E "(mysql|ojdbc|postgresql|sqljdbc)" 2>/dev/null`
do
JAVA_JDBC_LIBS=${JAVA_JDBC_LIBS}:$jarFile
done
fi
# Add libraries to the hadoop classpath - some may not need a colon as they already include it
export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}${JAVA_JDBC_LIBS}
# Setting path to hdfs command line
export HADOOP_LIBEXEC_DIR=/opt/hadoop/libexec
# Mostly required for hadoop 2.0
export JAVA_LIBRARY_PATH=${JAVA_LIBRARY_PATH}:/opt/hadoop/lib/native
# Fix temporary bug, when ulimit from conf files is not picked up, without full relogin.
# Makes sense to fix only when runing DN as root
if [ "$command" == "datanode" ] && [ "$EUID" -eq 0 ] && [ -n "$HADOOP_SECURE_DN_USER" ]; then
ulimit -n 128000
fi
Step 3-3. hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/hadoop/hdfs/namenode</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/hadoop/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/hadoop/hdfs/journal/edit</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>story-test</value>
</property>
<property>
<name>dfs.client.failover.proxy.provider.story-test</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.namenodes.story-test</name>
<value>nn1,nn2,nn3</value>
</property>
<property>
<name>dfs.namenode.http-address.story-test.nn1</name>
<value>story-hadoop-master01:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.story-test.nn2</name>
<value>story-hadoop-master02:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.story-test.nn3</name>
<value>story-hadoop-master03:50070</value>
</property>
<property>
<name>dfs.namenode.rpc-address.story-test.nn1</name>
<value>story-hadoop-master01:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.story-test.nn2</name>
<value>story-hadoop-master02:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.story-test.nn3</name>
<value>story-hadoop-master03:8020</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>
qjournal://story-hadoop-master01:8485;story-hadoop-master02:8485;story-hadoop-master03:8485/story-test
</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/hadoop/hdfs/namesecondary</value>
</property>
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>
${dfs.namenode.checkpoint.dir}
</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.fslock.fair</name>
<value>false</value>
</property>
<property>
<name>dfs.ha.tail-edits.in-progress</name>
<value>true</value>
</property>
<property>
<name>dfs.ha.tail-edits.period</name>
<value>0ms</value>
</property>
<property>
<name>dfs.ha.tail-edits.period.backoff-max</name>
<value>10ms</value>
</property>
<property>
<name>dfs.journalnode.edit-cache-size.bytes</name>
<value>1048576</value>
</property>
<property>
<name>dfs.namenode.accesstime.precision</name>
<value>0</value>
</property>
</configuration>
Step 4. 실행 (Namenodes)
Step 4-1. journalnode start
Master 1~3, 모든 마스터 서버에서 실행해 줍니다.
hdfs --daemon start journalnode
Step 4-2. Active namenode start
- story-hadoop-master01
Active Namenode에서는 아래와 같이 네임노드 설정 및 HA 구성을 위하여 `initializeSharedEdits`, `zkfc` 을 함께 설정해줍니다.
# pre-setting
hdfs namenode -format
hdfs namenode -initializeSharedEdits -force
hdfs zkfc -formatZK -force
# run
hdfs --daemon start namenode
Step 4-2. Standby namenode start
- story-hadoop-master02
- story-hadoop-master03
Standby Namenode에서는 아래와 같이 standBy 설정을 명시해주고 프로세스를 시작합니다.
# pre-setting
hdfs namenode -bootstrapStandby -force
# run
hdfs --daemon start namenode
Step 4-3. zkfc 실행
- story-hadoop-master01
- story-hadoop-master02
- story-hadoop-master03
active / standby 운영을 위하여 zkfc를 실행하여 줍니다.
hdfs --daemon start zkfc
Step 4-4. Namenodes 실행 확인
$ jps
690 JournalNode
931 NameNode
1913 DFSZKFailoverController
2844 Jps
$ hdfs haadmin -getAllServiceState
story-hadoop-master01:8020 active
story-hadoop-master02:8020 standby
story-hadoop-master03:8020 standby
5. 실행 (Datanodes)
Step 5-1. Datanode 실행
sudo mkdir -p /var/lib/hadoop-hdfs
sudo chown -R deploy:deploy /var/lib/hadoop-hdfs
hdfs --daemon start datanode
Step 5-2. Datanodes 확인
$ jps
14889 Jps
14812 DataNode
$ hdfs dfsadmin -report
Configured Capacity: 155545079808 (144.86 GB)
Present Capacity: 115403522048 (107.48 GB)
DFS Remaining: 115403448320 (107.48 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0.00%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
Live datanodes (3):
Name: 10.202.103.113:9866 (10.202.103.113)
Hostname: story-hadoop-worker01
Decommission Status : Normal
Configured Capacity: 51848359936 (48.29 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 13331693568 (12.42 GB)
DFS Remaining: 38499864576 (35.86 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.25%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Sep 22 17:14:15 KST 2021
Last Block Report: Wed Sep 22 17:14:12 KST 2021
Num of Blocks: 0
Name: 10.202.122.76:9866 (10.202.122.76)
Hostname: story-hadoop-worker02
Decommission Status : Normal
Configured Capacity: 51848359936 (48.29 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 13553762304 (12.62 GB)
DFS Remaining: 38277795840 (35.65 GB)
DFS Used%: 0.00%
DFS Remaining%: 73.83%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Sep 22 17:14:15 KST 2021
Last Block Report: Wed Sep 22 17:14:12 KST 2021
Num of Blocks: 0
Name: 10.202.123.24:9866 (10.202.123.24)
Hostname: story-hadoop-worker03
Decommission Status : Normal
Configured Capacity: 51848359936 (48.29 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 13205770240 (12.30 GB)
DFS Remaining: 38625787904 (35.97 GB)
DFS Used%: 0.00%
DFS Remaining%: 74.50%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Wed Sep 22 17:14:15 KST 2021
Last Block Report: Wed Sep 22 17:14:12 KST 2021
Num of Blocks: 0
Step 6. 테스트
Step 6-1. Write Test
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
$ cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Wed Sep 22 17:18:39 KST 2021
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 182.91
Average IO rate mb/sec: 187.19
IO rate std deviation: 28.48
Test exec time sec: 58.08
Step 6-2. Read Test
$ hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
$ cat TestDFSIO_results.log
...
----- TestDFSIO ----- : read
Date & time: Wed Sep 22 17:22:16 KST 2021
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 412.88
Average IO rate mb/sec: 464.31
IO rate std deviation: 166.27
Test exec time sec: 26.8
후기
이번 포스팅에서는 Hadoop3의 설치, 특히 3개의 네임노드 구성을 위주로 다루어 보았습니다.
다음 포스팅에서는 Observer Namenode에 대하여 다루어보도록 하겠습니다.
'Computer & Data > Big Data' 카테고리의 다른 글
HDFS 포트 정리 (0) | 2021.09.26 |
---|---|
Hadoop 시리즈. Hadoop3 설치하기 - Observer node (0) | 2021.09.26 |
Hadoop 시리즈. Zookeeper 설치 (0) | 2021.04.23 |
Hadoop 시리즈. HDFS 맛보기 (1) | 2020.11.20 |
Hadoop 시리즈. Hadoop Ecosystem (0) | 2020.11.20 |