一、背景
BI集群,有60多个节点,2P+数据,机器都已经运行了3年以上。
二、现象
1、yarn任务严重延迟,有时候甚至超时失败
2、yarn任务手动kill后重跑大多数时候会成功
三、调查思路
一开始怀疑是跑任务的当时资源不足导致,一直当作资源不足处理。
1、查看延迟任务日志
2、查看节点日志
3、分析任务执行的task以及container情况,发现有一个节点执行任务时耗时特别长
4、定位到可能是这台机器有问题,重点调查这台机器的问题:
通过跟踪节点的日志,yarn日志基本正常,hadoop datanode日志有异常,根据异常日志搜索出来的问题,没有结论。
2021-09-27 16:03:12,467 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378156_3517112822, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected local=/10.xx.xx.xx:50010 remote=/10.204.114.146:55280]. 447424 millis timeout left. at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:26:06,623 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113 src: /10.ee.ee.ee:19545 dest: /10.xx.xx.xx:50010 2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:19545, dest: /10.xx.xx.xx:50010, bytes: 5730, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-857996562_138, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, duration: 604633141 2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2021-09-27 16:26:07,399 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:26:07,406 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714 received exception java.io.IOException: Connection reset by peer 2021-09-27 16:26:07,406 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:31015 dst: /10.xx.xx.xx:50010 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:30:39,560 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:30:39,561 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE java.nio.channels.ClosedByInterruptException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 received exception java.io.IOException: Connection reset by peer 2021-09-27 16:30:39,561 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:2043 dst: /10.xx.xx.xx:50010 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 16:30:39,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475389081_3517122514 src: /10.ee.ee.ee:34679 dest: /10.xx.xx.xx:50010 2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 src: /10.ee.ee.ee:6123 dest: /10.xx.xx.xx:50010 2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475378492_3517117022, RBW getNumBytes() = 44179902 getBytesOnDisk() = 44179902 getVisibleLength()= 44179902 getVolume() = /data/dfs/data/current getBlockFile() = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475378492 bytesAcked=44179902 bytesOnDisk=44179902 2021-09-27 18:27:02,533 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165) at java.lang.Thread.run(Thread.java:745) 2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 18:27:02,535 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:65) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249) at java.lang.Thread.run(Thread.java:745) 2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 received exception java.io.IOException: Premature EOF from inputStream 2021-09-27 18:27:02,536 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.ee.ee.ee:3268 dst: /10.xx.xx.xx:50010 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) at java.lang.Thread.run(Thread.java:745) 2021-09-27 18:27:02,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 (numBytes=6635939) to /10.216.5.16:50010 2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 src: /10.ee.ee.ee:3340 dest: /10.xx.xx.xx:50010 2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475546259_3517286755, RBW getNumBytes() = 6635939 getBytesOnDisk() = 6635939 getVisibleLength()= 6635939 getVolume() = /data/dfs/data/current getBlockFile() = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475546259 bytesAcked=6635939 bytesOnDisk=6635939 2021-09-27 18:27:03,630 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550789_3517286896 src: /10.ee.ee.ee:25889 dest: /10.xx.xx.xx:50010 2021-09-27 18:27:03,670 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897 src: /10.ee.ee.ee:48873 dest: /10.xx.xx.xx:50010 2021-09-27 18:27:03,673 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:48873, dest: /10.xx.xx.xx:50010, bytes: 19994, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1365176392_177806, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897, duration: 2383047
1)、怀疑有数据倾斜,某个任务用到的数据分布不均匀,都在这个节点上,导致处理量大,耗时特别长。后来排除这种情况。
2)、怀疑这台机器的配置有问题,检查各种配置,发现jdk版本不一致,这台机器上用的是openjdk1.8,其他机器用的oraclejdk1.8,也有的机器上用openjdk1.8,为了排除jdk的问题,把这台机器的jdk改为oraclejdk1.8,重启服务后问题依然存在。(过程中发现整个集群的jdk有好几个小版本,感慨一下,维护的同学在集群扩容的时候就不考虑跟之前保持一致吗?)
中间也怀疑过centos系统的linux内核版本不一致导致的,查看了一下没有问题,据说有个小版本是有问题的。
3)、因为机器很老了,怀疑这台机器硬盘有坏掉的情况,进行磁盘检测,硬件故障检测,各项硬件都正常,排除硬盘问题。
4)、重启这台机器,问题依然存在,但是有一个发现,这台机器重启的时候,集群任务跑的很快,基本可以断定是这台机器导致整个集群的问题。
5)、怀疑网络问题,一开始检查都很正常,后来一直ping,发现有1%的丢包。
6)、可能是网线或者网口的问题,最后换了一个光口(从光口A换到光口B)后问题解决。
四、结论
1、可以通过一直ping来判断网络是否正常。
2、网络问题可能会导致很多意想不到的异常。
yarn任务异常,异常日志在datanode日志里,最后发现是光口的问题。调查问题的结局总是让人意想不到。