如何使用Java API操作HDFS？-java教程-PHP中文網

首頁

Java

java教程

如何使用Java API操作HDFS？

王林

Apr 19, 2023 pm 02:28 PM

javaapihdfs

1.遍歷目前目錄下所有檔案與資料夾

可以使用listStatus方法實現上述需求。
listStatus方法簽章如下

  /**
   * List the statuses of the files/directories in the given path if the path is
   * a directory.
   * 
   * @param f given path
   * @return the statuses of the files/directories in the given patch
   * @throws FileNotFoundException when the path does not exist;
   *         IOException see specific implementation
   */
  public abstract FileStatus[] listStatus(Path f) throws FileNotFoundException, 
                                                         IOException;

可以看出listStatus只需要傳入參數Path即可，傳回的是一個FileStatus的陣列。
而FileStatus包含有以下資訊

/** Interface that represents the client side information for a file.
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class FileStatus implements Writable, Comparable {

  private Path path;
  private long length;
  private boolean isdir;
  private short block_replication;
  private long blocksize;
  private long modification_time;
  private long access_time;
  private FsPermission permission;
  private String owner;
  private String group;
  private Path symlink;
  ....

從FileStatus中不難看出，包含有檔案路徑，大小，是否是目錄，block_replication, blocksize…等等各種資訊。

import org.apache.hadoop.fs.{FileStatus, FileSystem, Path}
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}
import org.slf4j.LoggerFactory

object HdfsOperation {
	
	val logger = LoggerFactory.getLogger(this.getClass)
	
	def tree(sc: SparkContext, path: String) : Unit = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		val fsPath = new Path(path)
		val status = fs.listStatus(fsPath)
		for(filestatus:FileStatus <- status) {
			logger.error("getPermission is: {}", filestatus.getPermission)
			logger.error("getOwner is: {}", filestatus.getOwner)
			logger.error("getGroup is: {}", filestatus.getGroup)
			logger.error("getLen is: {}", filestatus.getLen)
			logger.error("getModificationTime is: {}", filestatus.getModificationTime)
			logger.error("getReplication is: {}", filestatus.getReplication)
			logger.error("getBlockSize is: {}", filestatus.getBlockSize)
			if (filestatus.isDirectory) {
				val dirpath = filestatus.getPath.toString
				logger.error("文件夹名字为: {}", dirpath)
				tree(sc, dirpath)
			} else {
				val fullname = filestatus.getPath.toString
				val filename = filestatus.getPath.getName
				logger.error("全部文件名为: {}", fullname)
				logger.error("文件名为: {}", filename)
			}
		}
	}
}

如果判斷fileStatus是資料夾，則遞歸呼叫tree方法，達到全部遍歷的目的。

2.遍歷所有檔案

上面的方法是遍歷所有檔案以及資料夾。如果只想遍歷文件，可以使用listFiles方法。

	def findFiles(sc: SparkContext, path: String) = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		val fsPath = new Path(path)
		val files = fs.listFiles(fsPath, true)
		while(files.hasNext) {
			val filestatus = files.next()
			val fullname = filestatus.getPath.toString
			val filename = filestatus.getPath.getName
			logger.error("全部文件名为: {}", fullname)
			logger.error("文件名为: {}", filename)
			logger.error("文件大小为: {}", filestatus.getLen)
		}
	}

  /**
   * List the statuses and block locations of the files in the given path.
   * 
   * If the path is a directory, 
   *   if recursive is false, returns files in the directory;
   *   if recursive is true, return files in the subtree rooted at the path.
   * If the path is a file, return the file&#39;s status and block locations.
   * 
   * @param f is the path
   * @param recursive if the subdirectories need to be traversed recursively
   *
   * @return an iterator that traverses statuses of the files
   *
   * @throws FileNotFoundException when the path does not exist;
   *         IOException see specific implementation
   */
  public RemoteIterator<LocatedFileStatus> listFiles(
      final Path f, final boolean recursive)
  throws FileNotFoundException, IOException {
  ...

從原始碼可以看出，listFiles 回傳一個可迭代的物件RemoteIterator<locatedfilestatus></locatedfilestatus>，而listStatus回傳的是個陣列。同時，listFiles回傳的都是檔案。

3.建立資料夾

	def mkdirToHdfs(sc: SparkContext, path: String) = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		val result = fs.mkdirs(new Path(path))
		if (result) {
			logger.error("mkdirs already success!")
		} else {
			logger.error("mkdirs had failed!")
		}
	}

4.刪除資料夾

	def deleteOnHdfs(sc: SparkContext, path: String) = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		val result = fs.delete(new Path(path), true)
		if (result) {
			logger.error("delete already success!")
		} else {
			logger.error("delete had failed!")
		}
	}

5.上傳檔案

	def uploadToHdfs(sc: SparkContext, localPath: String, hdfsPath: String): Unit = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		fs.copyFromLocalFile(new Path(localPath), new Path(hdfsPath))
		fs.close()
	}

6.下載檔案

	def downloadFromHdfs(sc: SparkContext, localPath: String, hdfsPath: String) = {
		val fs = FileSystem.get(sc.hadoopConfiguration)
		fs.copyToLocalFile(new Path(hdfsPath), new Path(localPath))
		fs.close()
	}

以上是如何使用Java API操作HDFS？的詳細內容。更多資訊請關注PHP中文網其他相關文章！

陳述

本文轉載於：亿速云。如有侵權，請聯絡admin@php.cn刪除

為什麼Java是開發跨平台桌面應用程序的流行選擇？Apr 25, 2025 am 12:23 AM

javaispopularforcross-platformdesktopapplicationsduetoits“ writeonce，runany where”哲學。 1）itusesbytiesebyTecodeThatrunsonAnyJvm-備用Platform.2）librarieslikeslikeslikeswingingandjavafxhelpcreatenative-lookingenative-lookinguisis.3）

討論可能需要在Java中編寫平台特定代碼的情況。Apr 25, 2025 am 12:22 AM

在Java中編寫平台特定代碼的原因包括訪問特定操作系統功能、與特定硬件交互和優化性能。 1)使用JNA或JNI訪問Windows註冊表；2)通過JNI與Linux特定硬件驅動程序交互；3)通過JNI使用Metal優化macOS上的遊戲性能。儘管如此，編寫平台特定代碼會影響代碼的可移植性、增加複雜性、可能帶來性能開銷和安全風險。

Java將通過雲原生應用、多平台部署和跨語言互操作進一步提昇平台獨立性。 1）雲原生應用將使用GraalVM和Quarkus提升啟動速度。 2）Java將擴展到嵌入式設備、移動設備和量子計算機。 3）通過GraalVM，Java將與Python、JavaScript等語言無縫集成，增強跨語言互操作性。