2017-10-16

有限内存读取大文件

文章目录

1. 1、传统的在内存中读取
2. 2、文件流
3. 3、Apache Commons IO流
4. 4、结论

解决java读取大文件内存溢出问题、如何在不重复读取与不耗尽内存的情况下处理大文件。

1、传统的在内存中读取

读取文件行的标准方式是在内存中读取，Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法：

1 2	Files.readLines(newFile(path), Charsets.UTF_8); FileUtils.readLines(newFile(path));

这种方法带来的问题是文件的所有行都被存放在内存中，当文件足够大时很快就会导致程序抛出OutOfMemoryError 异常。
例如：读取一个大约1G的文件：

@Test  
publicvoidgivenUsingGuava_whenIteratingAFile_thenWorks() throwsIOException {  
  String path = ...  
  Files.readLines(newFile(path), Charsets.UTF_8);  
}

这种方式开始时只占用很少的内存：（大约消耗了0Mb内存）

1 2	[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116Mb

然而，当文件全部读到内存中后，我们最后可以看到（大约消耗了2GB内存）：

1 2	[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490Mb

这意味这一过程大约耗费了2.1GB的内存——原因很简单：现在文件的所有行都被存储在内存中。
把文件所有的内容都放在内存中很快会耗尽可用内存——不论实际可用内存有多大，这点是显而易见的。
此外，我们通常不需要把文件的所有行一次性地放入内存中——相反，我们只需要遍历文件的每一行，然后做相应的处理，处理完之后把它扔掉。所以，这正是我们将要做的——通过行迭代，而不是把所有行都放在内存中。

2、文件流

现在让我们看下这种解决方案——我们将使用Java.util.Scanner类扫描文件的内容，一行一行连续地读取：

FileInputStream inputStream = null;  
Scanner sc = null;  
try{  
  inputStream = newFileInputStream(path);  
  sc = newScanner(inputStream,"UTF-8");  
  while(sc.hasNextLine()) {  
    String line = sc.nextLine();  
    // System.out.println(line);  
  }  
  // note that Scanner suppresses exceptions  
  if(sc.ioException() != null) {  
    throwsc.ioException();  
  }  
}finally{  
  if(inputStream != null) {  
    inputStream.close();  
  }  
  if(sc != null) {  
    sc.close();  
  }  
}

这种方案将会遍历文件中的所有行——允许对每一行进行处理，而不保持对它的引用。总之没有把它们存放在内存中：（大约消耗了150MB内存）

1 2	[main] INFO org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763Mb [main] INFO org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605Mb

3、Apache Commons IO流

同样也可以使用Commons IO库实现，利用该库提供的自定义LineIterator:

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");  
try{  
  while(it.hasNext()) {  
  String line = it.nextLine();  
  // do something with line  
}  
}finally{  
  LineIterator.closeQuietly(it);  
}

由于整个文件不是全部存放在内存中，这也就导致相当保守的内存消耗：（大约消耗了150MB内存）

1 2	[main] INFO o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752Mb [main] INFO o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564Mb

4、结论

这篇短文介绍了如何在不重复读取与不耗尽内存的情况下处理大文件——这为大文件的处理提供了一个有用的解决办法。

以下为完整的代码

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.Scanner;
import org.apache.commons.io.Charsets;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;

/**
 * 大文件读取(超过内存大小)
 * @author ZSL
 *
 */
public class BigFileRead {

    /**
     * Scanner读取
     * @param path
     * @throws IOException
     */
    public static void readScanner(String path) throws IOException{
        FileInputStream inputStream=null;
        Scanner scan=null;
        try {
            inputStream=new FileInputStream(path);
            scan=new Scanner(inputStream, "UTF-8");
            while(scan.hasNextLine()){
                String line=scan.nextLine();
                System.out.println(line);
            }
            if(scan!=null)
                throw scan.ioException();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            if (inputStream != null) {
                inputStream.close();
            }
            if (scan != null) {
                scan.close();
            }
        }
    }
    
    /**
     * Apache Common IO 读取
     * @param path
     */
    public void readApacheCommon(String path){
        LineIterator it=null;
        try {
            it = FileUtils.lineIterator(new File(path),Charsets.UTF_8.name());
            while(it.hasNext()){
                 String line = it.nextLine();
                 System.out.println(line);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }finally{
            LineIterator.closeQuietly(it);
        }
    }

    /**
     * buffer读取
     */
    public void readBuffer(String path){
        File file=new File(path);
        try {
            BufferedReader reader=new BufferedReader(new FileReader(file), 10*1024*1024);
            String line=null;
            while((line=reader.readLine())!=null){
                System.out.println(line);
            }
            reader.close();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}