WORD解析：使用poi解析doc和docx

1.pom依赖

<!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
	<dependency>
		<groupId>org.apache.poi</groupId>
		<artifactId>poi</artifactId>
		<version>4.1.1</version>
	</dependency>
       <!--处理2007 excel-->
       <dependency>
           <groupId>org.apache.poi</groupId>
           <artifactId>poi-ooxml</artifactId>
           <version>4.1.1</version>
       </dependency>
       <!-- 用于处理当前的word文档2007的老版本 -->
	<dependency>
           <groupId>org.apache.poi</groupId>
           <artifactId>poi-scratchpad</artifactId>
           <version>4.1.1</version>
       </dependency>

这里需要注意：老版本的word是doc结尾，新版本的word是docx结尾，如果出现这个异常：org.apache.poi.openxml4j.exceptions.OLE2NotOfficeXmlFileException: The supplied data appears to be in the OLE2 Format. You are calling the part of POI that deals with OOXML (Office Open XML) Documents. You need to call a different part of POI to process this data (eg HSSF instead of XSSF)j就表示当前解析的版本不对需要使用老版本的word解析器

2.创建word文档

1.创建test.doc和test.docx文档，并且写入以下内容：
在这里插入图片描述

3.创建通用的解析方法用于解析当前的word文档

由于当前的docx和doc的类不兼容没有统一的接口所以需要区别对待

public void handlerWordFile(File file) throws Exception  {
		String fileName = file.getName();
		int lastIndexOf =fileName .lastIndexOf(".");
		if(lastIndexOf==-1) {
			throw new IllegalArgumentException("当前传入的文件格式不合法！");
		}
		String suffex=fileName.substring(lastIndexOf+1,fileName.length());
		try (InputStream is=new FileInputStream(file)){
			switch (suffex) {
			case "doc":
				handlerByDocFile(is);
				break;
			case "docx":
				handlerByDocxFile(is);
				break;
			default:
				throw new IllegalArgumentException("不能解析的文档类型，请输入正确的word文档类型的文件！");
			}
		}catch (Exception e) {
			throw e;
		}
			
	}

4.创建解析docx的方法

通过XWPFDocument对象可以解析word文档(docx)

public void handlerByDocxFile(InputStream is) throws IOException, InvalidFormatException {		
		XWPFDocument xwpfDocument = new XWPFDocument(is);
		Iterator<IBodyElement> bodyElementsIterator = xwpfDocument.getBodyElementsIterator();
		List<Object> datas=new ArrayList<>();
		while (bodyElementsIterator.hasNext()) {
			IBodyElement bodyElement = bodyElementsIterator.next();
			String content = handlerByBodyType(bodyElement,bodyElement.getPartType());
			datas.add(content);
		}
		xwpfDocument.close();
		is.close();
		printAllDatas(datas);
	}
	
	public void printAllDatas(Collection<?> datas) {
		System.out.println(datas);
	}
	
	//开始处理当前的身体元素
	public String handlerByIBodyElement(IBodyElement bodyElement) {
		String content=null;
		//用于处理XWPFParagraph
		if(bodyElement instanceof XWPFParagraph) {
			System.out.println("当前获取的元素类型为：XWPFParagraph");
			content=handlerXWPFParagraphType(bodyElement);
		}
		return content;
	}

	//用于处理当前的XWPFParagraph类型的数据
	public String handlerXWPFParagraphType(IBodyElement bodyElement) {
		XWPFParagraph xwpfParagraph = (XWPFParagraph) bodyElement;
		BodyElementType elementType = xwpfParagraph.getElementType();
		String content = getStringByBodyElementType(xwpfParagraph,elementType);
		System.out.println("当前文本的内容为："+content);
		return content;
	}
	
	//通过当前的类型和元素进行相对应的处理
	public String getStringByBodyElementType(XWPFParagraph xwpfParagraph,BodyElementType bodyElementType) {
		System.out.println(bodyElementType);//当前测试结果为：PARAGRAPH
		String content="";
		switch (bodyElementType) {
		case CONTENTCONTROL:
			//如果使用的是文本控件
			break;
		case PARAGRAPH:
			//如果是段落的处理结果
			content=xwpfParagraph.getParagraphText();
			break;
		case TABLE:
			//如果当前的的元素部分为表格
			break;

		default:
			break;
		}
		return content;
	}

	//通过身体类型来处理
	public String handlerByBodyType(IBodyElement bodyElement ,BodyType partType) {
		System.out.println("当前的BodyType为："+partType);
		String content=null;
		switch (partType) {
		case CONTENTCONTROL:
			break;
		case DOCUMENT:
			content=handlerByIBodyElement(bodyElement);
			break;
		case HEADER:

			break;
		case FOOTER:

			break;
		case FOOTNOTE:

			break;
		case TABLECELL:

			break;
		default:
			throw new IllegalArgumentException("there is no this document type !please check this type!");
		}
		return content;
	}

5.测试解析后的test.docx

	public static void main(String[] args) throws Exception {
		WordTest test=new WordTest();
		URL resource = Thread.currentThread().getContextClassLoader().getResource("test.docx");
		File file=new File(resource.getFile());
		test.handlerWordFile(file);
	}

执行的结果为：
在这里插入图片描述

6.创建解析doc文档的方法

解析doc文件需要使用HWPFDocument这个类

	public void handlerByDocFile(InputStream is) {
		String content=null;
		HWPFDocument hwpfDocument=null;
		try {
			hwpfDocument=new HWPFDocument(is);
			content=getAllDocText(hwpfDocument);			
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}finally {
			closeDocFile(hwpfDocument);
		}
		System.out.println(content);
	}
	
	//用于获取当前的doc word文档中的所有的内容
	public String getAllDocText(HWPFDocument hwpfDocument) {
		return hwpfDocument.getDocumentText();
		//这里也可以通过range获取数据
	}
	public void closeDocFile(HWPFDocument hwpfDocument) {
		try {
			if(hwpfDocument!=null)
			hwpfDocument.close();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

可以直接通过HWPFDocument .getDocumentText()获取当前word文档中的所有的文本内容！

结果为：
在这里插入图片描述
发现解析后的内容出现了[?]这几个字符，存在缺点

7.总结

1.使用poi解析docx和doc的时候注意当前的版本和类是不兼容的所以需要单独处理，解析docx需要使用XWPFDocument，而解析doc需要使用HWPFDocument

2.使用老版本的很容易就可以解析到当前word的所有文本内容，而新版本的需要通过判断和当前的类型才能获取数据，新版本的不会出现[?]这个问题！

以上纯属个人见解，如有问题请联系本人！