In the project that I have been busy with since 2016, the module I am mainly responsible for is the file parsing part. When I was working on it, I made all kinds of mistakes and troubles. At least it is finally over. Now I have put all the parts in the project together. Let’s summarize the parsing of these files for future reference. The main documents parsed in this project include office files, pdf, csv, rtf, txt, jtd, and emails in eml, msg and pst formats, as well as rar and zip compression. When decompressing the package, there is actually a file in the mlf format. However, after my research and the research of the company's bosses, I can't overcome the difficulty for the time being, so I can only give up the file in this format for the time being, and other analysis has not been done. It has been done, mainly these. I will summarize them all one by one later. Regarding file parsing, I use Tika of Apache.
Today we will first take a look at the analysis of this jtd file. Some people may not know what this jtd file is. Let me explain it first:
jtd格式文件是由日本的文字处理软件一太郎生成的文件格式
It can be understood as a jtd format file. The word we usually use does not need to be edited and opened with Itaro software. Let me show you what this Itaro software looks like:
I was very surprised when I first saw this requirement. Embarrassing. How to do this? It’s still a Japanese software. I can’t understand it even if I check the information. I can’t find it on Baidu and stackoverflow. At this time, thanks to a big boss in the company who can understand Japanese, this The boss found a solution on a Japanese website. The website address is http://d.hatena.ne.jp/satorufujimori/20070227/1172549793
. The solution is to use vbs script to convert the jtd format file Convert to txt file, and then parse the corresponding txt to obtain the content. The script on the website is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd" taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
Everyone pays attention to the 10, which is an identifier. 10 means converting the jtd format file into txt Format files, if you want to convert jtd format files into files in other formats, you need to replace 10 with other identifiers, but what is more embarrassing is that we did not find a specific document explaining which number represents which document, and then at that time I tried from 0 to 100, and a lot of messy formats came out. The only useful one is 10, which means that it can only convert jtd format files into txt format files. In this case, all the pictures in the original file will disappear. However, our business is to read the file content and enter it into Solr for retrieval, so if there is no picture, there will be no picture. Later, we adopted this method to solve the problem.
Through the above script, you can convert jtd files without passwords into txt files, but the most embarrassing thing is that our jtd format files have passwords. This is embarrassing, but fortunately it was solved in the end. , I forgot how to solve it at the time, but the solution is as follows:
//taro2txt.vbs Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open "c:\taro\a.jtd",password//在此处加上密码 taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1 taro.Quit
After the script is completed, just click Run to convert the specific jtd file into a txt file, and then Just process the txt file and extract the content (the content extraction of txt format files will be explained in another article later).
The above problem has been solved, but there is still a problem. I can’t create a script file for all jtd files. Besides, I don’t know what files the customer has, so I thought of adding it to vbs. The script passes parameters. Although I don’t know the syntax of VBS, I still wrote it according to what is said on the Internet. The specific script content is as follows:
Option Explicit Dim a0 : a0 = WScript.Arguments(0) Dim a1 : a1 = WScript.Arguments(1) Dim a2 : a2 = WScript.Arguments(2) Dim taro ExchangeFile a0, a1, a2 Sub ExchangeFile(src,dest,password) Set taro = CreateObject("JXW.Application") taro.Visible = True taro.Documents.Open src,password taro.ActiveDocument.SaveAs dest, "", "", "", 10, "" taro.Quit End Sub
Where a0 represents the path of the jtd file, and a1 represents the path to the jtd file. The path of the generated txt format file, a2 represents the password of the jtd file, which is actually the process of passing parameters to call the function.
After the script is perfected, it is a question of using java to call the vbs script. I found the answer to this question on stackoverflow. The calling method is as follows:
public static void main(String[] args) { try { Runtime.getRuntime().exec( "wscript D:/Send_Mail_updated.vbs" ); } catch( IOException e ) { System.out.println(e); System.exit(0); } }
Through the above series of steps, you can succeed Convert jtd files into txt files, but there are several problems:
Calling the vbs script through the java program does not return a value indicating whether the txt file is actually generated. If the password The error is that the corresponding txt file cannot be generated. My processing method is to check whether the txt file has been generated every once in a while. After a certain number of times, it will be judged that the conversion failed. The number of times is based on the file size. For example, a 10M file will be Check every 5 seconds, 10 times in total. If the txt file is not generated, it will be judged as a failure. This is a waste of time when trying the password, and the file may be relatively large, or the machine configuration is not good enough. The txt file is generated, but after the check time has passed, it is directly determined that it cannot be converted correctly;
Every time you run the vbs script, the Ichitaro software will be opened, and when trying the password, if the password If an error occurs, a Windows error pop-up window will appear on the server where the application is deployed. Although Ichitaro's process will be killed in the end, the customer can clearly see the Itaro program and error prompts before it is killed. This is very Embarrassing things;
If the jtd file is too large, for example, when the file reaches 30M, the script conversion speed will be very slow. Question 2 also mentioned that during the file conversion process, the customer can If the Ichitaro program is seen on the server, if the client directly kills Itaro during this period, then the file conversion will definitely fail;
The above problems have not been solved yet, and there will be more later It depends on the usage after deployment at the customer's end. If the jtd format files at the customer's end are all under 10M, then there shouldn't be much of a problem. However, if the files exceed 30M, the conversion process will definitely be slow. And there is always the risk that the Ichitaro software will be killed during the conversion process. The specific situation depends on the customer's trial situation.
That’s all for now about file parsing in jtd format. As for the extraction of content after converting jtd format files into txt format files, I will write about it later.
The above is the detailed content of jtd format file conversion analysis. For more information, please follow other related articles on the PHP Chinese website!

Python解析XML中的特殊字符和转义序列XML(eXtensibleMarkupLanguage)是一种常用的数据交换格式,用于在不同系统之间传输和存储数据。在处理XML文件时,经常会遇到包含特殊字符和转义序列的情况,这可能会导致解析错误或者误解数据。因此,在使用Python解析XML文件时,我们需要了解如何处理这些特殊字符和转义序列。一、特殊字符和

Python编程解析百度地图API文档中的坐标转换功能导读:随着互联网的快速发展,地图定位功能已经成为现代人生活中不可或缺的一部分。而百度地图作为国内最受欢迎的地图服务之一,提供了一系列的API供开发者使用。本文将通过Python编程,解析百度地图API文档中的坐标转换功能,并给出相应的代码示例。一、引言在开发中,我们有时会涉及到坐标的转换问题。百度地图AP

使用Python解析SOAP消息SOAP(SimpleObjectAccessProtocol)是一种基于XML的远程过程调用(RPC)协议,用于在网络上不同的应用程序之间进行通信。Python提供了许多库和工具来处理SOAP消息,其中最常用的是suds库。suds是Python的一个SOAP客户端库,可以用于解析和生成SOAP消息。它提供了一种简单而

随着PHP8.0的发布,许多新特性都被引入和更新了,其中包括XML解析库。PHP8.0中的XML解析库提供了更快的解析速度和更好的可读性,这对于PHP开发者来说是一个重要的提升。在本文中,我们将探讨PHP8.0中的XML解析库的新特性以及如何使用它。什么是XML解析库?XML解析库是一种软件库,用于解析和处理XML文档。XML是一种用于将数据存储为结构化文档

使用Python解析带有命名空间的XML文档XML是一种常用的数据交换格式,能够适应各种应用场景。在处理XML文档时,有时会遇到带有命名空间(namespace)的情况。命名空间可以防止不同XML文档中元素名的冲突,提高了XML的灵活性和可扩展性。本文将介绍如何使用Python解析带有命名空间的XML文档,并给出相应的代码示例。首先,我们需要导入xml.et

PHP中的HTTPBasic鉴权方法解析及应用HTTPBasic鉴权是一种简单但常用的身份验证方法,它通过在HTTP请求头中添加用户名和密码的Base64编码字符串进行身份验证。本文将介绍HTTPBasic鉴权的原理和使用方法,并提供PHP代码示例供读者参考。一、HTTPBasic鉴权原理HTTPBasic鉴权的原理非常简单,当客户端发送一个请求时

PHP爬虫是一种自动化获取网页信息的程序,它可以获取网页代码、抓取数据并存储到本地或数据库中。使用爬虫可以快速获取大量的数据,为后续的数据分析和处理提供巨大的帮助。本文将介绍如何使用PHP实现一个简单的爬虫,以获取网页源码和内容解析。一、获取网页源码在开始之前,我们应该先了解一下HTTP协议和HTML的基本结构。HTTP是HyperText

PHP中的单点登录(SSO)鉴权方法解析引言:随着互联网的发展,用户通常要同时访问多个网站进行各种操作。为了提高用户体验,单点登录(SingleSign-On,简称SSO)应运而生。本文将探讨PHP中的SSO鉴权方法,并提供相应的代码示例。一、什么是单点登录(SSO)?单点登录(SSO)是一种集中化认证的方法,在多个应用系统中,用户只需要登录一次,就能访问


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SublimeText3 Chinese version
Chinese version, very easy to use

SublimeText3 English version
Recommended: Win version, supports code prompts!

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

Dreamweaver CS6
Visual web development tools

WebStorm Mac version
Useful JavaScript development tools