Home >Java >javaTutorial >jtd format file conversion analysis

jtd format file conversion analysis

巴扎黑
巴扎黑Original
2017-06-26 09:58:244309browse

In the project that I have been busy with since 2016, the module I am mainly responsible for is the file parsing part. When I was working on it, I made all kinds of mistakes and troubles. At least it is finally over. Now I have put all the parts in the project together. Let’s summarize the parsing of these files for future reference. The main documents parsed in this project include office files, pdf, csv, rtf, txt, jtd, and emails in eml, msg and pst formats, as well as rar and zip compression. When decompressing the package, there is actually a file in the mlf format. However, after my research and the research of the company's bosses, I can't overcome the difficulty for the time being, so I can only give up the file in this format for the time being, and other analysis has not been done. It has been done, mainly these. I will summarize them all one by one later. Regarding file parsing, I use Tika of Apache.

Today we will first take a look at the analysis of this jtd file. Some people may not know what this jtd file is. Let me explain it first:

jtd格式文件是由日本的文字处理软件一太郎生成的文件格式

It can be understood as a jtd format file. The word we usually use does not need to be edited and opened with Itaro software. Let me show you what this Itaro software looks like:

jtd format file conversion analysis

I was very surprised when I first saw this requirement. Embarrassing. How to do this? It’s still a Japanese software. I can’t understand it even if I check the information. I can’t find it on Baidu and stackoverflow. At this time, thanks to a big boss in the company who can understand Japanese, this The boss found a solution on a Japanese website. The website address is http://d.hatena.ne.jp/satorufujimori/20070227/1172549793

. The solution is to use vbs script to convert the jtd format file Convert to txt file, and then parse the corresponding txt to obtain the content. The script on the website is as follows:

//taro2txt.vbs
Set taro = CreateObject("JXW.Application")
taro.Visible = True
taro.Documents.Open "c:\taro\a.jtd"
taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1
taro.Quit

Everyone pays attention to the 10, which is an identifier. 10 means converting the jtd format file into txt Format files, if you want to convert jtd format files into files in other formats, you need to replace 10 with other identifiers, but what is more embarrassing is that we did not find a specific document explaining which number represents which document, and then at that time I tried from 0 to 100, and a lot of messy formats came out. The only useful one is 10, which means that it can only convert jtd format files into txt format files. In this case, all the pictures in the original file will disappear. However, our business is to read the file content and enter it into Solr for retrieval, so if there is no picture, there will be no picture. Later, we adopted this method to solve the problem.

Through the above script, you can convert jtd files without passwords into txt files, but the most embarrassing thing is that our jtd format files have passwords. This is embarrassing, but fortunately it was solved in the end. , I forgot how to solve it at the time, but the solution is as follows:

//taro2txt.vbs
Set taro = CreateObject("JXW.Application")
taro.Visible = True
taro.Documents.Open "c:\taro\a.jtd",password//在此处加上密码
taro.ActiveDocument.SaveAs "c:\out\a.txt", "", "", "", 10, "ShiftJIS" //※1
taro.Quit

After the script is completed, just click Run to convert the specific jtd file into a txt file, and then Just process the txt file and extract the content (the content extraction of txt format files will be explained in another article later).

The above problem has been solved, but there is still a problem. I can’t create a script file for all jtd files. Besides, I don’t know what files the customer has, so I thought of adding it to vbs. The script passes parameters. Although I don’t know the syntax of VBS, I still wrote it according to what is said on the Internet. The specific script content is as follows:

Option Explicit

Dim a0 : a0 = WScript.Arguments(0)
Dim a1 : a1 = WScript.Arguments(1)
Dim a2 : a2 = WScript.Arguments(2)
Dim taro

ExchangeFile a0, a1, a2

Sub ExchangeFile(src,dest,password)
    Set taro = CreateObject("JXW.Application")
    taro.Visible = True
    taro.Documents.Open src,password
    taro.ActiveDocument.SaveAs dest, "", "", "", 10, "" 
    taro.Quit
End Sub

Where a0 represents the path of the jtd file, and a1 represents the path to the jtd file. The path of the generated txt format file, a2 represents the password of the jtd file, which is actually the process of passing parameters to call the function.

After the script is perfected, it is a question of using java to call the vbs script. I found the answer to this question on stackoverflow. The calling method is as follows:

public static void main(String[] args) {
   try {
      Runtime.getRuntime().exec( "wscript D:/Send_Mail_updated.vbs" );
   }
   catch( IOException e ) {
      System.out.println(e);
      System.exit(0);
   }
}

Through the above series of steps, you can succeed Convert jtd files into txt files, but there are several problems:

  1. Calling the vbs script through the java program does not return a value indicating whether the txt file is actually generated. If the password The error is that the corresponding txt file cannot be generated. My processing method is to check whether the txt file has been generated every once in a while. After a certain number of times, it will be judged that the conversion failed. The number of times is based on the file size. For example, a 10M file will be Check every 5 seconds, 10 times in total. If the txt file is not generated, it will be judged as a failure. This is a waste of time when trying the password, and the file may be relatively large, or the machine configuration is not good enough. The txt file is generated, but after the check time has passed, it is directly determined that it cannot be converted correctly;

  2. Every time you run the vbs script, the Ichitaro software will be opened, and when trying the password, if the password If an error occurs, a Windows error pop-up window will appear on the server where the application is deployed. Although Ichitaro's process will be killed in the end, the customer can clearly see the Itaro program and error prompts before it is killed. This is very Embarrassing things;

  3. If the jtd file is too large, for example, when the file reaches 30M, the script conversion speed will be very slow. Question 2 also mentioned that during the file conversion process, the customer can If the Ichitaro program is seen on the server, if the client directly kills Itaro during this period, then the file conversion will definitely fail;

The above problems have not been solved yet, and there will be more later It depends on the usage after deployment at the customer's end. If the jtd format files at the customer's end are all under 10M, then there shouldn't be much of a problem. However, if the files exceed 30M, the conversion process will definitely be slow. And there is always the risk that the Ichitaro software will be killed during the conversion process. The specific situation depends on the customer's trial situation.

That’s all for now about file parsing in jtd format. As for the extraction of content after converting jtd format files into txt format files, I will write about it later.

The above is the detailed content of jtd format file conversion analysis. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn