The official implementation of the DCL, DML and DQL of the QQL. It's a corpus server.
Project maintained by quicktext
Hosted on GitHub Pages — Theme by mattgraham
When I completed my doctoral dissertation, I found that I hadn’t found an effective way to manage patent corpuses.
My doctoral dissertation is about the US patent.
I downloaded the US patent from Google patent then I cleaned the data by .
I found many problems!
The problem still exists now!
The problems are as follows:
- The US patent file is XML format, but three are many XML schema versions!
- I want to clean the XML format. The official cleaning program is based on dom4j. Although I have tried other programs, such as the Gabe Fierro’s solution. It still costs me much time! Gabe Fierro’s solution is based on python, he stores the data onto the mysql database.
- I intend to store the xml files in relationship database, but it’s slowly. Then I use the XML database, such as the Sedna XML Database, it’s cashed many times! Finally I intend to store in file systems and process by full text engine, such as the Apache Lucene. But it’s not a good choice for that patent file is semi-structured! If the xml files are indexed by the Lucence, I can’t analyze the data directly!
In recent years, I have designed a new domain specific language(DSL) to manage the paper corpuses.
I will design a new corpus server oriented to the US patents.
I have read the program source of some relationship databases, such as the Apache Derby and Sqlite.
I will design a new corpus server base of the principle of DSL, RDB and INDEX technology soon.
There are many differences between the database query language and my solutions.
Please see the features:
For more information, please visit :