The purpose of the project is to help moderator without reviewing of pdf files to determine that this file is a magazine and take it to attribute it to a specific category. For convenient work, the user must create a list of tags for search. These tags are grouped into categories. Next pdf files are loaded and the parser counts the number of tags in the document. The result is a list with a preview (first page) and additional information (number of pages, the original title, etc.)
The user can also create filters to produce certain results. For example, the file must contain the word "tree", but does not contain the word "maple" or magazine must contain at least 20 words "fashion" and then it goes into a certain category. The user can also view a list of parsed files and if for some reason it did not get into magazines manually assign it to a specific category. And then upload the file list with the names and additional information.
Progress: at the moment the parser with draganddrop file upload and preserving the history of parsing, which displays a list of files with preview is ready; it also counts the number of tags and pulls the metadata from a file.
Solution
For the realization of this project, we use Laravel 5 PHP framework. Xpdf C ++ library that allows us to pull out text, images, metadata. To remove protection from protected files was used Ghostscript.
The parser works pretty quickly: 50 random files from 1MB to 80MB (with and without protection) work out for about 1 minute.
Development was carried out locally for Windows, but can be adapted for Linux and MAC.
Technologies
Laravel 5, Хpdf, Ghostscript, PHP
Team
Team of 3 specialists worked on this project:
- Project manager communication with customer, distribution and control of tasks;
- Web developer development of the project;
- Tester test of the project;