PDFBox

Java PDFBox – Crawling Function Design

Java PDFBox – Crawling Function Design

– Create Database Table with File Path / File Content / File Hash Column(s).
– Develop a Schedule Job to Perform Crawling Step ( to Read the PDF File under the Specific Folder Path ).
   – Iterate the File from the Specific Folder Path
      – If the File Item is not existed on Database Table,
         – Execute the PDFBox to retrieve the File Content
         – Insert Record with PDF File Hash Value on the Database Table
      – If the File Item is existed on SQL Table & The Hash is different from Database Table,
         – Execute the PDF Box the File Content
         – Update Record with PDF File Hash Value on the Database Table
      – If the File Item is existed on SQL Table & The Hash is the same as Database Table,
         – Nothing to do
– Develop a Web Function and UI to search the File by using the File Name or File Content Keyword.
   – Prepare a SQL Statement to search the Keyword from File Name / File Content Field …