At contextere, our vision is to enable people to achieve their productive potential. To accomplish this, we’re building an intelligent personal agent for blue-collar workers that delivers contextually relevant information, guidance, and data to the end-user on the ‘Last Tactical Mile’ – where warm hands touch cold steel.
Much of that contextually relevant information is stored in commonly used documents such as manuals, service bulletins, and maintenance records, most of which are in PDF format. To determine and extract what may be relevant, contextere uses Natural Language Processing (NLP) techniques to analyze the information in those and other documents. While this may seem straightforward, a key challenge lies in the fact that NLP algorithms can only analyze documents in pure text format.
Natural Language Processing (NLP)
Put simply, NLP refers to techniques that enable computers to process and understand human language. In other words, being able to analyze documents with an NLP algorithm will enable computers to process human language and solve NLP tasks such as part-of-speech tagging, semantic analysis, machine translation, question answering, and more.
Why Is It Challenging to Read PDF Documents?
As mentioned above, NLP input data must always be in a TXT file format; all other formats, including PDF files, need to be converted. Therefore, to use NLP to analyse the information that may exist in a PDF file, we must first extract and collect text from the original PDF files. This extraction process presents a significant challenge because of our need to have complete, semantically correct data.
Current PDF converters (e.g. PyPDF, PDF2TEXT, and PDF2HTML) can convert PDF files to other document formats, such as TXT, EXCEL, and HTML, but can not perform semantic analysis of objects (e.g. words and characters) from the file. Other PDF extractors, such as Tabula and embedded tools within Adobe Acrobat, can extract tables and figures, respectively. However, neither of these tools check text that may be related and semantically important for the identified table or figure, such as captions, legends, and notes.
Using the existing tools, for example, a converted TXT file output will typically contain all words and characters from paragraphs, tables, and figures without distinguishing from where the text originated (i.e. from a paragraph, table, or figure).
For some NLP algorithms and tasks, such as training a word embedding algorithm, this kind of conversion contains noise and is not clean enough to train the algorithm. This is because the text from tables and figures are generally not comprised of semantically continuous and grammatically correct sentences, as would be seen in standard paragraphs.
The contextere Approach: Solving the Complete PDF Extraction Process
To address these challenges, contextere has developed and refined a PDF extraction engine that enables us to create the complete datasets required for our NLP algorithms. The contextere PDF extractor identifies and stores tables, figures, and text and conducts a semantic analysis of every object in the document. Based on the analysis results, it groups objects to form structured Table Objects, Figure Objects, and Paragraph Objects. A variety of attributes are automatically determined for each Object, and the user can select which of those attributes will be exported during the extraction process.
From the structured output, the user can configure well-structured and clean datasets for specific purposes. For example, contextere‘s word embedding training uses datasets that contain only text from paragraphs, eliminating text in tables and figures to reduce noise. Word embedding vectors are then used by our information curation algorithms to determine critical information relevant to user context and questions.
Being able to properly convert PDF documents to pure text enables contextere to curate, assemble, and deliver the contextually relevant information blue-collar workers require to answer the prevalent “Now What?” question, empowering them to do their jobs safer and more effectively, and reducing equipment downtime.