Text mining and content categorization

Using Document Conversion Server to convert pdf\ppt\xlsx etc to dataset

Reply
SAS Employee
Posts: 1

Using Document Conversion Server to convert pdf\ppt\xlsx etc to dataset

Below is the detailed file ingestion process which we tried and tested:

 

  1. There are many different file type and format. To name a few, we have pdf, doc, docx, docm, xls, xlsx, ppt, pptx, pptm, html, .msg, etc..
  2. We are looking for an automated way to read and extract content from these document files and store them in a table in the database
  3. We had written Python script to go through about 200k files to read and extract the content from each of the file (where possible)
  4. We found that there are files which could have been extracted (i.e. non image based, audio and video files such as jpeg, gif, png, wav, mp4, etc) but were not extracted by our Python script
  5. Current effort to identify files which could have been extracted but not extracted and subsequently updating the Python script is an arduous process.
  6. We have tested file ingestion using SAS CA. We created a project in SAS CA which reads from a directory containing 1 year’s worth of files.
  7. We realised that SAS CA were able to ingest more files as compared to the Python script for the 1 year files which we have tested.
  8. We found that there is limitation using this method as SAS dataset generated truncates the content of the file. Only 5000 characters were extracted and saved in the SAS dataset.
  9. We subsequently tried to leverage on the “score code” generated by SAS CA. Please refer to this link for more details: http://support.sas.com/kb/60/158.html
  10. However, we realised that the output SAS dataset from the “score code” does not contain the content from the documents. It seems like the “score code” bypassed the extraction process by simply reading the file content and categorising it.
  11. Only the file path was returned in the SAS dataset.

 

Hence, we would like to know whether:

  1. There is an option/parameter within the “score code” which we can update to allow for the extraction of the file content?
  2. There is a way to leverage on the SAS Document Server to create an automated file ingestion process?

 

Thanks very much for your generous help!!

 

Best Regards,

Justin

Ask a Question
Discussion stats
  • 0 replies
  • 110 views
  • 0 likes
  • 1 in conversation