VTA bulk PDF/Docx analysis

carlosGoetz · Posted 08-02-2019 07:11 AM

Good morning everybody.

I need to analyze a huge number of legal documents in order to find out which ones have certain clauses and which ones don't. I'd like to know how to proceed. I'm using Visual Text Analytics on SAS Viya 3.4, but it seems to me that it's impossible to do something like that.

Can you help me with this issue, please?

Thank you very much!

Jason7 · Posted 08-08-2019 01:19 AM

Hello Carlos -

you can import many PDF files into Viya to use in VTA using the data import function:

https://go.documentation.sas.com/?docsetId=datahub&docsetTarget=p1sv89vo4n8f03n0zvq0k90i8g3t.htm&doc...

in VTA, you can define rules to categorize documents that include certain clauses you require:

https://go.documentation.sas.com/?activeCdc=ctxtcdc&cdcId=capcdc&cdcVersion=8.4&docsetId=ctxtug&docs...

in VTA, it tests your model on the PDF files, but you can also apply the model onto new data / scoring process here:

https://go.documentation.sas.com/?activeCdc=ctxtcdc&cdcId=capcdc&cdcVersion=8.4&docsetId=ctxtug&docs...

hope it helps!

carlosGoetz · Posted 08-09-2019 03:23 AM

Thank you very much.

I have another question: If I only have to check if a bunch of documents have or don't have the word "Wexner", can I just create a pipeline with just the two nodes: Data and Categories?

I've created a category Bueno that says (NOT,("Wexner")), but when I run the node I obtain the next error message:

Se ha producido un error mientras se ejecutaba el pipeline. Consulte los registros del nodo para más detalles.

... and the log says:

Exception occurred while querying categories table: category document table with the specified taxonomyId not found: 4a7f7f286c4c6558016c751433ff0004

Can you tell me what I'm doing wrong?

Also, you can find attached an image about matches on a document. Can you tell me why if there are 3 out of 4 documents that contains the word Anova as listed in lower part of the screen, at the right I see 0 matches? What does it mean?

Thank you very much!

Best regards,

Carlos

carlosGoetz · Posted 08-14-2019 04:05 AM

Can anybody help me with this issue, please?
Thank you very much.

VTA bulk PDF/Docx analysis

Re: VTA bulk PDF/Docx analysis

Re: VTA bulk PDF/Docx analysis

Re: VTA bulk PDF/Docx analysis

Catch up on SAS Innovate 2026