About RussAlbright

RussAlbright · ‎11-13-2023

Hi @azaras and @sbxkoenk , I don't think you need epcode in this environment. You can just run the AstoreScoreCode.sas after setting some of the macro variables. Directions are in the file itself. Can you give that a try, @azaras ? Thanks Russ

RussAlbright · ‎06-07-2022

Hi dylanleong78, I think you pretty much have the right understanding of the process for the action. The spell correction is statistical in nature and selects rare terms as potential candidates to have been misspelled, but only when there is a frequently occurring term that is similar to it and that can serve as its correct spelling. It is designed to work on large data not so much a small set like you have. The dictionary input prevents some rare terms from being mapped as a potential misspelling, if they are on the dictionary, but that does not help you here. I don't think the detailed complexTag will necessarily be of much help here. Although maybe you have found a pattern that helps in your cases. The "inc" there represents "unknown" essentially. Without the frequently occurring terms to map to, the algorithm won't suggest a correction. There are a couple of things that come to mind you might try . 1. If you find some well-edited documents (100s or 1000's of them) to accompany your 6 , you may find that some of your incorrect words do get corrected by including these. Also some of your tuning parameters will then become relevant (If you want to go really extreme, you could create sentences and ultimately documents by randomly and repeatedly selecting dictionary words to create random content. You would want to automate the construction of this with some data step because you want to create a large number of these and you want your dictionary words repeating in several documents. Then combine your new random documents with the 6 you are trying to spell check.) And a second approach that requires more work and doesn't use the action... 2. SAS has some functions for data step such as spedis (spelling edit distance) that you could potential compare every term of the offset table from tpParse with every term on your dictionary, looking for similar spellings. This means you do a loop through every term of your collection comparing each one to every term of your dictionary. When a term doesn't exactly match a dictionary term and it is close to some other term you could flag it. You might be surprised at how the type I and type II errors can come in to play here though.

RussAlbright · ‎01-03-2022

hmtan916, I am not sure why you don't see any documents there. I suppose it could be that none of the documents actually meet the underlying threshold for a document to belong to a topic. The amount of data you have looks kind of small. Try to reduce your number of topics to just 2 or 3 and turn off part-of-speech tagging to increase the potential of intersection based on terms. See if that run gives you anything. If not, you may want to contact tech support. Thanks.

RussAlbright · ‎04-27-2021

The tpSpell action outputs the mapping of suggested misspellings to corrected spellings. We do not have an action to rewrite the input documents with the corrected term in place. Is this what you you would like? Then run the concept node on the corrected version? This is something we can consider adding. For now you would need to do a tpParse/tpSpell action run and then modify the input documents yourself. Since the byte offset location of each word you want to change is specified in the output of tpSpell, you could make the change programmatically using this information. Once you make the change you can rerun your VTA diagram on the revised documents without spell checking. I hope that helps. Russ

RussAlbright · ‎12-12-2019

I am talking about built in macros that are called to do parts of the computation that the procedure does not do. If you right click on a node in your flow and choose "Export path to sas code" you can save the code that is run when your flow runs. If you look at that code you will see the names of these macros. Also, if you add options mprint; when you run the path, you will see a printout of many of the macros executing. The actual source of these macros is not visible otherwise. Russ

RussAlbright · ‎12-10-2019

You have to truncate the U matrix using the technique i described then reform the docpro data set. PROC HPTMINE does not do this. The process has quite a few steps and may be a challenge to re-implement. Have you considered just saving out the sas code from your flow? Depending on what your trying to accomplish, this code will allow you to submit the whole flow programatically. You would apply the termcutoffs to U, not V. U is number terms by number of topics. Then you reform docpro and then apply docutffs to docpro.

RussAlbright · ‎12-05-2019

See the answer here https://communities.sas.com/t5/SAS-Text-and-Content-Analytics/Proc-Hptmine-What-is-the-formula-behind-assigning-a-topic-to-a/m-p/609577/highlight/false#M898 to this follow up question. Thanks

RussAlbright · ‎12-05-2019

Here is a quick summary: For the U factor, it is number-of-terms by number-of-topics, calculate the mean and std deviation per column (topic) of the absolute value of each entry. I believe the default cutoff is 1 standard deviation above the mean. Set every value in abs value below that cutoff to zero. Now reform the document projections from your updated U. Now, repeat the procedure on that result as this time you will be doing it to documents. Russ

RussAlbright · ‎12-05-2019

Unfortunately, It isn't possible to use PROC hptmine to get the full topic results found in Text Miner. There is quite a lot of sas code in Text Miner that executes after proc hptmine to do these calculations. If you submit "options mprint;" in your start up code, you will see it. Our newer action on viya, the tmMine action, does have the full computation contained within the action so if you move to that you will direct access to the computation there. Russ

RussAlbright · ‎11-11-2019

JinHong, If you want complete coverage, every document to belong to a topic, you could look at clustering rather than topics. You do have some control of topics with some macro variables that you can set in your startup code. Take a look at these two found in the Text Miner doc under "Macro Variables, Macros, and Functions" TMM_DOCCUTOFF 0.001 document cutoff value is for any user-created topic. It is used to determine the default document cutoff for user topics (excluding those that are modified multi-term or single-term topics) in the Topic table. Higher values decrease the number of documents assigned to a topic. TMM_TERM_CUTOFF cutoff value is for any user-created or multi-term topic. It is used to determine the default term cutoff for user topics (excluding those that are modified multi-term or single-term topics) and for multi-term in the Topic table. Higher values decrease the number of documents assigned to a topic. If this macro variable is set to blank or not set, then the mean topic weight + 1 standard deviation is set for topic cutoff for each topic. As far as the optimal number of clusters, SAS Text Miner uses a heuristic based on your max number of dimensions and taking a certain percentage explained from that. Ideally we would like to take the percentage from the complete SVD, not the truncated one, but that is computationally not feasible with large text. I always treat this value as one to be tuned, typically along with the entries on my stop list. I experiment with changing the number of topics from 5-25 or so and when i find one that seems useful. I will also look at the descriptive terms for topics and add terms to the stop list that seem non informative given the context. Repeat until you get some useful insights.

RussAlbright · ‎07-11-2019

Hi, This sounds a little more like a data cleansing-fuzzy matching type of task. Take a look at something like this for sas functions and programs to help you standardize the input. https://www.lexjansen.com/sesug/2018/SESUG2018_Paper-143_Final_PDF.pdf Text Miner is based on how terms tend to cooccur together within documents. The learning occurs across the collection based on how these patterns of cooccurrences exist. In your example, where you mostly have a single term per document, there is no cooccurrence going on and so Text Miner is not the best tool for this kind of task.

RussAlbright · ‎04-18-2019

Jonathon, You can use the parent table in the workspace directory. It has the form of triples termnum document frequency In the end, in order to interpret results, you just have to map the termnum back to the term string from the terms table. Russ

RussAlbright · ‎04-17-2019

The Text Parse node creates an underlying representation in the Terms table (which you mentioned you saw) and a term-by-document frequency table that we refer to as the parent table. You cannot directly see this unless you look in your workspace project directory. When you follow the Text Parse node with a Text Filter node and other Text Mining nodes, these representations are used and not the original input text in that export table. So the stopped terms are being used. It is not until you use a Text Cluster node or a Text Topics node that you see the change on the exported table. And even then, the change is in a set of columns that are the numeric representation of the document (taking into account your stopped terms). The actual raw input text is never changed and exported. Russ

RussAlbright · ‎02-19-2019

Check out the Search Action Set in the standard Sas Viya distribution. https://go.documentation.sas.com/?docsetId=casanpg&docsetTarget=cas-search-TblOfActions.htm&docsetVersion=8.3&locale=en

RussAlbright · ‎02-16-2019

You will have to go to code to do this, i think, unless you want to manually add paths for each zipcode level and it sounds like you may have too many to do that. To do this in code, you can create a flow with one zip code chosen. Make sure that is working properly. Then right click on the path and choose Export Path as SAS Program. Once you have the sas code, you can wrap it in a %do loop to cycle through the subsets of the data. With each subset you run the same cluster code that you saved out.

Online Status	Offline
Date Last Visited	‎12-08-2023 08:50 PM

Re: score data set in sas vfl

Re: SAS Viya 4 - SAS Visual Text Analytics - tpParse and tpSpell using...

Re: Topic node not showing document data in text analytics model studi...

Re: Viya Text Analytics - Using Text Parsing node result for concept

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: SAS Text Analytics Text Topic Node

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: TEXTTOPIC_TRAIN dataset: is it possible to construct it in PROC HP...

Re: SAS Text Analytics Text Topic Node

Re: score data set in sas vfl

Re: Topic node not showing document data in text analytics model studi...

Re: Viya Text Analytics - Using Text Parsing node result for concept

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Ranking Teams with Matrix Computations

Re: score data set in sas vfl

Re: SAS Viya 4 - SAS Visual Text Analytics - tpParse and tpSpell using...

Re: Topic node not showing document data in text analytics model studi...

Re: Viya Text Analytics - Using Text Parsing node result for concept

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: SAS Text Analytics Text Topic Node

Re: Proc Hptmine: What is the formula behind assigning a topic to a do...

Re: TEXTTOPIC_TRAIN dataset: is it possible to construct it in PROC HP...

Re: SAS Text Analytics Text Topic Node

Re: How to Standardize Text Values with Text Miner in SAS Enterprise M...

Re: Question about exported data from text parsing in SAS Enterprise M...

Re: Question about exported data from text parsing in SAS Enterprise M...

Re: Natural Language Querying in SAS Viya

Re: text mining - alternative tool for start node, end node for groupi...