DATA Step, Macro, Functions and more

Hash find vs datastep find

Reply
Contributor
Posts: 29

Hash find vs datastep find

[ Edited ]

I have two datasets. One has clean variable text another has uncleaned with page numbers. I want to identify clean text into uncleaned text, like find function in data step. is there any solution?

 

I have tried below code it is not searching like index/find in data step.

 

P.S: I do not have any common variable to merge in datastep. I choose hash table to find page numbers from another dataset. I can not use Perl regular expression since variable acrf is unstructured.

 

data final;
    if _n_ eq 1 then do;
        if 0 then set acrftxt;
        dcl hash h(dataset:'acrftxt');
        h.definekey('acrf');
        h.definedata('pages');
        h.definedone();
    end;
    set gdb;
    if h.find(key: question) eq 0 then page=pages;
run;

acrftxt dataset eg

 

data ACRFTXT;
infile datalines dsd truncover;
input acrf:$32767. pages:$12.;
datalines4;
19. Preferred Term Code [hidden],1
[Preferred Term Code],1
20. High Level Term Code [hidden],1
,2
[High Level Term Code],2
,2
21.,2
,2
High Level Group Term Code [hidden],2
,2
[High Level Group Term Code],2
,2
22.,2
,2
Body System or Organ Class Code [hidden],2
1.,4
2.,4
3.,4
4.,4
5.,4
6.,4
7.,4
,4
,4
Microbiology Subcategory,4
,4
[Microbiology Subcategory],4
,4
Specimen Type,4
,4
[Specimen Type],4
,4
What was the site of specimen?,4
,4
[Site of Specimen],4
9.,4
10.,4
11.,4
12.,4
13.,4
14.,4
,4
,4
D,4
,4
,4
Were any isolates obtained from this specimen?,4
,4
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]",4
,4
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]",4
,4
[Oxacillin-susceptible],5
,5
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]",5
,5
Organism Genus/Species,5
,5
Entry,5
,5
Organism Genus/Species,5
,5
[Organism Genus/Species],5
;;;;
run;

 

gdb dataset eg

 

data GDB;
infile datalines dsd truncover;
input question:$200.;
datalines4;
Preferred Term Code [hidden]
High Level Term Code [hidden]
High Level Group Term Code [hidden]
Body System or Organ Class Code [hidden]
Microbiology Subcategory
What was the site of specimen?
Specimen Type
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]"
Were any isolates obtained from this specimen?
Organism Genus/Species
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]"
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]"
;;;;
run;

 

Attachment
Respected Advisor
Posts: 4,736

Re: Hash find vs datastep find

Please post your data as two working SAS data steps creating this data and not as screenshots.

 

A hash key lookup works only over an exact match so it's not suitable for your use case.

 

If your problem is only the potentially leading page numbers (digits, dot blank) at the beginning of a string then just clean up the string. A simple RegEx can do this job.

Super User
Posts: 10,217

Re: Hash find vs datastep find

Please post your example datasets as data steps with datalines. NEVER post pictures of data, we've got better things to do than typing things off screenshots.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 29

Re: Hash find vs datastep find

Posted in reply to KurtBremser
Thank you. I have attached dataset
Super User
Posts: 10,217

Re: Hash find vs datastep find

Posting .sas7bdat files may often not be helpful, as encodings or byte order may prevemt users from using them. That's why we STRONGLY recommend to post example data in a data step with datalines, as code can be simply copy/pasted and submitted on any environment.

A macro to do that conversion automatically can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Contributor
Posts: 29

Re: Hash find vs datastep find

Posted in reply to KurtBremser
@KurtBremser

I have used that macro and added SAS code
Super User
Super User
Posts: 9,599

Re: Hash find vs datastep find

" I can not use Perl regular expression since variable acrf is unstructured." - if its unstructured then nothing will work on it, there needs to be logical methodology to find one string in another.  You could generate the code from one to the other as another option (still needs the logic though);

data _null_;
  set gdb end=last;
  if _n_=1 then call execute('data want;  set have;');
  call execute(cats('if index(acrf,"'question,'") then found=1;'));
  if last then call execute('run;');
run;

This will create a datastep with one if for each row in gdb.

Contributor
Posts: 29

Re: Hash find vs datastep find

Thank you RW9. It is working but taking longer time to execute. around 10min. Since dataset contains larger number of observations.
Respected Advisor
Posts: 4,736

Re: Hash find vs datastep find

Just to demonstrate how the clean-up could look like:

data demo;
  have_str='3. At what time did the adverse event start';
  want_str=prxchange('s/(^\d+\.\s)(.*)/\2/o',1,strip(have_str));
run;

For anything else/more code: Post your data in the form of working data steps as I'm like others not going to do this work for you.

 

Contributor
Posts: 29

Re: Hash find vs datastep find

Thank you Patrick I have attached dataset
Highlighted
Respected Advisor
Posts: 4,736

Re: Hash find vs datastep find

@Rajaram

Something is missing in your data: In ds ACRFTXT there are only blanks for column ACRF.

Contributor
Posts: 29

Re: Hash find vs datastep find

@Patrick

Yes you are correct, It has blank values in the column. I will give you background. aCRF it is a text from directly coming from PDF (SaveAs Other/PDFTOTEXT). When I saved from PDF all the formatting went because of text file nature. In SAS I can not import PDF files directly so i have used text format to search text and pages from PDF.
Super User
Super User
Posts: 9,599

Re: Hash find vs datastep find

Why are you reading in the aCRF pDF in the first place.  This is a specification document for the database, so just extract the metadata directly from the database.  Reading in a PDF is going to be a lot of work, worthless duplication of what is already around, and not robust - how will you handle changes etc.

Contributor
Posts: 29

Re: Hash find vs datastep find

@RW9
Thanks for replying, I am in process of automating aCRF for SDTM submission. I am following the paper https://www.pharmasug.org/proceedings/2015/AD/PharmaSUG-2015-AD07.pdf
Super User
Super User
Posts: 9,599

Re: Hash find vs datastep find

Well, I don't know that paper, however the process itself sounds backwards.  PDF's are outputs, they are only used for people to look at, they are not conducive to any other process.  Most databases (Oracle, Medidata Rave etc.) have modules designed for standard CRF builds.  These are accessible for Data Management staff - who are responsible for this part - and other users.  It is down to the DM group to create standardised CRF libraries, then use these to implement database builds.  As a programmer, you can simply extract this metadata directly from the database.  This is a preferred method as then all the information done and entered in one place (hence one of the main reasons we use databases in the first place), it is stored in a usable format, and provides the option to extract as raw data or produce reports.  Doing this process the other way, getting an output, then reading that in and processing loses all of this - i.e. if anything changes you need to start again by getting the output and processing it.

Ask a Question
Discussion stats
  • 14 replies
  • 170 views
  • 4 likes
  • 4 in conversation