Hash find vs datastep find

Rajaram · Posted 03-07-2018 07:27 AM

I have two datasets. One has clean variable text another has uncleaned with page numbers. I want to identify clean text into uncleaned text, like find function in data step. is there any solution?

I have tried below code it is not searching like index/find in data step.

P.S: I do not have any common variable to merge in datastep. I choose hash table to find page numbers from another dataset. I can not use Perl regular expression since variable acrf is unstructured.

data final;
    if _n_ eq 1 then do;
        if 0 then set acrftxt;
        dcl hash h(dataset:'acrftxt');
        h.definekey('acrf');
        h.definedata('pages');
        h.definedone();
    end;
    set gdb;
    if h.find(key: question) eq 0 then page=pages;
run;

acrftxt dataset eg

data ACRFTXT;
 infile datalines dsd truncover;
 input acrf:$32767. pages:$12.;
datalines4;
19. Preferred Term Code [hidden],1
[Preferred Term Code],1
20. High Level Term Code [hidden],1
,2
[High Level Term Code],2
,2
21.,2
,2
High Level Group Term Code [hidden],2
,2
[High Level Group Term Code],2
,2
22.,2
,2
Body System or Organ Class Code [hidden],2
1.,4
2.,4
3.,4
4.,4
5.,4
6.,4
7.,4
,4
,4
Microbiology Subcategory,4
,4
[Microbiology Subcategory],4
,4
Specimen Type,4
,4
[Specimen Type],4
,4
What was the site of specimen?,4
,4
[Site of Specimen],4
9.,4
10.,4
11.,4
12.,4
13.,4
14.,4
,4
,4
D,4
,4
,4
Were any isolates obtained from this specimen?,4
,4
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]",4
,4
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]",4
,4
[Oxacillin-susceptible],5
,5
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]",5
,5
Organism Genus/Species,5
,5
Entry,5
,5
Organism Genus/Species,5
,5
[Organism Genus/Species],5
;;;;
run;

gdb dataset eg

data GDB;
 infile datalines dsd truncover;
 input question:$200.;
datalines4;
Preferred Term Code [hidden]
High Level Term Code [hidden]
High Level Group Term Code [hidden]
Body System or Organ Class Code [hidden]
Microbiology Subcategory
What was the site of specimen?
Specimen Type
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]"
Were any isolates obtained from this specimen?
Organism Genus/Species
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]"
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]"
;;;;
run;

Patrick · Posted 03-07-2018 07:36 AM

Please post your data as two working SAS data steps creating this data and not as screenshots.

A hash key lookup works only over an exact match so it's not suitable for your use case.

If your problem is only the potentially leading page numbers (digits, dot blank) at the beginning of a string then just clean up the string. A simple RegEx can do this job.

Kurt_Bremser · Posted 03-07-2018 07:36 AM

Please post your example datasets as data steps with datalines. NEVER post pictures of data, we've got better things to do than typing things off screenshots.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Rajaram · Posted 03-08-2018 01:12 AM

Thank you. I have attached dataset

Kurt_Bremser · Posted 03-08-2018 02:18 AM

Posting .sas7bdat files may often not be helpful, as encodings or byte order may prevemt users from using them. That's why we STRONGLY recommend to post example data in a data step with datalines, as code can be simply copy/pasted and submitted on any environment.

A macro to do that conversion automatically can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Rajaram · Posted 03-09-2018 03:49 AM

@Kurt_Bremser

I have used that macro and added SAS code

RW9 · Posted 03-07-2018 08:06 AM

" I can not use Perl regular expression since variable acrf is unstructured." - if its unstructured then nothing will work on it, there needs to be logical methodology to find one string in another. You could generate the code from one to the other as another option (still needs the logic though);

data _null_;
  set gdb end=last;
  if _n_=1 then call execute('data want;  set have;');
  call execute(cats('if index(acrf,"'question,'") then found=1;'));
  if last then call execute('run;');
run;

This will create a datastep with one if for each row in gdb.

Rajaram · Posted 03-08-2018 01:21 AM

Thank you RW9. It is working but taking longer time to execute. around 10min. Since dataset contains larger number of observations.

Patrick · Posted 03-07-2018 08:43 AM

Just to demonstrate how the clean-up could look like:

data demo;
  have_str='3. At what time did the adverse event start';
  want_str=prxchange('s/(^\d+\.\s)(.*)/\2/o',1,strip(have_str));
run;

For anything else/more code: Post your data in the form of working data steps as I'm like others not going to do this work for you.

Rajaram · Posted 03-08-2018 01:12 AM

Thank you Patrick I have attached dataset

Patrick · Posted 03-08-2018 02:58 PM

@Rajaram

Something is missing in your data: In ds ACRFTXT there are only blanks for column ACRF.

Rajaram · Posted 03-09-2018 03:55 AM

@Patrick

Yes you are correct, It has blank values in the column. I will give you background. aCRF it is a text from directly coming from PDF (SaveAs Other/PDFTOTEXT). When I saved from PDF all the formatting went because of text file nature. In SAS I can not import PDF files directly so i have used text format to search text and pages from PDF.

RW9 · Posted 03-09-2018 04:11 AM

Why are you reading in the aCRF pDF in the first place. This is a specification document for the database, so just extract the metadata directly from the database. Reading in a PDF is going to be a lot of work, worthless duplication of what is already around, and not robust - how will you handle changes etc.

Rajaram · Posted 03-09-2018 05:19 AM

@RW9
Thanks for replying, I am in process of automating aCRF for SDTM submission. I am following the paper https://www.pharmasug.org/proceedings/2015/AD/PharmaSUG-2015-AD07.pdf

RW9 · Posted 03-09-2018 05:43 AM

Well, I don't know that paper, however the process itself sounds backwards. PDF's are outputs, they are only used for people to look at, they are not conducive to any other process. Most databases (Oracle, Medidata Rave etc.) have modules designed for standard CRF builds. These are accessible for Data Management staff - who are responsible for this part - and other users. It is down to the DM group to create standardised CRF libraries, then use these to implement database builds. As a programmer, you can simply extract this metadata directly from the database. This is a preferred method as then all the information done and entered in one place (hence one of the main reasons we use databases in the first place), it is stored in a usable format, and provides the option to extract as raw data or produce reports. Doing this process the other way, getting an output, then reading that in and processing loses all of this - i.e. if anything changes you need to start again by getting the output and processing it.

Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Re: Hash find vs datastep find

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away