I have two datasets. One has clean variable text another has uncleaned with page numbers. I want to identify clean text into uncleaned text, like find function in data step. is there any solution?
I have tried below code it is not searching like index/find in data step.
P.S: I do not have any common variable to merge in datastep. I choose hash table to find page numbers from another dataset. I can not use Perl regular expression since variable acrf is unstructured.
data final;
if _n_ eq 1 then do;
if 0 then set acrftxt;
dcl hash h(dataset:'acrftxt');
h.definekey('acrf');
h.definedata('pages');
h.definedone();
end;
set gdb;
if h.find(key: question) eq 0 then page=pages;
run;
acrftxt dataset eg
data ACRFTXT;
infile datalines dsd truncover;
input acrf:$32767. pages:$12.;
datalines4;
19. Preferred Term Code [hidden],1
[Preferred Term Code],1
20. High Level Term Code [hidden],1
,2
[High Level Term Code],2
,2
21.,2
,2
High Level Group Term Code [hidden],2
,2
[High Level Group Term Code],2
,2
22.,2
,2
Body System or Organ Class Code [hidden],2
1.,4
2.,4
3.,4
4.,4
5.,4
6.,4
7.,4
,4
,4
Microbiology Subcategory,4
,4
[Microbiology Subcategory],4
,4
Specimen Type,4
,4
[Specimen Type],4
,4
What was the site of specimen?,4
,4
[Site of Specimen],4
9.,4
10.,4
11.,4
12.,4
13.,4
14.,4
,4
,4
D,4
,4
,4
Were any isolates obtained from this specimen?,4
,4
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]",4
,4
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]",4
,4
[Oxacillin-susceptible],5
,5
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]",5
,5
Organism Genus/Species,5
,5
Entry,5
,5
Organism Genus/Species,5
,5
[Organism Genus/Species],5
;;;;
run;
gdb dataset eg
data GDB;
infile datalines dsd truncover;
input question:$200.;
datalines4;
Preferred Term Code [hidden]
High Level Term Code [hidden]
High Level Group Term Code [hidden]
Body System or Organ Class Code [hidden]
Microbiology Subcategory
What was the site of specimen?
Specimen Type
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]"
Were any isolates obtained from this specimen?
Organism Genus/Species
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]"
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]"
;;;;
run;
Please post your data as two working SAS data steps creating this data and not as screenshots.
A hash key lookup works only over an exact match so it's not suitable for your use case.
If your problem is only the potentially leading page numbers (digits, dot blank) at the beginning of a string then just clean up the string. A simple RegEx can do this job.
Please post your example datasets as data steps with datalines. NEVER post pictures of data, we've got better things to do than typing things off screenshots.
Posting .sas7bdat files may often not be helpful, as encodings or byte order may prevemt users from using them. That's why we STRONGLY recommend to post example data in a data step with datalines, as code can be simply copy/pasted and submitted on any environment.
A macro to do that conversion automatically can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...
" I can not use Perl regular expression since variable acrf is unstructured." - if its unstructured then nothing will work on it, there needs to be logical methodology to find one string in another. You could generate the code from one to the other as another option (still needs the logic though);
data _null_; set gdb end=last; if _n_=1 then call execute('data want; set have;'); call execute(cats('if index(acrf,"'question,'") then found=1;')); if last then call execute('run;'); run;
This will create a datastep with one if for each row in gdb.
Just to demonstrate how the clean-up could look like:
data demo;
have_str='3. At what time did the adverse event start';
want_str=prxchange('s/(^\d+\.\s)(.*)/\2/o',1,strip(have_str));
run;
For anything else/more code: Post your data in the form of working data steps as I'm like others not going to do this work for you.
Something is missing in your data: In ds ACRFTXT there are only blanks for column ACRF.
Why are you reading in the aCRF pDF in the first place. This is a specification document for the database, so just extract the metadata directly from the database. Reading in a PDF is going to be a lot of work, worthless duplication of what is already around, and not robust - how will you handle changes etc.
Well, I don't know that paper, however the process itself sounds backwards. PDF's are outputs, they are only used for people to look at, they are not conducive to any other process. Most databases (Oracle, Medidata Rave etc.) have modules designed for standard CRF builds. These are accessible for Data Management staff - who are responsible for this part - and other users. It is down to the DM group to create standardised CRF libraries, then use these to implement database builds. As a programmer, you can simply extract this metadata directly from the database. This is a preferred method as then all the information done and entered in one place (hence one of the main reasons we use databases in the first place), it is stored in a usable format, and provides the option to extract as raw data or produce reports. Doing this process the other way, getting an output, then reading that in and processing loses all of this - i.e. if anything changes you need to start again by getting the output and processing it.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.