BookmarkSubscribeRSS Feed
Rajaram
Obsidian | Level 7

I have two datasets. One has clean variable text another has uncleaned with page numbers. I want to identify clean text into uncleaned text, like find function in data step. is there any solution?

 

I have tried below code it is not searching like index/find in data step.

 

P.S: I do not have any common variable to merge in datastep. I choose hash table to find page numbers from another dataset. I can not use Perl regular expression since variable acrf is unstructured.

 

data final;
    if _n_ eq 1 then do;
        if 0 then set acrftxt;
        dcl hash h(dataset:'acrftxt');
        h.definekey('acrf');
        h.definedata('pages');
        h.definedone();
    end;
    set gdb;
    if h.find(key: question) eq 0 then page=pages;
run;

acrftxt dataset eg

 

data ACRFTXT;
infile datalines dsd truncover;
input acrf:$32767. pages:$12.;
datalines4;
19. Preferred Term Code [hidden],1
[Preferred Term Code],1
20. High Level Term Code [hidden],1
,2
[High Level Term Code],2
,2
21.,2
,2
High Level Group Term Code [hidden],2
,2
[High Level Group Term Code],2
,2
22.,2
,2
Body System or Organ Class Code [hidden],2
1.,4
2.,4
3.,4
4.,4
5.,4
6.,4
7.,4
,4
,4
Microbiology Subcategory,4
,4
[Microbiology Subcategory],4
,4
Specimen Type,4
,4
[Specimen Type],4
,4
What was the site of specimen?,4
,4
[Site of Specimen],4
9.,4
10.,4
11.,4
12.,4
13.,4
14.,4
,4
,4
D,4
,4
,4
Were any isolates obtained from this specimen?,4
,4
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]",4
,4
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]",4
,4
[Oxacillin-susceptible],5
,5
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]",5
,5
Organism Genus/Species,5
,5
Entry,5
,5
Organism Genus/Species,5
,5
[Organism Genus/Species],5
;;;;
run;

 

gdb dataset eg

 

data GDB;
infile datalines dsd truncover;
input question:$200.;
datalines4;
Preferred Term Code [hidden]
High Level Term Code [hidden]
High Level Group Term Code [hidden]
Body System or Organ Class Code [hidden]
Microbiology Subcategory
What was the site of specimen?
Specimen Type
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]"
Were any isolates obtained from this specimen?
Organism Genus/Species
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]"
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]"
;;;;
run;

 

14 REPLIES 14
Patrick
Opal | Level 21

Please post your data as two working SAS data steps creating this data and not as screenshots.

 

A hash key lookup works only over an exact match so it's not suitable for your use case.

 

If your problem is only the potentially leading page numbers (digits, dot blank) at the beginning of a string then just clean up the string. A simple RegEx can do this job.

Kurt_Bremser
Super User

Please post your example datasets as data steps with datalines. NEVER post pictures of data, we've got better things to do than typing things off screenshots.

Rajaram
Obsidian | Level 7
Thank you. I have attached dataset
Kurt_Bremser
Super User

Posting .sas7bdat files may often not be helpful, as encodings or byte order may prevemt users from using them. That's why we STRONGLY recommend to post example data in a data step with datalines, as code can be simply copy/pasted and submitted on any environment.

A macro to do that conversion automatically can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...

Rajaram
Obsidian | Level 7
@Kurt_Bremser

I have used that macro and added SAS code
RW9
Diamond | Level 26 RW9
Diamond | Level 26

" I can not use Perl regular expression since variable acrf is unstructured." - if its unstructured then nothing will work on it, there needs to be logical methodology to find one string in another.  You could generate the code from one to the other as another option (still needs the logic though);

data _null_;
  set gdb end=last;
  if _n_=1 then call execute('data want;  set have;');
  call execute(cats('if index(acrf,"'question,'") then found=1;'));
  if last then call execute('run;');
run;

This will create a datastep with one if for each row in gdb.

Rajaram
Obsidian | Level 7
Thank you RW9. It is working but taking longer time to execute. around 10min. Since dataset contains larger number of observations.
Patrick
Opal | Level 21

Just to demonstrate how the clean-up could look like:

data demo;
  have_str='3. At what time did the adverse event start';
  want_str=prxchange('s/(^\d+\.\s)(.*)/\2/o',1,strip(have_str));
run;

For anything else/more code: Post your data in the form of working data steps as I'm like others not going to do this work for you.

 

Rajaram
Obsidian | Level 7
Thank you Patrick I have attached dataset
Patrick
Opal | Level 21

@Rajaram

Something is missing in your data: In ds ACRFTXT there are only blanks for column ACRF.

Rajaram
Obsidian | Level 7
@Patrick

Yes you are correct, It has blank values in the column. I will give you background. aCRF it is a text from directly coming from PDF (SaveAs Other/PDFTOTEXT). When I saved from PDF all the formatting went because of text file nature. In SAS I can not import PDF files directly so i have used text format to search text and pages from PDF.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Why are you reading in the aCRF pDF in the first place.  This is a specification document for the database, so just extract the metadata directly from the database.  Reading in a PDF is going to be a lot of work, worthless duplication of what is already around, and not robust - how will you handle changes etc.

Rajaram
Obsidian | Level 7
@RW9
Thanks for replying, I am in process of automating aCRF for SDTM submission. I am following the paper https://www.pharmasug.org/proceedings/2015/AD/PharmaSUG-2015-AD07.pdf
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, I don't know that paper, however the process itself sounds backwards.  PDF's are outputs, they are only used for people to look at, they are not conducive to any other process.  Most databases (Oracle, Medidata Rave etc.) have modules designed for standard CRF builds.  These are accessible for Data Management staff - who are responsible for this part - and other users.  It is down to the DM group to create standardised CRF libraries, then use these to implement database builds.  As a programmer, you can simply extract this metadata directly from the database.  This is a preferred method as then all the information done and entered in one place (hence one of the main reasons we use databases in the first place), it is stored in a usable format, and provides the option to extract as raw data or produce reports.  Doing this process the other way, getting an output, then reading that in and processing loses all of this - i.e. if anything changes you need to start again by getting the output and processing it.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 4071 views
  • 4 likes
  • 4 in conversation