03-07-2018 07:27 AM - edited 03-09-2018 03:47 AM
I have two datasets. One has clean variable text another has uncleaned with page numbers. I want to identify clean text into uncleaned text, like find function in data step. is there any solution?
I have tried below code it is not searching like index/find in data step.
P.S: I do not have any common variable to merge in datastep. I choose hash table to find page numbers from another dataset. I can not use Perl regular expression since variable acrf is unstructured.
data final; if _n_ eq 1 then do; if 0 then set acrftxt; dcl hash h(dataset:'acrftxt'); h.definekey('acrf'); h.definedata('pages'); h.definedone(); end; set gdb; if h.find(key: question) eq 0 then page=pages; run;
acrftxt dataset eg
infile datalines dsd truncover;
input acrf:$32767. pages:$12.;
19. Preferred Term Code [hidden],1
[Preferred Term Code],1
20. High Level Term Code [hidden],1
[High Level Term Code],2
High Level Group Term Code [hidden],2
[High Level Group Term Code],2
Body System or Organ Class Code [hidden],2
What was the site of specimen?,4
[Site of Specimen],4
Were any isolates obtained from this specimen?,4
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]",4
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]",4
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]",5
gdb dataset eg
infile datalines dsd truncover;
Preferred Term Code [hidden]
High Level Term Code [hidden]
High Level Group Term Code [hidden]
Body System or Organ Class Code [hidden]
What was the site of specimen?
"If a Gram-positive pathogen was cultured, is it vancomycin-susceptible?(only applies to Enterococcus, Pediococcus, Lactobacillus or Leuconostoc) [hidden]"
Were any isolates obtained from this specimen?
"If the pathogen cultured is Staphylococcus aureus, is it oxacillin-susceptible? (methicillin-susceptible) [hidden]"
"If a Gram-negative pathogen was cultured, is it aztreonam susceptible? [hidden]"
03-07-2018 07:36 AM
Please post your data as two working SAS data steps creating this data and not as screenshots.
A hash key lookup works only over an exact match so it's not suitable for your use case.
If your problem is only the potentially leading page numbers (digits, dot blank) at the beginning of a string then just clean up the string. A simple RegEx can do this job.
03-07-2018 07:36 AM
Please post your example datasets as data steps with datalines. NEVER post pictures of data, we've got better things to do than typing things off screenshots.
03-08-2018 02:18 AM
Posting .sas7bdat files may often not be helpful, as encodings or byte order may prevemt users from using them. That's why we STRONGLY recommend to post example data in a data step with datalines, as code can be simply copy/pasted and submitted on any environment.
A macro to do that conversion automatically can be found here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat...
03-07-2018 08:06 AM
" I can not use Perl regular expression since variable acrf is unstructured." - if its unstructured then nothing will work on it, there needs to be logical methodology to find one string in another. You could generate the code from one to the other as another option (still needs the logic though);
data _null_; set gdb end=last; if _n_=1 then call execute('data want; set have;'); call execute(cats('if index(acrf,"'question,'") then found=1;')); if last then call execute('run;'); run;
This will create a datastep with one if for each row in gdb.
03-07-2018 08:43 AM
Just to demonstrate how the clean-up could look like:
data demo; have_str='3. At what time did the adverse event start'; want_str=prxchange('s/(^\d+\.\s)(.*)/\2/o',1,strip(have_str)); run;
For anything else/more code: Post your data in the form of working data steps as I'm like others not going to do this work for you.
03-09-2018 03:55 AM
03-09-2018 04:11 AM
Why are you reading in the aCRF pDF in the first place. This is a specification document for the database, so just extract the metadata directly from the database. Reading in a PDF is going to be a lot of work, worthless duplication of what is already around, and not robust - how will you handle changes etc.
03-09-2018 05:19 AM
03-09-2018 05:43 AM
Well, I don't know that paper, however the process itself sounds backwards. PDF's are outputs, they are only used for people to look at, they are not conducive to any other process. Most databases (Oracle, Medidata Rave etc.) have modules designed for standard CRF builds. These are accessible for Data Management staff - who are responsible for this part - and other users. It is down to the DM group to create standardised CRF libraries, then use these to implement database builds. As a programmer, you can simply extract this metadata directly from the database. This is a preferred method as then all the information done and entered in one place (hence one of the main reasons we use databases in the first place), it is stored in a usable format, and provides the option to extract as raw data or produce reports. Doing this process the other way, getting an output, then reading that in and processing loses all of this - i.e. if anything changes you need to start again by getting the output and processing it.