BookmarkSubscribeRSS Feed
chinna0369
Pyrite | Level 9

Hi,

 

I have data as below I would like to create a numeric variable based the character value below. There are multiple values, is there any way to get the number only from those sentences? I have provided some data, and there are lot of different sentences so it should understand the sentence and take the number that is how many are missing pills or tablets? I have attached picture, which shows the numeric value we need to get in that new variable.

Data.PNG

 

data have;
input id $2. val $200.;
datalines;
1	1 MINI-PILL MISSING FROM COUNT PER MOM MAY HAVE DROPPED A MINI-PILL
2	2 MINI-TABLETS REMAINED STUCK IN THE BOTTLE
3	PER SOURCE DOCUMENT PATIENT ONLY RECEIVED 200MG (100MG EACH DOSE).
4	PER SOURCE DOCUMENT PATIENT ONLY RECEIVED 200MG (100MG EACH DOSE).
5	MISSKING 10 TABLETS DURING V12-V13
6	MISSED 4 TIMES ON UNKNOWN DATES,ONE TIME IS 10 TABLETS DURING V13-V14.
7	MISSKING 10 TABLETS DATA VALUE CHANGED
8	MISSED 10 TABLETS BETWEEN V15 AND V16
9	MISSING DOSE OF JUL-04 AM, JUL-18 AM, AUG-11 AM, AUG-18 PM, AND 65 PILLS WERE MISSED ON AN UNKNOWN DATE
10	THREE STUDY DRUGS WERE MISSED, AND THE TIME AND REASON WERE NOT KNOWN
11	TWO TABLETS WERE MISSED DURING V7, AND THE PATIENTS FAMILY MEMBERS FORGOT THE SPECIFIC DATE OF THE MISSED DOSE
12	7 STUDY DRUGS WERE MISSED, REMOTE ACCESS, NO COLLECTION
13	1 STUDY DRUG WAS MISSED, TIME AND REASON UNKNOWN
14	MISSED TAKING 1 DOSE, 12 TABLETS
15	THE PATIENT HAD MISSED 50 OF THE STUDY PILLS, AND THE EXACT TIME AND REASON FOR THE MISSED DOSE HAD BEEN FORGOTTEN
16	THE PATIENT TOOK 8 MORE TABLETS FROM THE EVENING OF AUGUST 3 TO AUGUST 8, AND THE TIME AND REASON OF THE EXTRA TABLETS HAVE BEEN FORGOTTEN.
17	3 TABLETS WERE MISSED, THE REASON, DAY AND MONTH WERE UNKNOWN.
18	1 TABLET WAS MISSED, REASON, DAY AND MONTH WERE UNKNOWN.
19	SUBJECT MISSED TAKING 10 TABLETS IN 2020YEAR
20	1 TABLET WAS MISSED, REASON, DAY AND MONTH WERE UNKNOWN.
21	THE PATIENT MISSED 20 STUDY DRUGS DURING V4-V5
22	MISSING 4 PIECES DURING V10-V11
23	13 PILLS WERE MISSED, 7 IN THE MORNING AND 6 IN THE AFTERNOON
24	8 CAPSULES WERE MISSED FROM FEBRUARY 25 TO FEBRUARY 27
25	FROM V3 TO V4, THERE WERE 26 PILLS NOT BE RETURNED WHICH INCLUDES PILLS DROPPED OFF ON THE GROUND AND UNEXPECTED DOSE BUT THE UNEXPECTED DOSE CAN NOT BE CONFIRMED
26	46 PILLS MISSED FROM V12 TO V13.
27	2 TABLETS WERE MISSED ON AN UNKNOWN DATE, AND MISSED BETWEEN 27TH FEB AND 22TH MAY.
28	19TABS MISSED TO BE TAKEN FROM V6 TO V7 DATE UNKNOWN
29	1 TABLETS WERE MISSED ON AN UNKNOWN DATE, AND MISSED BETWEEN 31TH MAY AND 15TH AUG.
30	28 PILLS WERE MISSED BETWEEN MAY 16, 2023, AND NOVEMBER 13, 2023, AT AN UNKNOWN TIME
31	17TABS MISSED TO BE TAKEN FROM V6 TO V7 DATE UNKNOWN
32	10TABS MISSED TO BE TAKEN FROM V12 TO V13, DATE UNKNOWN
33	1 TABLET WERE MISSED ON AN UNKNOWN DATE, AND MISSED BETWEEN 6TH JAN AND 31TH MARCH.
34	3TABLES MISSED FROM AUGUST 31, 2020 TO NOVEMBER 17, 2020 WITH TIME UNKNOWN
35	4TABLES MISSED FROM 17NOV2020 TO 29DEC2020 WITH TIME UNKNOWN
36	3TABLES MISSED FROM 26APR2021 TO 05AUG2021 WITH TIME UNKNOWN
37	11TABLES MISSED FROM 17MAY2022 TO 17AUG2022 WITH TIME UNKNOWN
38	47 TABLETS MISSED FROM 09MAR2023 TO 06SEP2023
39	43# TOTAL TABLETS MISSED FROM 01JUN2020 PM TO 22JUN2020 PM DOSED AT 360 MG/DAY PER CAREGIVER MISUNDERSTANDING OF THE ASSIGNED DOSE
40	THE SUBJECT DID NOT ADJUST THE MEDICINE ACCORDING TO THE PLAN, FORGOT TO ADJUST THE MEDICINE, AND MISSED TAKING
41	THE SUBJECT DID NOT ADJUST THE MEDICINE ACCORDING TO THE PLAN SO MISSED TAKING 2 TABLETS
42	1 TBALET WAS MISSED TO TAKE, REASON WAS UNKNOWN.
43	9 TABLETS WERE MISSED TO TAKE, REASON WAS UNKNOWN
44	SUBJECT MISSED TAKING 14 TABLETS
45	1 TABLET WAS FORGOTTEN TO TAKE.
46	SUBJECTS FORGOT TO TAKE 10 PILLS
47	SUBJECTS FORGOT TO TAKE 21 PILLS
48	V9-V10 SUBJECTS FORGOT TO TAKE 58 PILLS
49	V10-V11 SUBJECTS FORGOT TO TAKE 43 PILLS
50	17 JUN 2022-03 AUG 2022 SUBJECTS FORGOT TO TAKE 11 PILLS
;
run;
2 REPLIES 2
mkeintz
PROC Star

What do you want the result to look like?

 

I see you have example lines that have no desired numbers.  Do you ever have more than 1 desired number in a line of text?  I only see examples of 1 number and no numbers.

 

Given you sometimes have numbers in the form of words (e.g. "three" for ID=10), what range of values must you accommodate (e.g.  "one thousand three hundred twenty-one")?

 

It seems you are looking for numbers that represent a counts, (not a size or date component) - is that correct?  If so, you will need to develop rules to exclude numbers that are not counts, probably by looking at neighboring words (i.e. "tablets", "pills" are "good", but "January" is bad).  I.e. you might have to set up a list of "good" words and "bad" words.

 

There will be other responses, I think, with other questions.  From folks with more experience in text processing.

 

In the end, I think you will have to accept that you can't guarantee there will be no false positives or false negatives.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

Also that it would be possible to define a Regular Expression that will select the desired numbers in your sample data, I guess that there will be additional text patterns not the least using different terms for the units like drops, strips, .... 

How much data do you have? Like would it be a feasible approach to create a RegEx for your current sample data for you to then eyeball all the source data where nothing has been selected to find additional cases - to then eventually amend the RegEx or define a 2nd one for these cases?

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 247 views
  • 1 like
  • 3 in conversation