BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Cruise
Ammonite | Level 13

Hi All, 

 

Help is appreciated in creating distinct list of chemicals from 5 columns of strings where texts are separated by multiple different types of delimiters: hyphen, comma, space, period, brackets, pound sign. Please see my unsuccessful attempt. I found other examples but they were concerned with only one type of delimiter. 

 


data chemicals(drop=waste:);
  length chemicals $3000;
  set have(keep=waste:);
  chemicals = catx(':',waste1,waste2,waste3,waste4,waste5);
 run;

data temp; set chemicals;
 do i=1 to countw(chemicals,' ');
 v1=scan(chemicals,i,' ','m');output;
 end;
 drop i;
run;

Please see data I HAVE AND WANT below. 

 

DATA HAVE;
LENGTH waste1 $600 waste2 $600 waste3 $600 waste4 $600 waste5 $600;
INFILE DATALINES DLM='#'; 
INPUT id waste1 $ waste2 $ waste3 $ waste4 $ waste5 $;
DATALINES;
1  #  Filter Cake-Celite, Carbon, Plasticizer    # (60% by wt) Toluene, Sulfonic                     # acid catalyst							  # Plasticizers, consisting of 26 TM			  # Trimillitate and DOA Adipate Plasticizers                             
2  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          #   
3  #  Trichlorethylene (F001)                    # Tetrachloroethene {(PCE or "perc.") F002}         # 1,1,1-Trichloroethane                      # Cyanide (F007)                                # Arsenic  (D003)
4  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          # 
5  #  Plant #1 Ester Sump: Residual organics.    # 2 Ethylhexanol, alcohols, diethylene glycol,      # ethylene glycol, plasticizer, adipic acid, # polyester                                     # Plant #2 polmyer solids
6  #  Cadmium, chromium, dichlorobenzene,        # Tetrachloroethylene. dieldrin.                    # Plating and painting wastes                #                                               # 
7  #  Toluene                                    # Xylene                                            # Ethyl Benzene                              #                                               # 
8  #  1,1,1-Trichloroethane {TCA (F001)}         #                                                   #                                            #                                               # 
9  #  Plating sludges                            # Starchy clay sludge                               # (D006, D007, D008 wastes)                  # Trichloroethylene, PVC sludge                 # Tetrachloroethylene, Vinyl chloride
10 #  Chlorinated Solvents (FOO1) (FOO1)         #                                                   #                                            #                                               # 
11 #  Toluene                                    # 1-2, Dichloroethylene                             # 1,1,1-Trichloroethane (FOO1) (FOO2)        # Trichloroethylene (TCE)                       # 
12 #  Fuel oil, gasoline, solvents               # Chlorinated hydrocarbons (FOO1 & FOO2)            #                                            #                                               # 
13 #  Chlorinated hydrocarbons and solvents      # (FOO1 & FOO2)                                     #                                            #                                               # 
14 #  Heavy Metals, Fe, Cr, Cu, Zn, Ni           # (D006) (D007)                                     #                                            #                                               # 
15 #  Waste solvents, phenols                    # Trichloroethylene (TCE)                           # Xylene, Dichlorobenzene (FOO1) (FOO2)      # Tetrachloroethylene (PCE or "perc.")          # 
16 #  1,1,1-Trichloroethane (FOO1 & FOO2)        # Methylene chloride                                #                                            #                                               # 
17 #  Tetrachloroethylene (PCE or "perc.")       # Trichloroethylene (TCE)                           # 1,1,1-Trichloroethane (TCA)                #                                               # 
18 #  Copper, lead (D008)                        # 1,1,1 Trichloroethane (F002)                      #                                            #                                               # 
19 #  Methylene chloride                         # 1,1,1-Trichloroethane, toluene, xylene,           # Ethyl benzene,  1,2-dichloropropane        # Aluminum, arsenic, copper, cadmium, chromium, # Lead, 1,1- dichloroethene, ethane and zinc
20 #  Inks                                       # Solvents (toluene) (FOO5)                         # Ethylbenzene                               # Ethylacetate                                  # 
21 #  1,1,1-Trichloroethane {(TCA) (F001 waste)} # Tetrachloroethylene  {(PCE or "perc.") (F001)}    #                                            #                                               #                                           
;

 data want(drop=waste:);
  length combined $3000;
  set have;
  combined = catx(':',waste1,waste2,waste3,waste4,waste5);
 run;

data want;
length Chemicals $50;
input Chemicals $;
cards;
Filter Cake-Celite
Carbon
Plasticizer
Chromium sludge
Trichlorethylene
Chromium sludge
Plant Ester Sump 
Residual organics
Ester Sump
Cadmium
Toluene
Trichloroethane 
Plating sludges
Chlorinated Solvents
Toluene
Fuel oil
Gasoline
Solvents
Chlorinated hydrocarbons
Solvents
Heavy Metals
Fe
Cr
Cu
Zn
Ni
Tetrachloroethylene
Copper
Lead
Methylene chrloride
Inks
Trichloroethane
;

 

1 ACCEPTED SOLUTION

Accepted Solutions
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, this demonstrates the basic principal.  You haven't mentioned all the rules though, some data is not split by comma, but includes it, some data in brackets is not used, I can only apply rules I know about:

data have;
  length waste1 $600 waste2 $600 waste3 $600 waste4 $600 waste5 $600;
  infile datalines dlm='#'; 
  input id waste1 $ waste2 $ waste3 $ waste4 $ waste5 $;
datalines;
1  #  Filter Cake-Celite, Carbon, Plasticizer    # (60% by wt) Toluene, Sulfonic                     # acid catalyst                       # Plasticizers, consisting of 26 TM           # Trimillitate and DOA Adipate Plasticizers                             
2  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          #   
3  #  Trichlorethylene (F001)                    # Tetrachloroethene {(PCE or "perc.") F002}         # 1,1,1-Trichloroethane                      # Cyanide (F007)                                # Arsenic  (D003)
4  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          # 
5  #  Plant #1 Ester Sump: Residual organics.    # 2 Ethylhexanol, alcohols, diethylene glycol,      # ethylene glycol, plasticizer, adipic acid, # polyester                                     # Plant #2 polmyer solids
6  #  Cadmium, chromium, dichlorobenzene,        # Tetrachloroethylene. dieldrin.                    # Plating and painting wastes                #                                               # 
7  #  Toluene                                    # Xylene                                            # Ethyl Benzene                              #                                               # 
8  #  1,1,1-Trichloroethane {TCA (F001)}         #         
;
run;

proc transpose data=have out=want;
  by id;
  var waste:;
run;

data want;
  set want;
  length wrd $2000;
  do i=1 to countw(col1,",");
    wrd=scan(col1,i,",");
    output;
  end;
run;

View solution in original post

12 REPLIES 12
RW9
Diamond | Level 26 RW9
Diamond | Level 26

The scan function allows a list of delimiters to be used:

http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000214639.htm

E.g.

want=scan(have,1,"!:,; ");
Cruise
Ammonite | Level 13

@RW9

Thanks for help. I just tried following which didn't work. What am I doing wrong here? It only outputs one column with the first string, which was for example 'Filter' of 'Filter Cake-Celite'.

 

data d.have1(drop=waste:);
length chemicals $3000;
set d.have(keep=waste:);
chemicals = catx(':',waste1,waste2,waste3,waste4,waste5);
run;

data d.have2; set d.have1;
var=scan(chemicals,1,"!:,-;#()");
run;

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Please use the code window - its the {i} above post, so code formatting is retained.

You have only told it to take the first string with those delimiter.  It should not split the string you give, as space is not part of the delimiter list.  If you want to loop over them then you need a loop, this for instance outputs each string block as a new obs:

data d.have2;
  set d.have1;
  do i=1 to countw(chemicals,"!:,-;#()");
    var=scan(chemicals,i,"!:,-;#()");
    output;
  end;
run;

Or array:

data d.have2;
  set d.have1;
num_iter=countw(chemicals,"!:,-;#()");
array str{num_iter} $2000; do i=1 to num_iter; str{i}=scan(chemicals,i,"!:,-;#()"); end; run;

Will create variables for each string block.

 

 

 

Cruise
Ammonite | Level 13

I will use code window more consistently in the future, however, it sometimes flattens out my code to one line. 

Now I understand the logic. But the code example you shared produced quite bizarre result. 

Any further hints as to how to correct this further?

 

have 1 vs have 2have 1 vs have 2

 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Yes, I know the code window isn't great, its been pointed out to the forum admins.  Its better than smiley faces all over though.

I don't see your issue below.  You stipulate the list of delimiters, and it is correctly splitting on them.  So obs one is split because - is found, obs 2 to the comma, obs three to next comma.  Do you only want it to split on commas?  If so only put commas in the list of delimiters.

Reeza
Super User
This is why I said transpose first and then do scan. This approach is harder.
Cruise
Ammonite | Level 13

@Reeza

 

proc transpose data=d.have out=d.have_long;
by id;
run;

NOTE: No variables to transpose.
NOTE: There were 21 observations read from the data set D.HAVE.
NOTE: The data set D.HAVE_LONG has 0 observations and 3 variables.

Cruise
Ammonite | Level 13

Thanks. I finally transposed. But how does it help? I still end up with multiple columns with chemical names. I'm tempted to manually weed out using excel 🙂 

 

proc transpose data=d.have out=d.have_long NAME=TRANSPOSED;
VAR WASTE1 WASTE2 WASTE3 WASTE4 WASTE5; 
ID ID;
run;
 
Reeza
Super User
You should now end up with one column to fix.
And then you’re using SCAN once.
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, this demonstrates the basic principal.  You haven't mentioned all the rules though, some data is not split by comma, but includes it, some data in brackets is not used, I can only apply rules I know about:

data have;
  length waste1 $600 waste2 $600 waste3 $600 waste4 $600 waste5 $600;
  infile datalines dlm='#'; 
  input id waste1 $ waste2 $ waste3 $ waste4 $ waste5 $;
datalines;
1  #  Filter Cake-Celite, Carbon, Plasticizer    # (60% by wt) Toluene, Sulfonic                     # acid catalyst                       # Plasticizers, consisting of 26 TM           # Trimillitate and DOA Adipate Plasticizers                             
2  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          #   
3  #  Trichlorethylene (F001)                    # Tetrachloroethene {(PCE or "perc.") F002}         # 1,1,1-Trichloroethane                      # Cyanide (F007)                                # Arsenic  (D003)
4  #  Chromium sludge                            # Paint sludge                                      # Metals                                     # Chlorinated solvents                          # 
5  #  Plant #1 Ester Sump: Residual organics.    # 2 Ethylhexanol, alcohols, diethylene glycol,      # ethylene glycol, plasticizer, adipic acid, # polyester                                     # Plant #2 polmyer solids
6  #  Cadmium, chromium, dichlorobenzene,        # Tetrachloroethylene. dieldrin.                    # Plating and painting wastes                #                                               # 
7  #  Toluene                                    # Xylene                                            # Ethyl Benzene                              #                                               # 
8  #  1,1,1-Trichloroethane {TCA (F001)}         #         
;
run;

proc transpose data=have out=want;
  by id;
  var waste:;
run;

data want;
  set want;
  length wrd $2000;
  do i=1 to countw(col1,",");
    wrd=scan(col1,i,",");
    output;
  end;
run;
Cruise
Ammonite | Level 13
Thanks RW9 and Reeza!!!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 975 views
  • 5 likes
  • 3 in conversation