BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.

trying to make my ETL code more succint for my SAS explore paper.

 

I want to pull up all records that have the strings "$" and "million" in them. Here's the code that I'm currently using. Wondering if anyone has a chance to look at it & provide more succinct code. maybe 4-5 lines?

any help much appreciated.

 

*create table;

data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;

*find string;

data TestCount;
length MillionText $45;
set testcount;
*where find(Notes,'million')>0;

Notes=put(Notes,$hex1000.);
Notes=tranwrd(Notes,'C2A0','20');
Notes=input(Notes, $hex1000.);

MillionCount = 0;
NumWords=countw(Notes);

do i= 1 to NumWords;
word = scan(Notes, i, ' ');
if compress(word,',') = 'million' then MillionCount=i;
end;

if MillionCount ne 0 then MillionText=compress(catx(' ', scan(notes,MillionCount-1,' ') , scan(notes,MillionCount,' ') ), ',');
drop date Location 'Name or event(s)'n city countryname;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @sqlGoddess 

 

The following code extracts the same as your code from your test data, but it doesn't exactly match your specification. To make it work like your code, I made it scan for € as well as $. This is easy to correct, just replace [$|€] with $.

 

data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;

data testcount2;
  length MillionText $45;
  set testcount;
  Notes=tranwrd(Notes,'C2A0'x,'20'x);
  MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
run;

View solution in original post

9 REPLIES 9
sqlGoddess
SAS Employee
here are some notes from my helper on rationale .
• Since the NBSP spaces aren’t recognized by SCAN, it pulls incorrect strings.
• Using the ‘A0’ specification for hex, made the the entire data set unreadable. but for some reason if I try TRANWRD using the hex specification ‘20’ for a regular space, it works fine, but if I use ‘A0’ anywhere in the code, the entire data set that's produced becomes unreadable. It's a UTF 8/LATIN1 thing but I can't figure out what the issue is.
• I decided to convert the entire notes column to hex, replace the ‘A0’ characters in that string to ‘20’ and that failed as well.
• After a little investigation I learned that the conversion was giving me ‘C2A0’ not just ‘A0’ as the hex specification for NBSP.
• I changed my TRANWRD to replace ‘C2A0’ with ‘20’, then converted that hex string back to readable text, And there are of the program ran fine, pulling out the million specifications wherever it occurred.
ballardw
Super User

Normally when using a hex value the quoted text is followed by an x. 'A0'x for example.

 

It may help us if you showed how you read the source into SAS. That might be the simplest place to fix things.

To truly represent that source you may have to attach a file. When I copy your data step and run the code there aren't any of your problem characters. Quite likely because a data step that I run will use my language settings and so none of the conversion between UTF and WLatin are involved.

 

 

You will have to show exactly what you did in "Using the ‘A0’ specification for hex, made the the entire data set unreadable. ".  I have a hard time seeing how such could happen unless possibly used as a delimiter for an Infile statement when not actually present.

sqlGoddess
SAS Employee

Forgot I could attach a file, here is the excel file that I read in with the libname statement

 

%let path=
libname xl xlsx "&path\cdata.xlsx";

Tom
Super User Tom
Super User

Any character that is not a normal 7-bit ASCII code will require 2 or more bytes in UTF-8.

Using TRANWRD(), which does not understand multi-byte characters), can lead to changing just some of a multi-byte character and result in invalid UTF-8 codes.

 

Use KTRANSLATE() to convert characters in UTF-8 strings.  So to replace non-breaking spaces with normal spaces use:

string = ktranslate(string,' ','C2A0'x);
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @sqlGoddess 

 

The following code extracts the same as your code from your test data, but it doesn't exactly match your specification. To make it work like your code, I made it scan for € as well as $. This is easy to correct, just replace [$|€] with $.

 

data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;

data testcount2;
  length MillionText $45;
  set testcount;
  Notes=tranwrd(Notes,'C2A0'x,'20'x);
  MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
run;
sqlGoddess
SAS Employee

Just what I was hoping to get, some dense PERL Code. thank you. @ErikLund_Jensen 

sqlGoddess
SAS Employee

next question, How Would I locate just those 3 rows of data that have the string million in them.

my original thought was to craft a where clause in proc print against the original data, but I would love to see if prxchange can do more like a boolean expression,  & return only those rows with a match..

ballardw
Super User

@sqlGoddess wrote:

next question, How Would I locate just those 3 rows of data that have the string million in them.

my original thought was to craft a where clause in proc print against the original data, but I would love to see if prxchange can do more like a boolean expression,  & return only those rows with a match..


If you are looking for a relatively infrequent occurrence and already using a data step to add something like your MillionText variable, check if it is not missing and when present write to another data set.

data testcount2 Millions;
  length MillionText $45;
  set testcount;
  Notes=tranwrd(Notes,'C2A0'x,'20'x);
  MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
  if upcase(scan(milliontext,2,' '))='MILLION' then output millions;
  output testcount2;
run;

Though perhaps @ErikLund_Jensen can come up with a PRX that only assigns milliontext when million is present. I added some dummy note values and it copied their values to milliontext.

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @sqlGoddess 

 

Here are some different ways. My primitive prxchange extracts MillionText from Notes if a string containing corrency + number + 'million' is present, otherwise it returns the full Notes-text, so they are all based on a comparison between the extractes text and the full note. I think it could be done more elegant.

 

* Test data - row 3 doesn't contain million;
data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Cirque du Soleil created an expensive production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;

* Extract notes containing 'million', generate Milliontext;
data want1;
  length MillionText $45;
  set testcount;
  Notes=tranwrd(Notes,'C2A0'x,'20'x);
  MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
  if MillionText not =: Notes then output;
run;

* Extract notes containing 'million', generate Milliontext;
proc sql;
  create table want2 as
    select 
      Notes, 
      prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) as MillionText
    from testcount
    where calculated MillionText ne Notes;
quit;

* Extract notes containing 'million';
proc sql;
  create table want3 as
    select Notes
    from TestCount
    where prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) ne Notes;
quit;

* Extract notes containing 'million';
data want4;
  set testcount(where=(prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) not =: Notes));
run;

* Print notes containing 'million';
proc print data=testcount(where=(prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) not =: Notes));
run;

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1454 views
  • 4 likes
  • 4 in conversation