Hello everyone,
I am new to SAS programming, and working on a personal project to analyze used cars listing on Craigslist between April and May 2021. Here is the dataset that I am using: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I found a similar blog post on the towards data science website (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regr...) where the author, Jose Portillo, used an older dataset with the same variables. However, Jose coded his project in Python whereas I am doing mine in SAS.
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]
for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)There are some minor differences such as the drive, type, and num_of_cylinders. However, almost all the variables are the same. If we hold all the variables constant, how can I write the same code in SAS to extract data from the description column to fill in the missing values in the manufacturer, condition, fuel, transmission, and so on columns?
I think I have created the list correctly, however here is what I have done so far to format each column and its observations into a list.
%let manufacturer = (gmc,hyundai,toyota,mitsubishi,ford,chevrolet,ram,buick,jeep,dodge,subaru,nissan,audi,rover,lexus,honda,chrysler,mini,pontiac,mercedes-benz,cadillac,bmw,kia,volvo,volkswagen,jaguar,acura,saturn,mazda,mercury,lincoln,infiniti,ferrari,fiat,tesla,land rover,harley-davidson,datsun,alfa-romeo,morgan,aston-martin,porche,hennessey);
%let condition = (excellent,good,fair,like new,salvage,new);
%let fuel = (gas,hybrid,diesel,electric);
%let title_status = (clean,lien,rebuilt,salvage,missing,parts only);
%let transmission = (automatic,manual);
%let drive = (4wd,fwd,rwd);
%let size = (mid-size,full-size,compact,sub-compact);
%let type = (sedan,truck,SUV,mini-van,wagon,hatchback,coupe,pickup,convertible,van,bus,offroad);
%let paint_color = (red,grey,blue,white,custom,silver,brown,black,purple,green,orange,yellow);
%let cylinders = (s[1-9] cylinders?, s1[0-6]? cylinders?);
Try this:
data want;
  set have;
  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;
  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;
  drop word col index ;
run;The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.
So basically it is finding the first word in the list that it finds in the description.
For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.
I am unsure what you want, but the cylinders variable contains a (corrupt) regular expression rather than a list like the other variables.
I don't know what the python code does.
As for the regex, yes SAS can use them, and the macro variable can contain a regex. It just depends how the macro variable is used.
The other macros variables do not contain a regex, so this one stuck out.
Note that it'sall regular expressions in python, so I'm unsure why you built the SAS values the way you did.
In a nutshell, I am seeking the equivalent SAS code of the following Python code shown below:
import re
manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'
keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]
for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()
database.drop('description', axis=1, inplace= True)I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables. How can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Yes, I received your response. Thank you for that.
For completeness or anyone else looking to tackle this:
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
Other ways of solving this include making a list of your lookup values and categories such as:
Category Value
Fuel gas
Fuel diesel
Cylinders 1 Cylinder
Cylinders 2 Cylinder
....
Then take your description column and expand it so that each word has it's own row, see example below.
https://gist.github.com/statgeek/bed5ea2c12903b38fdcf19f3f1f1aae9
Then merge the two via a left join to get your missing values. One benefit of this approach is that you can then use some fuzzy join logic as well which is a bit harder in the direct match approach used above. For example if the post says 1 cyl that won't match to 1 cyl but you could do partial searches. I realized there's actually no regex in your original program except for the cylinders specification and I'm not 100% sure what the fillNA is doing with the matches.
It would probably help if you posted a few lines of the data and the output you get from python.
Try this:
data want;
  set have;
  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;
  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;
  drop word col index ;
run;The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.
So basically it is finding the first word in the list that it finds in the description.
For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.
Nice! This is what I was looking for 🙂 I will try the code and let you know how it goes.
Unfortunately, the code did not work. Do you know what the error messages mean?
3004  DATA want;
3005      SET have;
3006
3007      array columns  manufacturer condition fuel title_status transmission drive size type paint_color;
3008      array strings [9] $300 _temporary_ (
3009      /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus
3010      | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
3011      mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
NOTE: The quoted string currently being processed has become more than 262 characters long.  You might have unbalanced quotation marks.
3012      | hennessey'
3013      /*condition*/ 'excellent | good | fair | like new | salvage | new'
3014      /*fuel*/ 'gas | hybrid | diesel |electric'
3015      /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
3016      /*transmission*/ 'automatic | manual'
3017      /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
3018      /*size*/ 'mid-size | full-size | compact | sub-compact'
3019      /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
3020      /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
3021    );
3022
3023    length word $50;
3024    do col=1 to dim(columns);
3025      do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
                                                -
                                                388
                                                200
                                                76
ERROR 388-185: Expecting an arithmetic operator.
ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 76-322: Syntax error, statement will be ignored.
3026        word=left(scan(strings[col],index,'|'));
3027        if findw(description,word,,'it') then columns[col]=word;
3028      end;
3029    end;
3030
3031    if missing(cylinders) then do;
3032      index=findw(description,'cylinders',,'it');
3033      if index then do;
3034        cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
3035      end;
3036    end;
3037
3038    drop word col index ;
3039  run;There is an extra closing bracket that shouldn't be there. Try this:
do index=1 to countw(strings[col],'|') while (missing(columns[col])) ;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
