BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Vicente95
Obsidian | Level 7

Hello everyone,

 

I am new to SAS programming, and working on a personal project to analyze used cars listing on Craigslist between April and May 2021. Here is the dataset that I am using: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

 

I found a similar blog post on the towards data science website (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regr...) where the author, Jose Portillo, used an older dataset with the same variables. However, Jose coded his project in Python whereas I am doing mine in SAS. 

 

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

There are some minor differences such as the drive, type, and num_of_cylinders. However, almost all the variables are the same. If we hold all the variables constant, how can I write the same code in SAS to extract data from the description column to fill in the missing values in the manufacturer, condition, fuel, transmission, and so on columns? 

 

I think I have created the list correctly, however here is what I have done so far to format each column and its observations into a list. 

 

%let manufacturer = (gmc,hyundai,toyota,mitsubishi,ford,chevrolet,ram,buick,jeep,dodge,subaru,nissan,audi,rover,lexus,honda,chrysler,mini,pontiac,mercedes-benz,cadillac,bmw,kia,volvo,volkswagen,jaguar,acura,saturn,mazda,mercury,lincoln,infiniti,ferrari,fiat,tesla,land rover,harley-davidson,datsun,alfa-romeo,morgan,aston-martin,porche,hennessey);

 

%let condition = (excellent,good,fair,like new,salvage,new);

 

%let fuel = (gas,hybrid,diesel,electric);

 

%let title_status = (clean,lien,rebuilt,salvage,missing,parts only);

 

%let transmission = (automatic,manual);

 

%let drive = (4wd,fwd,rwd);

 

%let size = (mid-size,full-size,compact,sub-compact);

 

%let type = (sedan,truck,SUV,mini-van,wagon,hatchback,coupe,pickup,convertible,van,bus,offroad);

 

%let paint_color = (red,grey,blue,white,custom,silver,brown,black,purple,green,orange,yellow);

 

%let cylinders = (s[1-9] cylinders?, s1[0-6]? cylinders?);

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Try this:

data want;
  set have;

  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;

  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;

  drop word col index ;
run;

The columns array is the variables that you want to update (when they are missing).  The strings temporary array is the list of words for each variable.

So basically it is finding the first word in the list that it finds in the description.

 

For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.

View solution in original post

12 REPLIES 12
ChrisNZ
Tourmaline | Level 20

I am unsure what you want, but the cylinders variable contains a (corrupt) regular expression rather than a list like the other variables.

Vicente95
Obsidian | Level 7
So, I want the equivalent SAS code of what is contained in the screenshot since the author wrote it in Python. What would the SAS code resemble?

As for the cylinders list, it would be:

%let cylinders=(3 cylinders,4 cylinders,5 cylinders,6 cylinders,8 cylinders,10 cylinders,12 cylinders);

Is there a regex to shorten the cylinders list in SAS?
ChrisNZ
Tourmaline | Level 20

I don't know what the python code does.

As for the regex, yes SAS can use them, and the macro variable can contain a regex. It just depends how the macro variable is used.

The other macros variables do not contain a regex, so this one stuck out. 

 

Note that it'sall regular expressions in python, so I'm unsure why you built the SAS values the way you did.

Reeza
Super User
A lot of what that code is doing is not required in SAS though....you've truncated the code in the picture. I took a quick look at the data and there's no description field in the text so you need to clarify what you're asking for help with.
Also, since the data is CSV you control the the type and format when importing the data so you can ensure the data is read in correctly.

Vicente95
Obsidian | Level 7

In a nutshell, I am seeking the equivalent SAS code of the following Python code shown below:

 

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

  1. cylinders
  2. condition
  3. drive
  4. paint_color
  5. type
  6. manufacturer
  7. title_status
  8. model
  9. fuel
  10. transmission
  11. description
  12. region
  13. state
  14. price (num)
  15. posting_date (num)
  16. odometer (num)
  17. year (num)

After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables. How can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?

Reeza
Super User
I posted an example in your StackOverflow question - yeah, Kaggle only shows the first few columns by default apparently. But it's just doing a basic text search on the description field so you can just replace it with FIND() or INDEX() and some loops. REGEX is also an option but I hate writing regex so someone else will have to help you with that.
Vicente95
Obsidian | Level 7

Yes, I received your response. Thank you for that. 

Reeza
Super User

For completeness or anyone else looking to tackle this:

 

You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

 

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;


Other ways of solving this include making a list of your lookup values and categories such as:

Category Value
Fuel gas
Fuel diesel
Cylinders 1 Cylinder
Cylinders 2 Cylinder
....

Then take your description column and expand it so that each word has it's own row, see example below.

https://gist.github.com/statgeek/bed5ea2c12903b38fdcf19f3f1f1aae9

 

Then merge the two via a left join to get your missing values. One benefit of this approach is that you can then use some fuzzy join logic as well which is a bit harder in the direct match approach used above. For example if the post says 1 cyl that won't match to 1 cyl but you could do partial searches. I realized there's actually no regex in your original program except for the cylinders specification and I'm not 100% sure what the fillNA is doing with the matches. 

 

It would probably help if you posted a few lines of the data and the output you get from python. 

Tom
Super User Tom
Super User

Try this:

data want;
  set have;

  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;

  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;

  drop word col index ;
run;

The columns array is the variables that you want to update (when they are missing).  The strings temporary array is the list of words for each variable.

So basically it is finding the first word in the list that it finds in the description.

 

For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.

Vicente95
Obsidian | Level 7

Nice! This is what I was looking for 🙂 I will try the code and let you know how it goes.

Vicente95
Obsidian | Level 7

Unfortunately, the code did not work. Do you know what the error messages mean?

 

3004  DATA want;
3005      SET have;
3006
3007      array columns  manufacturer condition fuel title_status transmission drive size type paint_color;
3008      array strings [9] $300 _temporary_ (
3009      /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus
3010      | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
3011      mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
NOTE: The quoted string currently being processed has become more than 262 characters long.  You might have unbalanced quotation marks.
3012      | hennessey'
3013      /*condition*/ 'excellent | good | fair | like new | salvage | new'
3014      /*fuel*/ 'gas | hybrid | diesel |electric'
3015      /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
3016      /*transmission*/ 'automatic | manual'
3017      /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
3018      /*size*/ 'mid-size | full-size | compact | sub-compact'
3019      /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
3020      /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
3021    );
3022
3023    length word $50;
3024    do col=1 to dim(columns);
3025      do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
                                                -
                                                388
                                                200
                                                76
ERROR 388-185: Expecting an arithmetic operator.

ERROR 200-322: The symbol is not recognized and will be ignored.

ERROR 76-322: Syntax error, statement will be ignored.

3026        word=left(scan(strings[col],index,'|'));
3027        if findw(description,word,,'it') then columns[col]=word;
3028      end;
3029    end;
3030
3031    if missing(cylinders) then do;
3032      index=findw(description,'cylinders',,'it');
3033      if index then do;
3034        cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
3035      end;
3036    end;
3037
3038    drop word col index ;
3039  run;
SASKiwi
PROC Star

There is an extra closing bracket that shouldn't be there. Try this:

do index=1 to countw(strings[col],'|') while (missing(columns[col])) ;

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 1615 views
  • 4 likes
  • 5 in conversation