BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rxp1970
Fluorite | Level 6

Hello, I have a dataset (Portuguese Banking dataset for term deposits)  The data set imports the 20 or so variables correctly except the variable AGE that's imported as Char and I want it to be numeric.  I tried the below...but Age2 is created with just a "."... (searching for a few hours now and not finding a solution.)  I tried the method below to create a new var (age2)...not ideal but been stuck on this for hours...

 

data capstone;
set capstone_new;
length age2 3.0;
age2=age;

run;

 

 

I don't want to "reimport" the data and have to specify every the formats for the other 20 variables that's coming in fine for the 1 variable that's not... Is there a way to create a new variable ( within the existing and newly imported dataset) ...create a new var something like: Age2 and then just copy in the AGE variables values (currently viewed as char in imported dataset)

 

Can someone provide the code to either change Age to numeric or to drop Age and put its values into Age2 ???

 

Thanks for any guidance...still a newbie at this.

 

Below is a proc contents of the newly imported dataset showing AGE as char:

 

sasimportiskillingme.png

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@rxp1970 

The name of the file you're using in your filename statement indicates that you've actually used the Excel and converted it yourself to a .csv.

 

What I've done:

I've dowloaded the Excel to my environment, used the EG import wizard, selected to include the data into the generated code and then run it.

Attached is the generated EG generated code with the data included. You should be able to just run this in your environment.

 

The proper way of doing things would be to define an attrib and input statement manually based on the data descriptions provided. But if what the import wizard gives you is good enough then attached the quick-fix for you.

View solution in original post

13 REPLIES 13
tsap
Pyrite | Level 9

I generated a dummy table with an age field that aligned with the details from your Proc Contents screenshot:

 

DATA capstone;
FORMAT 		ID 8. Age $3.; 
INFORMAT	ID 8. Age $3.; 
INPUT  		ID    Age; 
CARDS;
12345	21
23456	25
34567	35
45678	55
56789	75
;

 

Here is what the Proc Contents for the dummy data and the actual table looks like:

Capstone.PNG

 

And here is the logic used to resolve the issue that you are encountering:

 

DATA WORK.Capstone_New (RENAME=(Age2=Age));
	SET WORK.Capstone;
	FORMAT Age2 3.0;
	LENGTH Age2 3;
	Age2=INPUT(Age,3.0);
	DROP Age;
RUN;

 

 

By dropping the Original Age column the end of the datastep, that leaves just Age2 which will then be renamed to Age afterwards.

 

And then utilized this logic to end up with an age column that is numeric and contains all of the (previously character formatted) values from the original age column.

 

Here is the resulting Proc Contents and table after the new logic was used:

Capstone_new.PNG

 

Hope this helps.

rxp1970
Fluorite | Level 6

Hello Tsap...thank you for the quick reply I tried to do what you said and here's the output I got back.  (I cut and pasted so not sure what was missed with the informat...as the source informat for char for the original set is there...when I look at the values in Age2 var they show missing values... (i.e. ".")  Any more guidance you can give on this?

 

ChangeAgeVar2.png

rxp1970
Fluorite | Level 6

TSAP, here's the dataset (linked to onedrive)...I just want to get this into SAS <sad face>

 

https://1drv.ms/u/s!AnS_2wxQJn8uhq16CaWrK_PqIFLOvQ

 

This is the link to the public description of the data set identifying variables....

 

https://archive.ics.uci.edu/ml/datasets/Bank%2BMarketing

 

I've looked at the age var as the other poster recommended but that was little to no help.

ballardw
Super User

Here's how to post an example of a CSV file. Copy lines from the file using a plain text editor, open a code box on the forum with either the {I} or "running man" icon. Paste.

Client ID,Clinic Site,Clinician,Sex,Date of Birth,Zip,Ethnicity,Race (Select All That Apply),Collection Date,Program Area,Insurance,CT Results,GC Results,Syphilis Results,
174680,4313,463,1,12/2/1973,83605,2,1,1/9/2019,4,3,n/a,n/a,2,
174734,4313,464,1,6/30/1958,83641,2,1,1/17/2019,4,3,n/a,n/a,2,
174735,4313,478,2,8/17/1965,83641,2,1,1/17/2019,4,1,n/a,n/a,2,
174905,4313,478,2,10/21/1955,83687,2,1,1/25/2019,4,1,n/a,n/a,2,

I am not going to attempt to open a file of unknown provenance from a site of unknow safety. Not to mention than many users here will be blocked from accessing your linked file.

 

I am willing to bet that your AGE variable has some value such as N/A in my example or NULL or similar. Just because a document says "numeric" does not mean that someone has exceptions and did not provide in the documentation what the exception code.

OR if you open the file in a TEXT editor you may find a number of rows at the bottom of the file that look like

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

As Excel is known for keeping "phantom lines" with no actual data in some circumstances and just doesn't show them.

 

Paste a few lines from the CSV that you are getting errors with. The LOG should show something with the line number of the read file so you could find it. DO not copy from a spreadsheet program that may open a CSV as they typically reformat data in a number of ways depending on the individual program.

Tom
Super User Tom
Super User

Why are you using PROC IMPORT when the site has posted documentation on what is in the file? Why force SAS to guess when you already know?

Attribute Information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone') 
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric) 
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
Astounding
PROC Star

This is a data problem first.  Once you resolve that, it becomes a programming problem.

 

The values of AGE contain some sort of character that is not convertible to numeric.  You have to actually dig in to the data and find what they are before trying to program a solution.  It may be an obvious non-numeric, or it may be something like a tab character that doesn't appear.  So first examine the data by eye, but be prepared to get more detailed, such as:

 

proc print data=have;

var age;

format age $hex6.;

run;

rxp1970
Fluorite | Level 6

Astounding...

Above is a link to the dateset file in csv format along with the command I used to import... as I mentioned I'm a novice and don't see anything readily apparent in the 41k dataset for the age variable...maybe you can see something given I've posted the actual dataset.

ballardw
Super User

1) Show the code you used to read the data, copy from your editor and paste into a code box using the forum's {I} or "running man" icon

2) What does your input file look like? Best is to copy some lines and paste into a code box as above.

 

IF the file was a spreadsheet then that is a complication.

rxp1970
Fluorite | Level 6

hello ballardw...

 

Above are links to the original dataset on onedrive and the other URL is to the source dataset website describing each var... Below is the import statement that I used:

=======================================================================================================

 

%let path=/folders/myfolders/capstone;
libname capstone "&path";

FILENAME REFFILE '/folders/myfolders/capstone/bank-additional-full-separated.csv' TERMSTR=CR;

PROC IMPORT DATAFILE=REFFILE
DBMS=CSV
Replace
OUT=capstone.bankdata;
GETNAMES=YES;
GUESSINGROWS=20000;
RUN;

PROC CONTENTS DATA=capstone.bankdata;
RUN;

DATA capstone.bankdata2 (RENAME=(Age2=Age));
SET capstone.bankdata;
FORMAT Age2 3.0;
LENGTH Age2 3;
Age2=INPUT(Age,3.0);
DROP Age;
RUN;

PROC CONTENTS DATA=capstone.bankdata2;
RUN;

 

 

rxp1970
Fluorite | Level 6

Here are the top 12 or so entries in the csv file....   (I put a space between the first line (headers) and the first record...)  If someone can help me form a datalines that would be appreciated... save the lectures...I don't need them... I've searched I've tried various methods so if you don't think I'm trying hard enough or assume I should know something then don't respond....again I don't need the lectures!

 

 

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y


56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
45,services,married,basic.9y,unknown,no,no,telephone,may,mon,198,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
59,admin.,married,professional.course,no,no,no,telephone,may,mon,139,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,217,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
24,technician,single,professional.course,no,yes,no,telephone,may,mon,380,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
25,services,single,high.school,no,yes,no,telephone,may,mon,50,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no
41,blue-collar,married,unknown,unknown,no,no,telephone,may,mon,55,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191,no

Patrick
Opal | Level 21

@rxp1970 

The name of the file you're using in your filename statement indicates that you've actually used the Excel and converted it yourself to a .csv.

 

What I've done:

I've dowloaded the Excel to my environment, used the EG import wizard, selected to include the data into the generated code and then run it.

Attached is the generated EG generated code with the data included. You should be able to just run this in your environment.

 

The proper way of doing things would be to define an attrib and input statement manually based on the data descriptions provided. But if what the import wizard gives you is good enough then attached the quick-fix for you.

rxp1970
Fluorite | Level 6

Patrick,

I was on my way to attempting a datalines type import as you've done (easy to see when I look at my code that was failing) and then look at your beautiful code.  I can use this code again and you've shown me how to import a complicated dataset (for me at least)

 

Thank you for your help...I was feeling a bit beat up and disgruntled with this community (I admitted my rookie status when I walked in and thought the fact that I searched for hours...no days...would at least by me some gentle guidance but no)

 

I actually got all the variables listed...but was miles away from where you got me...

 

You're a gentleman and a scholar...(and you've helped me along my academic path)...thank you sir.

 

 

Patrick
Opal | Level 21

@rxp1970 

Thank you for the compliment. 

The folks in this forum are actually really supportive and helpful. It's just not always that simple to figure out how and on what level to give an answer. Seems I've been lucky this time. 

When you continue using the forum here you'll figure out how to ask to get quicker responses - and people will learn to know you so they better understand what and on what level you need answers.

 

As for reading data from Excel spreadsheets: If this is an XLSX and your data (with or without column header) starts in cell A1 without any fancy layout (so just data without empty or summary rows) then you can also use a simple libname with the XLSX engine.

The Excel file needs then to be under a path accessible by the SAS server (so not where your client sits but where SAS executes).

 

Here how such code looks like for your Excel reading sheet IN.

libname mydata xlsx "/serverpath/bank-additional-full-separated.xlsx";
data want;
  set mydata.in;
run;

 

Using the XLSX engine requires SAS/Access to PC Files licensed. A lot of site have it and I believe it's also part of the UA edition. 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1979 views
  • 3 likes
  • 6 in conversation