BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Tom
Super User Tom
Super User

So you just have two variables to worry about (in terms of CONFLICT). Looks like you should also force NETWORK to be 15 characters long as one of the sources had it as only 14 characters.

ballardw
Super User

@Sleo007 wrote:

Hello,

I am relatively new to SAS and am wondering whether values are being truncated when I import. I have read quite a bit on importing, length statement, informat, and format statement, as well as importing an Excel workbook using proc import, infile statement, etc. My ultimate goal is to merge paid search, organic search, Google Analytics, and Facebook analytics data sets (4) which I downloaded through Supermetrics for Google sheets for funnel conversion insight. The data set I am stuck on (#1 unfortunately) is Google paid search data, with tabs as months. All my data sets are Excel with tabs as months, so if I figure this out, it should hopefully work for the rest!

 

 

The only problem is that mainly the variable Campaign_Name is a text string with varying lengths (36-88) depending on the month, so I am wondering:

a) if there is some way to standardize (or change) the variable length (and type) while importing into SAS to avoid truncation and additional missing values due to values that are shorter or not the same length?

b) how to treat the many missing values for the numeric variables in the data set (ex: not sure if I should leave them as missing or replace them with zeros)?

The majority of the other variables are numeric, 8 bytes, BEST12. The variables are marketing metrics (ex: CPC, CTR, Conversion value, etc.).

I have to do a report for Friday so any help is VERY MUCH appreciated! 😃


Proc Import "guesses" the length of variables for each individual proc call. The length and type of variables by default are set by examining the first 20 rows of data. So Proc Import is often a poor choice for production work involving multiple files with changing data as you mention your Campaign_name does.

 

Have you checked to see if any of your (possibly) sparse variables have changed type from character to numeric or vice versa between imported sets?

Sleo007
Obsidian | Level 7

Hi Ballard,

Thanks for your reply! Yes, while trying Tom's code above, I actually received a notification ERROR: Variable Unique_Cookies has been defined as both character and numeric. Therefore, I am looking to force variable length and type while avoiding the chance of truncation.

ballardw
Super User

@Sleo007 wrote:

Hi Ballard,

Thanks for your reply! Yes, while trying Tom's code above, I actually received a notification ERROR: Variable Unique_Cookies has been defined as both character and numeric. Therefore, I am looking to force variable length and type while avoiding the chance of truncation.


That will likely involve something done with each tab of your spread sheet separately mapping the original variable(s) to standard new variables. Something like:

libname paid xlsx 'E:\DI_Project\Data\RawData\Paidsearch2019.xlsx'  access=readonly;
data work.jan19 ;
  length Campaign_Name $40 ;
  set paid.Jan19 ;
  format _character_ ;
  /* test type of variable to create new consistent variable*/
  /* this creates a new character variable*/
  length NewCookies $ 15;
  if vtype(Unique_cookies)='C'  then NewCookies=Unique_cookies;
  else NewCookies = put(Unique_cookies,best15.);
  Drop Unique_cookies;
run;

And then make the All data set by combining the work.Jan19 work.Feb19 etc. The above is just an example of the logic. You should know whether you want a numeric or character version and how to convert to the desired type. (Input instead of Put and an appropriate informat which likely should NOT include a specified length such as Input(Unique_cookies,F.) to avoid implied decimals ending up in your data unless you know they are wanted.

Sleo007
Obsidian | Level 7

I tried your suggestion with the Campaign_Name variable with this code:

 

libname paid xlsx 'E:\DI_Project\Data\RawData\Paidsearch2019.xlsx' access=readonly;

data test19 ;
length Campaign_Name $100;
set paid.Jan19;
format _character_ ;
length CName $100;
if vlength(Campaign_Name)=100.
then CName=Campaign_Name;
else CName = put(Campaign_Name,$CHAR100.);
Drop Campaign_Name;

run;

 

It seems to have worked. Any idea how I can make it a do loop/macro for the rest of the months to not have to repeat coding? Thanks Ballard! 🙂

Sleo007
Obsidian | Level 7
So I just tried running your code for all data sheets, with the following log. Truncation still happening past January. 😞

25 GOPTIONS ACCESSIBLE;
26 data all_paid;
27 length Campaign_Name $100;
28 set Jan19 Feb19 Mar19 Apr19 May19 Jun19 Jul19 Aug19 Sep19 Oct19 Nov19 Dec19;
29 drop Campaign_Name;
30 format _character_;
31 run;

NOTE: Variable Campaign_Name is uninitialized.
NOTE: There were 1981 observations read from the data set WORK.JAN19.
NOTE: There were 3859 observations read from the data set WORK.FEB19.
NOTE: There were 5479 observations read from the data set WORK.MAR19.
NOTE: There were 6575 observations read from the data set WORK.APR19.
NOTE: There were 6090 observations read from the data set WORK.MAY19.
NOTE: There were 5549 observations read from the data set WORK.JUN19.
NOTE: There were 5506 observations read from the data set WORK.JUL19.
NOTE: There were 6320 observations read from the data set WORK.AUG19.
NOTE: There were 9969 observations read from the data set WORK.SEP19.
NOTE: There were 5057 observations read from the data set WORK.OCT19.
NOTE: There were 4174 observations read from the data set WORK.NOV19.
NOTE: There were 1985 observations read from the data set WORK.DEC19.
NOTE: The data set WORK.ALL_PAID has 62544 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
Tom
Super User Tom
Super User

It does not look like your datasets actual have a variable with that name.

NOTE: Variable Campaign_Name is uninitialized.
Sleo007
Obsidian | Level 7

Okay, I just tried:

data all_paid;
length CName $100;
set Jan19 Feb19 Mar19 Apr19 May19 Jun19 Jul19 Aug19 Sep19 Oct19 Nov19 Dec19;
format _character_;
run;

and received no errors in the log, yet still nothing after January.

Tom
Super User Tom
Super User

So FEB19 has multiple observations, but the values are all missing?

Note that the sheets contain different variables then the observations from sheets that don't have those variables will have missing values.  So if JAN19 has three variables CNAME, JAN19V1 and JAN19V2 and FEB19 has three variables named CNAME, FEB19V1 and FEB19V2 then you will see this type of diagonal layout of the data after concatenating them.

CNAME JAN19V1 JAN19V2 FEB19V1 FEB19V2
XXXX  10 10  .  .
YYYY   .  . 11 11

If that is what your data looks like them perhaps you meant to MERGE the datasets instead of just stacking them together. 

Sleo007
Obsidian | Level 7
No, all variable names are the same & Feb19 has values. To prevent problems, I actually created valid SAS variable names & simply copied the 1st row & pasted it onto every subsequent sheet.
Tom
Super User Tom
Super User

If FEB19 has data then using:

data both; set jan19 feb19; run;

will not remove any of that data.

 

One issue you do see a lot in Excel sheets is that there are many empty rows in the sheet.  If you ever touch a cell on a row that is 10 rows past the last line that has data then you will get 10 empty rows at the end of your dataset.

 

So you might need to add some logic to remove those empty rows.

Tom
Super User Tom
Super User
The header row in Excel is used for the names. That is why the number of observations is one less than the number of rows in the sheet.
Sleo007
Obsidian | Level 7
Yes, I realized that. That's why I deleted the message. So, I just tried your suggestion with Network (code below) & the same is still happening.

data jan19;
length Campaign_Name $100 Network$15;
set paid.Jan19;
format _character_ ;
length CName $100 NNetworNetwork $15;
if vlength(Campaign_Name)=100.
then CName=Campaign_Name;
else CName = put(Campaign_Name,$CHAR100.);
if vlength(Network)=15.
then NNetwork=Network;
else NNetwork = put(Network,$CHAR15.);
Drop Campaign_Name Network;
run;

data feb19;
length Campaign_Name $100 Network$15;
set paid.feb19;
format _character_ ;
length CName $100;
if vlength(Campaign_Name)=100.
then CName=Campaign_Name;
else CName = put(Campaign_Name,$CHAR100.);
if vlength(Network)=15.
then NNetwork=Network;
else NNetwork = put(Network,$CHAR15.);
Drop Campaign_Name Network;
run;

data both;
length CName $100 NNetwork $15;
set jan19 feb19;
run;

At this point I am thinking perhaps I should deal with missing values first in the original dataset. Since the metrics are %, $, & amounts, I was thinking of changing all missing values to 0. Would this affect the data in any way for a regression analysis?
Sleo007
Obsidian | Level 7

Ok, I replaced all missing values to 0 with:

 

proc stdize data=paid.jan19 reponly missing=0 out=jan19;
var _numeric_;
run;

 

for both Jan & Feb, & then I re-ran the code for forcing variable length (both months):

 

data jan_temp;
length Campaign_Name $100 Network$15;
set Jan19;
format _character_ ;
length CName $100 NNetwork $15;
if vlength(Campaign_Name) = 100.
then CName = Campaign_Name;
else CName = put(Campaign_Name,$CHAR100.);
if vlength(Network)= 15.
then NNetwork = Network;
else NNetwork = put(Network,$CHAR15.);
Drop Campaign_Name Network;
run;

 

The data still truncated, however, this time it seems it might be a date issue (see pic)?

Capture.PNG

Tom
Super User Tom
Super User

So that definitely looks like empty rows. If you set a date to 0 days then that means 01JAN1960.

Open the actual Excel file and see if it actually has any data. And if so how many rows of data in each sheet.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 37 replies
  • 1506 views
  • 6 likes
  • 3 in conversation