BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Sleo007
Obsidian | Level 7

Hello,

I am relatively new to SAS and am wondering whether values are being truncated when I import. I have read quite a bit on importing, length statement, informat, and format statement, as well as importing an Excel workbook using proc import, infile statement, etc. My ultimate goal is to merge paid search, organic search, Google Analytics, and Facebook analytics data sets (4) which I downloaded through Supermetrics for Google sheets for funnel conversion insight. The data set I am stuck on (#1 unfortunately) is Google paid search data, with tabs as months. All my data sets are Excel with tabs as months, so if I figure this out, it should hopefully work for the rest!

 

My goal was to import the Excel workbook and merge the months. The best solution I found for this was:

 

*Impporting Paidsearch data & creating SAS dataset;
%macro pim(sheet);                                                                           /*Creating macro variable for proc import step*/

 

proc import out=paid_temp                                                                /*Creating work.paid_temp data set*/

datafile='E:\DI_Project\Data\RawData\Paidsearch2019.xlsx'            /*Importing Excel Workbook*/

dbms=xlsx                                                                                         /*Defining DBMS*/

replace;                                                                      /*Replace = will allow you to replace the exported data set if you re-run the code*/

sheet="&sheet";                                                                               /*referring to macro variable for sheet names*/

getnames=yes;

run;


%mend piim; /*Defining macro variables for each individual sheet in workbook*/

%pim(Jan19);

%pim(Feb19);

%pim(Mar19);

%pim(Apr19);

%pim(May19);

%pim(Jun19);

%pim(Jul19);

%pim(Aug19);

%pim(Sep19);

%pim(Oct19);

%pim(Nov19);

%pim(Dec19);

 

The only problem is that mainly the variable Campaign_Name is a text string with varying lengths (36-88) depending on the month, so I am wondering:

a) if there is some way to standardize (or change) the variable length (and type) while importing into SAS to avoid truncation and additional missing values due to values that are shorter or not the same length?

b) how to treat the many missing values for the numeric variables in the data set (ex: not sure if I should leave them as missing or replace them with zeros)?

The majority of the other variables are numeric, 8 bytes, BEST12. The variables are marketing metrics (ex: CPC, CTR, Conversion value, etc.).

I have to do a report for Friday so any help is VERY MUCH appreciated! 😃

1 ACCEPTED SOLUTION

Accepted Solutions
Sleo007
Obsidian | Level 7

i FINALLY FOUND THE SOLUTION!!!

 

There are multiple aspects involved in this process so hopefully this will help whoever wants to use Excel files to import into SAS!

 

/*Assigning librefs*/
     libname SASData "E:\DI_Project\Data\SAS";
     libname WorkData "E:\DI_Project\Data\WorkData";

 

/*Excel LIBNAME REF FOR DATA IMPORT*/
     libname Paid "E:\DI_Project\Data\RawData\Paidsearch2019.xlsx"
          header=yes                  /*Tells SAS variable names are in 1st row*/
          dbsaslabel=none         /*Removes labels*/
          scantime=yes              /*Tells SAS to read date & time variables as datetime*/
          mixed=yes                  /*Tells SAS not to put missing values if a variable has both character & numeric data*/
          stringDates=no;          /*Tells SAS NOT to treat dates as text strings or character variables*/

 

/*VERIFY VARIABLE LENGTHS & TYPES*/
proc contents data=paid."jan19$"n; run;
proc contents data=paid."feb19$"n; run;
proc contents data=paid."mar19$"n; run;
proc contents data=paid."apr19$"n; run;
proc contents data=paid."may19$"n; run;
proc contents data=paid."jun19$"n; run;
proc contents data=paid."jul19$"n; run;
proc contents data=paid."aug19$"n; run;
proc contents data=paid."sep19$"n; run;
proc contents data=paid."oct19$"n; run;
proc contents data=paid."nov19$"n; run;
proc contents data=paid."dec19$"n; run;

 

/*JANUARY IMPORT*/
/*Formatting January Date Variable*/
data jandate;                                 /*Output file or File being written to*/
     retain Date;                              /*Tells SAS to retain the date formatting from Excel (treat as a date)*/
     format Date ddmmyy10.;         /*Tells SAS to format date from date9. to ddmmyy10., but you can choose whatever format desired*/
     set paid."Jan19$A1:U470"n;    /*Input file or File being read from*/
run;

 

/*VERIFY DATE FORMAT CHANGE*/
proc contents data=jandate;
run;

 

/*METHOD 1: RE-ASSIGNING VARIABLE LENGTHS & TYPES*/
/*#1 - Create macrovariable varlist containing a list of variable names from Jan*/
proc contents data=jandate noprint out=metaclass;
run;

 

options nocenter nodate;
     title 'Meta Data for jandate';
proc print data=metaclass;
     var name type length varnum;
run;

 

data _null_;
     length allvars $1000;
     retain allvars '';
     set metaclass end=eof;
     allvars = trim(left(allvars))||''||left(name);
     if eof then call symput('varlist',allvars);
run;

 

%put &varlist;          /*Macrovariable is created!!!*/

 

/*#2 - MODIFY VARIABLE TYPE & LENGTH IN JANDATE*/

/*Essentially this is renaming the variables, creating new variables with the old names & inputting the data, so it's like you modified the types & lengths, when in reality you created new variables altogether.*/


data sasdata.jan19 (drop=v1-v21);                  /*Creating new permanent SAS dataset for merging later & renaming variables*/
     retain &varlist;                                             *<-- preserve variable order ;
     length                                                          *<-- define new types/lengths;
          Campaign_name $100 Campaign_ID $15 Network $15
          Impressions Ave_pageviews Ave_timeonsite_sec Ave_frequency Clicks Conversions Channel
          Bounce_rate Interaction_rate CTR Conversion_rate Daily_budget Cost CPC CPA Conversion_Value ROAS 8;
     format                                                           *<-- recreate formats;
          Campaign_name $CHAR100. Campaign_ID $CHAR15. Network $CHAR15.
          Date ddmmyy10.
          Impressions Ave_pageviews Ave_timeonsite_sec Ave_frequency Clicks Conversions 10.
          Channel 1.
          Bounce_rate Interaction_rate CTR Conversion_rate 4.2
          Daily_budget Cost CPC CPA Conversion_Value ROAS DOLLAR10.2;
     set jandate (rename=(  Campaign_name=v1 Campaign_ID=v2 Date=v3 Network=v4 Channel=v5 Impressions=v6
                                          Bounce_rate=v7 Ave_pageviews=v8 Ave_timeonsite_sec=v9 Ave_frequency=v10
                                          Interaction_rate=v11 Daily_budget=v12 Clicks=v13 CTR=v14 Cost=v15 CPC=v16
                                          Conversions=v17 Conversion_rate=v18 CPA=v19 Conversion_Value=v20 ROAS=v21));

          Campaign_name = put(v1,char100.);         *<-- redefine variable;
          Campaign_ID = put(v2,char15.);                *<-- redefine variable;
          Date = v3;
          Network = put(v4,char15.);                         *<-- redefine variable;
          Channel = v5;
          Impressions = v6;
          Bounce_rate = v7;
          Ave_pageviews = v8;
          Ave_timeonsite_sec = v9;
          Ave_frequency = v10;
          Interaction_rate = v11;
          Daily_budget = v12;
          Clicks = v13;
          CTR = v14;
          Cost = v15;
          CPC = v16;
          Conversions = v17;
          Conversion_rate = v18;
          CPA = put(v19,8.);                                    *<-- redefine variable;
          Conversion_Value = v20;
          ROAS = v21;
run;

 

/*FEBRUARY IMPORT: REPEAT EVERYTHING EXCEPT THE CREATION OF THE MACRO VARIABLE*/
/*Formatting January Date Variable*/
data febdate;
     retain Date;
     format Date ddmmyy10.;
     set paid."feb19$A1:U374"n;
run;

 

/*Verifying Date Format Change*/
proc contents data=febdate;
run;

 

/*RE-ASSIGNING VARIABLE LENGTHS & TYPES*/
/* Modifying variable type and length of febdate */
data sasdata.feb19 (drop=v1-v21);                  /*Creating new permanent SAS dataset for merging later & renaming variables*/
     retain &varlist;                                             *<-- preserve variable order ;
     length                                                          *<-- define new types/lengths;
          Campaign_name $100 Campaign_ID $15 Network $15
          Impressions Ave_pageviews Ave_timeonsite_sec Ave_frequency Clicks Conversions Channel
          Bounce_rate Interaction_rate CTR Conversion_rate Daily_budget Cost CPC CPA Conversion_Value ROAS 8;
     format                                                           *<-- recreate formats;
          Campaign_name $CHAR100. Campaign_ID $CHAR15. Network $CHAR15.
          Date ddmmyy10.
          Impressions Ave_pageviews Ave_timeonsite_sec Ave_frequency Clicks Conversions 10.
          Channel 1.
          Bounce_rate Interaction_rate CTR Conversion_rate 4.2
          Daily_budget Cost CPC CPA Conversion_Value ROAS DOLLAR10.2;
     set febdate (rename=(  Campaign_name=v1 Campaign_ID=v2 Date=v3 Network=v4 Channel=v5 Impressions=v6
                                          Bounce_rate=v7 Ave_pageviews=v8 Ave_timeonsite_sec=v9 Ave_frequency=v10
                                          Interaction_rate=v11 Daily_budget=v12 Clicks=v13 CTR=v14 Cost=v15 CPC=v16
                                          Conversions=v17 Conversion_rate=v18 CPA=v19 Conversion_Value=v20 ROAS=v21));

          Campaign_name = put(v1,char100.);         *<-- redefine variable;
          Campaign_ID = put(v2,char15.);                *<-- redefine variable;
          Date = v3;
          Network = put(v4,char15.);                         *<-- redefine variable;
          Channel = v5;
          Impressions = v6;
          Bounce_rate = v7;
          Ave_pageviews = v8;
          Ave_timeonsite_sec = v9;
          Ave_frequency = v10;
          Interaction_rate = v11;
          Daily_budget = v12;
          Clicks = v13;
          CTR = v14;
          Cost = v15;
          CPC = v16;
          Conversions = v17;
          Conversion_rate = v18;
          CPA = put(v19,8.);                                    *<-- redefine variable;
          Conversion_Value = v20;
          ROAS = v21;
run;

 

/*Verifying sasdata.feb19*/
proc contents data=sasdata.feb19;
run;

 

/*MERGE DATASETS*/
data sasdata.all_paid;
     set sasdata.jan19 sasdata.feb19;
run;
/*DATA NOT TRUNCATED!!!! YAAAASSSSSSS :)*/

 

*****NOTE: Obviously, the reassignment of the variables types & lengths part could be turned into a macro ultimately, or do-loop, but since I do not have time for this project, I will not cover this in this post. Enjoy!*****

View solution in original post

37 REPLIES 37
Tom
Super User Tom
Super User

One of the problems with using a spreadsheet for data entry.  There isn't an easy way to tell SAS to ignore what your Excel sheet has.  But you should be able to adjust to it pretty easily if the variable names (or as Excel sees them column headers) are the same in each sheet. 

 

In fact you should be able to use the XLSX libname and so that you don't need the macro.

 

When combining the datasets just define the length of the variables BEFORE the SET statement to avoid truncation.  Make sure to remove the formats that SAS will mistakenly attach to the character variables.  If you have variable that is 40 characters long but use the $30. format with it then you will only see the first 30 characters of the values.

libname paid xlsx 'E:\DI_Project\Data\RawData\Paidsearch2019.xlsx'  access=readonly;
data all ;
  length Campaign_Name $40 ;
  set paid.Jan19 paid.Feb19 paid.Mar19 paid.Apr19 paid.May19 paid.Jun19
      paid.Jul19 paid.Aug19 paid.Sep19 paid.Oct19 paid.Nov19 paid.dec19
  ;
  format _character_ ;
run;

 

Sleo007
Obsidian | Level 7
Hi Tom,

Thanks for the quick reply! 🙂 Yes, the variable names are all the same & I purposely changed them in Excel to shorten the time to find a proper solution. I will shortly try your code. Just to be sure, am I right in assuming that _character_ is to be replaced by the variable name (in my case Campaign_Name)?
Tom
Super User Tom
Super User
No need to change it. _CHARACTER_ is an special variable list. It will resolve to every character variable in the current data step (at least that the compiler has seen at the point it compiles that statement). When you use a format statement with a variable list that is not followed by any format specification that tells SAS to attach no format to the variables.
You could attach formats to your character variables in that data step to override the one's that the XSLX engine will generate for each sheet/dataset. But attaching $xx. formats to character variables is not needed since SAS already knows how to display character values and so you don't have to give special instructions for how to display character variables (which is what a format is). Plus as I said before $xx. formats attached to character varaibles can actually cause harm when the format width is different than the variable's length.
Sleo007
Obsidian | Level 7
Hi Tom,
So to simplify my life I deleted the Unique_Cookies variable & ran your code, but it seems to have only imported the January sheet for some reason. All other sheets are truncated. Any idea why?
Thanks.

The Log shows the following:
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='E:\DI_Project\FINAL.egp';
6 %LET _CLIENTPROJECTPATHHOST='JM-SCBTM11';
7 %LET _CLIENTPROJECTNAME='FINAL.egp';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HtmlBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHome/SASEnterpriseGuide/7.1/Styles/HtmlBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26 data all_paid ;
27 length Campaign_Name $100;
28 set paid.Jan19 paid.Feb19 paid.Mar19 paid.Apr19 paid.May19 paid.Jun19
29 paid.Jul19 paid.Aug19 paid.Sep19 paid.Oct19 paid.Nov19 paid.dec19;
30 format _character_;
31 run;

NOTE: The import data set has 1981 observations and 21 variables.
NOTE: The import data set has 3859 observations and 21 variables.
NOTE: The import data set has 5479 observations and 21 variables.
NOTE: The import data set has 6575 observations and 21 variables.
NOTE: The import data set has 6090 observations and 21 variables.
NOTE: The import data set has 5549 observations and 21 variables.
NOTE: The import data set has 5506 observations and 21 variables.
NOTE: The import data set has 6320 observations and 21 variables.
NOTE: The import data set has 9969 observations and 21 variables.
NOTE: The import data set has 5057 observations and 21 variables.
NOTE: The import data set has 4174 observations and 21 variables.
NOTE: The import data set has 1985 observations and 21 variables.
NOTE: There were 1981 observations read from the data set PAID.Jan19.
NOTE: There were 3859 observations read from the data set PAID.Feb19.
NOTE: There were 5479 observations read from the data set PAID.Mar19.
NOTE: There were 6575 observations read from the data set PAID.Apr19.
NOTE: There were 6090 observations read from the data set PAID.May19.
NOTE: There were 5549 observations read from the data set PAID.Jun19.
NOTE: There were 5506 observations read from the data set PAID.Jul19.
NOTE: There were 6320 observations read from the data set PAID.Aug19.
NOTE: There were 9969 observations read from the data set PAID.Sep19.
NOTE: There were 5057 observations read from the data set PAID.Oct19.
NOTE: There were 4174 observations read from the data set PAID.Nov19.
NOTE: There were 1985 observations read from the data set PAID.dec19.
NOTE: The data set WORK.ALL_PAID has 62544 observations and 21 variables.
2 The SAS System 13:33 Wednesday, December 18, 2019

NOTE: DATA statement used (Total process time):
real time 7.32 seconds
cpu time 7.31 seconds


32
33 GOPTIONS NOACCESSIBLE;
34 %LET _CLIENTTASKLABEL=;
35 %LET _CLIENTPROCESSFLOWNAME=;
36 %LET _CLIENTPROJECTPATH=;
37 %LET _CLIENTPROJECTPATHHOST=;
38 %LET _CLIENTPROJECTNAME=;
39 %LET _SASPROGRAMFILE=;
40 %LET _SASPROGRAMFILEHOST=;
41
42 ;*';*";*/;quit;run;
43 ODS _ALL_ CLOSE;
44
45
46 QUIT; RUN;
Tom
Super User Tom
Super User
Why do you say that?
The NOTES clearly say it read only 1,981 observations from the JAN19 sheet and wrote a total of 62,544 observations.
Sleo007
Obsidian | Level 7
Yes, but after all the January observations, the rest are all truncated. So I have 11 sheets which appear as missing values (like 62,544-1981 are all missing values). In reality, they are not actually missing.
Tom
Super User Tom
Super User

That is a surprising result.

And if you read them separately and then combine them does the same thing happen?

data jan19; set paid.jan19; run;
data feb19; set paid.feb19; run;
data both; set jan19 feb19; run;
Sleo007
Obsidian | Level 7

Yes, initially when trying your code, I received the following log Note:

WARNING: Multiple lengths were specified for the variable Campaign_name by input data set(s). This can cause truncation of data.
NOTE: There were 1981 observations read from the data set WORK.JAN19.
NOTE: There were 3859 observations read from the data set WORK.FEB19.
NOTE: The data set WORK.BOTH has 5840 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

when I placed a length statement after each data statement, that was not the case.

NOTE: There were 1981 observations read from the data set WORK.JAN19.
NOTE: There were 3859 observations read from the data set WORK.FEB19.
NOTE: The data set WORK.BOTH has 5840 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

 

Perhaps I should be using a macro like the one I had in my original post or at the least a do loop with a macro? That would technically reiterate the process, initializing the length statement each time, no?

Tom
Super User Tom
Super User
Truncation is different than totally missing values.
Also you cannot fix type mismatches as easily. You cannot combine two datasets that use the same name to mean different types of variables. If you are lucky those are just caused by one column being completely empty in one of the spreadsheets. In that case you could drop the offending column and not have to worry about fixing it.
Sleo007
Obsidian | Level 7
So I just tried your code with the added length statement after the data statement & still the same results. Log shows:

26 data both; length Campaign_Name :$100.; set jan19 feb19; run;

WARNING: Multiple lengths were specified for the variable Campaign_name by input data set(s). This can cause truncation of data.
NOTE: There were 1981 observations read from the data set WORK.JAN19.
NOTE: There were 3859 observations read from the data set WORK.FEB19.
NOTE: The data set WORK.BOTH has 5840 observations and 21 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.00 seconds

So it's definitely a problem with the length. Should I be using an informat to read in the data from paid.jan19, paid.feb19, etc.?
Tom
Super User Tom
Super User

If you know the maximum length for each character variable then defining it before the SET statement will make sure no values are truncated.  Just because you get the warning does not mean anything is actually truncated. But it might be that other variables are being truncated. 

Sleo007
Obsidian | Level 7
Aside from the Network variable, which is CHAR15, all other variables are defined as numeric 8 bytes, BEST12. So how do I solve the truncation occurring because of potentially other variables if I do not know which variables are causing the issue?
Tom
Super User Tom
Super User

Run PROC CONTENTS and check.

This SQL code will print any variable that has more than one TYPE or more than one LENGTH.

proc contents data=paid._all_ noprint out=contents; run;
proc sql ;
 select upcase(name) as uname,name,memname,type,length
  from contents 
  group by calculated uname
  having count(distinct type)>1 
      or count(distinct length)>1
  order by 1,2,3
  ;
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 37 replies
  • 1708 views
  • 6 likes
  • 3 in conversation