BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lsandell
Obsidian | Level 7

Hi there, I am working through a simple example of in-stream data. When I use DLM = ',' and without colon modifying the "ID" variable, the output is perfect. However, when I use DSD in place of DLM, the data output is incorrect and is only fixed when the ID input has a colon modified. 

 

I am curious why this is happening? Using the DLM option is fine for now, but the reason behind this issue is making me curious! Thanks, and see code below.

 

THIS CODE does Not Work

DATA Work.question2;
INFILE DATALINES DSD;
INPUT ID $3.
GROUP :$7.
WEIGHT @@;

DATALINES;
data 
;

 

 

THIS CODE Works

DATA Work.question2;

INFILE DATALINES DLM = ',';
INPUT ID $3.
GROUP :$7.
WEIGHT @@;

DATALINES;
data
;

1 ACCEPTED SOLUTION

Accepted Solutions
jimbarbour
Meteorite | Level 14

@lsandell,

 

SAS has a series of arcane and sometimes obscure rules for reading in data.

 

If one codes something like $3 as an informat, this is a "hard" input definition.  SAS will literally take three characters and then position the cursor four to the right.  Hard informats override SAS' rules for determining variable boundaries in the input.  If I remove the ":" from the Group variable in your non-working example, like this:

DATA Work.question2A;
	INFILE DATALINES DSD;
	INPUT 	ID 			$3.
			GROUP 		$7.
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

You can see that I'm getting a comma, the character in the 4th position, in Group.  DSD does not override this behavior.

jimbarbour_0-1601013313954.png

 

I can get around this by not coding a hard informat.  I can do that by adding colons in front of each informat.  The colon basically tells SAS to use its rules for distinguishing one variable from another in the input and to then use the $3.  Alternatively, adding length statements prior to the INPUT statement will achieve essentially the same thing.  Here's an example with Length statements:

DATA Work.question2C;
	LENGTH	ID			$3;
	LENGTH	Group		$7;
	INFILE DATALINES DSD;
	INPUT 	ID
			GROUP
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

Without the hard informats, the data is read correctly:

jimbarbour_1-1601013865476.png

 

Now, let's remove all the colon modifiers from the DLM = ',' version of the code.

DATA Work.question2D;
	INFILE DATALINES DLM = ',';
	INPUT 	ID		$3.
			GROUP 	$7.
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

The results are maybe a little better, but they aren't much better, not really.

jimbarbour_2-1601014071788.png

 

The main take away here:  Use colon modifiers on delimited data.  Unless there's a clear reason to not use them, colon modifiers are the best way to go with delimited data.  Avoid hard informats with delimited data.  Hard formats are great (essential) for fixed position data, so they have their place, just not with delimited data.  This doesn't completely explain the difference between the behavior of DLM = ',' vs. DSD -- but the rule applies to either:  Use the colon modifier with delimited data.  Even with DLM = ',' you can get into trouble if you omit the colon modifier.

 

Hope that helps,

 

Jim

View solution in original post

5 REPLIES 5
jimbarbour
Meteorite | Level 14

@lsandell,

 

SAS has a series of arcane and sometimes obscure rules for reading in data.

 

If one codes something like $3 as an informat, this is a "hard" input definition.  SAS will literally take three characters and then position the cursor four to the right.  Hard informats override SAS' rules for determining variable boundaries in the input.  If I remove the ":" from the Group variable in your non-working example, like this:

DATA Work.question2A;
	INFILE DATALINES DSD;
	INPUT 	ID 			$3.
			GROUP 		$7.
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

You can see that I'm getting a comma, the character in the 4th position, in Group.  DSD does not override this behavior.

jimbarbour_0-1601013313954.png

 

I can get around this by not coding a hard informat.  I can do that by adding colons in front of each informat.  The colon basically tells SAS to use its rules for distinguishing one variable from another in the input and to then use the $3.  Alternatively, adding length statements prior to the INPUT statement will achieve essentially the same thing.  Here's an example with Length statements:

DATA Work.question2C;
	LENGTH	ID			$3;
	LENGTH	Group		$7;
	INFILE DATALINES DSD;
	INPUT 	ID
			GROUP
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

Without the hard informats, the data is read correctly:

jimbarbour_1-1601013865476.png

 

Now, let's remove all the colon modifiers from the DLM = ',' version of the code.

DATA Work.question2D;
	INFILE DATALINES DLM = ',';
	INPUT 	ID		$3.
			GROUP 	$7.
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
;
RUN;

The results are maybe a little better, but they aren't much better, not really.

jimbarbour_2-1601014071788.png

 

The main take away here:  Use colon modifiers on delimited data.  Unless there's a clear reason to not use them, colon modifiers are the best way to go with delimited data.  Avoid hard informats with delimited data.  Hard formats are great (essential) for fixed position data, so they have their place, just not with delimited data.  This doesn't completely explain the difference between the behavior of DLM = ',' vs. DSD -- but the rule applies to either:  Use the colon modifier with delimited data.  Even with DLM = ',' you can get into trouble if you omit the colon modifier.

 

Hope that helps,

 

Jim

lsandell
Obsidian | Level 7

Jim, thank you so much for the thorough explanation. I appreciate the walk-through of why and how SAS reads in the data in these situations. I will take your advice and use the colon modifier for delimited data!

 

Best, 

Lauren

ballardw
Super User

@lsandell wrote:

Jim, thank you so much for the thorough explanation. I appreciate the walk-through of why and how SAS reads in the data in these situations. I will take your advice and use the colon modifier for delimited data!

 

Best, 

Lauren


Another option is to use and INFORMAT statement prior to the INPUT and do not include any informats on the input statement at all.

 

One advantage of this approach is that if you have multiple variables that should have the same informat you only need to list is once:

Informat v1 v2 v3 $25.  v5 v6 v7 Comma18.  v8 v9 v10  mmddyy10.;

 

Variables not listed in the informat but appear on the input statement would be read as numeric with a best informat.

 

A somewhat esoteric side-effect of this approach is that the variables appear in the order they are on the informat statements not the order of the input. So you can either re-arrange table appearance or deal with data sources that change column order without warning and only have to move the variable in the input statement and not worry about the informat.

jimbarbour
Meteorite | Level 14

Absolutely.  And here's the real irony in this case: We'd actually do better with no informats at all with this data.  However, a caution.  There's a reason that we want to use either a Length or Informat.  If we don't specify a length or Informat, SAS will give us a default length of 8.  That works for the drugs in this data, but what if we have the drug Remdesivir which has a length of 10?

 

I've changed the code below so that there is no Length or Informat.  There is only the character designator, "$".  Notice that I've added Remdesivir to the Datalines.

DATA Work.question2E;
	INFILE DATALINES DLM = ',';
	INPUT 	ID		$
			GROUP 	$
			WEIGHT
			@@;

DATALINES;
101,Placebo,146,102,Placebo,97,103,Placebo,174
201,Drug A,188,202,Drug A,143,203,Drug A,110
301,Drug B,139,302,Drug B,128,303,Drug B,160
401,Remdesivir,215
;
RUN;

Our results look like the below.  Notice anything odd about Remdesivir?  It's truncated at 8 characters.  So while "hard" Informats are probably not a good idea as part of the Input statement, some kind of Length or Informat is typically needed.  My preference is the colon modifier before the Informat, but to @ballardw's point, coding the Informat before the Input statement will work equally well.  The nice thing about placing a Length or Informat before the input statement is that you can control the order of the variables.  Say the ID is last in the data, but I want to see ID first in my output.  If I code the Informats the way that @ballardw has suggested, SAS will order the variables in the Output in the order that I put the Informat statements.  SAS orders the variables in the order it encounters them in the code.  Statements like ATTRIB, FORMAT, LABEL, INFORMAT, and LENGTH can be used to place the variables in any order you so desire.  Caution:  RETAIN will set the order of a variable if RETAIN is the first reference, but generally RETAIN should not be used to order variables. RETAIN might affect the output data which one needs to be very careful of.

 

jimbarbour_0-1601055421118.png

 

Jim

TomKari
Onyx | Level 15

Another option that I like to use on REALLY unstructured data takes it out of the hands of SAS completely:

 

data want(drop=_:);
length _InString $32767;
infile "c:\have.txt" lrecl=32767;
input;
_InString = _infile_;
/* Do whatever processing you need to parse the input string into your variables */
run;

In your case I could do something like (untested);

 

id = scan(_InString, 1, ","); group = scan(_InString, 2, ","); weight = scan(_InString, 3, ","); output;

id = scan(_InString, 4, ","); group = scan(_InString, 5, ","); weight = scan(_InString, 6, ","); output;

id = scan(_InString, 7, ","); group = scan(_InString, 8, ","); weight = scan(_InString, 9, ","); output;

 

Tom

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1580 views
  • 4 likes
  • 4 in conversation