- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Experts,
i'd like to understand the difference between DSD and DLM=',' options in the infile statement. Although, there is enough documentation, long sentences are sometimes not precise in understanding the functionality. What i exactly want to know is that, it seems reading comma separated values with a DSD can incorrectly produce all missing values when reading a numeric variable or in other words when not used correctly. Also, where and when you would choose to use either dlm= or dsd. And most importantly, i have also noticed some using both in the same infile statement. I wanna know on what circumstances would DSD read incorrectly, and the same is read correctly by dlm=.
Please pardon me for asking basic question but if you don't mind the time to offer an easy answer with a simple/medium complex example, i'd most appreciate it. Thanks!!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As basic as this is, it's something I also happen to fall over from time to time. As a rule of thumb I would say either assign the variable a permanent informat (using a informat or attrib statement) or use the column modifier.
From SAS(R) 9.4 Statements: Reference, Third Edition
Comparisons
Reading above for formatted input "input cust_id ... ddmmyy10. " should only be reading max. 10 columns, so even if "The pointer moves the length that is specified in the informat" supersedes the delimiter definition, the length of "10" should actually stop before the next delimiter of "comma" is in the string.
I haven't found a really satisfying explanation in the docu which explains such behavior in a way that I feel I fully understand it.
data jansales1;
infile datalines truncover dsd;
input item $10. amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
data jansales2;
infile datalines truncover dsd;
input item : $10. amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
data jansales3;
infile datalines truncover dsd;
informat item $10. amount comma5.;
input item amount;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DSD does 3 things:
1. dlm=','
2. if there are two comma in a row, it sees a missing value.
3. if there are quotes around data elements, it removes them.
dlm is just dlm. When using dlm= and dsd together, the value defined by dlm= overrides comma.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
4. If there is a delimiter ( comma ) in the quote , SAS will ignore it . E.X.
"xx,yy",zz
SAS will take it as two variables not three.
So as far as I know they are not different . You can remove delimiter=',' when you are using DSD .
Xia Keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
DLM or the delimiter to specify what separates fields is a key piece in reading many data sets, not just comma separated. So the option lets you lead files where fields are separated by tabs, | , : and just about any other character you may want. You can also used multiple delimiters for a single file such as dlm = ',|' which would tread any comma or | not as the start of a new field.
DSD is critical when you have text entered that may include commas such as "This is a comment, that contains a comma"
without the DSD option this could be treated as two fields containing "This is a comment and that contains a comma"
It sounds like you have a concern about specific data, possibly numeric values with commas such as 123,345,789 if so there are other things that may be needed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ok here is an example, that is getting me rather confused:
/*this code reads trxn numeric values correctly*/
data one;
infile datalines dlm=',' ; /*using dlm','*/
input cust_id date ddmmyy10. trxn nat $;
cards;
1,10/01/2015,5000,dr
1,13/01/2015,2000,cr
1,15/01/2015,2000,dr
1,18/01/2015,2000,dr
1,21/01/2015,2000,cr
1,23/01/2015,2000,cr
1,24/01/2015,2000,cr
1,26/01/2015,2000,dr
1,28/01/2015,2000,dr
1,31/01/2015,2000,dr
;
run;
whereas the following code below with DSD incorrectly reads trxn and outputs all missing values
data one;
infile datalines dsd;
input cust_id date ddmmyy10. trxn nat $;
cards;
1,10/01/2015,5000,dr
1,13/01/2015,2000,cr
1,15/01/2015,2000,dr
1,18/01/2015,2000,dr
1,21/01/2015,2000,cr
1,23/01/2015,2000,cr
1,24/01/2015,2000,cr
1,26/01/2015,2000,dr
1,28/01/2015,2000,dr
1,31/01/2015,2000,dr
;
run;
So, where did i go wrong in my understanding?Please help. I am using SAS EG 6.1
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you're using a proper INformat then things will work (note the colon in front of the date informat).
data one;
infile datalines dsd;
input cust_id date :ddmmyy10. trxn nat $;
format date ddmmyy10.;
cards;
1,10/01/2015,5000,dr
1,13/01/2015,2000,cr
1,15/01/2015,2000,dr
1,18/01/2015,2000,dr
1,21/01/2015,2000,cr
1,23/01/2015,2000,cr
1,24/01/2015,2000,cr
1,26/01/2015,2000,dr
1,28/01/2015,2000,dr
1,31/01/2015,2000,dr
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
My oh My!, missing out that one : in front of the informat was the issue and made the difference? dear o dear, jeez, what does : do here and it's functionality in general? Does that mean I will need to specify : before an informat every single time I use dsd option in the infile statement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As basic as this is, it's something I also happen to fall over from time to time. As a rule of thumb I would say either assign the variable a permanent informat (using a informat or attrib statement) or use the column modifier.
From SAS(R) 9.4 Statements: Reference, Third Edition
Comparisons
Reading above for formatted input "input cust_id ... ddmmyy10. " should only be reading max. 10 columns, so even if "The pointer moves the length that is specified in the informat" supersedes the delimiter definition, the length of "10" should actually stop before the next delimiter of "comma" is in the string.
I haven't found a really satisfying explanation in the docu which explains such behavior in a way that I feel I fully understand it.
data jansales1;
infile datalines truncover dsd;
input item $10. amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
data jansales2;
infile datalines truncover dsd;
input item : $10. amount comma5.;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
data jansales3;
infile datalines truncover dsd;
informat item $10. amount comma5.;
input item amount;
datalines;
trucks 1,382
vans 1,235
sedans 2,391
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The difference between DSD and non-DSD that is causing the trouble with your example program is how it moves the cursor after reading a field. Without DSD it moves past all of the delimiters that it sees before reading the next field. With DSD it does not. I assume this is what allows it to treat adjacent delimiters as indicating a missing value.
See this example:
data one;
infile cards dlm=',' TRUNCOVER ; /*using dlm','*/
input cust_id date ddmmyy10. A $ B $ C $;
cards;
1,10/01/2015,5000,dr
;
run;
data two;
infile cards dsd TRUNCOVER ;
input cust_id date ddmmyy10. A $ B $ C $;
cards;
1,10/01/2015,5000,dr
;
run;
proc compare data=one compare=two;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tom,
I am here to learn. I tried your code and get table one with varaible C as missing, while table two with variable A as mssing, but still not understand why.
Could you explain how the cursor move between input date and A in these dlm=',' and dsd method?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The issue is when it starts to read A the cursor is pointing at the comma (column 13) in both cases.
With the DSD option SAS takes this to mean that there is a missing value before that comma. Without the DSD option it will skip over the comma (the same way that normal list input where space is the delimiter would skip over extra spaces in a line) and so the character after the comma is read as the value of A.
You can see the same behaviour by reading a line that starts with the delimiter.
data one ;
infile cards dlm=',' ;
input x $ ;
put x= ;
cards;
,AAA
;;;;;
data two;
infile cards dsd dlm=',';
input x $ ;
put x=;
cards;
,AAA
;;;;