I have been working with SQL XML Bulk Load (version 4.0) over the last period of time, and suddenly I encountered an issue where the imported data didn’t match the data in the XML files I received.
Looking closer into the issue, I discovered it was because the SQL XML Bulk Load removed my line breaks entirely from the imported data.
In XML line breaks are defined as standard using only “line feed” (LF). However working in a windows environment line breaks are almost always both carriage return and line feed combined (CR+LF).
However when the SQL XML Bulk Load parser (I would think it is a parsing issue) encounters a CR+LF in a node value it strips both characters from the text which results in the data being imported without the line break. If however the parser encounters only a LF, then that LF is retained in the imported data.
Now this seemed like a strange behaviour, because why would both CR and LF be removed and not only the CR, if it was to rely on the standard. However even more annoying, when the CR+LF is used as line breaks most every other place in Windows environments, why even remove those two characters to begin with?
I contacted Microsoft technical support which seems to have confirmed that it isn’t an error on my account, and that they maybe would look at it for the next release. (*fingers crossed*)
Whether or not they view it as a bug I do not know – but there is definitely something wrong there. And I would call it a bug, because removing extra information can’t be intentional.
To illustrate my issue I made a very simple XML file containing two nodes:
<root>
<myNode>Before CRLF
After CRLF</myNode>
<myNode>Before LF
After LF</myNode>
</root>
In this file, the first line break was made with a CR+LF (or rather Environment.NewLine via .NET, but parsing it matches CR+LF characters) and the second was made only with the line feed as a line break.
I then imported that file into a database using the schema file:
<?xml version=“1.0“ encoding=“utf-8“?>
<xs:schema xmlns:xs=“http://www.w3.org/2001/XMLSchema”
xmlns:dt=“urn:schemas-microsoft-com:datatypes”
xmlns:sql=“urn:schemas-microsoft-com:mapping-schema“>
<xs:element name=“root“ sql:is-constant=“1“>
<xs:complexType>
<xs:sequence>
<xs:element xml:space=“preserve“ type=“xs:token“ dt:type=“string“ name=“myNode“ sql:relation=“LineBreak“ />
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
Where I tried to do all sorts of XSD “keep my line breaks as they are” tricks (space=”preserve” and so on).
I then ran the import (vb.net code)
Dim strInFile As String = “c:\linebreak.xml”
Dim strSchema As String = “c:\linebreak.xsd”
Dim objBL As Object = CreateObject(“SQLXMLBulkLoad.SQLXMLBulkload.4.0″)
Try
objBL.ConnectionString = “connection string”
objBL.KeepIdentity = False
objBL.XMLFragment = True
objBL.ErrorLogFile = “error log file path”
objBL.Execute(strSchema, strInFile)
Catch ex As Exception
Debug.WriteLine(ex.ToString)
Finally
If Not objBL Is Nothing Then
objBL = Nothing
End If
End Try
And when I viewed the imported data, the CR+LF was removed from the first “myNode” and it had no breaks what so ever, and in the second “myNode” the LF was retained.
I was asked by the Microsoft employee I was in contact with to create the issue on Microsoft Connect, so I eagerly await what comes of this for the next release(s) of SQL XML Bulk Load. Hopefully they’ll have this fixed.
The only work around I have found myself is to pre-/post process the files. Before importing, run through the XML file and replace your line breaks with some special character not present in the file (usually ¤ is a good candidate) – and once the import has run, then post process your data and replace the used character with CR+LF again.
It is not a beautiful solution, but it works around the issue and helped myself move forward with the project.