xml validation and invalid characters solution in PHP/Python
Validate xml file, the website below can do the trick and show you invalid xml characters:http://www.xmlvalidation.com/
When you use xml parser in PHP like simplexml_load_string() of version 1.0 to process the xml file based on version 1.1, then it is possible that you will have this issue. You may got an error from your script like below:
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 0
or
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 14
This is because in the version 1.0, the characters like �,  is not allowed. The rule is that all characters composition /&#x[0-1]?[0-9A-E]/ is not allowed. So we can replace the xml file like below to fix this problem:
$string = file_get_content($xmllink);
preg_replace('/&#x[0-1]?[0-9A-E]/', ' ', $string);
The xml validation is similar in python, to process a big .xml file, we do not ready it once, we do it as following:
When you use xml parser in PHP like simplexml_load_string() of version 1.0 to process the xml file based on version 1.1, then it is possible that you will have this issue. You may got an error from your script like below:
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 0
or
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 14
This is because in the version 1.0, the characters like �,  is not allowed. The rule is that all characters composition /&#x[0-1]?[0-9A-E]/ is not allowed. So we can replace the xml file like below to fix this problem:
$string = file_get_content($xmllink);
preg_replace('/&#x[0-1]?[0-9A-E]/', ' ', $string);
The xml validation is similar in python, to process a big .xml file, we do not ready it once, we do it as following:
invalid_xml = re.compile(r'&#x[0-1]?[0-9a-fA-F];')
file_old = 'path/file_old.xml'
file_new = 'path/file_new.xml'
xmlfile = open(file_new, 'w')
with open(file_old) as fopen:
for line in fopen:
vline, count = invalid_xml.subn('', line)
if count > 0:
print 'Removed %s illegal characters from XML feed' % count
xmlfile.write(vline)
xmlfile.close()
Comments
Post a Comment