xml validation and invalid characters solution in PHP/Python

Validate xml file, the website below can do the trick and show you invalid xml characters:http://www.xmlvalidation.com/

When you use xml parser in PHP like simplexml_load_string()  of version 1.0 to process the xml file based on version 1.1, then it is possible that you will have this issue. You may got an error from your script like below:
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 0
or 
simplexml_load_string(): Entity: line 166858: parser error : xmlParseCharRef: invalid xmlChar value 14
This is because in the version 1.0, the characters like &#x0, &#xE is not allowed. The rule is that all characters composition /&#x[0-1]?[0-9A-E]/  is not allowed. So we can replace the xml file like below to fix this problem:
$string = file_get_content($xmllink);
preg_replace('/&#x[0-1]?[0-9A-E]/', ' ', $string);

The xml validation is similar in python, to process a big .xml file, we do not ready it once, we do it as following:

invalid_xml = re.compile(r'&#x[0-1]?[0-9a-fA-F];')
file_old = 'path/file_old.xml'
file_new = 'path/file_new.xml'

xmlfile = open(file_new, 'w')
with open(file_old) as fopen:
    for line in fopen:
        vline, count = invalid_xml.subn('', line)
        if count > 0:
            print 'Removed %s illegal characters from XML feed' % count
        xmlfile.write(vline)
xmlfile.close()


  

Comments

Popular posts from this blog

install postgreSQL in ubuntu 16.04

timestamp with 16, 13 and 10 digits to Qlik date

install ipython in Cloudera VM using pip