Chris Pedro

Just a guy from Bermuda


Filter XML Data to remove elements using Python

Last Updated: 2019-07-08

I recently had to setup a script that would remove certain elements from XML data. I found code examples online on how to do this in Python using ElementTree. However, nothing was successful in removing sub-elements as well. The issue is, the remove() function requires you to pass the parent element, but there’s no easy way to find the parent of a node when looping through XML data.

To work around this, I first specified a dictionary, with the element as the key and the path as the value. Then, find the ‘parent’ by doing a findall() on value and then call remove() on the parent.

In the snippet below, <revision> is on the root (.), and <last_rule_upd_time> is under <installedpackages> ... <snortglobal>.

Code

import xml.etree.ElementTree as ET

def filterconfig(cfg):
    """Filter out elements from config.
    """
    # Update this dictionary to filter more.
    # Format is '<element>': '<path>'
    # See below URL for path format:
    # https://docs.python.org/3.4/library/xml.etree.elementtree.html#elementtree-xpath
    toRemove = {
        'revision': '.',
        'last_rule_upd_time': './installedpackages/snortglobal' 
    }
    
    tree = ET.parse(cfg)
    root = tree.getroot()

    for element in toRemove:
        for parent in root.findall(toRemove[element]):
            for child in parent.findall(element):
                parent.remove(child)

    tree.write(cfg)

Caveats

Because I’m using a dictionary, which doesn’t support duplicate keys, if you have multiple elements with the same name, but under different paths, only the first instance as defined in the dictionary will be removed. If they’re under the same path, they will all be removed, since findall() is being used. If you need to remove elements named the same under different paths, a different method will need to be used.

Background

I used this method for backing up pfSense firewall configs. The issue I had was I was using RANCID to keep a track of changes in configs. However, we were using Snort on the firewalls, which was auto-updating rulesets. Each time a rule set was updated, we would get something similar to below in from RANCID:

diff -u -4 -r1.52
@@ -3611,11 +3611,11 @@
      <hideversion></hideversion>
      <dnssecstripped></dnssecstripped>
    </unbound>
    <revision>
-     <time>1562267070</time>
-     <description><![CDATA[admin@10.201.0.170 (Local Database): Configuration saved on completion of the pfSense setup wizard.]]></description>
-     <username>admin@10.201.0.170 (Local Database)</username>
+     <time>1562282138</time>
+     <description><![CDATA[(system): Snort pkg: updated status for updated rules package(s) check.]]></description>
+     <username>(system)</username>
    </revision>
    <vlans>
      <vlan>
        <if>vmx0</if>
@@ -3980,10 +3980,10 @@
        <rm_blocked>1h_b</rm_blocked>
        <autorulesupdate7>6h_up</autorulesupdate7>
        <rule_update_starttime>00:05</rule_update_starttime>
        <forcekeepsettings>on</forcekeepsettings>
        <last_rule_upd_status>success</last_rule_upd_status>
-       <last_rule_upd_time>1562252703</last_rule_upd_time>
+       <last_rule_upd_time>1562282138</last_rule_upd_time>
        <rule>
          <interface>wan</interface>
          <enable>on</enable>
          <uuid>9156</uuid>

Obviously, the <revision>, <last_rule_upd_time> elements in the XML data were a bit irrelevant, and not really needed to backup configs. So I decided to filter out these elements, but the method above can be used for any elements in XML data.