I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. This is my first time using Python and I can't find anything about the best way to do this.
<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">
Edit:I mean "large" by human standards. I am not having any memory issues with it.
You essentially have CSV data in the XML text value.
ElementTree to parse the XML, then use
numpy.genfromtxt() to load that text into an array:
from xml.etree import ElementTree as ET tree = ET.parse('yourxmlfilename.xml') species = tree.find(".//species[@name='MyHeterotrophEPS']") names = species.attrib['header'] array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), delimiter=',', names=names)
Note the generator expression, with a
str.splitlines() call; this turns the text of the XML element into a sequence of lines, which
.genfromtxt() is quite happy to receive. We do remove the trailing
; character from each line.
For your sample input (minus the
. lines), this results in:
array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835), (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])
If your XML is just that
species node, it's pretty simple, and Martijn Pieters has already explained it better than I can.
But if you've got a ton of
species nodes in the document, and it's too large to fit the whole thing into memory, you can use
iterparse instead of
import numpy as np import xml.etree.ElementTree as ET for event, node in ET.iterparse('species.xml'): if node.tag == 'species': name = node.attr['name'] names = node.attr['header'] csvdata = (line.rstrip(';') for line in node.text.splitlines()) array = np.genfromtxt(csvdata, delimiter=',', names=names) # do something with the array.
This won't help if you just have one super-gigantic
species node, because even
iterparse (or similar solutions like a SAX parser) parse one entire node at a time. You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that.
If the file is really large, use
If the file is not that large (i.e. fits into memory),
minidom might be easier to work with.
Each line seems to be a simple string of comma-separated numbers, so you can sipmly do