当前位置: 动力学知识库 > 问答 > 编程问答 >

python - How do I parse some of the data from a large xml file?

问题描述:

I need to extract the location and radius data from a large xml file that is formatted as below and store the data in 2-dimensional ndarray. This is my first time using Python and I can't find anything about the best way to do this.

<species name="MyHeterotrophEPS" header="family,genealogy,generation,birthday,biomass,inert,capsule,growthRate,volumeRate,locationX,locationY,locationZ,radius,totalRadius">

0,0,0,0.0,0.0,0.0,77.0645361927206,-0.1001871531330136,-0.0013358287084401814,4.523853439106942,234.14575280979898,123.92820420047076,0.0,0.6259920275663835;

0,0,0,0.0,0.0,0.0,108.5705297969604,-0.1411462759900182,-0.001881950346533576,1.0429122163754276,144.1066875513379,72.24884428367467,0.0,0.7017581019907897;

.

.

.

</species>

Edit:I mean "large" by human standards. I am not having any memory issues with it.

网友答案:

You essentially have CSV data in the XML text value.

Use ElementTree to parse the XML, then use numpy.genfromtxt() to load that text into an array:

from xml.etree import ElementTree as ET

tree = ET.parse('yourxmlfilename.xml')
species = tree.find(".//species[@name='MyHeterotrophEPS']")
names = species.attrib['header']
array = numpy.genfromtxt((line.rstrip(';') for line in species.text.splitlines()), 
    delimiter=',', names=names)

Note the generator expression, with a str.splitlines() call; this turns the text of the XML element into a sequence of lines, which .genfromtxt() is quite happy to receive. We do remove the trailing ; character from each line.

For your sample input (minus the . lines), this results in:

array([ (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 77.0645361927206, -0.1001871531330136, -0.0013358287084401814, 4.523853439106942, 234.14575280979898, 123.92820420047076, 0.0, 0.6259920275663835),
       (0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 108.5705297969604, -0.1411462759900182, -0.001881950346533576, 1.0429122163754276, 144.1066875513379, 72.24884428367467, 0.0, 0.7017581019907897)], 
      dtype=[('family', '<f8'), ('genealogy', '<f8'), ('generation', '<f8'), ('birthday', '<f8'), ('biomass', '<f8'), ('inert', '<f8'), ('capsule', '<f8'), ('growthRate', '<f8'), ('volumeRate', '<f8'), ('locationX', '<f8'), ('locationY', '<f8'), ('locationZ', '<f8'), ('radius', '<f8'), ('totalRadius', '<f8')])
网友答案:

If your XML is just that species node, it's pretty simple, and Martijn Pieters has already explained it better than I can.

But if you've got a ton of species nodes in the document, and it's too large to fit the whole thing into memory, you can use iterparse instead of parse:

import numpy as np
import xml.etree.ElementTree as ET

for event, node in ET.iterparse('species.xml'):
    if node.tag == 'species':
        name = node.attr['name']
        names = node.attr['header']
        csvdata = (line.rstrip(';') for line in node.text.splitlines())
        array = np.genfromtxt(csvdata, delimiter=',', names=names)
        # do something with the array.

This won't help if you just have one super-gigantic species node, because even iterparse (or similar solutions like a SAX parser) parse one entire node at a time. You'd need to find an XML library that lets you stream the text of large nodes, and off the top of my head, I don't think of any stdlib or popular third-party parsers that can do that.

网友答案:

If the file is really large, use ElementTree or SAX.

If the file is not that large (i.e. fits into memory), minidom might be easier to work with.

Each line seems to be a simple string of comma-separated numbers, so you can sipmly do line.split(',').

分享给朋友:
您可能感兴趣的文章:
随机阅读: