Analysis of a huge (~ 100 mb) kml (xml) file that takes * hours * without any signs of actual parsing

I am currently trying to parse a very large kml (xml) file with ruby ​​(Nokogiri) and I have few problems.

The parsing code is good, in fact I will share it just for him, although this code does not have much to do with my problem:

geofactory = RGeo::Geographic.projected_factory(:projection_proj4 => "+proj=lcc +lat_1=34.83333333333334 +lat_2=32.5 +lat_0=31.83333333333333 +lon_0=-81 +x_0=609600 +y_0=0 +ellps=GRS80 +to_meter=0.3048 +no_defs", :projection_srid => 3361) f = File.open("horry_parcels.kml") kmldoc = Nokogiri::XML(f) kmldoc.css("//Placemark").each_with_index do |placemark, i| puts i tds = Nokogiri::HTML(placemark.search("//description").children[0].to_html).search("tr > td") h = HorryParcel.new h.owner_name = tds.shift.text tds.shift tds.each_slice(2) do |k, v| col = k.text.downcase eval("h.#{col} = v.text") end coords = kmldoc.search("//MultiGeometry")[i].text.gsub("\n", "").gsub("\t", "").split(",0 ").map {|x| x.split(",")} points = coords.map { |lon, lat| geofactory.parse_wkt("POINT (#{lon} #{lat})") } geo_shape = geofactory.polygon(geofactory.linear_ring(points)) proj_shape = geo_shape.projection h.geo_shape = geo_shape h.proj_shape = proj_shape h.save end 

Anyway, I checked this code with a much smaller kml sample and it works.

However, when I load the real thing, the ruby ​​just waits as if it is processing something. However, this “processing” spans several hours while I do other things. As you can see, I have a counter ( each_with_index ) in the tags array, and during this many hours period, not a single i value was put on the command line. Oddly enough, it's not yet timed, but even if it works, there should be a better way to do it.

I know that I can open a KML file in Google Earth (Google Earth Pro here) and save the data in smaller, more manageable kml files, but it seems to be customized, it would be a very manual, unprofessional process.

Here is an example kml (w / only one label) if that helps.

 <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom"> <Document> <name>justone.kml</name> <Style id="PolyStyle00"> <LabelStyle> <color>00000000</color> <scale>0</scale> </LabelStyle> <LineStyle> <color>ff0000ff</color> </LineStyle> <PolyStyle> <color>00f0f0f0</color> </PolyStyle> </Style> <Folder> <name>justone</name> <open>1</open> <Placemark id="ID_010161"> <name>STUART CHARLES A JR</name> <Snippet maxLines="0"></Snippet> <description>""</description> <styleUrl>#PolyStyle00</styleUrl> <MultiGeometry> <Polygon> <outerBoundaryIs> <LinearRing> <coordinates> -78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0 </coordinates> </LinearRing> </outerBoundaryIs> </Polygon> </MultiGeometry> </Placemark> </Folder> </Document> </kml> -78.94637899999999 <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom"> <Document> <name>justone.kml</name> <Style id="PolyStyle00"> <LabelStyle> <color>00000000</color> <scale>0</scale> </LabelStyle> <LineStyle> <color>ff0000ff</color> </LineStyle> <PolyStyle> <color>00f0f0f0</color> </PolyStyle> </Style> <Folder> <name>justone</name> <open>1</open> <Placemark id="ID_010161"> <name>STUART CHARLES A JR</name> <Snippet maxLines="0"></Snippet> <description>""</description> <styleUrl>#PolyStyle00</styleUrl> <MultiGeometry> <Polygon> <outerBoundaryIs> <LinearRing> <coordinates> -78.941896,33.867893,0 -78.942514,33.868632,0 -78.94342899999999,33.869705,0 -78.943708,33.870083,0 -78.94466799999999,33.871142,0 -78.94511900000001,33.871639,0 -78.94541099999999,33.871776,0 -78.94635,33.872216,0 -78.94637899999999,33.872229,0 -78.94691400000001,33.87248,0 -78.94708300000001,33.87256,0 -78.94783700000001,33.872918,0 -78.947889,33.872942,0 -78.948655,33.873309,0 -78.949589,33.873756,0 -78.950164,33.87403,0 -78.9507,33.873432,0 -78.95077000000001,33.873384,0 -78.950867,33.873354,0 -78.95093199999999,33.873334,0 -78.952518,33.871631,0 -78.95400600000001,33.869583,0 -78.955254,33.867865,0 -78.954606,33.867499,0 -78.953833,33.867172,0 -78.952994,33.866809,0 -78.95272799999999,33.867129,0 -78.952139,33.866803,0 -78.95152299999999,33.86645,0 -78.95134299999999,33.866649,0 -78.95116400000001,33.866847,0 -78.949281,33.867363,0 -78.948936,33.866599,0 -78.94721699999999,33.866927,0 -78.941896,33.867893,0 </coordinates> </LinearRing> </outerBoundaryIs> </Polygon> </MultiGeometry> </Placemark> </Folder> </Document> </kml> 

EDIT: 99.9% of the data I work with is in *.shp format, so I ignored this problem last week. But I'm going to run this process on my desktop computer (on my laptop) and run it until it expires or ends.

 class ClassName attr_reader :before, :after def go @before = Time.now run_actual_code @after = Time.now puts "process took #{(@after - @before) seconds} to complete" end def run_actual_code ... end end 

The code above should tell me how much time has passed. From this (if it really ends), we should be able to calculate a crude rule of thumb about how long you should expect your (aka PERFECT) code to work without SAX analysis or “spraying” the text components of the document.

+4
source share
2 answers

For a huge XML file, you should not use the default XML parser from Nokogiri because it parses the DOM. The parsing strategy for large XML files is SAX. Fortunately, we Nokogiri supports SAX .

The disadvantage is that using the SAX analyzer, all logic must be executed using callbacks. The idea is simple: the sax parser starts reading the file and informs you when it finds something interesting, for example, opening a tag, closing a tag or text. You can bind callbacks to these events and retrieve everything you need.

Of course, you do not want to use the SAX parser to load the entire file into memory and work with it there - this is exactly what SAX wants to avoid. You will need to do whatever you want with this file in parts.

So this is basically a rewrite of your parsing with callback logic. To learn more about XML DOM and SAX parsing, you can check this FAQ at cs.nmsu.edu

+7
source

Actually, I got a copy of the data from a more accessible source, but I came back here because I wanted to present a possible solution to a common problem. Less. Less was built a long time ago and in most cases is part of the default unix.

http://en.wikipedia.org/wiki/Less_%28Unix%29

Does not apply to the style language (“LESS”), less - a text viewer (cannot edit files, just read them), which does not load the entire document that it reads until you yourself have looked at everything. For example, it loads the first "page", so to speak, and waits for you to call the next.

If the ruby ​​script could somehow translate the "pages" of the text into ... oh wait ... the XML structure would not allow this because it would not have closing delimiters from the end of the undigested text file .. .... So what you need to do is do some custom work on the front panel, cut out these first parental brackets for the pair so that you can rip the XML files one at a time and have the last closing parental brackets break the script, because the parser will read that he is finished and run into a friend my closing parenthesis, I think.

I have not tried this and can not try anything. But if I did, I would probably try to block n-lot text blocks in ruby ​​(or python, etc.) through less or something similar to it. Maybe something more primitive than less, I'm not sure

+1
source

Source: https://habr.com/ru/post/1500811/


All Articles