To do some analysis, I want to import XML into a data framework using R and an XML package. Example XML file:
<watchers shop_name="TEST" created_at="September 14, 2012 05:44"> <watcher channel="Site Name"> <code>123456</code> <search_key>TestKey</search_key> <date>September 14, 2012 04:15</date> <result>Found</result> <link>http://www.test.com/fakeurl</link> <price>100.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>Name Test</name> <results> <result position="1"> <c_name>CTest1</c_name> <c_price>599.49</c_price> <c_shipping>0.0</c_shipping> <c_total_price>599.49</c_total_price> <c_rating>8.3</c_rating> <c_delivery/> </result><result position="2"> <c_name>CTest2</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>9.8</c_rating> <c_delivery/> </result> <result position="3"> <c_name>CTest3</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>8.8</c_rating> <c_delivery/> </result> </results> </watcher> </watchers>
I want to have data frame rows containing the following fields:
shop_name created_at code search_key date result link price shipping origposition name position c_name c_price c_shipping c_total_price c_rating c_delivery
This means that the child nodes must also be taken into account, which will lead to the fact that in this example a three-dimensional series of data will be presented (since the results show 3 positions). Fields
shop_name created_at code search_key date result link price shipping origposition name
are the same for each of these lines.
I can go through the XML file, but I cannot get the framework with the fields I want. When I convert a dataframe to a dataframe, I get the following fields:
"code" "search_key" "date" "result" "link" "price" "shipping" "origposition" "name" "results"
Here are the fields
shop_name created_at
are missing at the beginning, and the โresultsโ are combined in a row in the โresultsโ column.
It should be possible to get the required data framework, but I don't know how to do it.
UPDATE
The solution provided by @MvG works fine in the XML test file mentioned above. However, the column result may also have the value "Not Found." Records with this value will skip certain fields (always the same) and, therefore, give "the number of argument columns does not match" -error when the solution starts. I would like these records to also be placed in the dataframe, and the fields that are not present are left blank. I do not understand how to include this scenario.
test.xml
<watchers shop_name="TEST" created_at="September 14, 2012 05:44"> <watcher channel="Site Name"> <code>123456</code> <search_key>TestKey</search_key> <date>September 14, 2012 04:15</date> <result>Found</result> <link>http://www.test.com/fakeurl</link> <price>100.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>Name Test</name> <results> <result position="1"> <c_name>CTest1</c_name> <c_price>599.49</c_price> <c_shipping>0.0</c_shipping> <c_total_price>599.49</c_total_price> <c_rating>8.3</c_rating> <c_delivery/> </result><result position="2"> <c_name>CTest2</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>9.8</c_rating> <c_delivery/> </result> <result position="3"> <c_name>CTest3</c_name> <c_price>654.0</c_price> <c_shipping>0.0</c_shipping> <c_total_price>654.0</c_total_price> <c_rating>8.8</c_rating> <c_delivery/> </result> </results> </watcher> <watcher channel="Shopping"> <code>12804</code> <search_key></search_key> <date></date> <result>Not found</result> <link>https://www.test.com/testing1323p</link> <price>0.0</price> <shipping>0.0</shipping> <origposition>0</origposition> <name>MOOVM6002020</name> <results> </results> </watcher> </watchers>