How to use lxml to find element text in an XHTML document

For a long time I hit my head about it, I have to do something stupid.

I try to extract all possible languages ​​supported by Wikipedia and output them to a text file by going through the List_of_Wikipedias tables.

Here is my python code so far that is just trying to extract one of the tables:

import httplib from lxml import etree def main(): conn = httplib.HTTPConnection("meta.wikimedia.org") conn.request("GET","/wiki/List_of_Wikipedias") res = conn.getresponse() root = etree.fromstring(res.read()) table = root.xpath('//table') print table main() 

On my machine, this only prints an empty list. To increase speed, I copied the page locally and used:

 wikipage = open("wikipage.html") root = lxml.parse(wikipage) 

but this has no effect (other than obvious acceleration). I also tried

 lxml.find('table') 

and

 for element in root.iter(): print("%s - %s" % (element.tag, element.text)) 

which successfully prints all the elements, so I know that the tree is being created.

What am I doing wrong?

Any help would be greatly appreciated. Thanks.

+4
source share
3 answers
 I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias 

Your problem is that the element names in the document are in the default namespace. How to write XPath expressions that include such element names is the most frequently asked question in XPath and has a lot of good answer in the xpath SO tag. Just find them.

Here is the complete solution:

Using

 (//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text() 

where you registered the XHTML namespace ( "http://www.w3.org/1999/xhtml" ) associated with the "x" prefix.

When I evaluated this XPath expression regarding a document retrieved from: http://s23.org/wikistats/wikipedias_html

I needed to add the following at the beginning of the document because I worked locally and did not have a DTD for XHTML. You may not need this:

 <!DOCTYPE html [ <!ENTITY uarr "&#8593;"> <!ENTITY darr "&#8595;"> <!ENTITY ccedil "&#199;"> <!ENTITY oslash "&#216;"> <!ENTITY aacute "&#225;"> <!ENTITY aring "&#229;"> <!ENTITY agrave "&#192;"> <!ENTITY egrave "&#232;"> <!ENTITY ograve "&#210;"> <!ENTITY ocirc "&#244;"> ]> 

The result of applying the above XPath expression to this document :

  English German French Polish Italian Japanese Spanish Portuguese Dutch Russian Swedish Chinese Catalan Norwegian (Bokmål) Finnish Ukrainian Czech Hungarian Romanian Korean Turkish Vietnamese Indonesian Danish Arabic Esperanto Serbian Lithuanian Slovak Volapük Persian Hebrew Bulgarian Slovenian Malay Waray-Waray Croatian Estonian Newar / Nepal Bhasa Simple English Hindi Galician Thai Basque Norwegian (Nynorsk) Aromanian Greek Haitian Azerbaijani Tagalog Latin Telugu Georgian Macedonian Cebuano Serbo-Croatian Breton Piedmontese Marathi Latvian Luxembourgish Javanese Belarusian (Taraškievica) Welsh Icelandic Bosnian Albanian Tamil Belarusian Bishnupriya Manipuri Aragonese Occitan Bengali Swahili Ido Lombard West Frisian Gujarati Afrikaans Low Saxon Malayalam Quechua Sicilian Urdu Kurdish Cantonese Sundanese Asturian Neapolitan Samogitian Armenian Yoruba Irish Chuvash Walloon Nepali Ripuarian Western Panjabi Kannada Tajik Tarantino Venetian Yiddish Scottish Gaelic Tatar Min Nan Ossetian Uzbek Alemannic Kapampangan Sakha Egyptian Arabic Kazakh Maori Limburgian Amharic Nahuatl Upper Sorbian Gilaki Corsican Gan Mongolian Scots Interlingua Central_Bicolano Burmese Faroese Võro Dutch Low Saxon Sinhalese Turkmen West Flemish Sanskrit Bavarian Malagasy Manx Ilokano Divehi Norman Pangasinan Banyumasan Sorani Romansh Northern Sami Zazaki Mazandarani Wu Friulian Uyghur Ligurian Maltese Bihari Novial Tibetan Anglo-Saxon Kashubian Sardinian Classical Chinese Fiji Hindi Khmer Ladino Zamboanga Chavacano Pali Franco-Provençal/Arpitan Pashto Hakka Cornish Punjabi Navajo Silesian Kalmyk Pennsylvania German Hawaiian Saterland Frisian Interlingue Somali Komi Karachay-Balkar Crimean Tatar Tongan Acehnese Meadow Mari Picard Erzya Lingala Kinyarwanda Extremaduran Guarani Kirghiz Emilian-Romagnol Assyrian Neo-Aramaic Papiamentu Aymara Chechen Lojban Wolof Banjar Bashkir North Frisian Greenlandic Tok Pisin Udmurt Kabyle Tahitian Sranan Zealandic Hill Mari Komi-Permyak Lower Sorbian Abkhazian Gagauz Igbo Oriya Lao Kongo Avar Moksha Mirandese Romani Old Church Slavonic Karakalpak Samoan Moldovan Tetum Gothic Kashmiri Bambara Inupiak Sindhi Bislama Lak Nauruan Norfolk Inuktitut Pontic Assamese Cherokee Min Dong Swati Palatinate German Hausa Ewe Tigrinya Oromo Zulu Zhuang Venda Tsonga Kirundi Dzongkha Sango Cree Chamorro Luganda Buginese Buryat (Russia) Fijian Chichewa Akan Sesotho Xhosa Fula Tswana Kikuyu Tumbuka Shona Twi Cheyenne Ndonga Sichuan Yi Choctaw Marshallese Afar Kuanyama Hiri Motu Muscogee Kanuri Herero 

Pay attention . Each selected second node is a text with a space node. If you do not want them selected, use:

 (//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()] 
+3
source

Parse it as html.

 from lxml import html url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias' tree = html.parse(url) languages = tree.xpath('//table/tr/td[2]/a/text()') print('\n'.join(languages)) 

Output

 English German French Polish Italian Japanese Spanish Portuguese Dutch Russian Swedish Chinese Catalan Norwegian (Bokmål) Finnish Ukrainian Czech Hungarian Romanian Korean Turkish Vietnamese Indonesian Danish Arabic Esperanto Serbian Lithuanian Slovak Volapük Persian Hebrew Bulgarian Slovenian Malay Waray-Waray Croatian Estonian Newar / Nepal Bhasa Simple English Hindi Galician Thai Basque Norwegian (Nynorsk) Aromanian Greek Haitian Azerbaijani Tagalog Latin Telugu Georgian Macedonian Cebuano Serbo-Croatian Breton Piedmontese Marathi Latvian Luxembourgish Javanese Belarusian (Taraškievica) Welsh Icelandic Bosnian Albanian Tamil Belarusian Bishnupriya Manipuri Aragonese Occitan Bengali Swahili Ido Lombard West Frisian Gujarati Afrikaans Low Saxon Malayalam Quechua Sicilian Urdu Kurdish Cantonese Sundanese Asturian Neapolitan Samogitian Armenian Yoruba Irish Chuvash Walloon Nepali Ripuarian Western Panjabi Kannada Tajik Tarantino Venetian Yiddish Scottish Gaelic Tatar Min Nan Ossetian Uzbek Alemannic Kapampangan Sakha Kazakh Egyptian Arabic Maori Amharic Limburgian Nahuatl Upper Sorbian Gilaki Corsican Gan Mongolian Scots Interlingua Central_Bicolano Burmese Faroese Võro Dutch Low Saxon Sinhalese Turkmen West Flemish Sanskrit Bavarian Malagasy Manx Ilokano Divehi Norman Pangasinan Banyumasan Sorani Romansh Northern Sami Zazaki Mazandarani Wu Friulian Uyghur Ligurian Maltese Bihari Novial Tibetan Anglo-Saxon Kashubian Sardinian Classical Chinese Fiji Hindi Khmer Ladino Zamboanga Chavacano Pali Franco-Provençal/Arpitan Pashto Hakka Cornish Punjabi Navajo Silesian Kalmyk Pennsylvania German Hawaiian Saterland Frisian Interlingue Somali Komi Karachay-Balkar Crimean Tatar Tongan Acehnese Meadow Mari Picard Kinyarwanda Erzya Lingala Extremaduran Guarani Kirghiz Emilian-Romagnol Assyrian Neo-Aramaic Papiamentu Aymara Chechen Lojban Wolof Banjar Bashkir North Frisian Greenlandic Tok Pisin Udmurt Kabyle Tahitian Sranan Zealandic Hill Mari Komi-Permyak Lower Sorbian Abkhazian Gagauz Igbo Oriya Lao Kongo Avar Moksha Mirandese Romani Old Church Slavonic Karakalpak Samoan Moldovan Tetum Gothic Kashmiri Bambara Inupiak Sindhi Bislama Lak Nauruan Norfolk Inuktitut Pontic Assamese Cherokee Min Dong Palatinate German Swati Hausa Ewe Tigrinya Oromo Zulu Zhuang Venda Tsonga Kirundi Cree Dzongkha Sango Chamorro Luganda Buginese Buryat (Russia) Fijian Chichewa Akan Sesotho Xhosa Fula Tswana Kikuyu Tumbuka Shona Twi Cheyenne Ndonga Sichuan Yi Choctaw Marshallese Afar Kuanyama Hiri Motu Muscogee Kanuri Herero 
+3
source

XPath requires a namespace. Running the loaded page:

 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr"> 

So you really want

 xpath('//html:table') 

where html is the prefix associated with "http://www.w3.org/1999/xhtml"

You will need to learn how to associate namespaces in lxml - I'm not a python expert.

If this is your problem, I sympathize - it caught me and many others!

0
source

Source: https://habr.com/ru/post/1336640/


All Articles