I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias
Your problem is that the element names in the document are in the default namespace. How to write XPath expressions that include such element names is the most frequently asked question in XPath and has a lot of good answer in the xpath SO tag. Just find them.
Here is the complete solution:
Using
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()
where you registered the XHTML namespace ( "http://www.w3.org/1999/xhtml" ) associated with the "x" prefix.
When I evaluated this XPath expression regarding a document retrieved from: http://s23.org/wikistats/wikipedias_html
I needed to add the following at the beginning of the document because I worked locally and did not have a DTD for XHTML. You may not need this:
<!DOCTYPE html [ <!ENTITY uarr "↑"> <!ENTITY darr "↓"> <!ENTITY ccedil "Ç"> <!ENTITY oslash "Ø"> <!ENTITY aacute "á"> <!ENTITY aring "å"> <!ENTITY agrave "À"> <!ENTITY egrave "è"> <!ENTITY ograve "Ò"> <!ENTITY ocirc "ô"> ]>
The result of applying the above XPath expression to this document :
English German French Polish Italian Japanese Spanish Portuguese Dutch Russian Swedish Chinese Catalan Norwegian (Bokmål) Finnish Ukrainian Czech Hungarian Romanian Korean Turkish Vietnamese Indonesian Danish Arabic Esperanto Serbian Lithuanian Slovak Volapük Persian Hebrew Bulgarian Slovenian Malay Waray-Waray Croatian Estonian Newar / Nepal Bhasa Simple English Hindi Galician Thai Basque Norwegian (Nynorsk) Aromanian Greek Haitian Azerbaijani Tagalog Latin Telugu Georgian Macedonian Cebuano Serbo-Croatian Breton Piedmontese Marathi Latvian Luxembourgish Javanese Belarusian (Taraškievica) Welsh Icelandic Bosnian Albanian Tamil Belarusian Bishnupriya Manipuri Aragonese Occitan Bengali Swahili Ido Lombard West Frisian Gujarati Afrikaans Low Saxon Malayalam Quechua Sicilian Urdu Kurdish Cantonese Sundanese Asturian Neapolitan Samogitian Armenian Yoruba Irish Chuvash Walloon Nepali Ripuarian Western Panjabi Kannada Tajik Tarantino Venetian Yiddish Scottish Gaelic Tatar Min Nan Ossetian Uzbek Alemannic Kapampangan Sakha Egyptian Arabic Kazakh Maori Limburgian Amharic Nahuatl Upper Sorbian Gilaki Corsican Gan Mongolian Scots Interlingua Central_Bicolano Burmese Faroese Võro Dutch Low Saxon Sinhalese Turkmen West Flemish Sanskrit Bavarian Malagasy Manx Ilokano Divehi Norman Pangasinan Banyumasan Sorani Romansh Northern Sami Zazaki Mazandarani Wu Friulian Uyghur Ligurian Maltese Bihari Novial Tibetan Anglo-Saxon Kashubian Sardinian Classical Chinese Fiji Hindi Khmer Ladino Zamboanga Chavacano Pali Franco-Provençal/Arpitan Pashto Hakka Cornish Punjabi Navajo Silesian Kalmyk Pennsylvania German Hawaiian Saterland Frisian Interlingue Somali Komi Karachay-Balkar Crimean Tatar Tongan Acehnese Meadow Mari Picard Erzya Lingala Kinyarwanda Extremaduran Guarani Kirghiz Emilian-Romagnol Assyrian Neo-Aramaic Papiamentu Aymara Chechen Lojban Wolof Banjar Bashkir North Frisian Greenlandic Tok Pisin Udmurt Kabyle Tahitian Sranan Zealandic Hill Mari Komi-Permyak Lower Sorbian Abkhazian Gagauz Igbo Oriya Lao Kongo Avar Moksha Mirandese Romani Old Church Slavonic Karakalpak Samoan Moldovan Tetum Gothic Kashmiri Bambara Inupiak Sindhi Bislama Lak Nauruan Norfolk Inuktitut Pontic Assamese Cherokee Min Dong Swati Palatinate German Hausa Ewe Tigrinya Oromo Zulu Zhuang Venda Tsonga Kirundi Dzongkha Sango Cree Chamorro Luganda Buginese Buryat (Russia) Fijian Chichewa Akan Sesotho Xhosa Fula Tswana Kikuyu Tumbuka Shona Twi Cheyenne Ndonga Sichuan Yi Choctaw Marshallese Afar Kuanyama Hiri Motu Muscogee Kanuri Herero
Pay attention . Each selected second node is a text with a space node. If you do not want them selected, use:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]
source share