HtmlUnit and Fragment Identifiers

Question

HtmlUnit and Fragment Identifiers

I am currently wondering how to handle fragment identifiers, the link I want to capture using the information contains the fragment identifier. It seems that HtmlUnit is dropping the “# / db4mj” of my url and therefore loading the original url.

Does anyone know how to handle fragment identifiers? (I can post sample code to further explain if necessary)

EDIT

Since I did not have many views (and no answers), I am going to add generosity. Sorry, but only 50, but I was only 79 to start with

EDIT

Here is a sample code on request.

Our URL will look like this: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0

So, if you look at the content in the link, you will see several brushes containing URLs. So my script grabs the url: http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4

As you can see, there is fragment identifier # / dbwam4 . Now I am trying to grab the content that is on this page, but HtmlUnit still believes that it is at the source URL.

Here is a sample code in my script where it fails on the fragment identifier url but does not have a problem with the original url.

client = new WebClient(BrowserVersion.FIREFOX_3) client.javaScriptEnabled = false page = client.getPage(url) //url with fragment identifier //this is on the url with the fragment identifier only, not the original url img = page.getByXPath("*[@id="gmi-ResViewSizer_img"]")

I expect that I can capture certain information from the URL with the fragment identifier, but I cannot access it.

+4

url identity fragment htmlunit

StartingGroovy Jan 03 '11 at 20:43

source share

1 answer

Mark mclaren · Accepted Answer · 2011-01-12T13:54:44+0000

good news and bad news .

At first, the good news is that HtmlUnit is working fine.

If you go to the page with the URL of the fragment identifier in the browser with JavaScript turned off (perhaps using the Firefox QuickJava plugin ), you will not see the “single brush” that you want.

So, to get this page, you need to use WebClient with setJavaScriptEnabled set to true.

And now the bad news:

I was unable to get the View One Brush page using HtmlUnit with JavaScript enabled (I don't know why). Although, I was able to get a full page in case.

The real problem is that the state of the returned HTML is so bad that it did not give in to attempts to analyze it (I tried TagSoup , jsoup , Jaxen , etc.). Therefore, I suspect that trying to parse a page using XPath might not work for you.

I would therefore think that you need to resort to using regular expressions (which is far from ideal) or even use some kind of String.indexOf variant ("gmi-ResViewSizer_img").

Hope this helps.

EDIT

I managed to get what works sporadically. I'm afraid I haven't converted to Groovy yet, so it will be in plain old Java.

I did not look at the source of HtmlUnit, but almost as if something in the process of starting the save helped to do parsing? Without saving, I seem to get NullPointerExceptions.

 import com.gargoylesoftware.htmlunit.BrowserVersion; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebRequest; import com.gargoylesoftware.htmlunit.WebResponse; import com.gargoylesoftware.htmlunit.html.HtmlElement; import com.gargoylesoftware.htmlunit.html.HtmlPage; import com.gargoylesoftware.htmlunit.util.FalsifyingWebConnection; import java.io.File; import java.io.IOException; public class TestProblem { public static void main(String[] args) throws IOException { WebClient client = new WebClient(BrowserVersion.FIREFOX_3_6); client.setJavaScriptEnabled(true); client.setCssEnabled(false); String url = "http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4"; client.setThrowExceptionOnScriptError(false); client.setThrowExceptionOnFailingStatusCode(false); client.setWebConnection(new FalsifyingWebConnection(client) { @Override public WebResponse getResponse(final WebRequest request) throws IOException { if ("www.google-analytics.com".equals(request.getUrl().getHost())) { return createWebResponse(request, "", "application/javascript"); // -> empty script } if ("d.unanimis.co.uk".equals(request.getUrl().getHost())) { return createWebResponse(request, "", "application/javascript"); // -> empty script } if ("edge.quantserve.com".equals(request.getUrl().getHost())) { return createWebResponse(request, "", "application/javascript"); // -> empty script } if ("b.scorecardresearch.com".equals(request.getUrl().getHost())) { return createWebResponse(request, "", "application/javascript"); // -> empty script } // if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6core_jc.js")) { WebResponse wr = super.getResponse(request); return createWebResponse(request, wr.getContentAsString(), "application/javascript"); } if (request.getUrl().toString().startsWith("http://st.deviantart.net/css/v6loggedin_jc.js")) { WebResponse wr = super.getResponse(request); return createWebResponse(request, wr.getContentAsString(), "application/javascript"); } return super.getResponse(request); } }); HtmlPage page = client.getPage(url); //url with fragment identifier File saveFile = new File("saved.html"); if(saveFile.exists()){ saveFile.delete(); saveFile = new File("saved.html"); } page.save(saveFile); HtmlElement img = page.getElementById("gmi-ResViewSizer_img"); System.out.println(img.toString()); } }

HtmlUnit and Fragment Identifiers

More articles: