Just this is what I'm trying to do: (I want to use jsoup)
- pass only one parsing URL
- search by date (s) specified inside the content of the web page.
- Retrieves at least one date from each page content.
- convert this date to standard format
So, point number 1 What I have now:
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Now I want to understand what format the "Document" is, it already understands html or any type of web page type, or what?
Then Point # 2 Now I have:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Here I try to match the date regex to search for dates on the page and store in a string for later use (point number 3), but I'm sure I'm not around, I need help here.
I made point number 4.
, , , , 4 , .
, Advance!
:
, :
public static void main(String[] args){
try {
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}