My program falls into a cycle that never ends, and I do not see how it falls into this trap or how to avoid it.
It parses the wikipedia data, and I think it’s just connected to the connected component around and around.
Perhaps I can save the pages that I visited already in the set, and if the page is in this set, I will not return to it?
This is my project , its quite small, only three short classes.
This is a link to the data that it creates, I stopped it, otherwise it went on and on.
This is a ridiculously small toy that spawned this mess.
This is the same project that I worked on when I asked this question .
The following is all the code.
Main class:
public static void main(String[] args) throws Exception
{
String name_list_file = "/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/02/test/people_test.txt";
String single_name;
try (
InputStream stream_for_name_list_file = new FileInputStream( name_list_file );
InputStreamReader stream_reader = new InputStreamReader( stream_for_name_list_file , Charset.forName("UTF-8"));
BufferedReader line_reader = new BufferedReader( stream_reader );
)
{
while (( single_name = line_reader.readLine() ) != null)
{
String associated_alias = URLEncoder.encode( single_name , "UTF-8");
String platonic_key = single_name;
System.out.println("now processing: " + platonic_key);
Wikidata_Q_Reader.getQ( platonic_key, associated_alias );
}
}
Wikidata_Q_Reader.print_data();
}
Wikipedia reader / value robber:
static Map<String, HashSet<String> > q_valMap = new HashMap<String, HashSet<String> >();
public static void getQ( String platonic_key, String associated_alias ) throws Exception
{
String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
wikiInputStream = wiki_connection.getErrorStream();
}
BufferedReader wiki_data_pagecontent = new BufferedReader(
new InputStreamReader(
wikiInputStream ));
String line_by_line;
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
Pattern disambig_pattern = Pattern.compile("<div class=\"wikibase-entitytermsview-heading-description \">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
Wikipedia_Disambig_Fetcher.all_possibilities( platonic_key, associated_alias );
}
else
{
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class=\"wikibase-toolbar-container\"><span class=\"wikibase-toolbar-item " +
"wikibase-toolbar \">\\[<span class=\"wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit\"><a " +
"href=\"/wiki/Special:SetSiteLink/(.*?)\">edit</a></span>\\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
put_to_hash( platonic_key, Q );
}
}
}
wiki_data_pagecontent.close();
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
public static HashSet<String> put_to_hash(String key, String value )
{
HashSet<String> valSet;
if (q_valMap.containsKey(key)) {
valSet = q_valMap.get(key);
} else {
valSet = new HashSet<String>();
q_valMap.put(key, valSet);
}
valSet.add(value);
return valSet;
}
public static void print_data()
{
System.out.println("THIS IS THE FINAL DATA SET!!!");
for (Map.Entry<String, HashSet<String> > entry : q_valMap.entrySet())
{
System.out.println(entry.getKey()+" : " + Arrays.deepToString(q_valMap.entrySet().toArray()) );
}
}
referring to value pages:
public static void all_possibilities( String platonic_key, String associated_alias ) throws Exception
{
System.out.println("this is a disambig page");
String URL_czech = "https://en.wikipedia.org/wiki/" + associated_alias;
URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;
try
{
wiki_connection.connect();
wikiInputStream = wiki_connection.getInputStream();
}
catch(IOException e)
{
wikiInputStream = wiki_connection.getErrorStream();
}
Document docx = Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");
Elements linx = docx.select( "p:contains(" + associated_alias + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = URLEncoder.encode( linq.text() , "UTF-8");
Wikidata_Q_Reader.getQ( platonic_key, linq_nospace );
}
}