I am trying to determine the type of website (in English) on a machine. I am trying to load the website home page, load the html page, parse and get the contents of the web page. For example, here is some context with CNN.com. I am trying to get the keywords of a webpage by matching it with my database. If your keywords include news, the latest news. The website will go to news websites. If there are words like healthy, medical, this will be a medical website.
There are some tools that can perform text segmentation, but it’s not easy to find a tool for semantics, such as online shopping , these are keywords, they shouldn’t spill two words. The combination will be useful information. But "oneline", "shopping" will be less useful since it can exist on the Internet ...
• Newark, JFK airports reopen • LaGuardia Airport reopened 1 runway • Over 4,155 flights were canceled on Monday • FULL STORY
* LaGuardia Airport snowplows busy Video
* Are you stranded? | Airport delays
* Safety tips for winter weather
* Frosty fun Video | Small dog, deep snow
Last news
* Easter eggs used to smuggle cocaine
* Salmonella forces cilantro, parsley recall
* Obama surprising verdict on Vick
* Blue Note baritone Bernie Wilson dead
* Busch aide to 911: She not waking up
* Girl, 15, last seen working at store in '90
* Teena Marie death shocks fans
* Terror network 'dismantled' in Morocco
* Saudis: 'Militant' had al Qaeda ties
* Ticker: Gov. blasts Obama 'birthers'
* Game show goof is 800K mistakeVideo
* Chopper saves calf on frozen pondVideo
* Pickpocketing becomes hands-freeVideo
* Chilean miners going to Disney World
* Who the most intriguing of 2010?
* Natalie Portman is pregnant, engaged
* 'Convert all gifts from aunt' CNNMoney
* Who controls the thermostat at home?
* This Just In: CNN news blog
source
share