I encode the scraper like Facebook to get information from a given URL. I am just one step to complete the basic functions. The problem so far is to remove unwanted images. For example, when I run some random url, I get all these images:
Scraper Object ( [url] => http://buzz.money.cnn.com/2012/07/23/spain-italy-short-selling/?iid=HP_Highlight [title] => Spain and Italy ban short selling - The Buzz - Investment and Stock Market News [description] => The Euronext 100 stock index falls sharply on renewed concerns about Spain. Securities regulators in Spain and Italy both instituted short-selling bans Monday as financial markets tumbled. The move is designed to limit the downward pressure [imageUrls] => Array ( [0] => http://cnnmoneybuzzblog.files.wordpress.com/2012/07/chart_ws_index_euronext100_201272310932-09.png [1] => http://i.cdn.turner.com/money/.e1m/img/5.0/data/feargreed/scale.316x95.png [2] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/ben_rooney_130.jpg [3] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/catherine_tymkiw.02.jpg [4] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/paul_lamonica.02.jpg [5] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/hibah_yousuf.02.jpg [6] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/maureen_farrell.02.jpg [7] => http://i2.cdn.turner.com/money/.element/img/5.0/sections/contributors/ben_rooney.02.jpg [8] => http://i.cdn.turner.com/money/.element/img/4.0/services/button_login.gif [9] => http://www.bizographics.com/collect/?fmt=gif&pid=311 [10] => http://pixel.quantserve.com/pixel/p-5dyPa639IrgIw.gif [11] => http://i.cdn.turner.com/money/.element/img/1.0/misc/1.gif [12] => http://buzz.money.cnn.com/2012/07/23/spain-italy-short-selling/?iid=HP_Highlight//pixel.quantserve.com/pixel/p-18-mFEk4J448M.gif?labels=%2Clanguage.en%2Ctype.wpcom%2Cposttag.bonds%2Cposttag.dow%2Cposttag.ibex%2Cposttag.italy%2Cposttag.lehman%2Cposttag.milan%2Cposttag.nasdaq%2Cposttag.sp-500%2Cposttag.short-selling%2Cposttag.spain%2Cposttag.stock%2Cposttag.yields%2Cvip.cnnmoneybuzzblog [13] => http://stats.wordpress.com/b.gif?v=noscript ) )
I just need to find a way to remove all these images ending in .gif or .png, and just allow .jpg inside the array so that the user can take a look and choose the one suitable for the article.
I tried some array functions, but I think it takes some magic of regular expressions to make it work with almost any given URL.
PS I can access all the data in the array using $info->url
, $info->description
and so on. You just need to filter out this array and it will be ready.
source share