How to get online HTML spreadsheets locally automatically using C #

It’s good to be short:

I have several different sites with tables containing information that I would like to query for "local".

I was looking for opportunities, and I have some ideas.

  • In Excel, I found a function where I can go to a web page and copy data from a table. The problem is that this happens only once. The data in the tables will be updated every week, so I need Excel to update automatically every time I open my program.

  • I could use a crawler, but then I would have to write a different solution for each table and find a way to save it.

I have a MySQL database database that contains a lot of information that I need in my program, so if any solution requires a complete database.

About my program: It will be written in C #, first as a local program, and then into the MVC project. Suggestions for both projects are very welcome, and if you need more information, just commented, and I will try to describe it a little more. :)

CHANGE! 1

I am very sorry that I didn’t enlighten you from the very beginning about the tables that I spoke about, but when I started this question, I still had to find all the tables. Now, however, I took a few of them to show you how the different types of tables I should work with. About the project, he should tell you that the program that I planned to do is for private use only and not for sale. I am not aware of crawl rules on public sites, which is why I keep this private.

Table 2 table 3

As you can see, a lot of football data is displayed in very different ways, so I need to know which way is the best to collect data, because I find it easier to create a med database with this knowledge.

+6
source share
6 answers

Anders, Excel has a built-in way to get data, and you have to do it once. Next time you just need to update the request. See this link.

html cricinfo scorecard parsing

Followup

Try to look at this page: soccernet.espn.go.com/stats / _ / league / eng.1 / ... There are 3 tables, but associate professor, it seems that they find them. :( - Anders Guerner 7 min ago

On this particular website, if you look at the source, you will see that the table does not have an identifier. All three tables have the same "tablehead" class. If you want, in the opening event of the workbook, loop all the tables and retrieve the data. Your work is simplified, since all 3 tables have the same class.

Alternatively you can also do this

In Excel, click File | Open , and in the dialog box, directly enter the URL below. You will notice that Excel neatly stacks the data.

In fact, you can write a small macro / code that opens the temp workbook and then opens the URL and then just extracts the tables from the temp workbook into your workbook. My assessment is that with a good Internet connection, the whole process should not take more than 15 seconds to complete

+7
source

If I just read information about a webpage, I find HtmlAgilityPack extremely useful. This makes it easy to use LINQ to find specific tags with identifying information, and then easily navigate through the subtags. So you can find <table> and easily find <tr> and <td> and grab the Text property to find the contents of the cell.

+1
source

you can use visual web ripper , they have an API that you can use with .NET, and you can create a template using your constructor to pull out the necessary data, it is very simple to use, my company used it to view reviews from sites even with search and search.

+1
source

My approach would be to use a tool to create an RSS feed for each of the URLs containing your table data, and then display the data in your user interface (be it WPF, WinForms or asp.net). Thus, you can easily configure additional “feeds” when you find / purchase a new website for data extraction, and your job will be to normalize the new website in the standard RSS feed format (customizable in one of these tools) and you you can even customize your user interface to pull out an additional channel based on your configuration, so you don’t need to recompile when you add a new site.

You can save feed data to the database or simply display it in real time, as well as cache / update data at regular intervals automatically. I think that the main premise of the approach is to standardize the different table formats of each site into one common format (rss or otherwise), and then only worry about consuming one standard format in your application. This approach can be configured in a class library that presents data in a common format, and then this class library can be used for both your C # application and your web application.

Edit: here is a link to good information on several tools that you can use to create an RSS feed from any website: http://profy.com/2007/09/30/7-tools-to-make-an-rss -feed-of-any-website /

0
source

You can use Selenium (for automated web testing). This is an extremely useful tool. Its API allows you to do things like search for a specific XPath, CSS, or DOM table.

You can control selenium through "remote control" in many different languages. See: http://seleniumhq.org/projects/remote-control/

See for example for C #: http://www.theautomatedtester.co.uk/tutorials/selenium/selenium_csharp_nunit.htm

See StackoverFlow for some examples: How to get text in a table column using Selenium RC?

0
source

Here is sample code using HtmlAgilityPack:

using System; using System.Collections.Generic; using System.Web; using System.Xml.XPath; using HtmlAgilityPack; namespace TableRipper { class Program { static List<string> SerializeColumnSet(XPathNodeIterator columnSet) { List<string> serialized = new List<string>(); while (columnSet.MoveNext()) { string value = HttpUtility.HtmlDecode(columnSet.Current.Value.ToString().Trim()); if (value.Contains(",") || value.Contains("\"")) { value = string.Concat('"', value.Replace("\"", "\"\""), '"'); } serialized.Add(value); } return serialized; } static List<List<string>> RipTable(string url, string xpath, bool includeHeaders = true) { HtmlWeb web = new HtmlWeb(); HtmlDocument document = web.Load(url); XPathNavigator navigator = document.CreateNavigator(); XPathNodeIterator tableElementSet = navigator.Select(xpath); List<List<string>> table = new List<List<string>>(); if (tableElementSet.MoveNext()) { XPathNavigator tableElement = tableElementSet.Current; XPathNavigator tableBodyElement = tableElement.SelectSingleNode("tbody") ?? tableElement; XPathNodeIterator tableRowSet = tableBodyElement.Select("tr"); bool hasRows = tableRowSet.MoveNext(); if (hasRows) { if (includeHeaders) { XPathNavigator tableHeadElement = tableElement.SelectSingleNode("thead"); XPathNodeIterator tableHeadColumnSet = null; if (tableHeadElement != null) { tableHeadColumnSet = tableHeadElement.Select("tr/th"); } else if ((tableHeadColumnSet = tableRowSet.Current.Select("th")).Count > 0) { hasRows = tableRowSet.MoveNext(); } if (tableHeadColumnSet != null) { table.Add(SerializeColumnSet(tableHeadColumnSet)); } } if (hasRows) { do { table.Add(SerializeColumnSet(tableRowSet.Current.Select("td"))); } while (tableRowSet.MoveNext()); } } } return table; } static void Main(string[] args) { foreach (List<string> row in RipTable(args[0], args[1])) { Console.WriteLine(string.Join(",", row)); } } } } 

Tested against:

http://www.msn.com "// table [@summary = 'Market Update']"

http://www.worldclimate.com/cgi-bin/data.pl?ref=N48W121+2200+450672C "// table [1]"

This is far from ideal, for example, it will not handle colspan or rowspan, but this is the beginning.

0
source

Source: https://habr.com/ru/post/908385/


All Articles