Scan a dynamic web page for data using scrapy

I am trying to get some data from official NBA statistics that will be used for some data analysis. I use scrapy as the main tool for scraping. However, after checking the elements of the webpage, I found that it was generated dynamically using javascript. I am completely unfamiliar with javascript and cannot understand how it works (that the js file is called, how it is loaded, which contains the data table, and is there an easier way to get the data). I also have found some json file on the network, and I have no idea how this is used.

http://stats.nba.com/teamLineups.html?TeamID=1610612739&pageNo=1&rowsPerPage=100&Season=2008-09&sortField=MIN&sortOrder=DES&PerMode=Per48

Anyone who can kindly guide me using the above URL and tell me how the website actually functions to load data and how they process the data in such a way that it displays in such a way?

The key part is still the question of how to get the data. I saw answers that use the POST method to return data (sorry I am not even familiar with GET / POST), but I still could not understand how this relates to this context.

Thanks for your generous guidance!

+4
source share
3 answers

Javascript , - - . javascript, , , . Firebug Firefox Chrome (ctrl + shift + J windows, cmd + opt + J ​​Mac). Chrome "", -.

, cleveland "2008-09", javascript . , , , , : http://stats.nba.com/stats/teamdashlineups?PlusMinus=N&pageNo=1&GroupQuantity=5&TeamID=1610612739&GameID=&Location=&SeasonType=Regular+Season&Season=2008-09&PaceAdjust=N&DateFrom=&sortOrder=DES&VsConference=&OpponentTeamID=0&DateTo=&GameSegment=&LastNGames=0&VsDivision=&LeagueID=00&Outcome=&GameScope=&MeasureType=Base&PerMode=Per48&sortField=MIN&SeasonSegment=&Period=0&Rank=N&Month=0&rowsPerPage=100

. LineupItem, scrapy crawl stats -o output.json.

import json
from scrapy.spider import Spider
from scrapy.http import Request
from nba.items import LineupItem
from urllib import urlencode


class StatsSpider(Spider):
    name = "stats"
    allowed_domains = ["stats.nba.com"]
    start_urls = (
        'http://stats.nba.com/',
        )

    def parse(self, response):
        return self.get_lineup('1610612739','2008-09')

    def get_lineup(self, team_id, season):
        params = {
            'Season':         season,
            'SeasonType':     'Regular Season',
            'LeagueID':       '00',
            'TeamID':         team_id,
            'MeasureType':    'Base',
            'PerMode':        'Per48',
            'PlusMinus':      'N',
            'PaceAdjust':     'N',
            'Rank':           'N',
            'Outcome':        '',
            'Location':       '',
            'Month':          '0',
            'SeasonSegment':  '',
            'DateFrom':       '',
            'DateTo':         '',
            'OpponentTeamID': '0',
            'VsConference':   '',
            'VsDivision':     '',
            'GameSegment':    '',
            'Period':         '0',
            'LastNGames':     '0',
            'GroupQuantity':  '5',
            'GameScope':      '',
            'GameID':         '',
            'pageNo':         '1',
            'rowsPerPage':    '100',
            'sortField':      'MIN',
            'sortOrder':      'DES'
        }
        return Request(
            url="http://stats.nba.com/stats/teamdashlineups?" + urlencode(params),
            dont_filter=True,
            callback=self.parse_lineup
        )

    def parse_lineup(self,response):
        data = json.loads(response.body)
        for lineup in data['resultSets'][1]['rowSet']:
            item = LineupItem()
            item['group_set'] = lineup[0]
            item['group_id'] = lineup[1]
            item['group_name'] = lineup[2]
            item['gp'] = lineup[3]
            item['w'] = lineup[4]
            item['l'] = lineup[5]
            item['w_pct'] = lineup[6]
            item['min'] = lineup[7]
            item['fgm'] = lineup[8]
            item['fga'] = lineup[9]
            item['fg_pct'] = lineup[10]
            item['fg3m'] = lineup[11]
            item['fg3a'] = lineup[12]
            item['fg3_pct'] = lineup[13]
            item['ftm'] = lineup[14]
            item['fta'] = lineup[15]
            item['ft_pct'] = lineup[16]
            item['oreb'] = lineup[17]
            item['dreb'] = lineup[18]
            item['reb'] = lineup[19]
            item['ast'] = lineup[20]
            item['tov'] = lineup[21]
            item['stl'] = lineup[22]
            item['blk'] = lineup[23]
            item['blka'] = lineup[24]
            item['pf'] = lineup[25]
            item['pfd'] = lineup[26]
            item['pts'] = lineup[27]
            item['plus_minus'] = lineup[28]
            yield item

json-, :

{"gp": 30, "fg_pct": 0.491, "group_name": "Ilgauskas,Zydrunas - James,LeBron - Wallace,Ben - West,Delonte - Williams,Mo", "group_set": "Lineups", "w_pct": 0.833, "pts": 103.0, "min": 484.9866666666667, "tov": 13.3, "fta": 21.6, "pf": 16.0, "blk": 7.7, "reb": 44.2, "blka": 3.0, "ftm": 16.6, "ft_pct": 0.771, "fg3a": 18.7, "pfd": 17.2, "ast": 23.3, "fg3m": 7.4, "fgm": 39.5, "fg3_pct": 0.397, "dreb": 32.0, "fga": 80.4, "plus_minus": 18.4, "stl": 8.3, "l": 5, "oreb": 12.3, "w": 25, "group_id": "980 - 2544 - 1112 - 2753 - 2590"}
+4

Scrapy javascript, javascript- - Python Scrapy , javascript ( URL- , ) script. - Firebug Firefox, Python Scrapy.

, , Selenium ( - ), javascript. Selenium, , ..


import requests
import json

# set request as GET
response = requests.get('http://stats.nba.com/stats/teamdashlineups?Season=2008-09&SeasonType=Regular+Season&LeagueID=00&TeamID=1610612739&MeasureType=Base&PerMode=Per48&PlusMinus=N&PaceAdjust=N&Rank=N&Outcome=&Location=&Month=0&SeasonSegment=&DateFrom=&DateTo=&OpponentTeamID=0&VsConference=&VsDivision=&GameSegment=&Period=0&LastNGames=0&GroupQuantity=5&GameScope=&GameID=&pageNo=1&rowsPerPage=100&sortField=MIN&sortOrder=DES')

# change json into dictionary
data =  json.loads(response.text)

#print data

import pprint

pprint.pprint(data)

for x in data['resultSets']:
    print x['rowSet']
+1

, , , .

, GET , , , " " . , "src", , , GET.

<script src="/js/libs/modernizr.custom.16166.js"></script>

JavaScript

jsfile.js:

function myFunction() {
//
//do stuff
//
}

myFunction();

nba ajax GET.

nba, , ajax GET " jquery.statrequest.js "" team-lineups.js", , .

, urllib, , .js JavaScript, nba.

- Mechanize, JavaScript.

I hope this gives you some idea of ​​what you wanted to know, I am not so familiar with the internal workings of browsers. You might want to find a website with a free API for nda game statistics.

Here is a link from nba that may be relevant to you.

0
source

Source: https://habr.com/ru/post/1548164/


All Articles