requests.get() not retrieving correct url in python 2.7

Question

I'm trying to access url and then parse it's contents based on tags. My code:

page = requests.get('https://support.apple.com/downloads/')
self.tree = html.fromstring(page.content)
names = self.tree.xpath("//span[@class='truncate_name']//text()")

Problem: variable page is containing data that of url 'https://support.apple.com/' I'm new to python 2.7. The whole encoding issues in file. I'm using unicode-escape as my default encoding. Encoding on resource at https://support.apple.com/downloads/ is utf-8 whereas encoding of resource at https://support.apple.com/ is variable. Is this has something to do with the problem? Please suggest solution for this.


Show source
| osx   | python-2.7   | python-requests   2016-09-29 05:09 1 Answers

Answers ( 1 )

  1. 2016-09-29 10:09

    It has nothing to do with encoding , what you are looking for is dynamically created so not in the source you get back. A series of ajax calls populates the data. To get the product names etc.. from the carousel where you see the span.truncate_name in your browser:

    params = {"page": "products",
              "locale": "en_US",
              "doctype": "DOWNLOADS",
              }
    js = requests.get("https://km.support.apple.com/kb/index", params=params).content
    

    Normally we could call .json() on the response object but in this case we need to use "unicode_escape" then call loads:

    from json import loads, dumps
    js2 = loads(js.decode("unicode_escape"))
    print(js2)
    

    Which gives you a huge dict of data like:

    {u'products': [{u'name': u'Servers and Enterprise', u'urlpath': u'serversandenterprise', u'order': u'', u'products': .............
    

    You can see the request in chrome tools:

    enter image description here

    We leave off callback:ACDow‌​nloadSearch.customCa‌​llBack as we want to get back valid json.

◀ Go back