抓取由jQuery动态产生的网页数据:以东方财富中的沪港通历史数据为例
本文以东方财富中的沪港通历史数据为例,介绍如何获取由jQuery动态产生的网页数据。
1. 抓取的内容
Tushare只提供沪港通的净买额(=买入成交额-卖出成交额),我还需要买入成交额和卖出成交额。东方财富恰好有,如下图所示:
图1 沪港通历史数据
然而,在查看页数时,并没有页数的超链接,而是href="javascript:"
,如下图所示:
图2 页数导航
2. 问题分析
页面2对应的HTML代码为<a target="_self" href="javascript:" ;="" data-page="2">2</a>
。javascript:
表示点击超链接时,会去执行一个javascript函数,地址不发生跳转。在这里,执行了一条空的js代码。
有一个疑问,<a></a>
中也没有类似于onclick="js_method()"
的代码,那去执行哪个js函数,把数据加载出来?
3. 数据爬取
欲抓取的数据是通过JavaScript加载。在浏览器右击 --> Inspect --> Network
,找到JS脚本返回的JSON数据。
图3 沪港通历史数据JS脚本
找到了get?callback=jQuery112309076618069356868_1612589090585&...
,完整的Request URL为http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p=1&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)
,实际上是执行回调函数jQuery112309076618069356868_1612589090585()
,回调函数释义如下(jQuery是一个JavaScript库,极大简化了JS编程):
A callback is a function that is passed as an argument to another function and is executed after its parent function has completed.
在Headers
标签上,可以看到查询字符串参数(Query String Parameters)为(p
为页码):
callback: jQuery112304606706430005292_1612531660704
st: DetailDate
sr: -1
ps: 10
p: 1
type: HSGTHIS
token: 894050c76af8597a853f5b408b759f5d
js: {pages:(tp),data:(x)}
filter: (MarketType=1)
Response为:
jQuery112304606706430005292_1612531660704({pages:145,data:[{"MarketType":1.0,"DetailDate":"2021-02-05T00:00:00","DRZJLR":3962.18,"DRYE":48037.82,"LSZJLR":636794.880000001,"DRCJJME":2715.71,"MRCJE":28084.09,"MCCJE":25368.38,"LCGCode":"600704","LCG":"物产中大","LCGZDF":10.0962,"SSEChange":3496.33,"SSEChangePrecent":-0.00157916078883799},{"MarketType":1.0,"DetailDate":"2021-02-04T00:00:00","DRZJLR":2369.05,"DRYE":49630.95,"LSZJLR":634079.170000001,"DRCJJME":1069.98,"MRCJE":28478.16,"MCCJE":27408.18,"LCGCode":"600583","LCG":"海油工程","LCGZDF":10.0467,"SSEChange":3501.86,"SSEChangePrecent":-0.00439256136081261},{"MarketType":1.0,"DetailDate":"2021-02-03T00:00:00","DRZJLR":1923.2,"DRYE":50076.8,"LSZJLR":633009.190000001,"DRCJJME":566.299999999999,"MRCJE":28155.03,"MCCJE":27588.73,"LCGCode":"600596","LCG":"新安股份","LCGZDF":10.0379,"SSEChange":3517.31,"SSEChangePrecent":-0.00463256435217674},{"MarketType":1.0,"DetailDate":"2021-02-02T00:00:00","DRZJLR":1684.64,"DRYE":50315.36,"LSZJLR":632442.890000001,"DRCJJME":298.84,"MRCJE":27462.14,"MCCJE":27163.3,"LCGCode":"600803","LCG":"新奥股份","LCGZDF":10.0248,"SSEChange":3533.68,"SSEChangePrecent":0.00810206317326993},{"MarketType":1.0,"DetailDate":"2021-02-01T00:00:00","DRZJLR":4212.37,"DRYE":47787.63,"LSZJLR":632144.050000001,"DRCJJME":2846.62,"MRCJE":26253.6,"MCCJE":23406.98,"LCGCode":"600740","LCG":"山西焦化","LCGZDF":10.0318,"SSEChange":3505.28,"SSEChangePrecent":0.00637655861065096},{"MarketType":1.0,"DetailDate":"2021-01-29T00:00:00","DRZJLR":1922.0,"DRYE":50078.0,"LSZJLR":629297.430000001,"DRCJJME":485.43,"MRCJE":27270.84,"MCCJE":26785.41,"LCGCode":"600970","LCG":"中材国际","LCGZDF":10.0637,"SSEChange":3483.07,"SSEChangePrecent":-0.00630780730233531},{"MarketType":1.0,"DetailDate":"2021-01-28T00:00:00","DRZJLR":-2238.38,"DRYE":54238.38,"LSZJLR":628812.000000001,"DRCJJME":-3660.4,"MRCJE":23665.29,"MCCJE":27325.69,"LCGCode":"601216","LCG":"内蒙君正","LCGZDF":10.1031,"SSEChange":3505.18,"SSEChangePrecent":-0.0190745912787477},{"MarketType":1.0,"DetailDate":"2021-01-27T00:00:00","DRZJLR":226.730000000003,"DRYE":51773.27,"LSZJLR":632472.400000001,"DRCJJME":-1231.26,"MRCJE":25394.48,"MCCJE":26625.74,"LCGCode":"600143","LCG":"金发科技","LCGZDF":10.0114,"SSEChange":3573.34,"SSEChangePrecent":0.00109541299311103},{"MarketType":1.0,"DetailDate":"2021-01-26T00:00:00","DRZJLR":-1547.56,"DRYE":53547.56,"LSZJLR":633703.660000001,"DRCJJME":-2916.14,"MRCJE":26748.18,"MCCJE":29664.32,"LCGCode":"603687","LCG":"大胜达","LCGZDF":10.0327,"SSEChange":3569.43,"SSEChangePrecent":-0.0151231706509503},{"MarketType":1.0,"DetailDate":"2021-01-25T00:00:00","DRZJLR":2456.61,"DRYE":49543.39,"LSZJLR":636619.800000001,"DRCJJME":1626.93,"MRCJE":33213.59,"MCCJE":31586.66,"LCGCode":"600516","LCG":"方大炭素","LCGZDF":10.0629,"SSEChange":3624.24,"SSEChangePrecent":0.00484924100644619}]})
现在好办了,思路有了:访问Request URL(参数页码p
从1到145)得到上述的Response,再从中提取想要的数据。
步骤1:访问Request URL得到上述的Response
import requests
request_url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p={page}&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'
response = requests.get(request_url.format(page=i))
print(response.text)
步骤2:提取数据,转换成JSON格式
response.text
返回的是字符串,咱们关注的数据是在data:
后面的中括号里[...]
,用正则表达摘取内容[...]
:
import re
s = re.findall(r'\[.*?\]', response.text)[0] # Extract substrings between Square brackets
将得到的子字符串转换成JSON格式:
import json
l = json.loads(s) # return a list of dicts
步骤3:提到想要的数据
json.loads(s)
返回一个列表,每个元素是一个字典,对应于每一天的记录:
日期 当日成交 净买额 买入成交额 卖出成交额 历史累计 净买额 当日资金 流入 当日余额 领涨股 领涨股 涨跌幅 上证指数 涨跌幅
2021-02-05 27.16亿元 280.84亿元 253.68亿元 6367.95亿元 39.62亿元 480.38亿元 物产中大 10.10% 3496.33 -0.16%
这里只提取买入成交额
和卖出成交额
,关键代码如下:
lists = [['trade_date', 'hgt_in', 'hgt_out', 'hgt_net', 'hgt_total']] # unit: million
for i in range(1, 146):
response = requests.get(request_url.format(page=i), headers=headers)
s = re.findall(r'\[.*?\]', response.text)[0] # Extract substrings between Square brackets
for d in json.loads(s): # json.loads(s) returns a list of dicts
#{'MarketType': 1.0, 'DetailDate': '2021-02-05T00:00:00', 'DRZJLR': 3962.18, 'DRYE': 48037.82, 'LSZJLR': 636794.880000001, 'DRCJJME': 2715.71, 'MRCJE': 28084.09, 'MCCJE': 25368.38, 'LCGCode': '600704', 'LCG': '物产中大', 'LCGZDF': 10.0962, 'SSEChange': 3496.33, 'SSEChangePrecent': -0.00157916078883799}
# change the format of trade_date
dt_trade_date = datetime.datetime.strptime(d['DetailDate'], '%Y-%m-%dT%H:%M:%S')
trade_date = dt_trade_date.strftime('%Y%m%d')
hgt_in = d['MRCJE']
hgt_out = d['MCCJE']
print(trade_date)
lists.append([trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out])
搞定:-)
最后得到的数据如下(单位为百万):
trade_date,hgt_in,hgt_out,hgt_net,hgt_total
20210205,28084.09,25368.38,2715.709999999999,53452.47
20210204,28478.16,27408.18,1069.9799999999996,55886.34
20210203,28155.03,27588.73,566.2999999999993,55743.759999999995
20210202,27462.14,27163.3,298.84000000000015,54625.44
20210201,26253.6,23406.98,2846.619999999999,49660.58
20210129,27270.84,26785.41,485.4300000000003,54056.25
20210128,23665.29,27325.69,-3660.399999999998,50990.979999999996
20210127,25394.48,26625.74,-1231.260000000002,52020.22
20210126,26748.18,29664.32,-2916.1399999999994,56412.5
20210125,33213.59,31586.66,1626.9299999999967,64800.25
...
4. 完整代码
代码很短,懒得放GitHub,直接贴在文末:
#!/usr/bin/env python3
import requests
import json
import re
import csv
import datetime
def main():
# Step 1: Extract data [trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out]
# 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p=1&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'
request_url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?callback=jQuery112304606706430005292_1612531660704&st=DetailDate&sr=-1&ps=10&p={page}&type=HSGTHIS&token=894050c76af8597a853f5b408b759f5d&js=%7Bpages%3A(tp)%2Cdata%3A(x)%7D&filter=(MarketType%3D1)'
headers = {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Mobile Safari/537.36'}
lists = [['trade_date', 'hgt_in', 'hgt_out', 'hgt_net', 'hgt_total']] # unit: million
for i in range(1, 146):
response = requests.get(request_url.format(page=i), headers=headers)
s = re.findall(r'\[.*?\]', response.text)[0] # Extract substrings between Square brackets
for d in json.loads(s): # json.loads(s) returns a list of dicts
#{'MarketType': 1.0, 'DetailDate': '2021-02-05T00:00:00', 'DRZJLR': 3962.18, 'DRYE': 48037.82, 'LSZJLR': 636794.880000001, 'DRCJJME': 2715.71, 'MRCJE': 28084.09, 'MCCJE': 25368.38, 'LCGCode': '600704', 'LCG': '物产中大', 'LCGZDF': 10.0962, 'SSEChange': 3496.33, 'SSEChangePrecent': -0.00157916078883799}
# change the format of trade_date
dt_trade_date = datetime.datetime.strptime(d['DetailDate'], '%Y-%m-%dT%H:%M:%S')
trade_date = dt_trade_date.strftime('%Y%m%d')
hgt_in = d['MRCJE']
hgt_out = d['MCCJE']
print(trade_date)
lists.append([trade_date, hgt_in, hgt_out, hgt_in-hgt_out, hgt_in+hgt_out])
# Step 2: save to file
with open('hgt_in_out.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(lists)
if __name__ == '__main__':
main()