xpath和lxml结合起来使用
一、往上复制一段html
到本地文件中
<table class="tablelist" cellpadding="0" cellspacing="0">
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=44203&keywords=&tid=87&lid=0">23295-互娱数据营销平台DMP后台开发工程师(深圳)</a><span
class="hot"> </span></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=44205&keywords=&tid=87&lid=0">27092-应用宝高级算法工程师(深圳)</a><span
class="hot"> </span></td>
<td>技术类</td>
<td>2</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=44200&keywords=&tid=87&lid=0">27193-互娱Web前端多媒体互动开发工程师(深圳)</a>
</td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=44189&keywords=&tid=87&lid=0">21062-移动游戏高级后台开发工程师(深圳)</a><span
class="hot"> </span></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=44194&keywords=&tid=87&lid=0">SA-腾讯社交广告客户端开发工程师(深圳)</a>
</td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=44198&keywords=&tid=87&lid=0">18428-银行业务测试工程师(深圳)</a>
</td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=44183&keywords=&tid=87&lid=0">EC-iOS开发工程师(深圳)</a><span
class="hot"> </span></td>
<td>技术类</td>
<td>1</td>
<td>深圳</td>
<td>2018-09-13</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=44190&keywords=&tid=87&lid=0">SNG02-腾讯智慧教育后台研发工程师(北京)</a>
</td>
<td>技术类</td>
<td>1</td>
<td>北京</td>
<td>2018-09-13</td>
</tr>
<tr class="even">
<td class="l square"><a target="_blank" href="position_detail.php?id=44178&keywords=&tid=87&lid=0">15575-《王者荣耀》高级引擎开发工程师(成都)</a><span
class="hot"> </span></td>
<td>技术类</td>
<td>2</td>
<td>成都</td>
<td>2018-09-13</td>
</tr>
<tr class="odd">
<td class="l square"><a target="_blank" href="position_detail.php?id=44180&keywords=&tid=87&lid=0">WXG02-116
微信小程序专项测试工程师(广州)</a></td>
<td>技术类</td>
<td>1</td>
<td>广州</td>
<td>2018-09-13</td>
</tr>
</table>
二、基本的操作使用
1、获取全部的
tr
from lxml import etree html = etree.parse('./demo1.html', etree.HTMLParser(encoding='utf8')) # 获取全部的tr标签 trs = html.xpath('//tr') for tr in trs: # 打印字符出来 print(etree.tostring(tr, encoding='utf8').decode('utf8')) print('-' * 30)
2、获取第二个
tr
tr = html.xpath('//tr')[2]
3、获取小于4的
tr
trs = html.xpath('//tr[position() < 4]') for tr in trs: print(etree.tostring(tr, encoding='utf8').decode('utf8')) print('-' * 30)
4、获取
tr
标签class='odd'
的全部节点trs = html.xpath('//tr[@class="odd"]') for tr in trs: print(etree.tostring(tr, encoding='utf8').decode('utf8')) print('-' * 30)
5、获取全部
a
标签且带href
属性的a_list = html.xpath('//a[@href]') for item in a_list: print(etree.tostring(item, encoding='utf8').decode('utf8')) print('-' * 30)
6、获取倒数第二个元素
trs = html.xpath('//tr[last() - 1]') # trs = html.xpath('//tr')[-2:-1] 这样也可以的 for tr in trs: print(etree.tostring(tr, encoding='utf8').decode('utf8')) print('-' * 30)
7、多级别获取
# 先获取全部的tr,再获取tr里面最后一个td trs = html.xpath('//tr') for tr in trs: tds = tr.xpath('./td[last()]') for td in tds: print(etree.tostring(td, encoding='utf8').decode('utf8')) print('-' * 30)
8、获取纯文本
trs = html.xpath('//tr') for tr in trs: tds = tr.xpath('./td[last()]/text()') for td in tds: print(td) print('-' * 30)
9、获取节点属性
links = html.xpath('//tr//a/@href') for link in links: print(link)
10、使用
get(属性)
来获取属性值