xpath和lxml结合起来使用

一、往上复制一段html到本地文件中

<table class="tablelist" cellpadding="0" cellspacing="0">
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44203&keywords=&tid=87&lid=0">23295-互娱数据营销平台DMP后台开发工程师(深圳)</a><span
                class="hot">&nbsp;</span></td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44205&keywords=&tid=87&lid=0">27092-应用宝高级算法工程师(深圳)</a><span
                class="hot">&nbsp;</span></td>
        <td>技术类</td>
        <td>2</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44200&keywords=&tid=87&lid=0">27193-互娱Web前端多媒体互动开发工程师(深圳)</a>
        </td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44189&keywords=&tid=87&lid=0">21062-移动游戏高级后台开发工程师(深圳)</a><span
                class="hot">&nbsp;</span></td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44194&keywords=&tid=87&lid=0">SA-腾讯社交广告客户端开发工程师(深圳)</a>
        </td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44198&keywords=&tid=87&lid=0">18428-银行业务测试工程师(深圳)</a>
        </td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44183&keywords=&tid=87&lid=0">EC-iOS开发工程师(深圳)</a><span
                class="hot">&nbsp;</span></td>
        <td>技术类</td>
        <td>1</td>
        <td>深圳</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44190&keywords=&tid=87&lid=0">SNG02-腾讯智慧教育后台研发工程师(北京)</a>
        </td>
        <td>技术类</td>
        <td>1</td>
        <td>北京</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="even">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44178&keywords=&tid=87&lid=0">15575-《王者荣耀》高级引擎开发工程师(成都)</a><span
                class="hot">&nbsp;</span></td>
        <td>技术类</td>
        <td>2</td>
        <td>成都</td>
        <td>2018-09-13</td>
    </tr>
    <tr class="odd">
        <td class="l square"><a target="_blank" href="position_detail.php?id=44180&keywords=&tid=87&lid=0">WXG02-116
            微信小程序专项测试工程师(广州)</a></td>
        <td>技术类</td>
        <td>1</td>
        <td>广州</td>
        <td>2018-09-13</td>
    </tr>
</table>

二、基本的操作使用

  • 1、获取全部的tr

    from lxml import etree
    
    html = etree.parse('./demo1.html', etree.HTMLParser(encoding='utf8'))
    
    # 获取全部的tr标签
    trs = html.xpath('//tr')
    for tr in trs:
        # 打印字符出来
        print(etree.tostring(tr, encoding='utf8').decode('utf8'))
        print('-' * 30)
    
  • 2、获取第二个tr

    tr = html.xpath('//tr')[2]
    
  • 3、获取小于4的tr

    trs = html.xpath('//tr[position() < 4]')
    for tr in trs:
        print(etree.tostring(tr, encoding='utf8').decode('utf8'))
        print('-' * 30)
    
  • 4、获取tr标签class='odd'的全部节点

    trs = html.xpath('//tr[@class="odd"]')
    for tr in trs:
        print(etree.tostring(tr, encoding='utf8').decode('utf8'))
        print('-' * 30)
    
  • 5、获取全部a标签且带href属性的

    a_list = html.xpath('//a[@href]')
    for item in a_list:
        print(etree.tostring(item, encoding='utf8').decode('utf8'))
        print('-' * 30)
    
  • 6、获取倒数第二个元素

    trs = html.xpath('//tr[last() - 1]')
    # trs = html.xpath('//tr')[-2:-1] 这样也可以的
    for tr in trs:
        print(etree.tostring(tr, encoding='utf8').decode('utf8'))
        print('-' * 30)
    
  • 7、多级别获取

    # 先获取全部的tr,再获取tr里面最后一个td
    trs = html.xpath('//tr')
    for tr in trs:
        tds = tr.xpath('./td[last()]')
        for td in tds:
            print(etree.tostring(td, encoding='utf8').decode('utf8'))
            print('-' * 30)
    
  • 8、获取纯文本

    trs = html.xpath('//tr')
    for tr in trs:
        tds = tr.xpath('./td[last()]/text()')
        for td in tds:
            print(td)
            print('-' * 30)
    
  • 9、获取节点属性

    links = html.xpath('//tr//a/@href')
    for link in links:
        print(link)
    
  • 10、使用get(属性)来获取属性值

results matching ""

    No results matching ""