用python快速过滤html指定标签函数

302 阅读 0 评论 0 点赞

"""
@author: MR.N
@created: 2022/3/30 Wed.
@version: 1.0
"""
 
import io
import re
 
 
def filter_html_tags(text):
    htmltags = ['div', 'ul', 'li', 'ol', 'p', 'span', 'form', 'br',
                'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
                'hr', 'input',
                'title', 'table', 'tbody', 'a',
                'i', 'strong', 'b', 'big', 'small', 'u', 's', 'strike',
                'img', 'center', 'dl', 'dt', 'font', 'em',
                'code', 'pre', 'link', 'meta', 'iframe', 'ins']
    blocktags = ['script', 'style']
    tabletags = ['tr', 'th', 'td']
    for tag in htmltags:
        # filter html tag with its attribute descriptions
        text = re.sub(f'<{tag}[^<>]*[/]?>', '', text)
        text = re.sub(f'</{tag}>', '', text)
    # '''
    for block in blocktags:
        re_block = re.compile('<\s*{block}[^>]*>[\S\s]*?<\s*/\s*{block}\s*>',re.I)#script
        text = re_block.sub('',text) #

    buffer = io.StringIO(text)
    text = ''
    line = buffer.readline()
    while line is not None and line != '':
        for tag in tabletags:
            if '<' + tag in line or '</' + tag in line:
                if len(line) < 2:
                    # len('\n') == 1
                    if ascii(line) == '\\n':
                        line = ''
                while '\n' in line:
                    line = line.replace('\n', '')
                line = re.sub(f'<{tag}[^<>]*[/]?>', '', line)
                line = re.sub(f'</{tag}>', '', line)
                # filter multiple spaces
                line = line.replace(' ', '')
        text += line
        line = buffer.readline()
    # '''
 
    # filter multiple empty lines
    while '\n\n' in text:
        text = text.replace("\n\n", '\n')
    return text

（本文内容根据网络资料整理和来自用户投稿，出于传递更多信息之目的，不代表本站其观点和立场。也不对其真实性、可靠性承担任何法律责任，特此声明！）

点赞(0) 打赏

本文分类：PYTHON编程
本文标签：无
浏览次数：302 次浏览
发布日期：2023-08-15 00:13:00
本文链接：https://www.yelongauto.com/PYTHONbiancheng/2072.html

用python快速过滤html指定标签函数

评论列表共有 0 条评论

发表评论取消回复

用python快速过滤html指定标签函数

python png模板图片上居中加文字 半透明处理后 居中合并到另外一个图片

python 图片加文字水印 且根据文字内容的长度自动换行的3总方法

python 图片加水印且根据文字长度自动换行

python 图片加长文字中textwrap.wrap文本自动换行与填充

评论列表 共有 0 条评论

发表评论 取消回复

python png模板图片上居中加文字半透明处理后居中合并到另外一个图片

python 图片加文字水印且根据文字内容的长度自动换行的3总方法

评论列表共有 0 条评论

发表评论取消回复