Java 学习之路

0 votes

answers

views

UnicodeEncodeError：字符映射到<undefined>

我在使用Python 3.4在PyCharm中运行此代码时遇到问题 . 当我将它传递给BeautifulSoup时，变量 html_text 停止运行（我正在使用BeautifulSoup4） . 错误消息是： UnicodeEncodeError：'charmap'编解码器无法编码位置52793中的字符'\ ufffd'：字符映射到<undefined> 为什么会这样，怎么解决？...

python python-3.x beautifulsoup
59 votes

answers

views

UnicodeEncodeError：'ascii' codec无法以特殊名称编码字符[重复]

这个问题在这里已有答案： UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128) 19个答案我的python（ver 2.7）脚本运行良好，从本地html文件中获取一些公司名称，但是当涉及到某个特定的国家/地区名称时，它会出现此错误“U...

python unicode encoding beautifulsoup ascii
0 votes

answers

views

UnicodeEncodeError：'charmap'编解码器可以在python 2.7中't encode character u' \ xfd'

我从我的localhost下载来自不同网站的不同公司名称有时我遇到这个问题，这是中断下载程序 . 我的脚本对其他国家工作正常，但是当我下载捷克共和国时发生了这种类型的错误 . 到目前为止处理的公司总数：0 Traceback（最近一次调用最后一次）：文件“process1.py”，第261行，打印“公司名称：”hit.text文件“C：\ Python27 \ lib \ encodings \...

mysql python-2.7 beautifulsoup lxml command-prompt
0 votes

answers

views

Python和BeautifulSoup编码时出错

我得到错误：文件“C：\ Python34 \ lib \ encodings \ cp1252.py”，第19行，编码返回codecs.charmap_encode（输入，self.errors，encoding_table）[0] UnicodeEncodeError：'charmap'编解码器不能编码字符' \ u0106'位置73：字符映射到这是我的代码： import reque...

python encoding beautifulsoup python-3.4
1 votes

answers

views

如何使用BeautifulSoup去除<p>标签并将文本传回汤中？

我正在尝试用我汤中的内容替换任何 <p> 标签 . 这是我正在使用BeautifulSoup进行的其他处理的中间 . 这与a similar question on extracting the text略有不同 . 输入示例： ... </p> ... <p>Here is some text</p> ... and some more 期望的输...

python beautifulsoup
0 votes

answers

views

beautifulsoup解析html标签异常

我正在从html文件中提取一些信息 . 但是有些文件没有返回的标签 <p class="p p1"> date </p> AttributeError: 'NoneType' object has no attribute 'strip' 并且某些文件中的日期不在标记内 . 我发现一个是： <time content="2005-11-...

python web-scraping beautifulsoup tags html-parsing
34 votes

answers

views

BeautifulSoup：无论有多少封闭标签，都可以进入标签内部

我正在尝试使用BeautifulSoup从网页中的 <p> 元素中删除所有内部html . 有内部标签，但我不在乎，我只想获得内部文本 . 例如，对于： <p>Red</p> <p><i>Blue</i></p> <p>Yellow</p> <p>Light <b>g...

python beautifulsoup
10 votes

answers

views

Beautifulsoup兄弟结构与br标签

我正在尝试使用BeautifulSoup Python库解析HTML文档，但结构被 <br> 标记扭曲了 . 让我举个例子 . 输入HTML： <div> some text <br> <span> some more text </span> <br> <span> and more text &lt...

python beautifulsoup
1 votes

answers

views

使用BeautifulSoup Python在span标记之间提取数据

我想在span标签之间提取数据 . 这是一个html代码示例： <p> <span class="html-italic">3-Acetyl-</span> <span class="html-italic">(4-acetyl-5-(β</span> "-&quo...

python beautifulsoup
4 votes

answers

views

BeautifulSoup只获取td标记中的“常规”文本，而嵌套标记中没有任何内容

假设我的html看起来像这样： <td>Potato1 <span somestuff...>Potato2</span></td> ... <td>Potato9 <span somestuff...>Potato10</span></td> 我有美丽的做法： for tag in soup.fin...

python beautifulsoup
0 votes

answers

views

使用Python中的BeautifulSoup在链接标记之间提取文本

我有HTML代码，如下所示： <a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style=&quo...

python html web-scraping beautifulsoup
1 votes

answers

views

使用beautifulsoup从span类标记中提取文本

我试图从网站的span类中提取一些文本元素 . 以下是HTML代码的片段： <h1>2 Some address</h1> </div> <div id="smi-summary-items"> <div id=&quot...

html web-scraping beautifulsoup html-parsing
1 votes

answers

views

BeautifulSoup没有提取特定的标签文本

我在使用BeautifulSoup收集特定标签的信息时遇到问题 . 我想在标签html之间提取“Item 4”的文本，但下面的代码获取与“Item 1”相关的文本 . 我做错了什么（例如，切片）？ Code: primary_detail = page_section.findAll('div', {'class': 'detail-item'}) for item_4 in page_secti...

python web-scraping beautifulsoup
1 votes

answers

views

beautifulsoup提取没有标签的文本

我有如下的HTML解析文本，并尝试以相同的顺序提取文本 . <b> <i> Data </i> Data Summary </b> Data Description <pre>Data paragraph which contains huge string</pre> <pre></pr...

python beautifulsoup python-requests
15 votes

answers

views

使用beautifulsoup在换行符之间提取文本（例如<br />标签）

我有一个更大的文档中的以下HTML Important Text 1 Not Important Text Important Text 2 Important Text 3 Non Important Text Important Text 4 我目前正在使用BeautifulSoup来获取HTML中的其他元素，但我无法找到在标记之间获取重要文本行的方法 . 我可以隔离并...

python html html-parsing beautifulsoup
1 votes

answers

views

使用BeautifulSoup在标签之间提取文本

我试图从一系列网页中提取文本，这些网页都遵循使用BeautifulSoup的类似格式 . 我想提取的文本的html如下 . 实际链接在这里：http://www.p2016.org/ads1/bushad120215.html . <p><span style="color: rgb(153, 153, 153);"></span><f...

python regex web-scraping beautifulsoup bs4
0 votes

answers

views

使用BeautifulSoup和Python从网页中提取两个文本字符串之间的文本

BeautifulSoup上有很多东西，但我找不到任何可以解决这个问题的东西...我想通过在代码中指定前后文本的位来提取两位html之间的文本 . 我可以使用Outwit Python模块执行此操作，但这次需要使用BeautifulSoup ... 我想要的页面位是下面的用户名： <a class="generic_class" href="/people/us...

python python-2.7 csv web-scraping beautifulsoup
0 votes

answers

views

BeautifulSoup没有得到完整的提取类

我正在尝试使用BeautifulSoup从craigslist中提取数据 . 作为初步测试，我写了以下内容： import urllib2 from bs4 import BeautifulSoup, NavigableString link = 'http://boston.craigslist.org/search/jjj/index100.html' print link soup = B...

beautifulsoup web-crawler
2 votes

answers

views

使用BeautifullSoup修改后保留html文件结构

我使用python和BeautifullSoup来查找和替换html页面上的一些文本，我的问题是我需要保持文件结构（缩进，空格，换行等）不变并仅更改所需的元素 . 我怎样才能做到这一点？ str(soup) 和 soup.prettify() 都在以多种方式改变源文件 . 附：示例代码： soup = BeautifulSoup(text) for element in soup...

python beautifulsoup
1 votes

answers

views

将一个美丽的汤分成两个汤树

有多种方法可以分割beautifulSoup parsetree获取元素列表或获取标记的字符串 . 但是在分裂时似乎没有办法保持树完好无损 . 我想在上分割下面的片段（汤） . 琐碎的字符串，但我想保留结构，我想要一个parsetrees列表 . s="""<p> foo <a href="http://...html" ta...

python html beautifulsoup
0 votes

answers

views

BeautifulSoup中的.text两次打印数据/信息

我开始使用BeautifulSoup并试图理解汤对象的文本和字符串属性之间的区别 . 这是我正在使用的HTML代码： - html_doc = """ <html> <body> <table> <tr> <td></td&...

html python-3.x beautifulsoup
16 votes

answers

views

将python脚本输出输出到文件时出现Unicode错误

这是代码： print '"' + title.decode('utf-8', errors='ignore') + '",' \ ' "' + title.decode('utf-8', errors='ignore') + '", ' \ '"' + desc.decode('utf-8', errors='ignore...

python unicode beautifulsoup
24 votes

answers

views

UnicodeEncodeError：'ascii' codec可以't encode character u' \ u2026'

我正在学习urllib2和Beautiful Soup，并且在第一次测试时遇到如下错误： UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128) 似乎有很多关于这种类型的错误的帖子，我已经尝试了我能理解的解决方案，但似乎有22个跟他们一...

python python-2.7 unicode beautifulsoup urllib2
0 votes

answers

views

从span标签中获取日期

使用Beautiful Soup，我想从包含url列表的文本文件中提取日期 . 其中日期在span标签中使用div class = update定义 . 当我尝试下面的代码时，我得到的结果是 <span id="time"></span> 但不是确切的时间 . 请帮忙 . 例如，sabah_url.txt中链接的类型是“http://www.dailys...

python web-scraping beautifulsoup
-1 votes

answers

views

BeautifulSoup：href和class之间的提取？

我想存储来自以下文本块的日期： newsoup = '''<html><body><a href="/president/washington/speeches/speech-3460">Proclamation of Pardons in Western Pennsylvania (July 10, 1795)</a>, &l...

python python-2.7 web-scraping beautifulsoup
0 votes

answers

views

BeautifulSoup - 使用“text =”在标签内提取文本

正在阅读“使用Python进行Web Scraping”这本书并且它很不错，但有时（令人沮丧地）掩盖了读者需要在不显示输出或提及相关限制的情况下使用的代码 . 我花了4个小时试图找出原因： fullText.findAll('a', text="bees") 返回一个关于以下标记的空字符串： <a class="search">Why are ...

search beautifulsoup extract partial findall
1 votes

answers

views

从Beautifulsoup标签中提取src

我试图使用beautifulsoup刮取newegg的产品名称，描述，价格和图像 . 我有以下bs4.element.Tag类型，我想从标签中提取“src”链接 . 以下是我的标签： df = <a class="itemImage" href="http://www.newegg.com/Product/Product.aspx?Item=N82E168751...

python-2.7 beautifulsoup
1 votes

answers

views

美丽的汤 - 选择下一个 Span 元素的文本没有类

我试图用美丽的汤来刮掉rottentomatoes.com的电影报价 . 页面源非常有趣，因为引号直接由span类"bold quote_actor"继续，但引用本身在没有类的范围内，例如（https://www.rottentomatoes.com/m/happy_gilmore/quotes/）：screenshot of web source 我想使用Beautiful ...

python web-scraping beautifulsoup
99 votes

answers

views

BeautifulSoup grab 可见的网页文本

基本上，我想使用BeautifulSoup严格抓取网页上的可见文字 . 例如，this webpage是我的测试用例 . 而且我主要想在这里和那里获得正文（文章）甚至几个标签名称 . 我已经尝试了这个SO question中的建议，该建议返回了大量的 <script> 标签和html注释，我不知道函数findAll()需要的参数，以便在网页上获取可见文本 . 那么，我应该如何找到除脚本...

python text beautifulsoup html-content-extraction
6 votes

answers

views

BeautifulSoup：剥离指定的属性，但保留标记及其内容

我正在尝试'defrontpagify'MS FrontPage生成的网站的html，我正在写一个BeautifulSoup脚本来做它 . 但是，我试图从包含它们的文档中的每个标记中剥离特定属性（或列表属性）的部分 . 代码段： REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','f...

python web-scraping beautifulsoup scraper frontpage

热门问题