在做爬虫去抓取网上一些信息的时候,有的网站设置了安全策略,导致通过WebClient请求的时候,提示错误:无法从传输连接中读取数据: 远程主机强迫关闭了一个现有的连接。
先看我最初写的代码:
public static Task<string> getHtmlByUrl(string url) { var taskCompletitionSource = new TaskCompletionSource<string>();//将WebClient的异步转为Task ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12; webClient.Encoding = Encoding.UTF8; webClient.DownloadStringCompleted += (s, e) => { if (e.Error != null) { taskCompletitionSource.TrySetException(e.Error); } else if (e.Cancelled) { taskCompletitionSource.TrySetCanceled(); } else { taskCompletitionSource.TrySetResult(e.Result); } }; webClient.DownloadStringAsync(new Uri(url)); return taskCompletitionSource.Task; }
如果请求的服务器没有设置请求限制,那么上面的代码,可以正常的运行,但是有的服务器,就是会做限制,如何解决,最容易想到的是伪造一些header,就有以下代码:
public static Task<string> getHtmlByUrl(string url) { var taskCompletitionSource = new TaskCompletionSource<string>(); ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12; webClient.Encoding = Encoding.UTF8; webClient.Headers["refer"] = url;//伪造了header webClient.Headers["ContentType"] = "application/x-www-form-urlencoded"; webClient.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"; webClient.DownloadStringCompleted += (s, e) => { if (e.Error != null) { taskCompletitionSource.TrySetException(e.Error); } else if (e.Cancelled) { taskCompletitionSource.TrySetCanceled(); } else { taskCompletitionSource.TrySetResult(e.Result); } }; webClient.DownloadStringAsync(new Uri(url)); return taskCompletitionSource.Task; }
这样还是不行,看了这篇的文章:https://blog.csdn.net/zikizhh/article/details/104531875/,我尝试使用了其中的一点,就是在请求完之后,要释放WebClient,我这里给WebClient加上using
public static Task<string> getHtmlByUrl(string url) { var taskCompletitionSource = new TaskCompletionSource<string>();//WebClient异步转为Task using (var webClient = new WebClient())//加上using { ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12; webClient.Encoding = Encoding.UTF8; webClient.Headers["refer"] = url;//伪造header webClient.Headers["ContentType"] = "application/x-www-form-urlencoded";//伪造header webClient.Headers["User-Agent"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36";//伪造header webClient.DownloadStringCompleted += (s, e) => { if (e.Error != null) { taskCompletitionSource.TrySetException(e.Error); } else if (e.Cancelled) { taskCompletitionSource.TrySetCanceled(); } else { taskCompletitionSource.TrySetResult(e.Result); } }; webClient.DownloadStringAsync(new Uri(url)); return taskCompletitionSource.Task; } }
但是,还是不能完全解决,不过,提示同样错误的概率极大的降低了。
可以的话,再偷偷随机伪造user-Agent,可能效果更好,你不介意的话,多随机几个数字
//我这里给浏览器版本 随便加了点随机数进去 webClient.Headers["User-Agent"] = $"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/5{new Random().Next(0, 10)}7.36 (KHTML, like Gecko) Chrome/11{new Random().Next(0, 10)}.0.0.0 Safari/537.36";