C# 爬虫 HttpClient 之 https 踩坑记录

发布时间 2023-04-20 21:20:06作者: C余L小R鱼

背景

有一个网页(https),请求返回是一串json,可通过模拟浏览器获取,也可以通过api请求获取,其中通过C#的httpclient的Get发起请求,会出现不定时的返回结果乱码

请求代码

[HttpPost(Name = "GetTestNoParams")]
        public async Task<string> GetTestNoParams(string url)
        {
            var ret = string.Empty;
            try
            {
                var httpClientHandler = new HttpClientHandler
                {
                    ServerCertificateCustomValidationCallback = (sender, cert, chain, sslPolicyErrors) => true,
                };

                var client = new HttpClient(httpClientHandler);
                client.Timeout = new TimeSpan(0, 0, 5);
                client.DefaultRequestHeaders.Add("Host", "www.southernfund.com");
                client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
                var response = await client.GetAsync(new Uri("url"));
                if (response.IsSuccessStatusCode)
                {
                    return await response.Content.ReadAsStringAsync();
                }
                else
                    return $"请求失败:{response.RequestMessage}";
            }
            catch (Exception ex)
            {

                return ex.Message;
            }
        }

首先乱码排查

通过浏览器直接访问,发现该网页一段时间返回结果会进行压缩,一段时间结果直接返回,直接是json串 【Content-Encoding: gzip 可以判断内容进行了gzip压缩】

此时代码运行返回乱码

乱码解决

知道内容进行了gzip压缩,那我们就对内容进行GZIP解压缩,代码如下

[HttpPost(Name = "GetTestNoParams")]
        public async Task<string> GetTestNoParams(string url)
        {
            var ret = string.Empty;
            try
            {
                var httpClientHandler = new HttpClientHandler
                {
                    ServerCertificateCustomValidationCallback = (sender, cert, chain, sslPolicyErrors) => true,
                };

                var client = new HttpClient(httpClientHandler);
                client.Timeout = new TimeSpan(0, 0, 5);
                client.DefaultRequestHeaders.Add("Host", "www.southernfund.com");
                client.DefaultRequestHeaders.Add("Accept-Encoding", "gzip, deflate, br");
                var response = await client.GetAsync(new Uri("url"));
                if (response.IsSuccessStatusCode)
                {
                    var stream = await response.Content.ReadAsStreamAsync();
                    var bts = Decompress(stream);
                    return Encoding.UTF8.GetString(bts);
                    //return await response.Content.ReadAsStringAsync();
                }
                else
                    return $"请求失败:{response.RequestMessage}";
            }
            catch (Exception ex)
            {

                return ex.Message;
            }
        }

 public static byte[] Decompress(Stream stream)
        {
            var gzipStream = new GZipStream(stream, CompressionMode.Decompress);
            var mmStream = new MemoryStream();
            byte[] block = new byte[1024];
            while (true)
            {
                int byteRead = gzipStream.Read(block, 0, block.Length);
                if (byteRead <= 0)
                    break;
                else
                    mmStream.Write(block, 0, byteRead);
            }
            mmStream.Close();
            return mmStream.ToArray();
        }

返回结果正常

bug

上述看似问题处理 但是这里面有个bug ,就是C#通过httpclient发起get请求的response的headers中不包含 Content-Encoding: gzip这个头信息,程序无法自动判断解压缩,上述代码也是强解压,当对方服务器切换返回非压缩的内容时,上述代码仍旧无法获取正常内容

C#通过httpclient发起get请求的response的headers

不过楼主发现通过python是可以获取到的

紧急的就用python做 后续期望C#能找到解决办法