第二篇:速卖通产品采集系列 之 产品采集实战

来源:转载

    上一篇,对速卖通产品采集做了分析,包含要采集产品信息,以及如何采集这些产品信息,这一篇接着来采集实战,相关技术前篇也说过了,不废话直接开项目做。

一, 创建解决方案,编写采集代码

1. 创建解决方案“CollectorSolution”,在其中新建“Collector” 空 ASP.NET MVC 项目,解决方案结构图如下:

2.在“Collector” 项目中,分别新增“CollectingController” 控制器,以及和控制器相关的视图,并将原来默认路由 Home -》 Index 改成 Collecting -》 Index,截图如下:

RouteConfig 修改成如下:

 1 using System.Web.Mvc;

2 using System.Web.Routing;

3

4 namespace Collector

5 {

6 public class RouteConfig

7 {

8 public static void RegisterRoutes(RouteCollection routes)

9 {

10 routes.IgnoreRoute("{resource}.axd/{*pathInfo}");

11

12 routes.MapRoute(

13 name: "Default",

14 url: "{controller}/{action}/{id}",

15 defaults: new { controller = "Collecting", action = "Index", id = UrlParameter.Optional }

16 );

17 }

18 }

19 }

3. 分别新增“CollectionViewModel” ,"CollectedProductViewModel","CollectedProductImageViewModel" 视图模型,和一个存放正则表达式的结构体:“ParseProductPatterns”,代码分别如下

1.> CollectionViewModel

 1 using System.Collections.Generic;

2

3 namespace Collector.Models

4 {

5 public class CollectionViewModel

6 {

7 public CollectionViewModel()

8 {

9 ProductViews = new List<CollectedProductViewModel>();

10 }

11 public string CollectionUrl { get; set; }

12 public IEnumerable<CollectedProductViewModel> ProductViews { get; set; }

13 }

14 }

2.> CollectedProductViewModel

 1 using System.Collections.Generic;

2

3 namespace Collector.Models

4 {

5 public class CollectedProductViewModel

6 {

7 public CollectedProductViewModel()

8 {

9 ProductImages = new List<CollectedProductImageViewModel>();

10 }

11 public string ProductName { get; set; }

12 public decimal ProductPrice { get; set; }

13 public decimal ProductDiscountPrice { get; set; }

14 public string ProductCurrency { get; set; }

15 public string ProductColor { get; set; }

16 public string ProductSize { get; set; }

17 public IEnumerable<CollectedProductImageViewModel> ProductImages { get; set; }

18 }

19 }

3.>CollectedProductImageViewModel

1 namespace Collector.Models

2 {

3 public class CollectedProductImageViewModel

4 {

5 public string ImageUrl { get; set; }

6 public int Sort { get; set; }

7 }

8 }

4.>ParseProductPatterns

namespace Collector.Models

{

public struct ParseProductPatterns

{

public static string ProductNamePattern = "(?<=<h1 class=\"product-name\" itemprop=\"name\">).*?(?=</h1>)";

public static string ProductJsnPattern = @"(?<=var skuProducts=).*?(?=;\s*var skuAttrIds=)";

public static string ProductImageJsonPattern = "(?<=window.runParams.imageBigViewURL=).*?(?=;)";

public static string ProductCurrencyPattern = "(?<=window.runParams.currencyCode=\").*?(?=\";)";

public static string ProductColorPattern =

"(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-1-{0}\" title=\").*?(?=\")";

public static string ProductSizePattern =

"(?<=<a data-role=\"sku\" data-sku-id=\"{0}\" id=\"sku-2-{0}\" href=\"javascript:void\\(0\\)\"\\s+><span>).*?(?=</)";

}

}

基本上容易理解,我这里就不再一一讲解了。

4. 视图布局设计很简单,如下图 

采集地址 就是速卖通产品地址,这里不支持店铺和类型采集地址。表格就是采集产品信息展示。

5. 控制器和视图代码如下

1.> CollectingController

 1 using System;

2 using System.Collections.Generic;

3 using System.Linq;

4 using System.Text.RegularExpressions;

5 using System.Web.Mvc;

6 using Collector.Models;

7 using Newtonsoft.Json.Linq;

8 using RestSharp;

9

10 namespace Collector.Controllers

11 {

12 public class CollectingController : Controller

13 {

14 // GET: Collecting

15 public ActionResult Index()

16 {

17 return View();

18 }

19

20 [HttpPost]

21 public ActionResult Index(CollectionViewModel collectionView)

22 {

23 collectionView = ColllectWithParse(collectionView);

24 return View(collectionView);

25 }

26

27 public CollectionViewModel ColllectWithParse(CollectionViewModel collectionView)

28 {

29 if (collectionView == null || string.IsNullOrEmpty(collectionView.CollectionUrl))

30 {

31 return collectionView;

32 }

33 var client = new RestClient(collectionView.CollectionUrl);

34 var request = new RestRequest(Method.GET);

35 var response = client.Execute(request);

36 var htmlContent = response.Content;

37 collectionView.ProductViews = ParseProducts(htmlContent);

38 return collectionView;

39 }

40

41 public IEnumerable<CollectedProductViewModel> ParseProducts(string productHtmlContent)

42 {

43 var productName = RegexMatchValue(ParseProductPatterns.ProductNamePattern, productHtmlContent);

44 var productCuurency = RegexMatchValue(ParseProductPatterns.ProductCurrencyPattern, productHtmlContent);

45

46 var productJson = RegexMatchValue(ParseProductPatterns.ProductJsnPattern, productHtmlContent);

47

48 var prodctJsonArray = JArray.Parse(productJson);

49 var products =

50 prodctJsonArray.Select(pja =>

51 {

52 var colorWithSizeCode = pja["skuPropIds"].ToString().Split(',');

53 var priceJson = pja["skuVal"];

54 var skuPrice = priceJson["skuPrice"];

55 var price = skuPrice == null ? "0" : skuPrice.ToString();

56 var actSkuPrice = priceJson["actSkuPrice"];

57 var discountPrice = actSkuPrice == null ? "0" : actSkuPrice.ToString();

58 return new

59 {

60 ColorCode = colorWithSizeCode.First(),

61 SizeCode = colorWithSizeCode.Last(),

62 Price = Convert.ToDecimal(price),

63 DiscountPrice = Convert.ToDecimal(discountPrice),

64 };

65 }).ToList();

66

67 var collectedImages = ParseProducImages(productHtmlContent);

68

69 var collectedProducts = products.Select(p => new CollectedProductViewModel

70 {

71 ProductName = productName,

72 ProductPrice = p.Price,

73 ProductDiscountPrice = p.DiscountPrice,

74 ProductCurrency = productCuurency,

75 ProductColor = SetProductColorWithSize(ParseProductPatterns.ProductColorPattern,p.ColorCode,productHtmlContent),

76 ProductSize = SetProductColorWithSize(ParseProductPatterns.ProductSizePattern, p.SizeCode, productHtmlContent),

77 ProductImages = collectedImages

78 }).ToList();

79 return collectedProducts;

80 }

81

82 private IEnumerable<CollectedProductImageViewModel> ParseProducImages(string productHtmlContent)

83 {

84 var imagesJson = RegexMatchValue(ParseProductPatterns.ProductImageJsonPattern, productHtmlContent);

85 var imageJsonArray = JArray.Parse(imagesJson);

86

87 var images = imageJsonArray.ToObject<List<string>>();

88 return images.Select((t, i) => new CollectedProductImageViewModel

89 {

90 ImageUrl = t,

91 Sort = i

92 });

93 }

94

95 private string SetProductColorWithSize(string pattern, string colorWithSizeCode,string input)

96 {

97 var newPattern = string.Format(pattern, colorWithSizeCode);

98 return RegexMatchValue(newPattern, input);

99 }

100

101 private string RegexMatchValue(string pattern, string input, RegexOptions regexOptions = RegexOptions.IgnoreCase|RegexOptions.Singleline)

102 {

103 var regex = new Regex(pattern, regexOptions);

104 var match = regex.Match(input);

105 return match.Value;

106 }

107 }

108 }

View Code

2.> Collecting->Index 

 1 @model Collector.Models.CollectionViewModel

2 <!DOCTYPE html>

3

4 <html>

5 <head>

6 <meta name="viewport" content="width=device-width" />

7 <title></title>

8 <!-- CSS goes in the document HEAD or added to your external stylesheet -->

9 <style type="text/css">

10 table.gridtable {

11 font-family: verdana,arial,sans-serif;

12 font-size: 11px;

13 color: #333333;

14 border-width: 1px;

15 border-color: #666666;

16 border-collapse: collapse;

17 }

18

19 table.gridtable th {

20 border-width: 1px;

21 padding: 8px;

22 border-style: solid;

23 border-color: #666666;

24 background-color: #dedede;

25 }

26

27 table.gridtable td {

28 border-width: 1px;

29 padding: 8px;

30 border-style: solid;

31 border-color: #666666;

32 background-color: #ffffff;

33 }

34 </style>

35 </head>

36 <body>

37 <div>

38 @using (Html.BeginForm("Index", "Collecting", FormMethod.Post))

39 {

40 <table>

41 <tr>

42 <td>采集地址:</td>

43 <td>

44 @Html.TextAreaFor(m => m.CollectionUrl, 4, 0, new { style = "width:1500px;" })

45 </td>

46

47 </tr>

48 <tr><td colspan="2" style="text-align: right;"><input type="submit" value="开始采集" /></td></tr>

49 </table>

50 }

51 </div>

52 <div>

53 <table class="gridtable">

54 <thead>

55 <tr>

56 <th width="5%">编号</th>

57 <th width="5%">图片</th>

58 <th width="30%">产品名称</th>

59

60 <th width="10%">产品单价</th>

61 <th width="10%">产品参考单价</th>

62 <th width="10%">产品币别</th>

63 <th width="10%">产品颜色</th>

64 <th width="10%">产品大小</th>

65 </tr>

66 </thead>

67 <tbody>

68 @{

69 var i = 0;

70 if (Model == null || Model.ProductViews == null)

71 {

72 return;

73 }

74 }

75 @foreach (var collectedProduct in Model.ProductViews)

76 {

77 <tr>

78 <td align="center">@{i++;}@i</td>

79 <td><img src="@collectedProduct.ProductImages.FirstOrDefault().ImageUrl" width="60" height="60" /></td>

80 <td>@collectedProduct.ProductName</td>

81 <td>@collectedProduct.ProductDiscountPrice</td>

82 <td>@collectedProduct.ProductPrice</td>

83 <td>@collectedProduct.ProductCurrency</td>

84 <td>@collectedProduct.ProductColor</td>

85 <td>@collectedProduct.ProductSize</td>

86 </tr>

87 }

88

89 </tbody>

90

91 </table>

92 </div>

93 </body>

94 </html>

View Code

这里要说明的是,本篇只是采集的冰山一角的例子,所有没有搞得很复杂,没有严格封装,不管是前端,还是后端,希望大家了解,还有本人不喜好在代码中加注释,在我看来代码就是注释。

二, 测试结果,将MVC项目,部署到IIS,端口号1005,走起看效果。

1. 测试上一篇速卖通产品地址:

http://www.aliexpress.com/store/product/Yoga-Tops-Women-Women-Yoga-Shirts-Womens-Sportswear-Gym-Woman-Running-Shirt-Camisetas-Deporte-Mujer-Gym/1025110_32620359354.html?spm=a2g01.8032156.template-section-container.27.wcM8ES&sdom=3514.555719.493653.0_32620359354

效果截图如下:

刚刚采集发现上一篇写的这个产品地址,速卖通不打折,因此没有了折扣价格。

2.再采集一个地址:

http://www.aliexpress.com/store/product/LEVEL-4-shock-Professional-running-intensive-training-without-rims-snow-sports-bra-open-front-zipper-style/1025110_32357688343.html?spm=2114.12010108.1000013.1.uvJqBj

截图如下

这个产品的产品变体有很多,所有一网页还显示不了。

源码码:https://github.com/haibozhou1011/Collector

总结:

好了,速卖通产品采集系列,就全部结束了,总的来说,采集这个活技术都是大家经常用的,主要是前期分析,抓产品信息规则,每个网站多有规律,大家留心观察就会找到一些蛛丝马迹,就会有所突破。希望大家如果有更好的采集方法,一定要和大家分享。

 

分享给朋友:
您可能感兴趣的文章:
随机阅读: