Python抓取页面数据分词统计展示

Python抓取京东评论数据生成词云

前提:想抓取商品评论分析关键词出现频率

aaa

实现方案

  • 打开京东SKU详情页面,查看评论,下一页(XHR 请求)找到评论数据接口
  • 使用到的Python 组件:
    • requests,
    • jieba,
    • numpy,
    • pandas,
    • matplotlib,
    • PIL(Pillow)

细节步骤

  • 评论接口地址拼接
  • 返回数据为JSONP,字符串截取一下(可以自定义callback参数)
  • 切分词
  • 停用词
  • 词组统计:wordData……agg(total=’count’)
  • 数据保存:可以使用其他逻辑存档入数据库
  • 词云图片生成、展示

小问题

  • 请求数据接口频率额需要控制 不要急不要慌
  • 文本读取的格式:GBK
  • wordData......agg(total=’count’)
  • 词云图生成设置的字体:挑个系统自带的 | 或者Copy你的字体文件到系统字体目录下
    • Win:simsun.ttc
    • Mac:[ ~ | /System]/Libray/Fonts

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
import json
import re
import sys
import os
import jieba
import pandas as pd
import numpy
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from os import path
import numpy as np
from PIL import Image
# 数据爬取模块


def get_comments():
all_comments = ""
fetchJSON_comment = "fetchJSON_comment9"
skuID = "1109759" # "4093841" # "100004549676" #
for i in range(1, 5):
url2 = str(i)
url1c = 'https://club.jd.com/comment/productPageComments.action?callback=' + \
fetchJSON_comment+url2+'&productId='+skuID+'&score=0&sortType=5&page='
url3c = '&pageSize=10&isShadowSku=0&rid=0&fold=1'

finalurlc = url1c+url2+url3c
xba = requests.get(finalurlc)
# fetchJSON_comment(
print(finalurlc, xba.text[0:len(fetchJSON_comment+url2)+1])
data = json.loads(xba.text[len(fetchJSON_comment+url2)+1:-2])
for j in data['comments']:
content = j['content']
all_comments = all_comments+content
print(i, xba.text[0:20])
return all_comments

# 数据清洗处理模块

xt=""

def data_clear(xt):
xt = get_comments()
sys.exit(xt)
pattern = re.compile(r'[\u4e00-\u9fa5]+')
filedata = re.findall(pattern, xt)
xx = ''.join(filedata)
clear = jieba.lcut(xx) # 切分词
cleared = pd.DataFrame({'keywords': clear})
stopwords = pd.read_csv("chineseStopWords.txt", index_col=False,
quoting=3, sep="\t", names=['stopword'], encoding='GBK')
cleared = cleared[~cleared.keywords.isin(stopwords.stopword)]
# count_words = cleared.groupby(by=['clear'])['clear'].agg({"num": numpy.size})
count_words = cleared.groupby('keywords')['keywords'].agg(total='count')
count_words = count_words.reset_index().sort_values(
by=["total"], ascending=False)
# df = pd.DataFrame(count_words)
# if os.path.exists("count_words.csv"):
# os.remove('count_words.csv')
# df.to_csv('count_words.csv', encoding='GBK')
xt = count_words
return count_words

# 词云展示模块


def make_wordclound():
# d = path.dirname(__file__)
# msk = np.array(Image.open(path.join(d, "151.jpg")))
word_frequence = {x[0]: x[1] for x in data_clear(xt).head(200).values}
wordcloud = WordCloud(font_path="simsun.ttc", # mask=msk,
background_color="#EEEEEE", max_font_size=250, width=2100, height=1200)
wordcloud = wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


if __name__ == "__main__":
make_wordclound()

Sonar代码检查服务部署

SonarQube代码检查服务部署

前言: 小组内的代码检查服务
部署: Docker

[TOC]

服务构成

  • PostgreSQL
  • SonarQube

镜像

阅读全文

Gitlab CI/CD 集成 SonarQube 扫描服务

Gitlab CI/CD 集成 SonarQube 扫描服务

[TOC]

环境:Docker

gitlab
gitlab-runner
sonarqube
postgresql

使用示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# .gitlab-ci.yml

stages:
- sonar-scanner
sonar:
stage: sonar-scanner
script:
- sonar-scanner
-Dsonar.projectKey=cd_demo
-Dsonar.sources=.
-Dsonar.host.url=http://10.18.27.80:9823
-Dsonar.login=a138bc0d36c7130bb30aebbaffbc44148b6ab8e4
tags:
- sonar
when: always

Sonar 服务

Sonar 集成 GitLab

  • admin 安装 git 插件 Administration->marketpalce-> search git->restart server
  • ALM Intergrations
    sonar-gitlab
  • Gitlab Application Token-Secret
    Gitlab Application T-S

Gitlab-runner 服务

  • 构建 gitlab-runner 镜像,集成 node ,sonar-scanner
  • 注册 Runner
    • gitlab-runner register
    • 输入 gitlab-host
    • 输入 runner-token
    • 输入 tag
    • 选择执行方式 shell
  • 开启 Runner 不匹配 tag 执行,在 CI/CD -> runner 的设置里

注册


Gitlab-runner 容器编排

docker-compose.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
runner:
container_name: 'gitlab-runner'
build: ../../server/gitlab-runner/
restart: always
ports:
- '8093:8093'
# volumes:
# - '$GITLAB_HOME/gitlab-runner/config:/etc/gitlab-runner'
# - '/var/run/docker.sock:/var/run/docker.sock'
web:
image: 'gitlab/gitlab-ee:latest'
restart: always
hostname: 'gitlab.example.com'
environment:
GITLAB_OMNIBUS_CONFIG: |
external_url 'http://gitlab.example.com:8929'
gitlab_rails['gitlab_shell_ssh_port'] = 2224
ports:
- '8929:8929'
- '2224:22'
volumes:
- '$GITLAB_HOME/config:/etc/gitlab'
- '$GITLAB_HOME/logs:/var/log/gitlab'
- '$GITLAB_HOME/data:/var/opt/gitlab'

gitlab-runner/Dockerfile
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
FROM gitlab/gitlab-runner:latest

LABEL MAINTAINER=[email protected]

RUN export LANG=en_US.UTF-8 && export LANGUAGE=en_US

RUN apt-get update && apt-get install -y nodejs vim unzip
RUN cd /opt && \
wget https://binaries.sonarsource.com/Distribution/sonar-scanner-cli/sonar-scanner-cli-4.0.0.1744-linux.zip && \
unzip sonar-scanner-cli-4.0.0.1744-linux.zip && \
mv sonar-scanner-4.0.0.1744-linux sonar-scanner

RUN ln -s /opt/sonar-scanner/bin/sonar-scanner /usr/bin/sonar-scanner && sonar-scanner -v


注意问题

  • 系统语言需设置 LCALL LANGUAGE LANG=en_US.UTF-8
[submodule "golang/example"]
    active = true
    url = [email protected]:baqianxin/examples.git
[submodule "spider/chineseocr_lite"]
    url = [email protected]:baqianxin/chineseocr_lite.git
    active = true

web服务https证书设置-nginx/golang

证书

  • 用于签名的信息
  • .crt/.cer 证书(Certificate)
  • .key 密钥/私钥(Private Key)
  • .csr 证书认证签名请求(Certificate signing request)
  • *.pem base64编码文本储存格式,可以单独放证书或密钥,也可以同时放两个;base64编码就是两条——-之间的那些莫名其妙的字符
  • *.der 证书的二进制储存格式(不常用)

自签证书使用注意事项

  • 根证书添加
  • 签名证书添加

生成方式

阅读全文