5ch – カテゴリーURL取得

目的
- 5chのカテゴリーURLを取得して中の記事を取得する
  - 中の記事はまだ取れてない（今後の課題）
結論
- 以下のスクリプトを実行するとカテゴリー名とカテゴリーURLのjsonファイルが出力される

実行するスクリプト


#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
import json

# 5chのトップのURLを指定
URL = 'https://www2.5ch.net/5ch.html'

# 一時ファイルの出力先パス
FILE_OUTPUT_PATH = '!!パスを指定'

# jsonファイルの出力先
OUTPUT_FILE_PATH = '!!パスを指定'

# 取得結果を出力する辞書配列を定義
result = {}

# post getでトップページの情報を取得する
response = requests.get(URL)
response_list = response.text

# 一旦ファイルを書き出す
# f = open(FILE_OUTPUT_PATH, 'w')
# f.write(response_list)
# f.close()

# 書き出したファイルを読み込んで1行ずつ処理する
with open(FILE_OUTPUT_PATH) as read_file:
    for data in read_file:
        # dataの例
        # ノートPC


        # URL部分以外の場合はcontinueする
        if 'A HREF' not in data:
            continue
        # URL部分の場合
        # カテゴリーの名前を取得する
        category_name = data.split('>')[1].split('<')[0]
        # 例
        # バスケット
        # テニス
        # バレーボール
        # URL部分だけ取得する
        cotegory_url = data.split('"')[1]
        # 例
        # https://rio2016.5ch.net/kokusai/
        # 辞書配列に入れる
        result[category_name] = cotegory_url

        with open(OUTPUT_FILE_PATH, 'w') as f:
            json.dump(result, f, indent=2, ensure_ascii=False)

        # カテゴリーURLごとに処理する
        # TODO 後ほどやる

実行結果

{
"5chの入り口": "https://www.5ch.net/",
"5ch総合案内": "https://info.5ch.net/",
"5chプレミアム浪人": "https://premium.5ch.net/",
"検索": "https://find.5ch.net/",
"超スレタイ検索": "https://dig.5ch.net/",
"5ch投稿数": "https://stat.5ch.net/SPARROW",
"お絵描き観測所": "https://o.5ch.net/",
"スマホメニュー": "https://itest.5ch.net/",
"過去ログ倉庫": "https://www.5ch.net/kakolog.html",
"地震headline": "https://headline.5ch.net/bbynamazu/",
"地震速報": "https://egg.5ch.net/namazuplus/",
"臨時地震": "https://mao.5ch.net/eq/",
"臨時地震+": "https://sora.5ch.net/eqplus/",
"緊急自然災害": "https://rio2016.5ch.net/lifeline/",
"プロ野球": "https://rio2016.5ch.net/base/",
"海外サッカー": "https://kizuna.5ch.net/football/",
"国内サッカー": "https://kizuna.5ch.net/soccer/",
"日本代表蹴球": "https://mevius.5ch.net/eleven/",
"ニュース速報+": "https://asahi.5ch.net/newsplus/",
"芸スポ速報+": "https://hayabusa9.5ch.net/mnewsplus/"
~~~~~~~~~
}

関連