1

I have developed a web scrapping code in Python which takes data from Hattrick.org's matches and returns them in a table so it can be mined, determined likelihood of goals, etc.

I have the difficult that is really slow, returning 12.000 rows in 5 hours or so.

This question is to ask if there is a way to improve the web scrapping technique so it does not take that amount of time.

This is the code in Python.

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

ini = 631163587
q = 200000 # Change to q = 10 to try a sample

Cols = {'01. Local MF',
        '02. Away MF',
        '03. Local RD',
        '04. Away RD',
        '05. Local CD',
        '06. Away CD',
        '07. Local LD',
        '08. Away LD',
        '09. Local RA',
        '10. Away RA',
        '11. Local CA',
        '12. Away CA',
        '13. Local LA',
        '14. Away LA',
        '15. Local IndD',
        '16. Away IndD',
        '17. Local IndA',
        '18. Away IndA',
        '19. Local Attitude',
        '20. Away Attitude',
        '21. Local Tactic',
        '22. Away Tactic',
        '23. Local Tactic Level',
        '24. Away Tactic Level',
        '25. Local Score',
        '26. Away Score'}

df_ht = pd.DataFrame(data=np.nan,index=range(ini,ini+q),columns=Cols)
cont=[]

for i in range(ini,ini+q):
    url2 = 'https://www74.hattrick.org/Club/Matches/Match.aspx?matchID='+str(i)
    response = requests.get(url2)
    soup = BeautifulSoup(response.text, 'html.parser')
    s1 = soup.findAll('td')

    m = soup.findAll('meta')[10].attrs['content']
    d = re.findall('[ ,.,A-Z,a-z,0-9]* - [., ,A-Z,a-z,0-9]*',m)
    d2 = re.findall('[0-9]+',d[1])

    partido = d[0]

    try:
        D = {'01. Local MF': float(s1[3].contents[0]),
                          '02. Away MF': float(s1[4].contents[0]),
                          '03. Local RD': float(s1[10].contents[0]),
                          '04. Away RD': float(s1[11].contents[0]),
                          '05. Local CD': float(s1[17].contents[0]),
                          '06. Away CD': float(s1[18].contents[0]),
                          '07. Local LD': float(s1[24].contents[0]),
                          '08. Away LD': float(s1[25].contents[0]),
                          '09. Local RA': float(s1[31].contents[0]),
                          '10. Away RA': float(s1[32].contents[0]),
                          '11. Local CA': float(s1[38].contents[0]),
                          '12. Away CA': float(s1[39].contents[0]),
                          '13. Local LA': float(s1[45].contents[0]),
                          '14. Away LA': float(s1[46].contents[0]),
                          '15. Local IndD': float(s1[54].contents[0]),
                          '16. Away IndD': float(s1[55].contents[0]),
                          '17. Local IndA': float(s1[61].contents[0]),
                          '18. Away IndA': float(s1[62].contents[0]),
                          '19. Local Attitude': (s1[67].contents[0]),
                          '20. Away Attitude': (s1[68].contents[0]),
                          '21. Local Tactic': s1[70].contents[0],
                          '22. Away Tactic': s1[71].contents[0],
                          '23. Local Tactic Level': s1[75].contents[0],
                          '24. Away Tactic Level': s1[76].contents[0],
                          '25. Local Score': float(d2[0]),
                          '26. Away Score': float(d2[1])}


        df_ht.loc[i,:] = D

    except:
        cont.append(i)

    df_ht.to_csv(r"Datos9.csv")

0 Answers0