Questions tagged [crawling]

14 questions
28
votes
7 answers

Publicly available social network datasets/APIs

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics…
Rubens
  • 4,097
  • 5
  • 23
  • 42
11
votes
5 answers

LinkedIn web scraping

I recently discovered a new R package for connecting to the LinkedIn API. Unfortunately the LinkedIn API seems pretty limited to begin with; for example, you can only get basic data on companies, and this is detached from data on individuals. I'd…
8
votes
5 answers

How to scrape a website with a searchbar

How do I scrape a website that basically looks like google with just a giant searchbar in the middle of the screen. From it you can search after various companies and their stats. I have a list of 1000 companies I want to get information about. I…
Ceylon
  • 141
  • 1
  • 1
  • 4
4
votes
2 answers

Web Scraping - a scientific database

I am searching a scientific database for abstracts of papers containing the words project management. Here is the link: For getting abstracts, I need to click on any paper and open a new page. How can I do that for 68 papers? I program in R and…
Hamideh
  • 920
  • 2
  • 11
  • 22
2
votes
4 answers

Format for storing textual data

For an upcoming project, I'm mining textual posts from an online forum, using Scrapy. What is the best way to store this text data? I'm thinking of simply exporting it into a JSON file, but is there a better format? Or does it not matter?
cakesofwrath
  • 21
  • 1
  • 2
2
votes
3 answers

Crawling customer reviews from Amazon

I want to know if there is any way that I can crawl customer reviews for particular products from amazon without being blocked. At the moment, my crawler is blocked after a few times. Any idea will be appreciated.
bensw
  • 189
  • 1
  • 4
2
votes
0 answers

How can I find company descriptions for a long list of companies?

I'm going to train an ml algorithm to qualify potential sales leads based upon company descriptions. To do this, I need to find the company descriptions programatically. E.g. given a long list of company names, how can I find descriptions for these…
Per Borgen
  • 21
  • 1
2
votes
0 answers

Is there a way to scrape tweets in realtime from a list of specified users?

I am trying to build a scraper that will run continuously and save the tweets from a list of users instantaneously or within seconds of the user tweeting it. It could save the tweet details to a continuously updated csv file.
niusoski
  • 21
  • 2
1
vote
1 answer

Publicly available news APIs/datasets?

In addition to our list of publicly available datasets, I'd like to know if there is any list of publicly available news datasets/crawling APIs. It would be very nice if alongside with a link to the dataset/API, characteristics of the data available…
stevec
  • 211
  • 1
  • 7
1
vote
2 answers

Data extraction using crawlers

I have a rather simple data scraping task, but my knowledge of web scraping is limited. I have a excel file containing the names of 500 cities in a column, and I'd like to find their distance from a fixed city, say Montreal. I have found this…
Jay
  • 13
  • 3
0
votes
1 answer

corpus development for plagiarism detection

There are many simple plagiarism detection algorithms that work on search engines like google etc. I want to have a index of corpus of the whole internet to serve as a back-end database for my plagiarism detection software. What should be the…
Shiva
  • 9
  • 2
0
votes
0 answers

Scrapping Number of Customer Transactions Completed on a Marketplace

I'm looking to build a web scrapper/crawler that counts the number of completed transactions completed on a given website that sells a given product. I know purchasing is typically handled by a 3rd party and not the website itself. I don't want to…
Okeith
  • 1
0
votes
1 answer

Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer?

It seems like a lot of noteworthy AI tools are being trained on datasets generated by web crawlers rather than human-edited, human-compiled corpora (Facebook Translate, GPT-3). In general, it sounds more ideal to have an automatic and universal way…
hmltn
  • 131
  • 3
-3
votes
4 answers

Looking for Web scraping tool for unstructured data

I want to scrape some data from a website. I have used import.io but still not much satisfied.. can any of you suggest about it.. whats the best tool to get the unstructured data from web
cap
  • 432
  • 3
  • 9