Intricacies of web scraping in 2023 with Pierluigi VInciguerra, founder of The Web Scraping Club
저장한 시리즈 ("피드 비활성화" status)
When? This feed was archived on September 29, 2024 21:08 (). Last successful fetch was on February 26, 2024 20:47 ()
Why? 피드 비활성화 status. 잠시 서버에 문제가 발생해 팟캐스트를 불러오지 못합니다.
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 365822283 series 3427778
This episode is a great opportunity to learn more about the man behind the Web Scraping Club project and get his perspective on the industry and its future.
Quotes
1. “If we are talking about the success of a small web scraping project, the most important thing is the quality of the output. If you're selling this project you need to create trust between you as a provider and a user and you need to put all the effort you can to provide quality data. To do so you need to set up a process of data quality with the most common techniques like human count regression, trends forecasting, etc. For large-scale projects, this applies as well but you also need to think about your scraping architecture. If you're building something that you're going to scale you need to standardize your processes.”
2. “Web scraping is becoming harder and more expensive. 10 years ago there was no need to have any proxy unless you needed to by-pass a geo-fence of a website. Now you need much more tools - proxies, headless browsers... "
3. “Many in the industry try to sell their APIs for automatic extraction from websites. This is a trend I've seen started four or five years ago and I think it's a good trend for for the data sourcing industry because it resolves quite a number of issues."
4. "There is more attention to the sourcing of the IP from many proxy providers, the Narrative of the proxy provider about the proxy industries moved to the ethical sourcing of the IP. It's good for this industry because web scraping has always been seen as shady. But it's totally legit if you do it in a proper way."
3 questions we ask all guests:
1. Who in the world of Tech/Data Pier would take out for lunch?
Scrapy - an open-source and collaborative framework for extracting data websites.
3. What real-life problem did Pier solve using data?
Wrote a scraper to help him buy a TV, which eventually saved him 300-400 EUR.
Episode Resources
Subscribe, rate, and review the "Ethical Data, Explained" podcast on Apple Podcasts.
Follow the "Ethical Data, Explained" podcast on Spotify.
Follow the "Ethical Data, Explained" podcast on Google Podcasts.
Watch full episodes of the "Ethical Data, Explained" podcast on YouTube.
To know more about SOAX visit the website.
10 에피소드