Intricacies of web scraping in 2023 with Pierluigi Vinciguerra

“Web scraping is becoming harder and more expensive. 10 years ago there was no need to have any proxy unless you needed to by-pass a geo-fence of a website. Now you need much more tools, e.g proxies, headless browsers...”

— Pierluigi Vinciguerra, LinkedIn

Listen on Spotify

Listen on YouTube

Description

In this episode of Ethical Data, Explained, Henry Ng is joined by Pierluigi Vinciguerra, founder of The Web Scraping Club as well as Co-founder and CTO of Re Analytics - Databoutique.com. Pier is a web scraping professional with more than 15 years of experience in data sourcing. We discussed web scraping past, present, and future - how technology evolves and what to expect in the coming years, what trends are emerging and driving the market, and where is the future of web scraping for business - in-house or outsourced teams. We also talked about things like what determines the success of a web scraping project or how to choose a proxy provider for your project.

This episode is a great opportunity to learn more about the man behind the Web Scraping Club project and get his perspective on the industry and its future. Tune in for more!

Transcript

Henry NG 00:00 Welcome to Ethical Data, Explained. Join us as we discuss data-related obstacles and opportunities with entrepreneurs, cyber security specialists, lawmakers, and even hackers to get a better understanding of how to handle data ethically and legally. Here to keep you informed in this data-saturated world is your host, Henry NG. Good afternoon. Good evening, everyone. Welcome back to Ethical Data, Explained. I'm your host, Henry NG, and today we have a very special guest, the co-founder, and CTO at Re Analytics, working for ten years in creating web scrapers and helping maintain web scraping knowledge in terms of the web scraping club. Our special guest today is Pierluigi. Yeah. Luigi, how are you today?

Pierluigi 00:45 Hi, Henry. Thank you very much for inviting me here. Yeah, really happy to be here today.

Henry NG 00:51 Perfect. And as we said before we started this introduction, I have a very bad pronunciation of the Pierluigi surname. So Pierluigi, if you want to tell the viewers and the listeners how to pronounce your surname, that would be great.

Pierluigi 01:05 So my full name is Pierluigi Vinciguerra, but you can call me Pier.

Henry NG 01:11 Well, tell us a little bit about yourself and a little bit about your career, because it would be great to kind of give some context on why we've asked you to be our guest today.

Pierluigi 01:20 Yeah, sure. So I started working, I think 15 years. No, more than 15 because it was 2009. So yeah, more or less 15 years in Accenture. I was in a big consultancy firm as a data guy. So we were basically making business intelligence projects for banks or insurances. Then with my colleague at the time, Andrea, I met him on my first day of work and now he is my actual co-founder of Re Analytics. So it's basically we are getting married soon, I suppose. If we were not already married to our wives.

Henry NG 02:11 Spending too much time in the office.

Pierluigi 02:14 Yeah. We have spent too much time together and basically, we have seen together this flow of web data coming. We thought it could be interesting for companies to have access to this kind of data because we were making business intelligence. So basically internal data, business intelligence, but integrating this kind of data with external data coming from the web, will be much more interesting. So we started working on some prototypes. Yes, ten years ago. Seven years ago we started Re Analytics.

Henry NG 03:07 Perfect.

Pierluigi 03:08 Yes.

Henry NG 03:08 So Re Analytics. And does that feed into Databoutique.com? How how do those integrate in terms of the two sides of the business?

Pierluigi 03:19 Yeah, so basically Re Analytics is the actual business. So after some turnarounds today we are scraping a lot of e-commerce websites, mainly in fashion for both investors in the fashion industry and fashion brands themselves. So we basically work with most of the Italian brands and European brands in luxury. And it's work. It works perfectly, of course, but it's more like a consultancy job because every customer wants his own project, his own web data project. So you really cannot sell the same thing two times. So basically six months ago we started to think about databoutique.com. Databoutique actually is a marketplace of web scraping data, where sellers, like we are, can sell their data on databoutique.com. The platform as a boutique maxes quality control and ensures that everything is fine for the buyer. And then make some marketing efforts to bring buyers to the website.

Henry NG 05:02 I see. And in terms of figuring out what those data sets look like and what's going to be in demand, how do you gauge that, and from what you've seen, what data sets are the most popular among your clients at the moment?

Pierluigi 05:18 Well, actually, we are in, as I have said, at the initial point of Databoutique. We started with one industry because we have already the data and we wanted to try the full circle of buying and selling data with a buyer self. So we wanted to sell our data and see if someone will go buy it. So basically we are selling e-commerce data on fashion. We choose 3 or 4 datasets. Maybe some customers only need a price. Some customers need price plus product description. So there are 3 or 4 several different data structures for the buyer to choose from. And then the buyer can choose if he wants the data, one-off, monthly, weekly, daily, or every frequency he wants.

Narrator 06:26 This podcast is brought to you by SOAX, a leading proxy provider enabling your business to unlock the world of publicly available data. Get data at scale SOAX data.

Henry NG 06:42 Okay. And obviously, when you're gathering this data, and we know from a lot of the customers we work with, all websites are different. So do you essentially have to create a new scraper for every single site to extract data or do you vary your tech stack? What does that look like?

Pierluigi 06:58 No. Unluckily we need to create for each website its proper scraper because not only every website is different, but also some logic is different from website to website. Maybe you encounter some websites with five types of product IDs. Which one is interesting for our customer? So, unluckily, yes, we have at the moment in Re Analytics, I guess 300 scrapers more or less running daily or weekly. So one more for each website. Yes, that's a problem because in databoutique since the seller can be different, so we have potentially infinite scrapers that sell data to databoutique.com.

Henry NG 08:09 I see. And obviously from that gathering and scraping side as a proxy company, I'm assuming that you use proxies as part of your process. So how do you determine your proxy providers and kind of what do you look out for in terms of green flags and red flags of these proxy providers that you use?

Pierluigi 08:27 The first thing is the sourcing of the IP of the proxy provider because we work also with hedge funds and they are the most regulated industry at the moment. So you cannot provide the information gathered. In some way that is a bit shady. So we want to be sure that the IP we are using for scraping is perfectly clear.

Henry NG 09:05 Apart from the quality, is there anything else that you would say helps you determine which proxy provider to go for? Or do you only really use one proxy provider?

Pierluigi 09:15 No, No. Given that legal party apart, of course, for running operations at scale one key factor is the pricing of course. And depending on case to case, also the number of IPs in the pool of the provider because we encountered some websites, some large websites that blocked several IPs. So you need a large pool of IPs to gather all the website content. So yes, mainly pricing and depending on the project IP pool size.

Henry NG 10:01 I see. And based on those web scraping projects that are barring the quality of proxy IPs, what essentially determines the success of a good web scraping project?

Pierluigi 10:12 Well, how many hours do we have to answer this question? Well, if we are talking about a small project, maybe the most important thing is the quality of the output. Because if you're selling this project both internally in your company or to another company, you need to create trust between you as a provider and the user. You can create distrust if your data is incorrect. So you need to put all the effort you can to provide quality data. And to do so, you need to set up a process of data quality with the most common techniques like human count, regression, trends, forecasting, or whatever you can use. That's something we do, also on a large scale, but on a large scale project the quality, of course, is important, but to reach the quality, you need to also think about the processes of your scraping architecture with the setup. Because if you're building something that you're going to scale you need to standardize your processes and the logs. Your architecture is quite important. To think for the architecture on a large scale. So I will say these are the key factors in large-scale projects.

Henry NG 12:12 Okay. You've been working in the scraping industry for over ten years as you mentioned. Would you say that it's become more or less labor-intensive? And what changes have you seen in the market for web scraping?

Pierluigi 12:28 Yes, it's becoming harder and more expensive also. Ten years ago, there were no proxies. There's no need to have any proxy unless you need to go on geolocation, on a different geo-fence. You need to bypass the geo-fence of a website. But basically, you do not need to change the header request from Scrapy. So basically ten years ago it was much easier. Now you need much more tools, proxies, and head-full browsers. So data sourcing is becoming more expensive. And the challenge is, I think, as some antibiotics are becoming very aggressive, sometimes you can't even access a website if you're using a VPN. And you're are human. So that's it.

Henry NG 13:40 Exactly. Obviously, that's where we are and that's where we've come from. In your opinion, looking into the future, what are the new trends and changes in data extraction that you're seeing on a daily basis? Obviously, you're in contact with them very regularly. So how do you see it changing in the future?

Pierluigi 14:01 I've seen many providers in the industry. I've seen many proxy providers or companies in the industry that are trying to sell these APIs for automatic extraction from websites. This is a trend I've seen started 4 or 5 years ago, but now I've seen many companies that are trying to launch these API structures. I think it's a good trend for the data-sourcing industry because it resolves quite a number of issues, but it has a cost, of course.

Henry NG 14:59 You obviously work a lot with the collaboration between proxies and web scraping solution providers. Do you see any developments in those tendencies and those offerings? And do you think both proxies and web scraping are moving in the right direction together?

Pierluigi 15:16 As I was saying, apart from the API and the Extractors API, these proxy providers are trying to sell actually, but I see that there is more attention to the sourcing of the IP also. For many providers, the narrative of the proxy provider or the proxy industries moved to the ethical sourcing of the IP. For me, it's good for this industry because, you know, web scraping is always seen as a shady industry, but it's totally legit if you do it, of course, in a proper way. I like the way the industry is moving in this direction.

Henry NG 16:12 Good. Obviously, data scraping and proxies are becoming more accessible to everyone. So in your mindset, do you see all these data-hungry companies that are coming to Databoutique and obviously using their own scrapers? How do you see the direction of data extraction for these companies in the next decade? Do you see them bringing it in-house? Do you see them using platforms like Databoutique? What's the mindset that you think they'll take?

Pierluigi 16:42 Actually, I've seen a growing interest in data, in web data from companies. But as I as saying before, web scraping is getting more and more expensive. So there's a mismatch in the demand. At the moment only big companies can afford a great web scraping project. But if you are a small company trying to make the web scraping project in-house. The results will be mixed because they don't have the skill to do it. That's why we started Databoutique because we wanted to sell prescriptive data at a low price for everyone.

Henry NG 17:32 I see. So it could technically go both ways depending on how advanced people want to get in terms of their personal skills and developing their in-house solution. But it does sound like there is a slight gap between where they are and where they need to be moving forward.

Pierluigi 17:50 Yeah, you're right.

Henry NG 17:51 Perfect. And one of the last topics we really wanted to discuss today was Web Scraping Club. Obviously, in the fall, you launched Web Scraping Club. It's a newsletter about all things scraping. What made you finally share your expertise in that type of format?

Pierluigi 18:10 As you know, I've been scraping for years and I took some notes, of course, of how to do this or that. And last year I thought, why these notes are only for me? Why don't you share these notes with every person like me that is struggling to get the right information online? I think it's also a sort of paranoia in the web scraping industry, because if you talk on about how you solve this, then the company that's read your note and says, so let's change it so people cannot scrape us'. But it's not true. It's not technically true. Basically, I wanted to share my expertise. I needed to find online the expertise of someone else to bypass some technical issues. So I said, okay, why not? Why don't I share how I would do myself things and get feedback from the readers? So maybe I'm doing something wrong?

Henry NG 19:33 Fair enough. So it sounds like you just had a big heart and you were like, you know what? I want to share my knowledge with everyone else as well. That's great to hear. And looking at the growth that you've had - you've gained a large number of subscribers in a very short period of time - and considering it's like a niche topic in a lot of people's mindsets, did you expect to receive that much attention? And what has been your biggest post so far? What's got the most interest?

Pierluigi 20:01 No, I did not expect at all these subscribers. We were getting near to 1300 subscribers in less than a month. And as I said, it's a niche topic. So one post that made me think that it could be an audience for this topic. It was, I think, my first post where I shared some code and it was about scraping data from an app. I shared this post on Hacker News and I got 100 subscribers in one week because it went on the first page. It means that there was interest in this topic.

Henry NG 20:58 So it sounds like you shot to web scraping fame on LinkedIn and with Web Scraping Club. So what are your plans for the newsletter? Have you got any new formats you wanted to bring forward or any new posts that you have in mind that you're going to look into in terms of topics?

Pierluigi 21:17 Well, I'm actually thinking about giving a bit more structure because the Web Scraping Club is not my main job and I need to find some time to write the posts. And until now it was a bit disorganized. I didn't have really a publishing plan. But starting this month, I want to try to make a story. So basically link all the themes of the article for a month. This month will be AI month. We will invite some AI experts in web scraping and will test some AI tools. And next month there will be another topic. I still don't have decided yet, but yes, I try to produce better content because you can always do better.

Henry NG 22:32 Of course. I mean, you recently had an article that you posted about scraper using ChatGPT. Can you tell us a little bit more about the experiment and where do you think AI fits in the field of web scraping?

Pierluigi 22:46 Yeah. I tried to write a script with ChatGPT, but using ChatGPT itself. Then another public repository on GitHub called Scrape Ghost. It was a play because it was a game. Nothing can be used in production. Because of time. But it's interesting evolution because ChatGPT or GPT models are not trained at the moment to scrape or to get information from a website. But given the high level of human language comprehension, I think we are not so far from ChatGPT or GPT models to make them understand how HTML works. We already have done it with browsers. So basically, I don't see it as not a huge gap to fill. But my issue with AI in AI-based products is that you're basically using a black box. So let's say you're using a pre-trained model or even a solution that is already on the market for scraping a website and you get the wrong results. You cannot do anything because you basically cannot fix anything. You cannot change the model, you cannot fix the API. It's only a matter of using it or not. I think until we don't solve this issue we do not want to trust too much AI for a larger web scraping project.

Henry NG 25:07 That's a great insight, especially when so many people are starting to make it part of their day-to-day usage using ChatGPT. It's good to get the other side of things and see someone who is kind of a little bit apprehensive and directly going for it. So that's the end of our main interview questions. We have three questions that we always ask all guests. So I'd like to share them with you. The first one is who in the world of data would you most like to take to lunch?

Pierluigi 25:38 There are many but I think Seattle Data Guy will be a good choice. He has a very popular and interesting data newsletter. I think it could be interesting to meet him.

Henry NG 26:03 Okay. And what piece of software could you not live without and that you use on a daily basis?

Pierluigi 26:10 I know you are expecting me to say ChatGPT, but I won't say it. No. I still have to rely on Scrapy.

Henry NG 26:21 I see. And the final question is, when have you used data to solve a real-world problem that you've had? Can be at work or outside.

Pierluigi 26:32 Well. Real-world? Okay. On a personal level, I wrote some scrapers because I needed to buy a new TV, so I needed to find a good bargain because honestly, I don't know if it's the same somewhere else. But here in Italy, I had to monitor for one month a website to get a bargain and save €300 or €400 for a TV.

Henry NG 27:07 I will need to borrow that next time I buy a new TV. It sounds like a great tool to have to hand. That's all the questions that we had for today and all the time that we have. I'd like to thank all the listeners for joining in and listening to Ethical Data, Explained. And I'd like to thank our guest, Pierre, for joining us and answering all of our questions. It's been great having you. Do you have any final words that you want to share with our listeners?

Pierluigi 27:33 Thank you. Thank you, Harry, for having me here, and thank you SOAX for this podcast. I find it very informative and it's a real pleasure to listen to it. And I'm really glad that there's more and more content about web scraping around because we need to get out of our niche.

Henry NG 27:59 Of course. Thank you very much again, for everyone listening, thank you, Pierre.

Narrator 28:03 Ethical Data, Explained is brought to you by SOAX, a reputable provider of premium residential and mobile proxies. The gateway to data worldwide at scale. Make sure to search for Ethical Data, Explained in Apple Podcasts, Spotify, and Google Podcasts, or anywhere else podcasts are found, and hit Subscribe so you never miss an episode. On behalf of the team here at SOAX, thanks for listening.

Intricacies of web scraping in 2023 with Pierluigi Vinciguerra

Listen on Spotify

Listen on YouTube

Description

Transcript

Pierluigi Vinciguerra

Related posts