Verified:

EDge Game profile

Member
81

May 1st 2014, 18:15:06

I'm trying to build something that will crawl and scrape the files from https://marketplace.spp.org/...guest/binding-constraints

Haven't had much luck with the traditional methods so was hoping someone could help me out. Preferably in Python or C#, but a simple way to grab the urls for each file after you go through the folders would help too. It looks like it's using a filebrowser widget which makes the easy way of grabbing links a little tough.

Thanks!

Pang Game profile

Administrator
Game Development
5731

May 1st 2014, 19:40:06

I guess I'm sort of familiar with scraping....

Here's 2 ways you could do this:
1) Watch the Network tab in your browser's inspector as you navigate the site. Write scripts to mimic those requests. Get the response JSON from the calls and use that to appropriately build subsequent requests. Recursively do this until you have traversed the tree completely to get the data you desire.
2) Write a script in-browser (tampermonkey?) using JavaScript that will mimic user interactions to traverse the links and collect relevant data. This doesn't work if you want a remotely hosted solution running on a rack somewhere.

That list is not exhaustive but those are the two easiest lifts, IMO
-=Pang=-
Earth Empires Staff
pangaea [at] earthempires [dot] com

Boxcar - Earth Empires Clan & Alliance Hosting
http://www.boxcarhosting.com

iScode Game profile

Member
5718

May 1st 2014, 20:20:35

Originally posted by Pang:
I guess I'm sort of familiar with scraping....




lol
iScode
God of War


DEATH TO SOV!

EDge Game profile

Member
81

May 1st 2014, 21:15:26

Thanks for the ideas. Looking to build a script that can be run via cron on a regular basis grabbing new data. I tried using fiddler to inspect the requests while manually navigating the site without making much headway. Although tampermonkey won't work for this specific task, it looks pretty useful for other things, thanks!

BILL_DANGER Game profile

Member
524

May 1st 2014, 21:19:48

Originally posted by EDge:
Thanks for the ideas. Looking to build a script that can be run via cron on a regular basis grabbing new data. I tried using fiddler to inspect the requests while manually navigating the site without making much headway. Although tampermonkey won't work for this specific task, it looks pretty useful for other things, thanks!


I HAVEN'T USED FIDDLER MUCH.. FIREBUG (ADDON FOR FIREFOX ONLY AS THE NAME MAY IMPLY) IS A MUST-HAVE THOUGH, AND SHOULD ENABLE WHAT PANG DESCRIBES. IN THEIR CONSOLE LOG YOU'LL BE ABLE TO SEE FULL DETAILS INCLUDING HEADERS, POSTS, ETC FOR EVERY REQUEST YOU MAKE WHILE IT'S ENABLED.

I USED FIREBUG EXTENSIVELY WHEN WRITING MY BOT THAT EVERYONE THINKS IS SAM DANGER!

HA!
BILL

Pang Game profile

Administrator
Game Development
5731

May 1st 2014, 23:34:54

Use the chrome inspector and watch the network tab while you're using that site. it's better than trying to use fiddler for this sort of thing. :)

at that point it's just a matter of language preference as to how to make requests and parse responses.
-=Pang=-
Earth Empires Staff
pangaea [at] earthempires [dot] com

Boxcar - Earth Empires Clan & Alliance Hosting
http://www.boxcarhosting.com