0
   

Downolad of Complex Websites and Archives for further analysis

 
 
Sulkhan
 
Reply Wed 20 Aug, 2008 11:36 am
My problem is that I am working with very large online database of UN. That is the Treaty database. I need o download all the pages available online and after use my content analysis tool to evaluate the data. So most of the time is vasted because I have to save all the pages (25 results dispayed per page and sometimes I have 3000 - 4000 results) maunally. I am looking for the tool which can help me to download all data at once. That will save me lot of time. Does someone have any suggestion?
 
Robert Gentel
 
  3  
Reply Wed 20 Aug, 2008 11:42 am
@Sulkhan,
I've used HTTrack in the past and it works pretty well for that.
Sulkhan
 
  2  
Reply Wed 20 Aug, 2008 12:27 pm
@Robert Gentel,
Thank you very much. I am trying this. Well it is not working with sub pages. But the program is downloading entire database (I guess), lets see what happens. Thank you again!
Robert Gentel
 
  1  
Reply Wed 20 Aug, 2008 03:23 pm
@Sulkhan,
You may have to tweak the settings a bit to get what you want.
0 Replies
 
hingehead
 
  2  
Reply Wed 20 Aug, 2008 04:26 pm
@Sulkhan,
Have you contemplated actually contacting the UN? They might assist. Depending on your purposes.
High Seas
 
  1  
Reply Wed 20 Aug, 2008 04:32 pm
@Sulkhan,
Sulkhan - if you can give any indication of the order of magnitude involved, many posters here can answer you. Tera- , Peta-, whatever. And btw, HOW did YOU (never heard of you before, for the record) come to have a "negative" rating of MINUS 50?
Quote:
* Answered Questions: 0
* Reputation: -50.00
* Posts: 2
* Location:
* Occupation:

Not that it matters to anyone in mathematics (we DO use Arabic numerals) but I'm just curious. Tks.
Sulkhan
 
  2  
Reply Wed 20 Aug, 2008 04:54 pm
@hingehead,
O yes, I did, actually they cannot do much about that. They have huge data base and are quite confused with the amount of work (it is my impression).
Robert Gentel
 
  3  
Reply Wed 20 Aug, 2008 04:56 pm
@Sulkhan,
Don't worry about the ratings, some old members don't like them (they are a new feature) and are trying to stir up more angst about them by rating new people down and pointing out how it's unhelpful to the community.

This will pass as their maturity that deserted them returns.
Sulkhan
 
  3  
Reply Wed 20 Aug, 2008 04:57 pm
@High Seas,
I do not really have the idea how this site exactly works, just needed online help and used the search engine. Why people here do not like me? No idea, maybe I have too complicated name Smile.

To be honest I cannot excatly understand what you mean with "order of magnitude", I am sorry.
Sulkhan
 
  2  
Reply Wed 20 Aug, 2008 04:59 pm
@Robert Gentel,
Well that can be possible explanation also Smile
High Seas
 
  0  
Reply Wed 20 Aug, 2008 05:00 pm
@Robert Gentel,
The question was about mathematics, which you signally failed to grasp.

YOU may be one of those OLDER members who can't grasp a goddam THING, MR (dr?) Gentel, but your I DI O TIC comment couldn't possibly apply to METADATA - look it UP ya DIMWIT.

Sorry to the original poster - I know the UN database and will get back to you.
0 Replies
 
High Seas
 
  0  
Reply Wed 20 Aug, 2008 05:02 pm
@Sulkhan,
well Sulkhan, no offense, but if you got passwords for accessing UN databases, then surely you got access to their technical support as well

If you do NOT have access to the databases then what IS your problem? Pls start at the beginning. Thanks
0 Replies
 
Robert Gentel
 
  1  
Reply Wed 20 Aug, 2008 05:05 pm
@Sulkhan,
If you can do this on a Linux box, you also might want to try wget. And here is an article that helps explain the options you'd use to download a whole website:

http://linuxreviews.org/quicktips/wget/

Be careful with your rules, primarily about what offsite links it will follow. When I've spidered sites I often stop the spider right away when I find it is too broad and would take too long (after all, you can end up telling it to try to spider the whole internet) or tweak the priorities and thresholds to get it to the pages I need it to go to.
0 Replies
 
Sulkhan
 
  2  
Reply Wed 20 Aug, 2008 05:10 pm
@Sulkhan,
I am trying this program (you kindly provided the link). Well, I do not know what is wrong. It does not seem to be so difficult that I cannot understand. Actually what I want is very simple thing: I would like to download the database in order dont to jump from page to page what costs me too much time. Funny that it is so difficult to find some tool which can perform such an easy task.

Now something is worng with the program or with the database: I want to download database by the country. Lets say USA. So I enter the link in the program and give the order to download the US database (the sites only). Now I am waiting for 45 minutes - the program has downloaded 50 MB-s but still continuing to work. 45 minutes per country and it is not done jet? then it is better to do everything manually again. Confused...
Robert Gentel
 
  2  
Reply Wed 20 Aug, 2008 05:22 pm
@Sulkhan,
Can you give a link to what you are trying to download? Or is it behind a membership login?
Sulkhan
 
  2  
Reply Wed 20 Aug, 2008 05:36 pm
@Robert Gentel,
No, it is free to access now. Sure here is the link:

http://157.150.195.4/LibertyIMS::/sidqLpzWYHM3DQ9iNSQ/cmd%3DXmlGetWebPage%3BCmdFile%3DXmlAdvSearch.cmd

This is advanced search mode. After you can select any country from the list and simply press seacrh button leaving everything else default. So you will get results which I would like to download. Exammple of Canada is here:

http://157.150.195.4/LibertyIMS::/sidqLpzWYHM3DQ9iNSQ/Cmd%3D%24%24B8D1gZoxtumv91dda%3BAvBV%3D%23fP

That is what I would like to download - complete results (my goal is plain text of it at the end).

Sometimes these links fail to work. In that case you can go to the page with fallowing steps. go to: http://untreaty.un.org/

Press English

Go to the end of the page and press: Access to Databases

From quick links pick up United Nations Treaty Series

Press Advanced Mode. There you are. Pick up any country you like. So how can I save that database? Smile
Robert Gentel
 
  1  
Reply Wed 20 Aug, 2008 07:49 pm
@Sulkhan,
Now I see where it's having problems. The spiders I gave you as recommendations don't do well with this kind of "walled garden" and spiders don't do well with data hidden behind POST forms and not using the URL to reference them in any way (e.g. your canada example doesn't show the info you saw when I visit your link).

It's a much more complicated problem to solve because it requires the spider to emulate a human a bit more. Even Google is just now starting to test this kind of crawling with their spiders.

There are some web content extractors I have heard of that may be able to help you, but they often cost a few hundred dollars and you still need to train it in ways that might be more complex than your task merits.

So my recommendation now is that you look for software that lets you make macros or automate your own actions.

Things like Auto Hot Key or for a great Mac example the Automator software for macs that allow you to use your mouse to train the computer to repeat your actions may be useful.

I don't have personal experience using any of them the exact way you are trying but would recommend you start with the Auto Hot Key (link above) and this one (in the free browser versions):

http://www.iopus.com/imacros/compare/all/

Thing is, there's really no way around needing to do some programmatic work to automate your manual work, because what you are doing isn't that easy for a computer to replicate without precise instructions and in this case a website downloader or spider isn't going to work, and you are in the realm of content scrapers and macros if you really want to automate this.
0 Replies
 
 

Related Topics

What Jobs Will Never Be Done By Robots? - Discussion by failures art
Robots Encountering Socks; robotic manipulation - Discussion by BumbleBeeBoogie
robots as a help to society - Question by amsa5
Living Machines - Discussion by failures art
This robot can get into a house - Discussion by edgarblythe
Companion Robot Owners - Discussion by wizardcally
 
  1. Forums
  2. » Downolad of Complex Websites and Archives for further analysis
Copyright © 2024 MadLab, LLC :: Terms of Service :: Privacy Policy :: Page generated in 1.68 seconds on 11/23/2024 at 11:25:12