I'm looking for some volunteers familiar with (or at least, not afraid of!) Python to help me compile data from the legacy Lenovo support site before it disappears.
The goal is to preserve the Machine Type + Model number + Description details from the http://support1.lenovo.com/ website before Lenovo removes that site from the web as they warn they will (https://forums.lenovo.com/t5/Feedback-o ... 9835#M2245).
What Data Are You Trying to Save?
Currently, I'm just trying to save the original summary of the specifications for each model so that if you have the Machine Type - Model number (for e.g., 2008-01U) then you would be able to read the thinkpad family "ThinkPad T60" and the following set of specs: "T2500(2GHz), 512MB RAM, 80GB 5400rpm HD, 14.1in 1024x768 LCD, 64MB ATI Radeon X1300, CDRW/DVDRW, Intel 802.11abg wireless, Bluetooth/Modem, 1Gb Ethernet, UltraNav, Secure chip, Fingerprint reader, 6c Li-Ion batt, WinXP Pro".
Web Crawling Strategy
To do this, I have written a web crawling spider for Scrapy (http://scrapy.org/), which is a Python-based application framework for building web crawlers. To capture all the data, the crawler needs to run once for every four-digit Machine Type, but there are hundreds of Machine Types and a single run currently takes me about 6-10hours. The spider uses a brute force method of crawling through every single possible three character model sequence for each machine-type, which means that a single run consists of some 40,000+ page requests. For me to capture all the data on my own, then, would take months and months. Therefore, I'm looking for help from Thinkpadders who would be willing to run the scraper on their local machine(s) and then send me the output (csv files) to compile together.
Where Would the Data Go?
I would share the compiled data with all project helpers and with the Thinkpad community, probably in the form of a single CSV data file. Also, I have been thinking of setting up a website where this data could be accessed by the community in more friendly way, and if I do that, then this data would form the starting point for that site's database.
How You Can Help
- If you are a Python developer, then you could help me optimize my spider. This is my first Python project, and I'm sure my code could be improved...a lot!
- If you are a Python user or an intrepid Thinkpadder who is willing to install Python just to help out with this, then I could set you up with a list of Machine-Types and get you started with the spider. I will probably be putting a copy of the spider on Github over the next week.
- If you are a Thinkpad specialist, then you could help me to make sure that I understand exactly how the Machine Types and Models system works in relation to the legacy support site - I've got a couple questions right away (see below).
- On the legacy support site (http://support1.lenovo.com/), you can use a selector method to come up with drivers for your specific model of Thinkpad, by narrowing down your selection from Series > Subseries > Machine Type > Model. However, when I use my scraper, I uncover many, many more models than those listed in the selector. Does anyone know why only a selection of models are shown in the selector?
- When I reviewed the data from one Machine-Type, I found that more than half of the Models have details that start with something like "Based on ...", presumably because the specs for that model were based on some other core model. But there are tons that appear to be "Based on ...-CTO", which suggests they are models that are based on a Custom configuration, which doesn't make sense to me. I'm guessing this is just some weird way that IBM stored these details in the original database, but I wonder if anyone can shed any further light on why a specific model (for e.g. 2008-BR7) would be based on a CTO model (2008-CTO).
I was partly inspired to do this because I found a thread over on the German Thinkpad discussion forum where someone had done something similar. And they shared their copies of the scraper output for the T60/T60p/T61/T61p series:
http://thinkpad-forum.de/threads/178429 ... ost1798388 [in German!]
However, when I compare my scaper output to the output from this other scraper, I found that mine produced a list of models that was at least twice as long for each machine-type. I believe this is because that other scraper only scraped for models that appeared in the product Selector list - or perhaps that other scraper missed some data due to timeouts or not enough delay between requests.
If you are interested in helping, you can post a note in here, or you can send me a private message.