Screen Scraping

Many people use the term "screen scraping" when it really isn't appropriate. It's really an older term, referring to a strongly discouraged practice of trying to get information that was displayed to the screen by manually comparing the pixels on the screen with the pixels generated by a character. As you can see, this is strongly looked down upon because it is such a messy way of doing inter-program communication.

However, since XML is supposed to simplify such communications problems, I think the term is wrong to be used in this context. RDF extraction is a perfectly legitimate way of generating RDF data. XML is designed to be a structured data -- generating RDF-type information from it is exactly the type of thing it was designed for. "Scraping" implies that, like the screen scraping technique, a program is trying to get at information that wasn't designed to be gotten at.

The term has also been used in the case of HTML, where it has a much better fit. Generating XML files (especially RSS as Moreover* does) from an HTML site has been called "site scraping". This makes much more sense since HTML was designed to be displayed by a browser, not interpreted by a program.

However, the system I suggest is a perfectly reasonable way of dealing with RDF. XML is designed to represent custom structured data formats, it seems silly to "reinvent the wheel" and force people to use a specialized syntax to gain the benefits of RDF. I see this as bringing RDF to the people, by not forcing them to learn the complexities of DLG*s and predicate calculus*, but to still gain the benefits that RDF provides.

So not only am I trying to extract RDF from a structured format like XML, but in many cases, we're trying to extract it from a format like RSS, which had the RDF model in mind from the beginning, but didn't use the RDF syntax.

Part of LogicError. Powered by Blogspace, an Aaron Swartz project. Email the webmaster with problems.