Archive for the ‘Web Tech’ Category


If you’re responsible for monitoring Twitter for conversations about your brand, you’re faced with a challenge: You need to know what people are saying about your brand at all times AND you don’t want to live your entire life in front of Twitter Search.

Over the years, a number of social media applications have been released specifically for brand managers and social media teams, but most of those applications (especially the free/inexpensive ones) differentiate themselves only by the quality of their analytics and how real-time their data is reported. If that’s what you need, you have plenty of fantastic options. Those differentiators don’t really help you if you want to take a more passive role in monitoring Twitter search … You still have to log into the application to see your fancy dashboards with all of the information. Why can’t the data come to you?

About three weeks ago, Hazzy stopped by my desk and asked if I’d help build a tool that uses the Twitter Search API to collect brand keywords mentions and send an email alert with those mentions in digest form every 30 minutes. The social media team had been using Twilert for these types of alerts since February 2012, but over the last few months, messages have been delayed due to issues connecting to Twitter search … It seems that the service is so popular that it hits Twitter’s limits on API calls. An email digest scheduled to be sent every thirty minutes ends up going out ten hours late, and ten hours is an eternity in social media time. We needed something a little more timely and reliable, so I got to work on a simple “Twitter Monitor” script to find all mentions of our keyword(s) on Twitter, email those results in a simple digest format, and repeat the process every 30 minutes when new mentions are found.

With Bear’s Python-Twitter library on GitHub, connecting to the Twitter API is a breeze. Why did we use Bear’s library in particular? Just look at his profile picture. Yeah … ’nuff said. So with that Python wrapper to the Twitter API in place, I just had to figure out how to use the tools Twitter provided to get the job done. For the most part, the process was very clear, and Twitter actually made querying the search service much easier than we expected. The Search API finds all mentions of whatever string of characters you designate, so instead of creating an elaborate Boolean search for “SoftLayer OR #SoftLayer OR @SoftLayer …” or any number of combinations of arbitrary strings, we could simply search for “SoftLayer” and have all of those results included. If you want to see only @ replies or hashtags, you can limit your search to those alone, but because “SoftLayer” isn’t a word that gets thrown around much without referencing us, we wanted to see every instance. This is the code we ended up working with for the search functionality:

def status_by_search(search):
    statuses = api.GetSearch(term=search)
    results = filter(lambda x: x.id > get_log_value(), statuses)
    returns = []
    if len(results) > 0:
        for result in results:
            returns.append(format_status(result))

        new_tweets(results)
        return returns, len(returns)
    else:
        exit()

If you walk through the script, you’ll notice that we want to return only unseen Tweets to our email recipients. Shortly after got the Twitter Monitor up and running, we noticed how easy it would be to get spammed with the same messages every time the script ran, so we had to filter our results accordingly. Twitter’s API allows you to request tweets with a Tweet ID greater than one that you specify, however when I tried designating that “oldest” Tweet ID, we had mixed results … Whether due to my ignorance or a fault in the implementation, we were getting fewer results than we should. Tweet IDs are unique and numerically sequential, so they can be relied upon as much as datetime (and far easier to boot), so I decided to use the highest Tweet ID from each batch of processed messages to filter the next set of results. The script stores that Tweet ID and uses a little bit of logic to determine which Tweets are newer than the last Tweet reported.

def new_tweets(results):
    if get_log_value() < max(result.id for result in results):
        set_log_value(max(result.id for result in results))
        return True

def get_log_value():
    with open('tweet.id', 'r') as f:
        return int(f.read())

def set_log_value(messageId):
    with open('tweet.id', 'w+') as f:
        f.write(str(messageId))

Once we culled out our new Tweets, we needed our script to email those results to our social media team. Luckily, we didn’t have to reinvent the wheel here, and we added a few lines that enabled us to send an HTML-formatted email over any SMTP server. One of the downsides of the script is that login credentials for your SMTP server are stored in plaintext, so if you can come up with another alternative that adds a layer of security to those credentials (or lets you send with different kinds of credentials) we’d love for you to share it.

From that point, we could run the script manually from the server (or a laptop for that matter), and an email digest would be sent with new Tweets. Because we wanted to automate that process, I added a cron job that would run the script at the desired interval. As a bonus, if the script doesn’t find any new Tweets since the last time it was run, it doesn’t send an email, so you won’t get spammed by “0 Results” messages overnight.

The script has been in action for a couple of weeks now, and it has gotten our social media team’s seal of approval. We’ve added a few features here and there (like adding the number of Tweets in an email to the email’s subject line), and I’ve enlisted the help of Kevin Landreth to clean up the code a little. Now, we’re ready to share the SoftLayer Twitter Monitor script with the world via GitHub!

SoftLayer Twitter Monitor on GitHub

The script should work well right out of the box in any Python environment with the required libraries after a few simple configuration changes:

  • Get your Twitter Customer Secret, Access Token and Access Secret fromhttps://dev.twitter.com/
  • Copy/paste that information where noted in the script.
  • Update your search term(s).
  • Enter your mailserver address and port.
  • Enter your email account credentials if you aren’t working with an open relay.
  • Set the self.from_ and self.to values to your preference.
  • Ensure all of the Python requirements are met.
  • Configure a cron job to run the script your desired interval. For example, if you want to send emails every 10 minutes: */10 * * * * <path to python> <path to script> 2>&1 /dev/null

As soon as you add your information, you should be in business. You’ll have an in-house Twitter Monitor that delivers a simple email digest of your new Twitter mentions at whatever interval you specify!

Like any good open source project, we want the community’s feedback on how it can be improved or other features we could incorporate. This script uses the Search API, but we’re also starting to play around with the Stream API and SoftLayer Message Queue to make some even cooler tools to automate brand monitoring on Twitter.

Advertisements

Breaking Down ‘Big Data’

Posted: 01/08/2013 in Web Tech

Forester defines big data as “techniques and technologies that make capturing value from data at an extreme scale economical.” Gartner says, “Big data is the term adopted by the market to describe extreme information management and processing issues which exceed the capability of traditional information technology along one or multiple dimensions to support the use of the information assets.” Big data demands extreme horizontal scale that traditional IT management can’t handle, and it’s not a challenge exclusive to the Facebooks, Twitters and Tumblrs of the world … Just look at the Google search volume for “big data” over the past eight years:

Big Data Search Interest

Developers are collectively facing information overload. As storage has become more and more affordable, it’s easier to justify collecting and saving more data. Users are more comfortable with creating and sharing content, and we’re able to track, log and index metrics and activity that previously would have been deleted in consideration of space restraints or cost. As the information age progresses, we are collecting more and more data at an ever-accelerating pace, and we’re sharing that data at an incredible rate.

To understand the different facets of this increased usage and demand, Gartner came up with the three V’s of big data that vary significantly from traditional data requirements: Volume, Velocity and Variety. Larger, more abundant pieces of data (“Volume”) are coming at a much faster speed (“Velocity”) in formats like media and walls of text that don’t easily fit into a column-and-row database structure (“Variety”). Given those equally important factors, many of the biggest players in the IT world have been hard at work to create solutions that provide the scale and speed developers need when they build social, analytics, gaming, financial or medical apps with large data sets.

When we talk about scaling databases here, we’re talking about scaling horizontally across multiple servers rather than scaling vertically by upgrading a single server — adding more RAM, increasing HDD capacity, etc. It’s important to make that distinction because it leads to a unique challenge shared by all distributed computer systems: The CAP Theorem. According to the CAP theorem, a distributed storage system must choose to sacrifice either consistency (that everyone sees the same data) oravailability (that you can always read/write) while having partition tolerance (where the system continues to operate despite arbitrary message loss or failure of part of the system occurs).

Let’s take a look at a few of the most common database models, what their strengths are, and how they handle the CAP theorem compromise of consistency v. availability:

Relational Databases

What They Do: Stores data in rows/columns. Parent-child records can be joined remotely on the server. Provides speed over scale. Some capacity for vertical scaling, poor capacity for horizontal scaling. This type of database is where most people start.
Horizontal Scaling: In a relational database system, horizontal scaling is possible via replication — dharing data between redundant nodes to ensure consistency — and some people have success sharding — horizontal partitioning of data — but those techniques add a lot of complexity.
CAP Balance: Prefer consistency over availability.
When to use: When you have highly structured data, and you know what you’ll be storing. Great when production queries will be predictable.
Example Products: OracleSQLitePostgreSQLMySQL

Document-Oriented Databases

What They Do: Stores data in documents. Parent-child records can be stored in the same document and returned in a single fetch operation with no join. The server is aware of the fields stored within a document, can query on them, and return their properties selectively.
Horizontal Scaling: Horizontal scaling is provided via replication, or replication + sharding. Document-oriented databases also usually support relatively low-performance MapReduce for ad-hoc querying.
CAP Balance: Generally prefer consistency over availability
When to Use: When your concept of a “record” has relatively bounded growth, and can store all of its related properties in a single doc.
Example Products: MongoDBCouchDBBigCouchCloudant

Key-Value Stores

What They Do: Stores an arbitrary value at a key. Most can perform simple operations on a single value. Typically, each property of a record must be fetched in multiple trips, with Redis being an exception. Very simple, and very fast.
Horizontal Scaling: Horizontal scale is provided via sharding.
CAP Balance: Generally prefer consistency over availability.
When to Use: Very simple schemas, caching of upstream query results, or extreme speed scenarios (like real-time counters)
Example Products: CouchBaseRedisPostgreSQL HStoreLevelDB

BigTable-Inspired Databases

What They Do: Data put into column-oriented stores inspired by Google’s BigTable paper. It has tunable CAP parameters, and can be adjusted to prefer either consistency or availability. Both are sort of operationally intensive.
Horizontal Scaling: Good speed and very wide horizontal scale capabilities.
CAP Balance: Prefer consistency over availability
When to Use: When you need consistency and write performance that scales past the capabilities of a single machine. Hbase in particular has been used with around 1,000 nodes in production.
Example Products: HbaseCassandra (inspired by both BigTable and Dynamo)

Dynamo-Inspired Databases

What They Do: Distributed key/value stores inspired by Amazon’s Dynamo paper. A key written to a dynamo ring is persisted in several nodes at once before a successful write is reported. Riak also provides a native MapReduce implementation.
Horizontal Scaling: Dynamo-inspired databases usually provide for the best scale and extremely strong data durability.
CAP Balance: Prefer availability over consistency,
When to Use: When the system must always be available for writes and effectively cannot lose data.
Example Products: CassandraRiakBigCouch

Each of the database models has strengths and weaknesses, and there are huge communities that support each of the open source examples I gave in each model. If your database is a bottleneck or you’re not getting the flexibility and scalability you need to handle your application’s volume, velocity and variety of data, start looking at some of these “big data” solutions

@NK Aravind


Big Data is likely to have a major impact on our world – amounting in a significant technological shift over the next data. Don’t believe us? Watch this video to find out some real world examples of how Big Data is already making an impact.
Thanks to IBM ..Great Success


Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time.Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. The target moves due to constant improvement in traditional DBMS technology as well as new databases like NoSQL and their ability to handle larger amounts of data. With this difficulty, new platforms of “big data” tools are being developed to handle various aspects of large quantities of data.

IBM Big Data Platform

Posted: 21/07/2013 in Web Tech

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis,and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”
As of 2012, limits on the size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.Scientists regularly encounter limitations due to large data sets in many areas, including meteorology, genomics, connectomics, complex physics simulations, and biological and environmental research. The limitations also affect Internet search, finance and business informatics. Data sets grow in size in part because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies (remote sensing), software logs, cameras, microphones, radio-frequency identification readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created. The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.
Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. What is considered “big data” varies depending on the capabilities of the organization managing the set, and on the capabilities of the applications that are traditionally used to process and analyze the data set in its domain. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.”

Big Data Analytics

Posted: 21/07/2013 in Web Tech

Big Data Analytics

Between now and 2020, the sheer volume of digital information is predicted to increase to 35 trillion gigabytes – much of it coming from new sources including blogs, social media, internet search, and sensor networks.

Teradata can help you manage this onslaught with big data analytics for structured big data within an integrated relational database– and now Teradata’s Aster Data Analytic Platform can help you deal with the emerging big data that typically has unknown relationships, includes non-relational data types. Together, these two powerful technologies provide greater insight than ever for smarter, faster, decisions.

Teradata’s Aster Data Analytic Platform powers next-generation big data analytic applications with a massively parallel processing (MPP) analytics engine that stores and processes big data analytics together with data. As a result, it delivers breakthrough performance and scalability to give you a competitive edge in these critical areas:

  • Enable new analytics: Big data analytics framework with pattern and graph analysis that are hard to define and execute in SQL enable valuable new applications – including digital marketing optimization, fraud detection and prevention, and social network and relationship analysis.
  • Accelerate analytics development: Unique analytics architecture combined with pre-built library of analytic modules, visual development environment, and local testing capability simplify and streamline analytic development. Supports a variety of languages are supported – including C, C++, C#, Java, Python, Perl, and R – to simplify development and embedding of rich analytics in the MPP data store.
  • High-performance and elastic scalability: Patented high parallelism and massive scalability for complex analytics that enable iterative, on-the-fly data exploration and analysis to rapidly uncover new and changing patterns in data.
  • Cost-effective big data analytics: Uses commodity hardware to provide lower cost to scale than alternative approaches.

Introduction

A very common use-case for WebCenter applications is to use them for Content Contribution projects. WebCenter Portal and WebCenter Content are designed to work seamlessly with each other. Once set up correctly a content contribution user can make all content changes from within the Portal application, without going into content administration application.

The following article shows the steps required to start a new WebCenter project for content contribution.

 

Main Article

Jdeveloper Setup

Create a new Content repository connection in JDeveloper and add the following values.

 

Parameter Value
RIDC Socket Type socket
Server Host Name <Content Host name>
Listener Port 4444
Context Root /cs
Cache Invalidation Interval 2
Authentication Identity Propogation

 

In our example the content server is running on a local VM hence host name is localhost.

content-rep-conn

 

Application Setup

Create a new Portal Framework Application

create-app

Index Page: Change the default page in index.html to be your home page. This will show a pretty URL when home page is rendered.

index

 

Login Success Page: For showing a pretty URL after login change the login_success page in faces-config.xml to your home page. This will always redirect to home page post login, so do not change it if users can login from any page in your application.

faces-config

Page Hierarchy: Once the necessary design time pages are created, add them to the page hierarchy and set the appropriate Title and security options. Failure to do so will prevent the page to be shown on the UI.

Contrary to the name, the page hierarchy shown on UI is controlled by the Navigation Model.

pages

 

Navigation Model: Best practices states that all pages should be added as Page Links within Navigation Model. The Navigation Hierarchy created here dictates how page URL is created.

nav-model

 

Page Template: Most users choose to create their own custom template. Our recommendation is to keep taskflows to the minimum for optimal page performance. To allow users to edit page at runtime, add panelEditor tag either within the template or within page.

 

<pe:pageCustomizable id="hm_pgc1">
  <cust:panelCustomizable id="hm_pnc1" layout="scroll">
    <af:facetRef facetName="content"/>
  </cust:panelCustomizable>
  <f:facet name="editor">
    <pe:pageEditorPanel id="pep1"/>
  </f:facet>
</pe:pageCustomizable>

 

Content Presenter: A content can be added directly via JDeveloper or at runtime via composer. Custom content presenter templates should be created as per requirements.

When adding the content presenter via JDeveloper surround it with showDetailFrame so that it can be edited at runtime by composer.

 

create-presenter

 

Configuration Settings

High Availability Support: Most production grade applications run on multiple servers in cluster. Hence “High Availability for ADF Scopes” should be checked so JDeveloper can prompt developer high availability violations are detected.

high-avail

Cookie Path: Cookies are enabled by default in weblogic.xml and should remain that way. Additionally Cookie Trigger Path should be set to a unique value. A good rule of thumb is to set it same as the context root.

cookie-trigger

 

Context Root: Needless to say a context root should always be set to appropriate name. Its done via Java EE Application in Portal project properties. Optionally, the deployment profile name and application name can also be changed for clarity.

context-root

 

Refer to performance blogs for other application configuration settings.

 

Deployment

Integrated Server deployment: For design/development time testing, the application can be deployed by running index.html page.

image021.