Big Data Developer Resume
Pittsburgh-pA
SUMMARY
- 8+ years of experience in Design, Analysis and Development of software application using Java,Big Data/ Hadoop, Spark Technologies.
- Worked on Open Source Apache Hadoop, Cloudera Enterprise (CDH) and HortonworksData Platform (HDP).
- Hands on experience on major components in Hadoop Ecosystem like Hadoop Map Reduce, HDFS, HIVE, PIG, HBase, Zookeeper, Sqoop.
- Hands on experience on fetching the live stream data using Spark Streaming and Apache Kafka.
- Capable of processing large sets of structured, semi - structured and unstructured data sets.
- Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
- Proficient in designing and querying the NoSQL databases like HBase, Postgres, Cassandra,andMongoDB.
- Good Knowledge in Spark Core and Spark SQL.
- Expert in performing Data Analysis, Gap Analysis, Co-ordinate with the business, Requirement gathering and technical documents preparation. Experience in multiple distributions i.e. Horton works, Cloudera etc.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Good experience in extending the core functionality of Hive and Pig by developing user-defined functions to provide custom capabilities to these languages.
- Hands on experience in working with file formats like Parquet, JSON, ORC.
- Worked on Extraction, Transformation, and Loading (ETL) of data from multiple sources like Flat files, XML files and Databases.
- Familiar with streaming microservices using Kafka.
- Experience on working with build tools like Maven, Ant and SBT, and application containers like Apache Tomcat.
- Experience with scheduling and monitoring workflows using Apache Airflow.
- Experience with ETL tools like Informatica, DataStage and Snowflake.
- Proficient in implementing Object Oriented Programming (OOPs) concepts.
- Good Knowledge in HTML, CSS, JavaScript and web-based applications.
- Used Agile Development Methodology and Scrum for the development process.
- Familiar with Jenkins for CI/CD process implementation.
- Good knowledge in containerizing applications using Docker.
- Familiar in managing containerized workloads and services using Kubernetes.
- Hands-on experience in working with GitHub, GitBucket repositories.
TECHNICAL SKILLS
Programming & Scripting Languages: Python, Scala, Java, R, C, Shell Scripting.
Databases: MySQL, PL/SQL, SQL Server, Postgres, SparkSQL, MongoDB, Oracle DB, HBaseCassandra.
Cloud Services: AWS, S3, EMR, Lambda, Redshift, EC2, Glue, Athena.
Platforms: Windows, Linux (Ubuntu), Mac OS, CentOS (Cloudera).
Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Impala, ZookeeperFlume, Airflow, Informatica, Data Bricks, DataStage, Snowflake, KafkaCloudera.
Software Methodologies: Agile, Scrum, Waterfall.
PROFESSIONAL EXPERIENCE
Big Data Developer
Confidential - Pittsburgh-PA
Responsibilities:
- Designed and implemented data pipelines consisting of launching several Spark clusters equipped with Glue that read the datasets from various data sources and perform transformations, analytics and finally store results to application.
- Responsible for implementing a generic framework to handle different data collection methodologies from the client primary data sources, validate, transform using spark and load into S3.
- Responsible for providing SQL Engine over the data lake in S3 by adapting Parquet storage format with SparkSQL as SQL engine.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time.
- Developed and configured Kafka brokers to pipeline server logs data into spark streaming.
- Integrated AWS Kinesis with on premise Kafka cluster.
- Managed data pipelines using Airflow.
- Migrated historical data to S3 and developed a reliable mechanism for processing the incremental updates.
- Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark-Context, Spark-SQL, Data Frame and Pair RDD's.
- Collaborated with the infrastructure, network, database, application, and BI teams to ensure data quality and availability.
Environment: AWS, S3, Glue, Athena, Redshift, Python, Scala, Spark, SparkSQL, Airflow, Kafka, Hive, LINUX,Slack, IntelliJ, Git, UNIX Shell Scripting.
Data Engineer
Confidential - Seattle, Washington
Responsibilities:
- Creating, optimizing, updating, and maintaining logical and physical data models for various databases, applications, and systems.
- Lead design reviews of data models and relevant metadata to ensure consistency, quality, accuracy, and integrity.
- Collaborated with database administrators in creating physical data schema from the logical and physical data models to ensure compliance with business requirements.
- Designed and developed data mapping and transformation scripts to support and promote data warehouse development, structural changes of multiple RDBMS and data analytics efforts as well as design effective ETL logic and code as required.
- Defined and governed data modeling and design standards, tools, best practices, and related development methodologies as required.
- Utilized data modeling tools and associated graphical methods to depict and analyze conceptual, logical, and physical data schemas.
- Evaluated data models, business data objects and physical databases for proper usage or re-use of data models in different environments.
Environment: Erwin data modeler, Oracle Designer, Toad database management toolset, ER/Studio software, Snowflake, DataStage, JIRA, Oracle SQL Developer Data Modeler .
Big Data- Hadoop Developer
Confidential - Madison, NJ
Responsibilities:
- Involved in requirement gathering phase of the SDLC and helped team breaking the project into modules.
- Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data in partitioned tables.
- Involved in data ingestion into HDFS using Sqoop and Flume from variety of sources.
- Responsible for managing data from various sources.
- Worked on Kafka to produce the streamed data into topics and consumed that data.
- Worked with NoSQL databases like Cassandra to load large sets of semi structured data coming from various sources.
- Configured and wrote HiveUDFs that helped spot market trends.
- Involved in loading data from UNIX file system to HDFS.
- Created Hive tables to load the data and wrote Hive queries to analyze the data.
Environment: HDFS, Pig, Hive, Cassandra, Python, Java, Spark, Oozie, Sqoop, Kafka, AWS, Linux Shell Scripting.
