Big Data Developer Resume San Jose, CA - Hire IT People

SUMMARY

Accomplished IT professional with 7+ years of experience, specialized in Big Data and Data Warehousing tools/ technologies.
Enthusiast in exploring how big data analytics benefits to different industry verticals - Banking, Insurance, Healthcare, Retail, Manufacturing, Transportation etc.
Have in-depth understanding of key big data concepts - distributed file systems, parallel processing, high availability, fault tolerance and scalability.
Worked on most of the significant big data ecosystem tools/ frameworks - Hadoop, Sqoop, Hive, PySpark, Kafka, Oozie, and ZooKeeper.
Experience in working with major Hadoop Distributions - Cloudera and Hortonworks.
Expertise in writing custom UDFs for extending Hive core functionality.
Built data pipelines between RDBMS and Hadoop distributed file system using Sqoop to import and export data.
Experience in building real-time data pipelines with Kafka Connect and Spark Streaming.
Acquired profound knowledge on Spark Architecture and its key Components - Spark Core, Spark SQL, Data Frames, Spark Streaming and two pivotal Abstractions - RDD & DAG.
Experience in working with all the three optimized file formats in Hadoop - Optimized Row Columnar (ORC), Avro, and Parquet.
Translated Informatica ETLs to Hadoop equivalent using PySpark.
Experience in designing both time driven and data driven automated workflows using Oozie.
Extensive experience working in Oracle, SQL Server and MySQL databases.
Hands on experience working on NoSQL databases like MongoDB, HBase, Cassandra and its integration with Hadoop cluster.
Worked with several Python libraries like NumPy, Pandas and Matplotlib.
Sound experience in building ETL pipelines between several source systems and Enterprise Data Warehouse by leveraging Informatica PowerCenter.
Extensive experience in developing Stored Procedures, Functions, Views and Triggers, Complex SQL queries using Oracle PL/SQL.
Experience in resolving on-going maintenance issues and bug fixes; monitoring Informatica sessions as well as performance tuning of mappings and sessions.
Experience in different Linux shell scripting flavors like Bash and KornShell.
Experience in Dimensional Data Modeling using Star and Snowflake Schema.
Experience working with Git and SVN enterprise version control systems.
Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.
Experience in all phases of Data warehouse development from requirements gathering for the data warehouse to develop the code, Unit Testing and Documenting.
Experience in using Automation Scheduling tools like Control-M, Tidal.
Expertise in complex troubleshooting, root-cause analysis and solution development.
Sound knowledge on programming concepts like OOPs, multithreading, collections and exception handling.
Experience with different project management software/ tools like Jira and HP ALM.
Experience in working with IDEs like Eclipse, PyCharm.
Worked in both Agile and Waterfall project management methodologies.
Possess excellent people skills, rational in thinking and decision making.
Quick learner and hard worker to deliver quality results.

TECHNICAL SKILLS

Big Data Ecosystem: Cloudera Hadoop Distribution (CDH4, CDH5), Hadoop (HDFS, MapReduce, YARN), Sqoop, Hive, Spark, Oozie, HBase, Zookeeper, Kafka, Flume

ETL Tools: Informatica PowerCenter 9.6.1/9.5.1/8. x

Languages: Python, SQL, Bash/ Korn Shell Scripting

Databases: Oracle 11g/10g, MySQL, SAP HANA v2.3.25

Scheduling Tools: BMC Control-M, Tidal

Version Control Systems: Git, SVN

Methodologies: Agile, Waterfall

Operating Systems: LINUX, Windows

Other Tools: Oracle SQL Developer, Eclipse IDE, PyCharm, Putty, WinSCP, Jira, HP ALM

PROFESSIONAL EXPERIENCE

Big Data Developer

Confidential, San Jose, CA

Responsibilities:

Worked in Growth and Retention track of subscription analytics domain.
Used Sqoop to import and export data between SAP HANA database and Hadoop.
Written python scripts to extract from different API’s.
Implemented incremental loads and SCD Type 2 to capture change data.
Used best Hive optimization techniques to improve the performance.
Developed UDF’s in Hive for certain business logics.
Developed a real-time data streaming pipeline using Kafka and Spark Streaming.
Leveraged different file formats (ORC and Parquet) and Snappy compression codec to optimize storage and processing.
Used different SerDe’s in Hive to read different file formats.
Involved in converting HQL queries into Spark transformations using Spark RDD's.
Automated the jobs using shell scripts and scheduled using tidal and Oozie.
Created denormalized tables in SAP-HANA specifically for reporting purpose.
Developed Hive queries to process the data and generate the data cubes for visualizing.
Built PySpark based applications for both batch and streaming requirements.
Optimized performance of the built Spark applications in Hadoop using configurations around Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
Responsible to manage data coming from various sources.
Developed Oozie workflows to schedule various jobs in Hadoop ecosystem.
Helped in debugging the Tableau dashboards.
Used git as code repository and for version controlling.
Used Jira as project management tool.
Involved in various activities like development, code review, QA and deployment.
Involved in grooming sessions and sprint planning with Project Manager/Scrum Master.

Environment: Cloudera cdh5.16.1, MapReduce, HDFS, Hive v1.1.0, Spark v1.6.0, Python v2.7.5, Sqoop v1.4.6, Kafka v0.10.2, Oozie v4.1.0, SAP Hana v2.3.25, MySQL, Tableau v2019.2.4, Linux, Git, Perforce, PyCharm, JIRA.

Hadoop Developer

Confidential, Pittsburgh, PA

Responsibilities:

Involved in collecting business requirement and designing multiple data pipelines and monitoring the data flow in Cloudera Hue UI.
Imported and Exported data from RBDMS systems to HDFS/ HBase or vice-versa using Sqoop.
Stored data in Parquet file format since it utilizes less space and has high ingestion rate.
Used Gzip and Snappy compression codecs to compress files which will be efficient for storage and processing.
Created external tables (both transactional and non-transactional) from compressed files in Hive.
Performed ad-hoc queries on structured data using HiveQL and used partitioning, bucketing techniques and joins with Hive for faster data access.
Worked on Spark Streaming for real time data processing.
Developed highly optimized Spark applications using Python to perform various data cleansing, validation, transformation and summarization activities according to the requirement
Developed Python scripts, UDFs using Data frames/SQL/Data Sets in Spark 2.1 for Data Aggregation, queries.
Data pipeline consists Spark, Hive, Kafka and Sqoop and custom-built input Adapters to ingest, transform and analyze operational data.
Designed and developed jobs to validate the data post migration such as reporting fields from source and designation systems using Spark SQL, RDDs and DataFrames/Datasets.
Worked on query performance and try to optimize it by using aggregations/optimizing techniques.
Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
Co-ordinated with TMS team in gathering data from Kafka producers’ team and writing spark-core jobs to achieve the business requirement.
Co-ordinated with offshore team on daily basis through teleconference to discuss about roadblocks, issues and developments.

Environment: Cloudera, MapReduce, HDFS, Hive, PySpark, Sqoop, Oozie, Oracle, Linux, Git, JIRA.

Hadoop Developer

Confidential, Pittsburgh, PA

Responsibilities:

Involved in installation and configuration of parcels of various Hadoop eco system components including HDFS, Hive, Sqoop, Hbase.
Developed data pipeline using Sqoop to ingest customer behavioural data into HDFS for analysis.
Imported data into Hive using Sqoop from RDBS systems
Worked on Hive partitioning, bucketing and performed different types of joins on Hive tables and implemented Hive SerDe.
Used HBase to perform fast, random reads and writes to all the data stored and integrated with other components like Hive.
Added authorization to the server using the user’s Kerberos identity to determine which role each was and which operations they could perform
Monitoring and Debugging Hadoop jobs/Applications running in production.
Performed cluster co-ordination and assisted with data capacity planning and node forecasting using Zookeeper.
Developed job flows in Oozie to automate the workflow for extraction of data from warehouses and weblogs.
Knowledge in performance troubleshooting and tuning Hadoop clusters.
Worked on Providing User support and application support on Hadoop Infrastructure.
Used Spark SQL to analyze web logs and used Spark tranformations and actions to compute some statistics for web server monitoring.

Environment: Cloudera, MapReduce, HDFS, Hive, PySpark, Sqoop, Oozie, Oracle, Linux, Git, JIRA.

ETL Developer

Confidential

Responsibilities:

Prepared Technical Required Documents along with Design and Mapping documents.
Worked with BA, Data Modeler/Architect team in designing the Data Model for the project.
Used Informatica Power Center to create mappings, mapplets, User defined functions, workflows, worklets, sessions and tasks.
Designed ETL process using Informatica tool to load from Sources to Targets through data Transformations.
Designed an ETL system by creating Refresh mappings and Workflows for Siebel-OBIEE interface to load daily data into corresponding Dimension and Fact tables.
Involved in Maintaining the Repository Manager for creating Repositories, user groups, folders and migrating code from Dev to Test, Test to Prod environments.
Fine-tuned ETL processes by eliminating the bottlenecks at source and target levels.
Used push down optimization techniques for Load balancing, database performance tuning and capacity monitoring.
Involved in Creation of SQL, Packages, Functions, Procedures, Views, and Database Triggers.
Involved in Data Validating, Data integrity, Performance related to DB, Filed Size Validations, Check Constraints and Data Manipulation and Updates by using SQL Single Row Functions.
Designed and Developed ODS to Data Mart Mappings/Sessions/Workflows.
Created various Oracle database objects like Indexes, stored procedures, Materialized views, synonyms and functions for Data Import/Export.
Created reusable mapplets, worklets and workflows.
Used TOAD and MS SQL Server to run SQL queries and validate the data in warehouse and mart.
Populated error tables as part of the ETL process to capture the records that failed the migration.
Partitioned the Sessions for better performance.
Trained end users in using full client OBIEE for analysis and reporting.
Extensive documentation on the design, development, implementation, daily loads and process flow of the mappings.
Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping Designer, Mapplet Designer, Transformation Developer, Informatica Repository Manager and Informatica Workflow Manager.
Design and Development of ETL routines, using Informatica Power Center within the Informatica Mappings, usage of Lookups, Aggregator, Java, XML, Ranking, Mapplets, connected and unconnected stored procedures / functions / Lookups, SQL overrides usage in Lookups and source filter usage in Source qualifiers and data flow management into multiple targets using Routers were extensively done.
Created complex mappings using Unconnected Lookup, Sorter, Aggregator, Union, Rank, Normalizer, Update strategy and Router transformations for populating target table in efficient manner.
Created Type1 and Type2 SCD mappings.
Created Unix Shell scripts for FTP, Error handling, Error reports, Parameter files etc.
Created Stored Procedures, Packages and Triggers for complex requirements.
Experience in writing and optimizing SQL code across different databases.
Performed Unit testing and worked closely with offshore testing team in preparing ETL Test Plans, Test Cases, Test Scripts for Unit and System Integration testing.
Involved with Scheduling team in creating and scheduling jobs in Control-M.

Environment: Informatica, Oracle, MySQL, Toad, Linux, Control-M, SVN, JIRA.

We provide IT Staff Augmentation Services!

Big Data Developer Resume

San Jose, CA

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship